6. Combine, Split & Index
This section covers operations that manipulate samples (columns) in a .cx file:
yame index— build or update a sample indexyame split— split a multi-sample.cxfile into individual.cxfilesyame subset— extract selected samples or states- Combining files — achieved using standard Unix
cat
These tools are essential for workflows that involve multiple .cx samples, such as merging epigenomic features, analyzing groups of samples, or reorganizing sample structure.
6.1 Generating and Updating Index Files (yame index)
A .cx file containing multiple samples stores its samples sequentially, and YAME uses an accompanying index file (.cx.idx) to record the byte offset of each sample.
You can generate an index using:
yame index yourfile.cx
This produces:
yourfile.cx.idx
with two columns:
sample_name byte_offset
Assigning Sample Names from a List
If the .cx has N samples but no index file, provide a sample-name list:
yame index -s sample_names.tsv yourfile.cx
sample_names.tsv contains names in its first column.
Appending a New Sample (-1)
If you have added an extra cdata block to the end of a .cx file, you may append it to the existing index:
yame index -1 NewSampleName yourfile.cx
YAME locates the final block and records the offset of the newly appended sample.
Output to Console
yame index -c yourfile.cx
This prints the index to stdout instead of writing yourfile.cx.idx.
6.2 Example: Merging Multiple .cm Feature Files
This example demonstrates how to:
- Convert many BED files into
.cmfeature files - QC them
- Merge them together
- Generate a combined index
Step 1 — Prepare the table describing sample IDs and BED paths
268 GSM648494 human_hm/268_sort_peaks.narrowPeak.bed
269 GSM648495 human_hm/269_sort_peaks.narrowPeak.bed
272 GSM575295 human_hm/272_b_sort_peaks.broadPeak.bed
...
Step 2 — Convert each BED into .cm
cat controlfiles.tsv \
| parallel --colsep '\t' -j 72 '
id={1}; path={3};
sortbed $path \
| bedtools intersect -a cpg_nocontig.bed.gz -b - -sorted -c \
| cut -f4 \
| yame pack -f b - $id.cm
'
Each $id.cm is a binary feature mask aligned to the CpG coordinate list.
Step 3 — QC each .cm
awk '{print ""$1".cm", $2";"$4;}' controlfiles.tsv \
| while read fn anno; do yame summary $fn; done \
> qc.txt
Example filter: keep feature files with ≥ 5000 overlapping CpGs.
Step 4 — Merge and index
awk '$1!~/QFile/ && $6>5000' qc.txt \
| awk 'NR==FNR{a[$1]=1;}NR!=FNR&&($1".cm" in a){print $0;}' - controlfiles.tsv \
| awk '{print ""$1".cm", $2";"$4;}' \
| sort -k2,2 \
| while read fn anno; do
cat $fn >> merged.cm
yame index -1 $anno merged.cm
done
merged.cm is the concatenation of all retained .cm samples, with indexing updated each iteration.
6.3 Splitting Multi-Sample Files (yame split)
yame split takes a multi-sample .cx file and produces one .cx file per sample.
Basic usage:
yame split input.cx output_prefix
If sample names are present in the index, the output naming scheme becomes:
output_prefix<SampleName>.cx
Otherwise:
output_prefix_split_1.cx
output_prefix_split_2.cx
...
Providing a Sample List
If the .cx has no index file but you know sample names:
yame split -s sample_list.txt input.cx prefix_
sample_list.txt should contain one name per line.
This preserves sample naming and ensures the prefix_<sample>.cx files correspond correctly.
For more help:
6.4 Subsetting Samples (yame subset)
yame subset extracts a subset of samples from a multi-sample .cx file. It uses the .cx.idx file to locate and extract the requested samples efficiently.
Basic syntax:
yame subset -l sample_list.txt input.cx > subset.cx
or:
yame subset input.cx SampleA SampleB SampleC > subset.cx
If you specify an output file via -o, YAME writes both:
- the new subset
.cx - a new index
.cx.idx
Example:
yame subset -o cluster1.cx singlecell.cx Cell_01 Cell_07 Cell_33
Head / Tail extraction
Useful for inspecting the first or last N samples:
yame subset -H 10 input.cx > first10.cx # first 10 samples
yame subset -T 5 input.cx > last5.cx # last 5 samples
Subsetting Format 2 states (-s)
If the input is a format 2 .cx file (categorical states), you may split states into binary masks:
yame subset -s -l state_list.txt -o states.cx chromatin_states.cx
This produces one binary vector per selected state.
For more help:
6.5 Combining .cx Files
YAME does not provide a dedicated combine command because combining .cx files is equivalent to concatenation:
cat sample1.cx sample2.cx sample3.cx > combined.cx
yame index combined.cx
Rules:
- All
.cxfiles must have the same format and same row dimension. - After combining, run
yame indexto regenerate the sample index.
This pattern is used in the .cm merging example above.
Summary of Commands
| Command | Purpose | Notes |
|---|---|---|
yame index | Build or update a .cx.idx sample index | Required for fast sample lookup |
yame split | Produce one .cx per sample | Naming uses index or user-supplied list |
yame subset | Extract selected samples or states | Supports head/tail and format-2 filtering |
cat | Combine .cx files | Must re-index after concatenation |