Format 0 – Binary Presence/Absence Data
Format 0 stores binary data, one bit per CpG, representing presence/absence or a simple methylation state.
- Typical extension:
.cg - Input: a vector of 0/1 values, one per CpG
- Best for:
- Binary methylation calls
- Presence/absence features
- DMR indicators
- Peak overlap masks
- Cell-level binary CpG accessibility
Format 0 is the smallest and fastest CX format: 8 CpGs per byte (~32× compression over text).
1. Input Requirements
Your input must be:
- Aligned to a CpG reference file (
.cr, format 7) - Exactly one line per CpG
- Each line containing 0 or 1
Example input (binary_data.txt):
1
0
1
1
0
Interpretation:
1→ presence, methylated, peak overlap, or “TRUE”0→ absence, unmethylated, no peak, or “FALSE”
There is no NA value in format 0. Use Format 4 or 6 if you need NA or universe masking.
2. Packing to Format 0
2.1 From a raw 0/1 text vector
yame pack -fb binary_data.txt > binary_output.cg
-fb selects Format 0 (b).
2.2 From BED features or presence/absence annotations
This is the canonical way to create a .cm/.cg mask from BED:
bedtools sort -i dmr_sites.bed \
| bedtools intersect -a cpg_ref.bed.gz -b - -sorted -c \
| cut -f4 \
| yame pack -fb - > dmr_binary.cg
Explanation:
bedtools intersect -ccounts how many times each CpG overlaps the BED inputcut -f4extracts that count (0 or >0)yame pack -fbconverts the result into a compressed CX file
If a CpG overlaps ≥ 1 region, the value becomes 1; otherwise 0.
3. Unpacking Format 0
To get the 0/1 vector back:
yame unpack binary_output.cg | head
Example output:
1
0
1
1
0
Format 0 does not include sample-level metadata unless packed from multiple samples; YAME will unpack each sample sequentially if multiple samples exist.
4. Integration with Other YAME Commands
Format 0 integrates well with many downstream YAME tools.
4.1 yame summary
Format 0 is ideal for:
- Feature masks
- Peak overlaps
- Region presence indicators
Example:
yame summary -m promoter.cm sample.cg
Outputs for each mask:
N_maskN_overlaplog2OddsRatio- Binary fraction (
Beta) - Universe counts depending on the query
4.2 yame rowop
Useful operations on binary data:
# Sum across samples (pseudobulk for binary)
yame rowop -o binasum multi_sample.cg > pseudobulk.cg
# Convert to binary string representation
yame rowop -o binstring multi_sample.cg > patterns.txt
binasum → format 3 output, with M = #1 votes and U = #0 votes.
binstring → one binary string per row, e.g.:
01011001
11100011
...
4.3 yame dsample
Downsample a binary file to N “present” sites:
yame dsample -N 10000 -s 1 binary.cg > dsampled.cg
For Format 0:
- Eligible sites are those with value
1 - Non-selected sites become
0
4.4 yame rowsub
Subset rows from Format 0 the same as other formats:
# By row IDs
yame rowsub -l row_ids.txt binary.cg > subset.cg
# By coordinate list and reference
yame rowsub -R cpg_nocontig.cr -L CpG_sites.txt binary.cg > subset.cg
# By mask
yame rowsub -m promoter.cm binary.cg > subset.promoters.cg
4.5 yame mask
Format 0 plays especially well with yame mask:
# Mask out positions where mask == 1
yame mask binary.cg lowquality.cm -o masked_binary.cg
If you use -c, Format 0 can be contextualized into Format 6 to define a universe for sparse annotations.
5. When to Use Format 0 vs Other Formats
Use Format 0 when:
- Your data is intrinsically binary (e.g., peak/no-peak)
- You want minimal storage footprint
- You want maximum speed for summarization / enrichment
- You are constructing feature files (promoters, enhancers, windows)
Consider alternatives if:
- You need M/U counts → Format 3
- You need NA handling or fractions → Format 4
- You need structured query + universe semantics → Format 6
6. Minimal End-to-End Example
# 1. Prepare CpG reference (only once)
yame unpack cpg_nocontig.cr | gzip > cpg_ref.bed.gz
# 2. Create binary mask from BED peaks
bedtools intersect -a cpg_ref.bed.gz -b H3K27ac.bed -sorted -c \
| cut -f4 \
| yame pack -fb - > H3K27ac.cm
# 3. Summarize enrichment of a methylation sample over H3K27ac peaks
yame summary -m H3K27ac.cm sample.cg > enrich.txt
# 4. Subset sample to only CpGs in promoter regions
yame rowsub -m promoters.cm sample.cg > sample.promoters.cg
# 5. Downsample the promoter mask to 5000 sites
yame dsample -N 5000 promoters.cm > promoters_5k.cm