Format 3 – Methylated & Unmethylated Counts
Format 3 stores paired M/U (methylated/unmethylated) read counts per CpG from bisulfite-based assays (e.g., WGBS, RRBS).
- Typical extension:
.cg - Input: two integers per CpG:
M(methylated) andU(unmethylated) - Best for: high-coverage sequencing where you want to preserve read counts and coverage
1. Input Requirements
Before packing into format 3, your data must be:
- Aligned to a CpG reference coordinate file (format 7, e.g.
cpg_nocontig.cr) - One row per CpG, in the same order and of the same length as the reference
- Each row: two non-negative integers:
MandU
Example input (mu_counts.txt):
12 3
8 8
0 15
20 2
0 0
5 10
Interpretation per line:
- Column 1: methylated count (M)
- Column 2: unmethylated count (U)
- Both must be ≥ 0;
M = U = 0means “no coverage / missing”
2. Packing to Format 3
2.1 From an aligned M/U table
If you already have a table of M U pairs, aligned to the CpG reference:
yame pack -f3 mu_counts.txt > sample.cg
This creates a compressed .cg file in format 3.
2.2 From BED-like methylation calls
If your pipeline outputs per-CpG counts as BED:
# mu_calls.bed columns (example):
# chr start end M U
bedtools intersect -a cpg_ref.bed.gz -b mu_calls.bed -loj -sorted \
| awk '{ if ($8 == ".") print "0\t0"; else print $8"\t"$9 }' \
| yame pack -f3 - > sample.cg
Key points:
-loj(left outer join) ensures every reference CpG appears exactly once.- CpGs not covered in
mu_calls.bedare assigned0 0(no coverage). - The
awkproduces two columns (M,U) suitable foryame pack -f3.
3. Unpacking Format 3
By default, yame unpack converts M/U counts into beta and coverage:
yame unpack sample.cg | head
Example output:
0.800 15
0.500 16
0.000 15
0.909 22
NA 0
0.333 15
Columns:
- Beta –
M / (M + U);NAifM + U == 0 - Coverage –
M + U
You can filter by minimum coverage with -f:
# Only output CpGs with coverage ≥ 5
yame unpack -f 5 sample.cg > sample_cov5.txt
This is convenient for QC and downstream tools that expect beta + coverage.
4. Integration with Other YAME Commands
Format 3 works with most downstream operations:
4.1 yame summary
yame summary sample.cg
yame summary -m feature.cm sample.cg
summary will compute per-sample and per-feature:
- Number of CpGs in universe / query / mask
- Overlap counts and log2 odds ratio
- Average beta in feature regions
- Average depth in feature regions
4.2 yame rowop (row-wise operations)
Some common operations on format 3:
# Per-CpG mean beta across samples
yame rowop -o mean sample.cg > mean_beta.tsv
# Per-CpG standard deviation of beta across samples
yame rowop -o std sample.cg > beta_std.tsv
# Sum M and U across samples (true pseudobulk counts)
yame rowop -o musum sample.cg bulk.cg
# Binarize and sum (vote-based pseudobulk)
yame rowop -o binasum -c 3 sample.cg bulk_binasum.cg
Quick interpretations:
-
-o meanFor each CpG, computes the mean beta across samples that have coverage ≥-c(default 1). Output:beta_mean n_samples_used. -
-o stdFor each CpG, computes standard deviation of beta across samples with coverage ≥-c. Output:beta_sd n_samples_used. -
-o musumSums M and U directly across samples, preserving counts. Output is a new format 3.cgfile. -
-o binasumConverts each sample to a binary call by comparing M vs U (and coverage ≥-c), then counts methylated vs unmethylated “votes” across samples.
4.3 yame dsample (downsampling)
You can create downsampled versions of format 3 data to simulate reduced coverage or increased sparsity:
# Keep 50,000 covered CpGs per sample, fixed seed
yame dsample -N 50000 -s 1 sample.cg > sample_N50k.cg
Behavior for format 3:
- Eligible sites: CpGs with
M + U > 0 - Randomly selects up to
Neligible sites; the rest are masked by settingM = U = 0.
4.4 yame rowsub (row subsetting)
Subset CpGs (rows) from a format 3 file:
# By index list
yame rowsub -l row_ids.txt sample.cg > subset.cg
# By coordinate list and row coordinate file
yame rowsub -R cpg_nocontig.cr -L CpG_sites.txt sample.cg > subset.cg
# By binary mask
yame rowsub -m mask.cx sample.cg > masked_subset.cg
The output remains format 3.
5. When to Use Format 3 (vs Other Formats)
Choose Format 3 when:
- You have bisulfite sequencing data with per-CpG counts.
- You care about coverage and want to model uncertainty (beta at low coverage vs high coverage).
- You want the option to derive beta, binary calls, or more complex statistics later.
Consider other formats when:
- You only need beta values: use Format 4 for more compact storage.
- You want only 0/1 calls: use Format 0.
- You’re working with sparse single-cell binary data with an explicit universe: consider Format 6.
6. Minimal End-to-End Example
# 1. Prepare reference CpG coordinates (once per genome/build)
yame unpack cpg_nocontig.cr | gzip > cpg_ref.bed.gz
# 2. Align M/U calls to reference CpGs
bedtools intersect -a cpg_ref.bed.gz -b mu_calls.bed -loj -sorted \
| awk '{ if ($8 == ".") print "0\t0"; else print $8"\t"$9 }' \
| yame pack -f3 - > sample.cg
# 3. QC and summary
yame info sample.cg
yame summary sample.cg
# 4. Export beta and coverage for external tools (coverage ≥ 10)
yame unpack -f 10 sample.cg > sample_cov10.txt
This is the typical pattern for ingesting WGBS/RRBS data into YAME using Format 3.