Format 7 – CpG Reference Coordinate Files (.cr)
Format 7 provides the reference coordinate system for all .cx files in YAME.
Every .cg / .cm file—regardless of format (0–6)—must match the row count and ordering of a Format 7 reference file.
This makes Format 7 the backbone of the entire YAME infrastructure.
A .cr file contains:
- Chromosome
- Start position
- End position
- CpG ID (e.g.,
chr1_10469) - Efficiently delta-encoded and BGZF-compressed for fast access
1. Purpose of Format 7
Format 7 answers two fundamental questions:
- Where is CpG row i located in the genome?
- What row corresponds to genomic coordinate (chr, pos)?
All downstream YAME operations assume that every sample is aligned to the same .cr file.
Typical uses:
- Establish genome-wise ordering of CpGs
- Provide coordinate context to summary operations
- Support indexing, subsetting, windowing, and masking
- Serve as the base for feature creation (
format 2) - Ensure consistent row alignment across multi-sample
.cxfiles
2. What Format 7 Contains
Each row represents one CpG:
chr1 10468 10469 chr1_10469
chr1 10470 10471 chr1_10471
chr1 10483 10484 chr1_10484
...
Columns:
- Chromosome
- Start (0-based BED convention)
- End (start+1)
- Name (or ID), usually
chr_pos1
Internally, YAME stores:
- Delta-compressed positions
- RLE encoding of chromosome boundaries
- A name dictionary
This makes .cr much smaller than a BED.
3. Packing to Format 7
3.1 From a BED file of CpG sites
yame pack -f7 cpg_coords.bed > cpg_reference.cr
Input BED must contain at least the first 3 columns; the 4th is optional but recommended.
Example input (cpg_coords.bed):
chr1 10468 10469 chr1_10469
chr1 10470 10471 chr1_10471
chr1 10483 10484 chr1_10484
If no name is supplied, YAME generates row IDs automatically.
3.2 Converting an existing .cr file to BED
yame unpack cpg_reference.cr > cpg_ref.bed
This unpacking step is foundational for most workflows:
- Feature construction (Format 2)
- BED-based intersection
- Generating masks or windows
4. Using Format 7 as the Reference Genome for .cx Data
Every .cg / .cm file must:
- Contain one row per CpG
- Follow the same ordering as the
.crfile - Use the same genome build and coordinate convention
To verify alignment:
yame info sample.cg
This prints:
- Row count
- Format ID
- Validity checks
- Compatibility with
.cr(if provided)
4.1 Aligning BED-like data to the reference CpGs
This is the standard workflow:
yame unpack cpg_reference.cr | gzip > cpg_ref.bed.gz
bedtools intersect -a cpg_ref.bed.gz -b input.bed -loj -sorted \
| cut -f4 \
| yame pack -fb - > binary.cg
This ensures:
- Same ordering as
.cr - One CpG per row
- Missing positions explicitly included (value 0)
4.2 Aligning M/U counts or fraction values
See examples in Format 3 / Format 4 documentation:
bedtools intersect -a cpg_ref.bed.gz -b mu_counts.bed -loj -sorted \
| awk '{print $8"\t"$9}' \
| yame pack -f3 - > sample.cg
5. Integration with YAME Commands
5.1 yame rowsub
Coordinate-based row selection requires .cr:
# Extract CpGs listed by genomic coordinate
yame rowsub -R cpg_reference.cr -L CpG_sites.txt sample.cg > subset.cg
5.2 yame summary
Format 7 itself is not summarized, but masks and queries rely on the order defined by .cr.
5.3 yame chunk / yame chunkchar
When chunking .cg files:
yame chunk sample.cg chunks/
the row boundaries respect .cr ordering.
5.4 Creating features (Format 2) requires .cr
Example:
yame unpack cpg_reference.cr | gzip > cpg_ref.bed.gz
bedtools intersect -a cpg_ref.bed.gz -b chromhmm.bed -loj -sorted \
| bedtools groupby -g 1-3 -c 7 -o first \
| cut -f4 \
| yame pack -f2 - > ChromHMM.cm
6. Choosing or Building a CpG Reference
Most users rely on prebuilt .cr files, e.g.:
- hg19: CpG_nocontig CR file
- hg38: CpG_nocontig CR file
- mm10, mm39: Available via KYCGKB repositories
Custom genomes:
# Build your own CpG reference
grep -E -o 'CG' -b genome.fa \
| awk '{pos=$1; ... construct BED rows ... }' \
> cpg_coords.bed
yame pack -f7 cpg_coords.bed > cpg_reference.cr
7. When NOT to Use Format 7
Do NOT use Format 7 for:
- Storing methylation data → use Formats 0,1,3,4,5,6
- Storing mask or feature annotations → use Format 2
- Storing per-sample or bulk CpG calls → use Formats 0/3/4
Format 7 is strictly a reference coordinate container.
8. Minimal End-to-End Example
# 1. Unpack the reference CpGs
yame unpack cpg_nocontig.cr | gzip > cpg_ref.bed.gz
# 2. Align your BED data to reference CpGs
bedtools intersect -a cpg_ref.bed.gz -b peaks.bed -loj -sorted \
| cut -f4 \
| yame pack -fb - > peaks.cg
# 3. Create a feature file using ChromHMM
zcat ChromHMM.bed.gz \
| bedtools intersect -a cpg_ref.bed.gz -b - -loj -sorted \
| bedtools groupby -g 1-3 -c 7 -o first \
| cut -f4 \
| yame pack -f2 - > chromhmm.cm
# 4. Summarize enrichment
yame summary -m chromhmm.cm peaks.cg > enrich.txt
Format 7 is the foundation of all YAME workflows: It defines CpG identity, ordering, genomic position, and compatibility across all .cg and .cm files. A correct and consistent .cr file ensures that your entire methylation analysis pipeline remains coherent, efficient, and reproducible.