1. Storage & Format

Convert between text and compressed CX binary formats for efficient storage and analysis.

Overview

YAME’s pack and unpack commands provide bidirectional conversion between human-readable text formats (BED/TSV/etc.) and the compressed CX binary formats.

CX formats:

Dramatically reduce storage requirements (often 10–100×)
Are optimized for different methylation / feature data types
Are interoperable across all YAME commands

Most workflows follow this pattern:

Align your data to a CpG reference coordinate file (.cr, format 7)
Pack into an appropriate CX format (0/2/3/4/5/6)
Use other YAME commands (summary, rowsub, rowop, subset, etc.)
Optionally unpack to text for external tools

Format Overview

YAME currently supports the following CX format family:

Format	Code	Typical Ext	Best for
Format 0	`0` / `b`	`.cg`	Binary presence/absence (DMR sites, ChIP-seq peaks, generic 0/1 tracks)
Format 1	`1`	`.cg`	Integer values with RLE (count tracks, QC metrics, per-CpG integer signals)
Format 2	`2` / `s`	`.cm`	Chromatin states, genomic annotations, gene features, windows/bins
Format 3	`3` / `m`	`.cg`	M/U read counts from bisulfite sequencing
Format 4	`4`	`.cg`	Continuous methylation values (beta/fraction), array/WGBS imputed values
Format 6	`6`	`.cx`	Set + universe representation for enrichment; sparse single-cell methylation
Format 7	`7`	`.cr`	CpG genomic reference coordinates (required for all `.cg` / `.cm` files)

All CX formats use BGZF compression and share a consistent internal structure (cdata_t blocks), enabling uniform handling across different data types.

Pack / Unpack Basics

The core commands are:

# Pack text → CX
yame pack -f<format> [options] <input.txt> [output.cx]
# Read from stdin:
cat data.txt | yame pack -fb - data.cg
# Unpack CX → text
yame unpack [options] <input.cx>

⚠️ Your input MUST match reference CpG coordinates exactly:

Same number of rows as reference CpGs
Same order as reference CpGs
One value per CpG in reference

How to ensure alignment:

# Step 1: Get reference coordinates
yame unpack cpg_nocontig.cr | gzip > cpg_ref.bed.gz
# Step 2: Intersect your BED file with reference
bedtools intersect -a cpg_ref.bed.gz -b your_data.bed -sorted -c | \
  cut -f4 | \
  yame pack -fb - > aligned_output.cg
# the following two output should match in dimension
yame info cpg_nocontig.cr
yame info aligned_output.cg

A more complete workflow (including reference coordinate alignment) is described in each format’s page.

Practical Workflows

Workflow 1: Process Bisulfite Sequencing Data

Complete pipeline from bisulfite sequencing output to analysis:

# Assuming you have M and U counts from your pipeline
# Format: chr start end M U

# 1. Extract M and U, align with reference
bedtools intersect -a cpg_ref.bed.gz -b methylation_calls.bed -loj -sorted | \
  awk '{if ($8==".") print "0\t0"; else print $8"\t"$9}' | \
  yame pack -f3 - > sample.cg

# 2. Verify data quality
yame info sample.cg
yame summary sample.cg

# 3. Perform enrichment analysis
yame summary -m ChromHMM.cm sample.cg > chromatin_enrichment.txt
yame summary -m genes.cm sample.cg > gene_enrichment.txt

# 4. If needed, unpack for external tools
yame unpack -f 5 sample.cg > sample_cov5.txt

Workflow 2: Create Multi-Feature Database

Build comprehensive feature database:

#!/bin/bash
# Create comprehensive feature database for hg38

# 1. Prepare reference
yame unpack cpg_nocontig.cr | gzip > cpg_ref.bed.gz

# 2. Download and process multiple features
declare -A features=(
  ["ChromHMM_15"]="ChromHMM_15state.bed.gz"
  ["CpG_Islands"]="cpgIslandExt.bed.gz"
  ["Promoters"]="promoters_2kb.bed.gz"
  ["Enhancers"]="enhancers_merged.bed.gz"
  ["TFBS"]="tfbs_combined.bed.gz"
)

# 3. Create individual feature files
for name in "${!features[@]}"; do
  file="${features[$name]}"
  echo "Processing $name..."
  
  zcat "$file" | bedtools sort | \
    bedtools intersect -a cpg_ref.bed.gz -b - -loj -sorted | \
    bedtools groupby -g 1-3 -c 7 -o first | \
    cut -f4 | \
    yame pack -f2 - > "${name}.cm"
done

echo "Feature files created:"
ls -lh *.cm

Workflow 3: Convert Array Data to CX Format

Convert Illumina array data:

#!/bin/bash
# Convert 450k/EPIC array data to CX format

# Assuming you have: array_data.txt with columns: ProbeID, Beta

# 1. Get array manifest with probe coordinates
# manifest.bed format: chr start end ProbeID

# 2. Intersect with reference CpGs
bedtools intersect -a cpg_ref.bed.gz -b manifest.bed -loj -sorted > probe_cpg_map.bed

# 3. Map betas to CpG positions
join -1 4 -2 1 -t$'\t' \
  <(sort -k4,4 probe_cpg_map.bed) \
  <(sort -k1,1 array_data.txt) | \
  sort -k2,2 -k3,3n | \
  cut -f7 | \
  awk '{if ($1=="") print "NA"; else print $1}' | \
  yame pack -f4 - > array_sample.cg

echo "Array data converted to CX format"
yame info array_sample.cg

Best Practices

Always align with reference first

bedtools intersect -a cpg_ref.bed.gz -b your_data.bed -sorted -c

Verify dimensions match

# Count reference CpGs
zcat cpg_ref.bed.gz | wc -l
   
# Count your data rows
wc -l your_data.txt

Handle missing values properly
- Format 3: Use M=0, U=0 for no coverage
- Format 4: Use “NA” for missing values
- Format 2: Use “.” for unassigned CpGs