Format 1 – ASCII Values with RLE (Run-Length Encoding)
Format 1 stores single-byte ASCII values for each CpG, compressed using run-length encoding (RLE).
It is the most flexible CX format for encoding arbitrary categorical or numeric symbols that can be represented as one ASCII character (0–255).
Examples of valid values:
- ASCII digits:
0,1,2, … - ASCII letters:
a,b,A,Z - Punctuation or symbolic codes:
.,*,+,? - Encoded or discretized values where each state is a single-byte token
This format is appropriate whenever the data is:
- Discrete, but not limited to binary (Format 0)
- Symbolic, but not needing full string labels (Format 2)
- Piecewise constant, allowing large compression via RLE
1. When to Use Format 1
Use Format 1 when your per-CpG track:
- Represents one ASCII symbol per CpG
- Uses small alphabets, leading to long repeated runs
(e.g., most values are'0','1','2', or'.') - Must remain symbolic rather than numeric
- Must be efficiently stored and accessed
- Does NOT require:
- floating-point values (use Format 4)
- multi-byte integers (Format 1 only stores 1 byte)
- arbitrary text labels (use Format 2)
- pairs of values (use Format 3)
Typical applications:
- Discretized methylation states (
0,1,2,3) - Coverage tiers (
A,B,C,D) - Peak strength encoded as ASCII values
- Custom one-character categorical encodings
- Low-memory symbolic genome segmentation
2. Input Requirements
Format 1 accepts any single ASCII character per line.
Valid examples:
0
1
2
a
b
A
.
*
?
Multi-character strings like "10" or "state1" are not allowed — use Format 2 if you need labels longer than one byte.
Missing values:
- There is no NA in Format 1
- You may encode missing as
'.'or'0', depending on your design
3. Packing to Format 1
3.1 Pack from a simple ASCII list
yame pack -f1 ascii_values.txt > ascii_track.cg
Where ascii_values.txt contains one character per line.
3.2 Convert numeric or discrete BED values to ASCII
Example: discretizing peak intensities into ASCII symbols:
bedtools intersect -a cpg_ref.bed.gz -b peaks.bed -loj -sorted \
| awk '{ if ($8==".") print "."; else if ($8<5) print "a"; else print "b"; }' \
| yame pack -f1 - > peak_states.cg
3.3 Overlap count → ASCII (integer → ASCII conversion)
bedtools intersect -a cpg_ref.bed.gz -b regions.bed -sorted -c \
| cut -f4 \
| awk '{ printf("%c\n", $1); }' \
| yame pack -f1 - > counted.cg
Any number 0–255 may be converted to ASCII using printf("%c").
4. Unpacking Format 1
To recover ASCII characters:
yame unpack ascii_track.cg | head
Output is exactly one character per line:
0
0
0
1
1
.
.
a
a
2
It is lossless, even for non-printable bytes.
5. Integration with Other YAME Commands
5.1 yame summary
yame summary -m feature.cm ascii_track.cg
Interprets ASCII data as integer values 0–255 internally when computing summary statistics:
Beta= mean of ASCII byte valuesN_query= number of CpGs where value ≠ 0Log2OddsRatiotreats nonzero values as “hits”
If your encoding uses characters where '0' is not the null value, you may want to preprocess.
5.2 yame rowsub
Supports all selection modes:
yame rowsub -m promoters.cm ascii_track.cg > promoters_ascii.cg
yame rowsub -R cpg.cr -L CpGs.txt ascii_track.cg > subset.cg
5.3 yame dsample
Downsampling uses the rule:
- Non-zero ASCII values = eligible
- Retain N
- Others become
'0'(ASCII zero)
Example:
yame dsample -N 50000 ascii_track.cg > ascii_50k.cg
5.4 yame mask
Masking replaces values with '0':
yame mask ascii_track.cg mask.cm -o masked.cg
6. When NOT to Use Format 1
Avoid Format 1 if:
| Requirement | Use Instead |
|---|---|
| Values longer than 1 byte | Format 2 |
| Floating-point, NA support | Format 4 |
| M/U count pairs | Format 3 |
| Sparse universe + binary | Format 6 |
| Only 0/1 values | Format 0 |
Format 1 is best for compact single-byte symbolic encodings.
7. Minimal End-to-End Example
# 1. Prepare reference CpGs
yame unpack cpg_nocontig.cr | gzip > cpg_ref.bed.gz
# 2. Encode peak intensity categories as ASCII
bedtools intersect -a cpg_ref.bed.gz -b peaks.bed -loj -sorted \
| awk '{ if ($8==".") print "."; else if ($8<5) print "a"; else print "b"; }' \
| yame pack -f1 - > peaks_ascii.cg
# 3. Summarize across annotations
yame summary -m ChromHMM.cm peaks_ascii.cg > peak_state_enrichment.txt
# 4. Extract promoter subset
yame rowsub -m promoters.cm peaks_ascii.cg > promoter_ascii.cg
# 5. Downsample symbolic data
yame dsample -N 10000 peaks_ascii.cg > peaks_10k.cg
Format 1 provides an efficient, flexible container for single-byte, symbolic per-CpG encodings, bridging the gap between simple binary formats and complex state-based annotations.