7. Downsampling & Masking Methylation Sites
YAME provides tools to control sparsity and apply masks to methylation data:
yame dsample— randomly downsample sites to simulate lower coverage or sparsity.yame mask— apply a binary mask to zero out sites or convert a binary format into a contextualized format 6.
These functions are especially useful for benchmarking methods at different sparsity levels, building controlled simulation datasets, or restricting analyses to a specific universe of CpGs.
7.1 Random Downsampling with yame dsample
yame dsample randomly keeps a fixed number of non-NA sites per sample and masks out the rest.
Supported input formats:
- Format 3 (
M/Ucounts) - Format 6 (binary with universe bit; often used for single-cell sparse data)
Basic usage:
yame dsample -N 10000 -s 1 input.cg > downsampled.cg
This keeps 10,000 covered sites per sample (or all if fewer are available), using seed 1 for reproducibility.
7.1.1 What dsample Does (Format 3 vs Format 6)
-
Format 3 (
.cg, M/U counts)- Eligible sites are those with
M+U > 0. dsamplerandomly selects N such sites.- Selected sites keep their original
MandU. - Non-selected sites are masked by setting
M=U=0(treated as missing/NA).
- Eligible sites are those with
-
Format 6 (universe-bit binary)
- Eligible sites are those in the universe (
FMT6_IN_UNI). dsamplerandomly selects N universe positions.- Selected sites remain unchanged.
- Non-selected sites have their universe bit cleared (
FMT6_SET_NA), effectively dropping them from the analyzable universe.
- Eligible sites are those in the universe (
In both cases, if N is larger than the number of eligible sites, all eligible sites are kept (no error).
7.1.2 Key Options
yame dsample [options] <in.cx> [out.cx]
Options:
-N [int]— number of eligible sites to keep per sample (default:100).-s [int]— random seed (default: current time). Use a fixed seed for reproducible downsampling.-
-r [int]— number of independent replicates per sample (default:1).- Each replicate is downsampled separately from the same input sample.
-h— show help.
Output destination:
- If
out.cx(positional) or-ois provided: write to that file and also write an index. - If no output is given: write to stdout (no index).
7.1.3 Replicates and Index Naming
When -r is greater than 1, dsample creates multiple downsampled versions of each input sample.
-
If the input
.cxhas an index:- The original sample names are used as a base.
- Replicates are suffixed:
SampleA-0,SampleA-1, …,SampleA-(r-1).
-
If no input index exists:
- Samples are named
0,1,2, … internally. - Replicates follow the same
base-repnaming pattern.
- Samples are named
Example: create 5 replicates with 50k sites each:
yame dsample -N 50000 -r 5 input.cg downsampled.cg
The resulting index in downsampled.cg.idx will contain entries like:
Sample1-0
Sample1-1
...
Sample1-4
Sample2-0
...
7.1.4 Typical Use Cases
-
Benchmarking methods at different sparsity levels Run the same pipeline on
N = 1e5,N = 5e4,N = 1e4to see robustness to coverage. -
Generating multiple randomized sparsity replicates For each sample, simulate multiple downsampling replicates with different seeds or with
-r. -
Single-cell simulations with format 6 Use
dsampleto progressively restrict the universe of accessible CpGs and observe performance changes.
For more help with dsample, run:
yame dsample -h
or see the dsample help page.
7.2 Masking and Contextualization with yame mask
yame mask applies a row-wise mask to a .cx file and optionally converts binary data into format 6 for contextualized single-cell usage.
Basic usage:
yame mask input.cg mask.cx -o masked.cg
Here:
input.cg— query methylation file (format 0, 1, or 3).mask.cx— mask file (format 0, 1, or 3; internally converted to a binary format 0).- Output
masked.cgcontains only the unmasked positions.
7.2.1 Supported Inputs and Mask Semantics
-
Mask file (
mask.cx):- Can be format 0, 1, or 3.
-
Format 1 and 3 masks are converted to format 0:
- For format 1:
1is treated as masked,0unmasked. - For format 3: sites with
M+U > 0become1(masked), zeros become0.
- For format 1:
-
Query file (
input.cg):- Format 3: M/U counts.
- Format 0/1: binary.
The mask must have the same row length as the query; otherwise the command will abort with an error.
By default, bits that are 1 in the mask are masked out (removed).
If you set -v, the mask is inverted, so bits that are 0 in the original mask become masked.
7.2.2 Operations Without Contextualization (default)
yame mask input.cg mask.cx -o masked.cg
-
If the query is format 3:
- For every site where the mask bit is
1, setM=U=0. - Effect: those sites are treated as missing.
- For every site where the mask bit is
-
If the query is format 0:
- Perform a binary AND with the complement of the mask (
c &= ~mask). - Effect: all
1s in the mask are forced to0in the query.
- Perform a binary AND with the complement of the mask (
This is handy for:
- Removing blacklist CpGs from an existing
.cg. - Removing low-quality or low-coverage sites.
- Restricting analysis to a curated panel of CpGs.
7.2.3 Contextualizing to Format 6 (-c)
With -c, yame mask turns a binary query plus a mask into a format 6 object:
yame mask -c input_binary.cx mask.cx -o contextualized.cx
Behavior:
- The mask defines the universe (sites that exist in the cell).
-
The binary values in the query define whether each universe site is methylated (
1) or unmethylated (0):-
If
mask[i] = 1:- If
input[i] = 1➜ set format 6 as methylated (FMT6_SET1). - If
input[i] = 0➜ set format 6 as unmethylated (FMT6_SET0).
- If
-
If
mask[i] = 0:- Site is outside the universe (no entry in the resulting format 6 vector).
-
With -v, the universe is effectively the complement of the mask (invert mask before contextualizing).
This is useful for:
- Defining cell- or experiment-specific universes while retaining 0/1 methylation calls.
- Converting feature masks into sparse, contextualized single-cell objects.
7.2.4 Command Summary
yame mask [options] <in.cx> <mask.cx>
Options:
-o [PATH]— output.cxfile name. If missing, write to stdout (no index).-c— contextualize to format 6 using1s in mask as the universe.-v— invert the mask (mask0s instead of1s).-h— help.
Example workflows:
Mask out low-quality sites:
yame mask highcov.cg lowqual_mask.cx -o highcov_masked.cg
Restrict to a predefined universe and create fmt6:
yame mask -c raw_binary.cx universe_mask.cx -o cell_fmt6.cx
For more help with mask, run:
yame mask -h
or see the mask help page.