7. Downsampling, Masking & Noise Injection
YAME provides tools to control sparsity, apply masks, and inject noise into methylation data:
yame dsample— randomly downsample sites to simulate lower coverage or sparsity.yame mask— apply a binary mask to zero out sites or convert a binary format into a contextualized format 6.yame perturb— randomly flip methylation bits to inject noise for benchmarking and sensitivity testing.
These functions are especially useful for benchmarking methods at different sparsity levels, building controlled simulation datasets, or restricting analyses to a specific universe of CpGs.
7.1 Random Downsampling with yame dsample
yame dsample randomly keeps a fixed number of non-NA sites per sample and masks out the rest.
Supported input formats:
- Format 3 (
M/Ucounts) - Format 6 (binary with universe bit; often used for single-cell sparse data)
Basic usage:
yame dsample -N 10000 -s 1 input.cg > downsampled.cg
This keeps 10,000 covered sites per sample (or all if fewer are available), using seed 1 for reproducibility.
7.1.1 What dsample Does (Format 3 vs Format 6)
-
Format 3 (
.cg, M/U counts)- Eligible sites are those with
M+U > 0. dsamplerandomly selects N such sites.- Selected sites keep their original
MandU. - Non-selected sites are masked by setting
M=U=0(treated as missing/NA).
- Eligible sites are those with
-
Format 6 (universe-bit binary)
- Eligible sites are those in the universe (
FMT6_IN_UNI). dsamplerandomly selects N universe positions.- Selected sites remain unchanged.
- Non-selected sites have their universe bit cleared (
FMT6_SET_NA), effectively dropping them from the analyzable universe.
- Eligible sites are those in the universe (
In both cases, if N is larger than the number of eligible sites, all eligible sites are kept (no error).
7.1.2 Key Options
yame dsample [options] <in.cx> [out.cx]
Options:
-N [int]— number of eligible sites to keep per sample (default:100).-s [int]— random seed (default: current time). Use a fixed seed for reproducible downsampling.-
-r [int]— number of independent replicates per sample (default:1).- Each replicate is downsampled separately from the same input sample.
-h— show help.
Output destination:
- If
out.cx(positional) or-ois provided: write to that file and also write an index. - If no output is given: write to stdout (no index).
7.1.3 Replicates and Index Naming
When -r is greater than 1, dsample creates multiple downsampled versions of each input sample.
-
If the input
.cxhas an index:- The original sample names are used as a base.
- Replicates are suffixed:
SampleA-0,SampleA-1, …,SampleA-(r-1).
-
If no input index exists:
- Samples are named
0,1,2, … internally. - Replicates follow the same
base-repnaming pattern.
- Samples are named
Example: create 5 replicates with 50k sites each:
yame dsample -N 50000 -r 5 input.cg downsampled.cg
The resulting index in downsampled.cg.idx will contain entries like:
Sample1-0
Sample1-1
...
Sample1-4
Sample2-0
...
7.1.4 Typical Use Cases
-
Benchmarking methods at different sparsity levels Run the same pipeline on
N = 1e5,N = 5e4,N = 1e4to see robustness to coverage. -
Generating multiple randomized sparsity replicates For each sample, simulate multiple downsampling replicates with different seeds or with
-r. -
Single-cell simulations with format 6 Use
dsampleto progressively restrict the universe of accessible CpGs and observe performance changes.
For more help with dsample, run:
yame dsample -h
7.2 Masking and Contextualization with yame mask
yame mask applies a row-wise mask to a .cx file and optionally converts binary data into format 6 for contextualized single-cell usage.
Basic usage:
yame mask input.cg mask.cx -o masked.cg
Here:
input.cg— query methylation file (format 0, 1, or 3).mask.cx— mask file (format 0, 1, or 3; internally converted to a binary format 0).- Output
masked.cgcontains only the unmasked positions.
7.2.1 Supported Inputs and Mask Semantics
-
Mask file (
mask.cx):- Can be format 0, 1, or 3.
-
Format 1 and 3 masks are converted to format 0:
- For format 1:
1is treated as masked,0unmasked. - For format 3: sites with
M+U > 0become1(masked), zeros become0.
- For format 1:
-
Query file (
input.cg):- Format 3: M/U counts.
- Format 0/1: binary.
The mask must have the same row length as the query; otherwise the command will abort with an error.
By default, bits that are 1 in the mask are masked out (removed).
If you set -v, the mask is inverted, so bits that are 0 in the original mask become masked.
7.2.2 Operations Without Contextualization (default)
yame mask input.cg mask.cx -o masked.cg
-
If the query is format 3:
- For every site where the mask bit is
1, setM=U=0. - Effect: those sites are treated as missing.
- For every site where the mask bit is
-
If the query is format 0:
- Perform a binary AND with the complement of the mask (
c &= ~mask). - Effect: all
1s in the mask are forced to0in the query.
- Perform a binary AND with the complement of the mask (
This is handy for:
- Removing blacklist CpGs from an existing
.cg. - Removing low-quality or low-coverage sites.
- Restricting analysis to a curated panel of CpGs.
7.2.3 Contextualizing to Format 6 (-c)
With -c, yame mask turns a binary query plus a mask into a format 6 object:
yame mask -c input_binary.cx mask.cx -o contextualized.cx
Behavior:
- The mask defines the universe (sites that exist in the cell).
-
The binary values in the query define whether each universe site is methylated (
1) or unmethylated (0):-
If
mask[i] = 1:- If
input[i] = 1➜ set format 6 as methylated (FMT6_SET1). - If
input[i] = 0➜ set format 6 as unmethylated (FMT6_SET0).
- If
-
If
mask[i] = 0:- Site is outside the universe (no entry in the resulting format 6 vector).
-
With -v, the universe is effectively the complement of the mask (invert mask before contextualizing).
This is useful for:
- Defining cell- or experiment-specific universes while retaining 0/1 methylation calls.
- Converting feature masks into sparse, contextualized single-cell objects.
7.2.4 Command Summary
yame mask [options] <in.cx> <mask.cx>
Options:
-o [PATH]— output.cxfile name. If missing, write to stdout (no index).-c— contextualize to format 6 using1s in mask as the universe.-v— invert the mask (mask0s instead of1s).-h— help.
Example workflows:
Mask out low-quality sites:
yame mask highcov.cg lowqual_mask.cx -o highcov_masked.cg
Restrict to a predefined universe and create fmt6:
yame mask -c raw_binary.cx universe_mask.cx -o cell_fmt6.cx
For more help with mask, run:
yame mask -h
7.3 Noise Injection with yame perturb
yame perturb randomly flips methylation bits in format 0 or format 6 data with a specified probability. It is designed for benchmarking and sensitivity testing — for example, measuring how much label noise a method can tolerate before its accuracy degrades.
7.3.1 What perturb Does (Format 0 vs Format 6)
- Format 0 (binary bit vector)
- Every bit (0 or 1) is independently flipped with probability
p.
- Every bit (0 or 1) is independently flipped with probability
- Format 6 (universe-bit binary)
- Only in-universe sites are eligible for flipping.
- NA sites (universe bit = 0) are left unchanged.
7.3.2 Key Options
yame perturb [options] <in.cx>
Options:
-p [float]— fraction of sites to flip, in [0, 1] (default:0.05).-s [int]— random seed (default: current time). Use a fixed seed for reproducibility.-o [PATH]— output.cxfile (default: stdout).
7.3.3 Typical Use Cases
-
Method robustness benchmarking Apply increasing levels of noise (
-p 0.01,0.05,0.10, …) to a known ground truth and measure how quickly downstream results degrade. -
Sensitivity analysis Test whether a classifier or enrichment result is driven by a small number of highly informative sites, or is robust to random perturbation.
-
Controlled simulation Combine with
dsample(for sparsity) andperturb(for noise) to build fully parameterized simulation datasets.
Example: inject 10% noise into a format 6 single-cell file with a fixed seed:
yame perturb -p 0.10 -s 42 input.cg > noisy.cg
For more help with perturb, run:
yame perturb -h