Skip to contents
library(MethScope)

Overview

This tutorial demonstrates a standard MethScope workflow:

  • Generate MRMP embeddings from .cg input
  • Perform cell-type annotation with a pre-trained model
  • Train a custom model
  • Visualize prediction quality
  • Run deconvolution and unsupervised clustering

Data and preprocessing resources:

1) Generate cell/pixel/sample MRMP embeddings

Use GenerateInput() to transform a query .cg file into a sample-by-MRMP matrix.

# Paths to your .cg and .cm files
example_file <- "/home/fuh1/cell_type/Tool/MethScope/inst/extdata/example.cg"
reference_pattern <- "/home/fuh1/cell_type/Tool/MethScope/inst/extdata/Liu2021_MouseBrain.cm"
input_pattern <- GenerateInput(example_file, reference_pattern)

Example output (head(input_pattern[, 1:9])):

Sample P1 P2 P3 P4 P5 P6 P7 P8 P9
GSM4012028 0.935 0.016 0.902 0.899 0.881 0.898 0.900 0.883 0.850
GSM4014036 0.941 0.021 0.911 0.904 0.888 0.911 0.920 0.901 0.867
GSM4014043 0.955 0.019 0.931 0.926 0.916 0.930 0.918 0.932 0.897
GSM4014056 0.954 0.023 0.940 0.934 0.928 0.940 0.912 0.932 0.918
GSM4014298 0.947 0.016 0.926 0.922 0.909 0.923 0.863 0.918 0.894
GSM4016157 0.929 0.016 0.887 0.881 0.859 0.880 0.904 0.882 0.825

For really large datasets, please consider split .cg files and process in parallel: https://zhou-lab.github.io/YAME/docs/subcommands/YAME_subset.html

2) Perform cell-type annotation

Use a built-in pre-trained model for fast annotation. Available built-in examples include:

  • MethScope:::Liu2021_MouseBrain_P1000
  • MethScope:::Zhou2025_HumanAtlas_P1000
prediction_result <- PredictCellType(
  MethScope:::Liu2021_MouseBrain_P1000,
  input_pattern
)

3) Train a custom classification model

You can train your own XGBoost model using known labels. cell_type_label must align with rows of input_pattern.

trained_model <- Input_training(
  input_pattern,
  cell_type_label
)

You can also tune hyperparameters automatically or provide your own list of xgb_parameters:

trained_model_cv <- Input_training(
  input_pattern,
  cell_type_label,
  cross_validation = TRUE
)

4) Visualize prediction results

umap_plot <- PlotUMAP(input_pattern, prediction_result)
# cell_type_label is the true label vector
PlotConfusion(prediction_result, cell_type_label)
PlotF1(prediction_result, cell_type_label)

UMAP visualization from PlotUMAPConfusion matrix from PlotConfusionF1 score plot from PlotF1

5) Cell-type deconvolution

Estimate cell-type proportions using nnls_deconv(). Some reference matrix examples are available in: https://github.com/zhou-lab/methscope_data

reference_pattern <- "Liu2021_MouseBrain.cm"
reference_input <- readRDS("2021Liu_reference_pattern.rds")
cell_proportion <- nnls_deconv(reference_input, input_pattern)

6) Unsupervised clustering (Seurat example)

After generating input_pattern, you can cluster with standard single-cell pipelines.

library(Seurat)
Pattern.obj <- CreateSeuratObject(counts = t(input_pattern), assay = "DNAm")
VariableFeatures(Pattern.obj) <- rownames(Pattern.obj[['DNAm']])
DefaultAssay(Pattern.obj) <- "DNAm"
Pattern.obj <- NormalizeData(Pattern.obj, assay = "DNAm", verbose = FALSE)
Pattern.obj <- ScaleData(Pattern.obj, assay = "DNAm", verbose = FALSE)
# You can also directly use the initial counts matrix
Pattern.obj@assays$DNAm@layers$scale.data <- as.matrix(Pattern.obj@assays$DNAm@layers$counts)
Pattern.obj <- RunPCA(Pattern.obj, assay = "DNAm", reduction.name = "mpca", verbose = FALSE)
Pattern.obj <- FindNeighbors(Pattern.obj, reduction = "mpca", dims = 1:30)
Pattern.obj <- FindClusters(Pattern.obj, verbose = FALSE, resolution = 0.7)
Pattern.obj <- RunUMAP(Pattern.obj, reduction = "mpca", reduction.name = "meth.umap", dims = 1:30)

7) Upscaling Network (10k-CpG block example)

This section describes a compact workflow to upscale MRMP embeddings to CpG-level methylation predictions within a genomic block.

7.1 Prepare model input (x) and target (y)

x is the MRMP embedding matrix (samples x features).
y is the CpG-level ground truth matrix for the same samples (samples x CpGs).

# create masked replicates (masked CpGs encoded as 2)
# this generate 100 replicates per original sample, each of the 10 output files contains 10 replicates per sample
parallel -j 24 'yame dsample -p {} -s {} -r 10 -N 29000 -o example_rep{}.cg example.cg' ::: {1..10}

# summarize MRMP features per replicate
parallel -j 24 'yame summary -m patterns.cm example_rep{}.cg | cut -f2,4,10 | doawk | sort -k1,1 -k2,2n | bedtools groupby -g 1 -c 3 -o collapse > example_rep{}_mrmp.txt' ::: {1..10}

cat example_rep*_mrmp.txt > input_pattern.txt

# whole-genome .cg file can be divided into 10k-CpG blocks, and then CpG-level truth from .cg
parallel -j 72 'yame rowsub -I {}_10000 example.cg | yame unpack -a - >truth/example_{}.txt' ::: {0..2940}
x <- read.table("input_pattern.txt", header = FALSE)

y <- read.table("example_0.txt", header = FALSE)
y <- t(y)  # samples x CpGs

7.2 Train decoder (Python)

import torch
import torch.nn as nn

class BottleneckDecoder(nn.Module):
    def __init__(self, input_dim=num_of_patterns, hidden_dim=512, output_dim=10000):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.BatchNorm1d(hidden_dim),
            nn.Linear(hidden_dim, output_dim)
        )
    def forward(self, x):
        return self.net(x)

Recommended trainning setup:

  • Loss: masked BCEWithLogitsLoss (ignore entries where label is 2)
  • Optimizer: Adam (lr = 1e-3, weight_decay = 1e-5)
  • Scheduler: ReduceLROnPlateau
  • Model selection: run 3 epoches and save checkpoint with best validation

7.3 Evaluate

After training:
1. Run inference on the test set.
2. Apply sigmoid to logits to obtain probabilities.
3. Compute metrics on valid positions only (Accuracy, MAE, AUPRC).
4. Save predicted probabilities to predictions.csv for downstream R visualization.

Notes
- For whole-genome analysis, perform block-wise upscaling in parallel to improve computational efficiency.