This vignette book documents the SeSAMe package in full to supplement the vignettes hosted on the Bioconductor.
SeSAMe design includes a light-weight fullly-exposed infrastructure of the internal signal intensities. Central to this infrastructure is the SigDF
data structure, which is a data.frame
subclass. One can treat it like a regulator data.frame
with 7 specific columns, i.e., Probe_ID, MG, MR, UG, UR, col and mask. The col
column specifies the color channel and takes G
, R
and 2
. The Infinium-I probes carry G
and R
in col to indicate the designed color. This infrastructure is nimble to allow change of color channel, and mask (the scope of usable probes) depending on the use of the array on different species, strain, population etc. For example, the following data.frame operation let you easily peek into the signal intensities.
One can summarize resulting SigDF
using the ‘sesameQC_calcStats’ function (more QC can be found in the quality control vignette).
##
## =====================
## | Number of Probes
## =====================
## N. Probes : 866553 (num_probes)
## N. Inf.-II Probes : 724429 (num_probes_II)
## N. Inf.-I (Red) : 92192 (num_probes_IR)
## N. Inf.-I (Grn) : 49932 (num_probes_IG)
## N. Probes (CG) : 862927 (num_probes_cg)
## N. Probes (CH) : 2932 (num_probes_ch)
## N. Probes (RS) : 59 (num_probes_rs)
Sometimes, particularly with older arrays, there might exist a controls
attributes to contain the control probe information. In the new manifest, the control probes will be parsed and included as regulator probes (except with a ctl
prefix in the probe ID). The control probe annotation can be found using the following function:
SigDF
can be written as and read from plain text file (e.g, tab-delimited files and comma-delimited files) with the compliant column names (see above).
tsv_file_path = sprintf("%s/sigdf.tsv", tempdir())
sdf_write_table(sdf, file=tsv_file_path, sep="\t", quote=FALSE) # save as tsv
sdf2 = sdf_read_table(tsv_file_path) # read back
csv_file_path = sprintf("%s/sigdf.csv", tempdir())
sdf_write_table(sdf, file=csv_file_path, sep=",") # save as csv
sdf2 = sdf_read_table(csv_file_path, sep=",") # read back
The raw Infinium BeadChip data are stored in IDAT files. Each sample has two IDAT files and they correspond to the red and green signal respectively. Green and red files for the same samples should always share the same sample name prefix. For example, 204529320035_R06C01_Red.idat
and 204529320035_R06C01_Grn.idat
correspond to the red and green signal of one sample. In SeSAMe, we will use the common prefix, i.e. 204529320035_R06C01
, to refer to that sample. SeSAMe recognizes both the raw IDAT as well as gzipped IDATs which are common for data stored in GEO. For example, in addition to the example above, SeSAMe also recognizes GSM2178224_Red.idat.gz
and GSM2178224_Grn.idat.gz
.
The function readIDATpair
function reads in the signal intensity data from the IDAT pairs. The function takes the common prefix as input and outputs a SigDF
object. The SigDF
object is simply an R data.frame with rows representing probes and columns representing different signal intensity and probe annotations. The SigDF
class will be discussed more in a separate section below. Using the two examples above, one would run the following commands.
sdf = readIDAT("idat_folder/204529320035_R06C01") # Example 1
sdf = readIDAT("idat_folder/GSM2178224") # Example 2
Note that SeSAMe automatically detects and matches up the green and red signal files for the same sample.
If you are dealing with a custom-made array instead of the standard array (MM285, EPIC, HM450 etc) supported natively by SeSAMe, you would need to provide a manifest that describes the probe information. You should be able to obtain the probe information manifest from the Illumina support website. The manifest should be formated as a data frame with four columns minimally: Probe_ID
, M
, U
and col
. A optional mask
column may also be included as a default mask for the platform. The easiest way to format a SeSAMe-compatible manifest is by following internal manifests for a SeSAMe-supported platform. They can be retrieved with the sesameDataGet
function:
The col
is either G
(which stands for Green) or R
(which stands for Red) or 2
(which stands for Infinium II designs). For Infinium-II probes, the M
column and col
column is left as NA
. For example, one can check that both M
and col
columns are filled with the Infinium-I probes (in mouse array this can be indicated by a _[TBN][CON]1
suffix):
The last column mask
is a logical vector that defines the default masking behavior of SeSAMe for the platform (see below for discussion of NA-masking). With the manifest, your data can be processed using the manifest=
option in openSesame
or readIDATpair
(one sample).
In most cases, we would be working with a folder that contains many IDATs. Here is where the searchIDATprefixes
function comes in useful. It lets us search all the IDATs in a folder and its subfolders recursively. Combine this with the R looping functions lets you process many IDATs without having to specify all IDAT names. searchIDATprefixes
returns a named vector of prefixes with associated Red
and Grn
files, which can be given to readIDATpair
:
which returns a list of “SigDF”s. This is how the openSesame
pipeline is handling your data internally.
DNA methylation level (aka the β values) are defined as
β = M/(M + U)
M represents the signal from methylated allele and U represents the unmethylated allele. It can be retrieved calling the getBetas
function with the SigDF
as input. The output is a named vector with probe ID as names. For example, the following commands read in one sample and convert it to β values.
## cg00000029 cg00000103 cg00000109 cg00000155 cg00000158 cg00000165
## 0.8237945 0.2100515 0.8125637 0.9152265 0.9105163 0.8196466
CRITICAL: getBetas
takes a single SigDF
object as input instead of a list of SigDF
s. A common mistake is to c
-merge multiple SigDF
s. To combine multiple SigDF
s, one can use list()
instead. To process many SigDF
s, we should combine that with looping functions lapply
or mclapply
s, or using the openSesame
pipeline (see below).
β values for Infinium-I probes can also be obtained by summing up the two in-band channel and out-of-band channels. This rescues probes with SNP hitting the extension base and hence switching color channel. More details can be found in Zhou et al 2017.
As you notice the probe ID has the suffix which distinguishes replicate probes. These replicate probes can be collapsed to the cg prefix by adding the collapseToPfx=TRUE
option in openSesame
and getBetas
function (if that function is used explicitly).
## cg00101675_BC21 cg00116289_BC21 cg00211372_TC21 cg00531009_BC21 cg00747726_TC21
## 0.8140124 0.7519231 0.8431105 0.7911640 0.7225773
## cg00896209_TC21
## 0.8377515
## cg00101675 cg00116289 cg00211372 cg00531009 cg00747726 cg00896209
## 0.8140124 0.7519231 0.8431105 0.7911640 0.7225773 0.8377515
The NAs in beta values are controlled by the mask
column in the SigDF
object. The mask
column is a boolean vector that specifies which probe should be NA-masked when getBetas
function is called. For example, the following illustrate the relationship between the mask
column in SigDF
and NAs in the beta values.
sdf = sesameDataGet('EPIC.1.SigDF')
sdf_withNA = pOOBAH(sdf)
sum(sdf_withNA$mask) # number of probes to be NA-masked
## [1] 31631
## [1] 31631
## [1] 0
## [1] 0
Note that mask
in SigDF
does not actually remove the probe reading but only specifies how SeSAMe currently views the measurement (as unreliable). This is how SeSAMe gives finer control over different preprocessing functions, e.g., to make them only use reliable probes or species-specific. One can add more probes to the mask with the addMask
function. Other functions such as the detection p-value calculation (e.g., pOOBAH
), also modifies mask
. NA-masking influences other normalization and preprocessing functions. Therefore one should have them set for the preprocessing methods mentioned below. No mask is set when a new SigDF
is returned from the readIDATpair
function.
The qualityMask
function does experiment-independent masking using pre-defined probe sets. For example, probes with cross hybridization or are influenced by common polymorphisms can be masked using this function. For example, the following function add probes with low mapping quality to the mask:
For a list of options of such pre-defined sets one can use the listAvailableMasks()
function.
Note there is a “recommended” mask, it’s the merge of several mask sets specified in the mask_recommended
column. Details of the recommended mask can be found in Zhou et al. 2017.
Most of the masking comes from two major sources:
Experiment-independent Probe Masking due to design issues (see Zhou et al. 2017). This masking can be added by the qualityMask
function. Currently we support EPIC, MM285, HM450 and HM27 and should be used in openSesame
by default.
Experiment-dependent Probe Masking based on signal detection p-value (Zhou et al. 2018). Probes with p-value higher than a threshold (default: 0.05, but can also be raised to allow more lenient masking) are masked (see following for detection p-value calculation using the pOOBAH
method).
As mentioned above, experiment-dependent masking based on signal detection p-values is effective in excluding artifactual methylation level reading and probes with too much influence from signal background. We recommend the pOOBAH
algorithm that was based on Infinium-I probe out-of-band signal for calibrating the distribution of the signal background:
## [1] 31631
## [1] 132545
## [1] 132545
Sometimes one would want to calculation detection p-value without modifying the mask. For example, one may want to upload the p-values to GEO separately. In those cases one can use the return.pval
option and add pvalue-based mask later.
SeSAMe implements the background subtraction based on normal-exponential deconvolution using out-of-band probes noob
(Triche et al. 2013) and optionally with more aggressive subtraction (scrub
). One can use following β value distribution plot to see the effect of background subtraction. For example, the two (M and U) modes are further polarized.
Dye bias refers to the difference in signal intensity between the two color channel. SeSAMe offers two flavors of dye bias correction: linear scaling (dyeBiasCorr
) and nonlinear scaling (dyeBiasCorrTypeINorm
). Linear scaling equalize the mean of all probes from the two color channel.
par(mfrow=c(1,2))
sesameQC_plotRedGrnQQ(dyeBiasCorr(sdf), main="Before") # linear correction
sesameQC_plotRedGrnQQ(dyeBiasNL(sdf), main="After") # nonlinear correction
Residual dye bias can be corrected using nonlinear quantile interpolation with Type-I probes. Under this correction, Infinium-I Red probes and Infinium-I Grn probes have the same distribution of signal. Note that linear scaling does not shift beta values of Type-I probes while nonlinear scaling does shift beta values of Type-I probes.
Sometimes Infinium-I channel spec is inaccurate in the manifest. We can infer the channel from data.
## [2022-05-17 21:51:47] Infinium-I color channel reset: R>R:92127;G>G:49028;R>G:65;G>R:904
As one can see, most probes remain with the designated channel. A small fraction of the probes is considered “channel-switching”.
As we may’ve noticed, even with proper dye bias correction, there is still remaining differences in β value distribution between Infinium-I and Infinium-II probes due to the signal tail inflation. One solution is to perform ad-hoc quantile normalization. We provide a function similar to the BMIQ algorithm with modifications (one-, two- and three-state mixture and Infinium-I to II matching since Infinium-I is more often influenced by signal inflation based on our experience) to match Infinium-I and Infinium-II beta value distribution. We do not recommend the use of this (or any such methods) for all data unless your data is known to be relatively well-behaving in methylation distribution, for protection of real biological signal. This function assumes Infinium-I/II probes are similar in beta value distribution in the unmethylated and methylated mode.
SeSAMe provides functions to create QC plots. Some functions takes sesameQC as input while others directly plots the SigDF objects. For example, the sesameQC_plotBar
function takes a list of sesameQC objects and creates bar plot for each metric calculated.
The fraction of detection failures are signs of masking due to variety of reasons including failed detection, high background, putative low quality probes etc. To compare samples in terms of detection success rate, one can use the sesameQC_plotBar
function in the following way:
Dye bias is shown by an off-diagonal q-q plot of the red (x-axis) and green signal (y-axis).
Beta value is more influenced by signal background for probes with low signal intensities. The following plot shows this dependency and the extent of probes with low signal intensity.
## Loading required namespace: pals
Extra SNP allele frequencies can be obtained by summing up methylated and umethylated alleles of color-channel-switching probes. These allele frequencies can be combined with explicit SNP probes:
## rs10033147 rs1019916 rs1040870 rs10457834 rs10796216 rs10882854
## 0.4731873 0.9265673 0.4825240 0.4711608 0.9561655 0.9326652
## cg00038584 cg00408315 cg00413617 cg00488829 cg00519463 cg00523683
## 0.10148976 0.17358858 0.09358459 0.69068952 0.10698911 0.05029466
SeSAMe can extract explicit and Infinium-I-derived SNPs to identify potential sample swaps.
One can also output the allele frequencies and output a VCF file with genotypes. This requires additional SNP information (ref and alt alleles), which can be downloaded using the following code:
## Retrieving annotation from https://github.com/zhou-lab/InfiniumAnnotationV1/raw/main/Anno/EPIC/EPIC.hg19.snp_overlap_b151.rds... Done.
## Retrieving annotation from https://github.com/zhou-lab/InfiniumAnnotationV1/raw/main/Anno/EPIC/EPIC.hg19.typeI_overlap_b151.rds... Done.
One can output to actual VCF file with a header by formatVCF(sdf, vcf=path_to_vcf)
.
Infinium platforms are intrinsically robust to incomplete bisulfite conversion as non-converted probes would fail to hybridize to the target. Residual incomplete bisulfite conversion can be quantified using GCT score based on C/T-extension probes. Details of this method can be found in Zhou et al. 2017. The closer the score to 1.0, the more complete the bisulfite conversion.
## [1] 1.067769
SeSAMe provide utilities to view methylation reading in a track view. Next, we will demonstrate how to create track view with transcript position marked focusing on a genomic region, a gene and specific probes. Let’s first load example HM450 data
To visualize probes from arbitrary region, we will call visualizeRegion
:
Zero to full methylation are displayed using the jet color scheme with blue representing no methylation and red full methylation.
To visualize all probes from a gene, we will call visualizeGene
SeSAMe performs copy number variation in three steps: 1) normalizes the signal intensity using a copy-number-normal data set; 2) groups adjacent probes into bins; 3) runs DNAcopy internally to group bins into segments.
To visualize segmentation in SeSAMe,
Let’s load beta values from SeSAMeData
Compare mouse array data with mouse tissue references. This will return a grid object that contrasts the traget sample with pre-built mouse tissue reference.
One can also visualize the actual SNP data for strain inference by the following. The target sample is plotted as a vertical bar on the right of the signature heatmap.
sdf = sesameDataGet("MMB.1.SigDF") # an example dataset
inferStrain(sdf, return.strain = TRUE) # return strain as a string
## [1] "NOD_ShiLtJ"
As expected, the variant configuration is consistent with the NOD strain.
There are many creative ways one can visualize enrichment results. Here we demonstrate a few popular ones supported in sesame. Dot plot:
library(SummarizedExperiment)
df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"]
results <- testEnrichment(query, "TFBS")
or a bar plot:
or a volcano plot. Here we need a two-tailed test to show both enrichment and depletion (the default is one-tailed):
query <- df$Probe_ID[df$branch == "fetal_brain" & df$type == "Hypo"]
results_2tailed <- testEnrichment(query, "TFBS", alternative = "two.sided")
KYCG_plotVolcano(results_2tailed)
and a Waterfall plot:
If you have a list of cg groups and they were tested against the same set of databases, you can make a point range plot to summarize the overall trend:
## pick some big TFBS-overlapping CpG groups
cg_lists <- KYCG_getDBs("MM285.TFBS", silent=TRUE)
queries <- cg_lists[(sapply(cg_lists, length) > 40000)]
res <- lapply(queries, function(q) {
testEnrichment(q, "MM285.chromHMM", silent=TRUE)})
## note the input of the function is a list of testEnrichment outputs.
KYCG_plotPointRange(res)
plot enrichment over metagene:
KnowYourCG builds in several metagene/meta-feature coordinates. One can test enrichment over meta genes or simply plot the mean over metagenes:
## Platform set to: EPIC
## Selected the following database groups:
## 1. KYCG.EPIC.metagene.20220126
Here we picked some transcription factor binding sites-overlapping CpGs and tested against chromHMM states. As expected, most of these CGs are enriched at promoters and enhancers but depleted at heterchromatic regions.
Manhattan plots are an intuitive way to visualize the results from large scale “omics” studies that investigate genetic or epigenetic features on genome wide scales across large groups of samples. Here we demonstrate how the KYCG_plotManhattan()
function can be used to visualize the chromosomal distribution and significance level of CpG probes from an EWAS study. KYCG_plotManhattan()
takes a named numeric vector with CpG probeIDs as names and numeric values (P,Q,logFC,etc) obtained from analysis as values. To plot EWAS results, simply load the necessary libraries and pass the named numeric vector to KYCG_plotManhattan()
By default, KYCG_plotManhattan()
will infer the array platform from the probeIDs and retrieve the corresponding GRanges object to obtain probe coordinates. However, the platform and GRanges objects can be pre - specified if offline. Color, title, and the threshold to label significant probes on the plot can also be specified:
library(ggrepel)
smry_pvals = sesameAnno_get("Test/20220413_testManhattan.rds")
KYCG_plotManhattan(-log10(smry_pvals), platform="HM450",
title="Differentially Methylated CpGs - EWAS Aging",
col=c("navy","skyblue"), ylabel = bquote(-log[10](P~value)), label_min=30) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
In addition to hypothesis testing, knowYourCG also uses the curated database sets for feature engineering. We have a pre-curated summarized experiment containing a samplesheet and beta value matrix corresponding to about 467 MM285 samples with 20k probes. The samplesheet includes UIDs pertaining to the sample and several categorical/numerical features. To use this data for a linear model, we will extract the most relevant prevalent features.
We have found that it is computationally expensive to perform a linear model/generalized linear model on a feature set of individual CpGs. Additionally, interpreting the mechanism the significantly contributing CpGs is non-trivial due to their complex interactions. We hope to leverage these pre-curated database sets by using their beta value summary statistics as features instead. We will calculate the summary statistics for the betas matrix using a list of database sets. The default is to calculate the mean.
## Selected the following database groups:
## 1. KYCG.MM285.chromHMM.20210210
## Enh EnhG EnhLo EnhPois EnhPr
## [1,] 0.4747438 0.6719724 0.6432090 0.5546328 0.3883567
## [2,] 0.4629993 0.6726350 0.5992930 0.5677431 0.3881347
## [3,] 0.4857179 0.6724716 0.6359722 0.5672481 0.3889514
## [4,] 0.4985426 0.6888043 0.6279642 0.5803560 0.3987007
## [5,] 0.4720005 0.6698668 0.6420811 0.5530925 0.3851314
## [6,] 0.4978402 0.6910251 0.6072940 0.5807078 0.4133535
Just from the few database set means above, we can see that TSS are consistently hypomethylated, which is consistent with known biology.
Previously, the signal was implemented an S4 implementation in SigSet
complies with Bioconductor guidelines, and for backwards compatibility, SigSet
can be transformed to a SigDF
using the SigSetToSigDF
function sesame:::SigSetToSigDF(sset)
.
SigSet
/SigDF
can be converted back and forth from minfi::RGChannelSet
in multiple ways. One can sesamize a minfi RGChannelSet
which returns a GenomicRatioSet
. See sesamize for more detail.
## R Under development (unstable) (2021-11-09 r81170)
## Platform: x86_64-apple-darwin20.6.0 (64-bit)
## Running under: macOS Monterey 12.3.1
##
## Matrix products: default
## BLAS: /Users/zhouw3/.Renv/versions/4.2.dev/lib/R/lib/libRblas.dylib
## LAPACK: /Users/zhouw3/.Renv/versions/4.2.dev/lib/R/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] wheatmap_0.2.0 ggrepel_0.9.1
## [3] ggplot2_3.3.6 SummarizedExperiment_1.26.1
## [5] Biobase_2.56.0 GenomicRanges_1.48.0
## [7] GenomeInfoDb_1.32.2 IRanges_2.30.0
## [9] S4Vectors_0.34.0 MatrixGenerics_1.8.0
## [11] matrixStats_0.62.0 dplyr_1.0.9
## [13] sesame_1.14.2 sesameData_1.14.0
## [15] ExperimentHub_2.4.0 AnnotationHub_3.4.0
## [17] BiocFileCache_2.4.0 dbplyr_2.1.1
## [19] BiocGenerics_0.42.0 rmarkdown_2.14
## [21] R6_2.5.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-3 ellipsis_0.3.2
## [3] DNAcopy_1.69.0 XVector_0.36.0
## [5] dichromat_2.0-0 base64_2.0
## [7] farver_2.1.0 bit64_4.0.5
## [9] interactiveDisplayBase_1.34.0 AnnotationDbi_1.58.0
## [11] fansi_1.0.3 cachem_1.0.6
## [13] knitr_1.39 jsonlite_1.8.0
## [15] cluster_2.1.2 png_0.1-7
## [17] shiny_1.7.1 BiocManager_1.30.17
## [19] mapproj_1.2.8 readr_2.1.2
## [21] compiler_4.2.0 httr_1.4.3
## [23] assertthat_0.2.1 Matrix_1.4-0
## [25] fastmap_1.1.0 RPMM_1.25
## [27] cli_3.3.0 later_1.3.0
## [29] htmltools_0.5.2 tools_4.2.0
## [31] gtable_0.3.0 glue_1.6.2
## [33] GenomeInfoDbData_1.2.8 reshape2_1.4.4
## [35] maps_3.4.0 rappdirs_0.3.3
## [37] Rcpp_1.0.8.3 jquerylib_0.1.4
## [39] vctrs_0.4.1 Biostrings_2.64.0
## [41] preprocessCore_1.58.0 xfun_0.31
## [43] stringr_1.4.0 mime_0.12
## [45] lifecycle_1.0.1 zlibbioc_1.42.0
## [47] MASS_7.3-55 scales_1.2.0
## [49] BiocStyle_2.23.1 hms_1.1.1
## [51] promises_1.2.0.1 parallel_4.2.0
## [53] RColorBrewer_1.1-3 yaml_2.3.5
## [55] curl_4.3.2 memoise_2.0.1
## [57] sass_0.4.1 stringi_1.7.6
## [59] RSQLite_2.2.14 BiocVersion_3.15.2
## [61] highr_0.9 filelock_1.0.2
## [63] BiocParallel_1.30.2 pals_1.7
## [65] rlang_1.0.2 pkgconfig_2.0.3
## [67] bitops_1.0-7 evaluate_0.15
## [69] lattice_0.20-45 purrr_0.3.4
## [71] labeling_0.4.2 bit_4.0.4
## [73] tidyselect_1.1.2 plyr_1.8.7
## [75] magrittr_2.0.3 generics_0.1.2
## [77] DelayedArray_0.22.0 DBI_1.1.2
## [79] pillar_1.7.0 withr_2.5.0
## [81] KEGGREST_1.36.0 RCurl_1.98-1.6
## [83] tibble_3.1.7 crayon_1.5.1
## [85] KernSmooth_2.23-20 utf8_1.2.2
## [87] tzdb_0.3.0 grid_4.2.0
## [89] blob_1.2.3 digest_0.6.29
## [91] xtable_1.8-4 httpuv_1.6.5
## [93] illuminaio_0.38.0 openssl_2.0.1
## [95] munsell_0.5.0 bslib_0.3.1
## [97] askpass_1.1