As sesame and sesameData are under active development, this documentation is specific to the following version of R, sesame, sesameData and ExperimentHub:
library(sesame)
sesame_checkVersion()
## SeSAMe requires matched versions of R, sesame, sesameData and ExperimentHub.
## Here is the current versions installed:
## R: 4.3.0
## Bioconductor: 3.17
## sesame: 1.18.4
## sesameData: 1.18.0
## ExperimentHub: 2.8.0
We recommend updating your R, ExperimentHub, sesameData and sesame to use this documentation consistently. If you have installed directly from github, please make sure the compatible ExperimentHub is installed.
If you use a previous version, please checkout the vignette that corresponds to the right version here https://zhou-lab.github.io/sesame/dev/supplemental.html#Versions
CRITICAL: After a new installation, one must cache the associated annotation data using the following command. This needs to be done only once per SeSAMe installation/update. Caching data to local guarantees proper data retrieval and saves internet traffic.
sesameDataCache()
This function caches the needed SeSAMe annotations. SeSAMe annotation data is managed by the sesameData package which uses the ExperimentHub infrastructure. You can find the location of the cached annotation data on your local computer using:
::R_user_dir("ExperimentHub", which="cache") tools
## [1] "/Users/zhouw3/Library/Caches/org.R-project.R/R/ExperimentHub"
The openSesame
function provides end-to-end processing
that converts IDATs to DNA methylation level (aka β value) matrices in R. The function
can take either one of the following input:
The following code uses a directory that contains built-in two HM27
IDAT pairs to demonstrates the use of openSesame
:
## The following use the idat_dir in system.file,
## Replace it with your actual IDAT path
= system.file("extdata/", package = "sesameData")
idat_dir = openSesame(idat_dir, BPPARAM = BiocParallel::MulticoreParam(2)) betas
The BPPARAM=
option is from the
BiocParallel
package and controls parallel processing (in
this case, we are using two cores). Under the hood, the function
performs a series of tasks including: searching IDAT files from the
directory (the searchIDATprefixes
function), reading IDAT
data in as SigDF
objects (the readIDATpair
function), preprocessing the signals (the prepSesame
function), and finally converting them to DNA methylation levels (β values, the getBetas
function). Alternatively, one can run the following command to get the
same results, while gaining more refined control:
## The above openSesame call is equivalent to:
= do.call(cbind, BiocParallel::bplapply(
betas searchIDATprefixes(idat_dir), function(pfx) {
getBetas(prepSesame(readIDATpair(pfx), "QCDPB"))
BPPARAM = BiocParallel::MulticoreParam(2)))
},
## or even more explicitly (if one needs to control argument passed
## to a specific preprocessing function)
= do.call(cbind, BiocParallel::bplapply(
betas searchIDATprefixes(idat_dir), function(pfx) {
getBetas(noob(pOOBAH(dyeBiasNL(inferInfiniumIChannel(qualityMask(
readIDATpair(pfx)))))))
BPPARAM = BiocParallel::MulticoreParam(2))) },
The openSesame
function is highly customizable. The
prep=
argument is the same argument one gives to the
prepSesame
function (see Data
Preprocessing for detail) which openSesame
calls
internally. The argument uniquely specifies a preprocessing procedure.
The func=
option specifies the signal extraction function.
It can be either be getBetas
(DNA methylation) or
getAFs
(allele frequencies of SNP probes) or NULL (returns
SigDF
). The manifest=
option allows one to
provide an array manifest when handling data from platform not supported
natively. Finally, the BPPARAM=
argument is the same
argument taken by BiocParallel::bplapply
to allow parallel
processing. See Supplemental
Vignette for details of these component functions of openSesame.
The output of openSesame
can also be customized. It can
either be beta values, which are the end DNA methylation readings, as
shown above. It can also be a list of SigDF
s which stores
the signal intensities and can be further put back to openSesame for
more processing. The openSesame(func=)
argument specifies
whether the output is a SigDF list or beta values. The following shows
some usage:
= openSesame(idat_dir, func = getBetas) # getBetas is the default
betas = openSesame(idat_dir, func = NULL) # return SigDF list
sdfs = openSesame(idat_dir, func = getAFs) # SNP allele frequencies
allele_freqs = openSesame(sdfs, prep = "Q", func = NULL) # take and return SigDFs sdfs
The prep=
argument instructs the openSesame
function to call the prepSesame
function to preprocess
signal intensity under the hood. This can be skipped by using
prep=""
. The prepSesame
function takes a
single SigDF
as input and returns a processed
SigDF
. When prep=
is non-empty, it selects the
preprocessing functions (see Preprocessing Function
Code) and specifies the order of their execution. For example,
= sesameDataGet('EPIC.1.SigDF')
sdf = openSesame(sdf, prep="DB", func=NULL) sdf_preped
performs dye bias correction (D
) followed by background
subtraction (B
). In other words,
prepSesame(sdf, "DB")
is equivalent to
noob(dyeBiasNL(sdf))
. All the preprocessing functions take
a SigDF
as input and return an updated SigDF
.
Therefore, these functions can be chained together. The choice of
preprocessing functions and the order of their chaining is important
(see Supplemental
Vignette) for detailed discussions of these functions). The
following table lists the best preprocessing strategy based on our
experience.
Platform | Sample Organism | Prep Code |
---|---|---|
EPICv2/EPIC/HM450 | human | QCDPB |
EPICv2/EPIC/HM450 | non-human organism | SQCDPB |
MM285 | mouse | TQCDPB |
MM285 | non-mouse organism | SQCDPB |
Mammal40 | human | HCDPB |
Mammal40 | non-human organism | SHCDPB |
The optimal strategy of preprocessing depends on:
The array platform. For example, certain array platforms (e.g., the Mammal40) do not have enough Infinium-I probes for background estimation and dye bias correction, therefore background subtraction (where the out-of-band signals are from) might not work most optimally;
The expected sample property. For example, some
samples have the signature bimodal distribution of methylation of most
mammalian cells. Others may undergo global loss of methylation (germ
cells, tumors etc). Other important factors include high-input vs
low-input, tumor vs normal, somatic vs germ cells, human vs model
organisms, mouse strains etc. Some platforms (e.g., Mammal40 and MM285)
are designed for multiple species and strains. Therefore S
and T
would be important when those arrays are used on
non-reference organisms (see Working with
Nonhuman Arrays).
The prepSesameList
function lists all the available
codes and the associated preprocessing functions.
prepSesameList()
Here are some consideration when determining the preprocessing order.
Species (S)
and strain (T)
inference resets
the mask and color channels based on probe alignment and presence of
genetic variants. Therefore when they are used, they need to be called
first. Q
masks non-uniquely mapped probes which may inflate
the out-of-band signal for background estimation. Therefore
Q
should be used before detection p-value calculation
(P
) and background subtraction (B
) when
necessary. Channel inference (C
) and dye bias correction
(D
) should take place early since dye bias effect is
global. C
should be placed before D
because
dye bias correction uses in-band signal the identification of which
relies on correct channel designation. Detection p-value
(P
) should happen before background subtraction
(B
) since background subtraction modifies signal and may
affect out-of-band signal assumption used in P
. Lastly,
functions that explicitly normalizes β value distribution
(M
) should happen last if they even need to be used.
See Supplemental Vignette for details of preprocessing functions.
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: aarch64-apple-darwin22.3.0 (64-bit)
## Running under: macOS Ventura 13.2.1
##
## Matrix products: default
## BLAS: /Users/zhouw3/.Renv/versions/4.3.0/lib/R/lib/libRblas.dylib
## LAPACK: /Users/zhouw3/.Renv/versions/4.3.0/lib/R/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.42 sesame_1.18.4 sesameData_1.18.0
## [4] ExperimentHub_2.8.0 AnnotationHub_3.8.0 BiocFileCache_2.8.0
## [7] dbplyr_2.3.2 BiocGenerics_0.46.0 rmarkdown_2.21
## [10] R6_2.5.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 dplyr_1.1.2
## [3] blob_1.2.4 filelock_1.0.2
## [5] Biostrings_2.68.0 bitops_1.0-7
## [7] fastmap_1.1.1 RCurl_1.98-1.12
## [9] promises_1.2.0.1 digest_0.6.31
## [11] mime_0.12 lifecycle_1.0.3
## [13] ellipsis_0.3.2 KEGGREST_1.40.0
## [15] interactiveDisplayBase_1.38.0 RSQLite_2.3.1
## [17] magrittr_2.0.3 compiler_4.3.0
## [19] rlang_1.1.1 sass_0.4.6
## [21] tools_4.3.0 utf8_1.2.3
## [23] yaml_2.3.7 S4Arrays_1.0.1
## [25] bit_4.0.5 curl_5.0.0
## [27] DelayedArray_0.26.1 plyr_1.8.8
## [29] RColorBrewer_1.1-3 BiocParallel_1.34.0
## [31] withr_2.5.0 purrr_1.0.1
## [33] grid_4.3.0 stats4_4.3.0
## [35] preprocessCore_1.62.1 fansi_1.0.4
## [37] wheatmap_0.2.0 colorspace_2.1-0
## [39] xtable_1.8-4 ggplot2_3.4.2
## [41] MASS_7.3-58.4 scales_1.2.1
## [43] SummarizedExperiment_1.30.1 cli_3.6.1
## [45] crayon_1.5.2 generics_0.1.3
## [47] reshape2_1.4.4 httr_1.4.5
## [49] tzdb_0.3.0 DBI_1.1.3
## [51] cachem_1.0.8 stringr_1.5.0
## [53] zlibbioc_1.46.0 parallel_4.3.0
## [55] AnnotationDbi_1.62.1 BiocManager_1.30.20
## [57] XVector_0.40.0 matrixStats_0.63.0
## [59] vctrs_0.6.2 Matrix_1.5-4
## [61] jsonlite_1.8.4 IRanges_2.34.0
## [63] hms_1.1.3 S4Vectors_0.38.1
## [65] bit64_4.0.5 jquerylib_0.1.4
## [67] glue_1.6.2 codetools_0.2-19
## [69] stringi_1.7.12 gtable_0.3.3
## [71] BiocVersion_3.17.1 later_1.3.1
## [73] GenomeInfoDb_1.36.0 GenomicRanges_1.52.0
## [75] munsell_0.5.0 tibble_3.2.1
## [77] pillar_1.9.0 rappdirs_0.3.3
## [79] htmltools_0.5.5 GenomeInfoDbData_1.2.10
## [81] lattice_0.21-8 evaluate_0.20
## [83] shiny_1.7.4 Biobase_2.60.0
## [85] readr_2.1.4 png_0.1-8
## [87] memoise_2.0.1 BiocStyle_2.27.2
## [89] httpuv_1.6.9 bslib_0.4.2
## [91] Rcpp_1.0.10 xfun_0.39
## [93] MatrixGenerics_1.12.0 pkgconfig_2.0.3