SeSAMe provides extensive native support for the Illumina Mouse Methylation BeadChip array (referred to as the MM285 or MMB array) and the Horvath Mammal40 array (refered to as the Mammal40 array). The MM285 array contains ~285,000 probes covering over 20 design categories including gene promoters, enhancers, CpGs in synteny to human EPIC array as well as other biology. In SeSAMe, MM285 is used as the product abbreviation to distinguish future platforms including MM320. This documents describe the procedure to process the MM285 and the Mammal40 array.
We first load required library and perform sesame data caching (only needed at new SeSAMe installation).
library(sesame)
sesameDataCache()
SeSAMe supports automatic inference of the profiled organism. This is
achieved by the inferSpecies
function. Usually, users need
not call this function explicitly and only need to specify the code
S
as part of the 2nd argument of the
openSesame
function. See the
Basics Usage vignette for more detail.
The following example downloads an example Mammal40 array IDAT pair
and call openSesame
function with species inference (note
the S
in the prep=
argument).
= tempdir()
tmp sesameAnno_download("Test/GSM4411982_Grn.idat.gz", dest_dir=tmp)
sesameAnno_download("Test/GSM4411982_Red.idat.gz", dest_dir=tmp)
= openSesame(sprintf("%s/Test/GSM4411982", tmp), prep="SHCDPM") betas
The above code is equivalent to
## equivalent to the above openSesame call
= getBetas(matchDesign(pOOBAH(dyeBiasNL(inferInfiniumIChannel(
betas prefixMaskButC(inferSpecies(readIDATpair(
sprintf("%s/Test/GSM4411982", tmp)))))))))
As can be seen, inferSpecies
takes a SigDF as input and
outputs an updated SigDF which contains a species-specific masking and
color channel designation. This information is key to proper
preprocessing since knowledge of the color channel designation and probe
hybridization success is the foundational assumption of many
preprocessing algorithms. One may instruct the function to return the
inferred species information by using the
return.species = TRUE
argument. The following example shows
this usage:
= sesameDataGet("Mammal40.1.SigDF") # an example SigDF
sdf inferSpecies(sdf, return.species = TRUE)
## $scientificName
## [1] "xenopus_tropicalis"
##
## $taxonID
## [1] 8364
##
## $commonName
## [1] "Xenopus tropicalis"
##
## $assembly
## [1] "Xenopus_tropicalis_v9.1"
Internally, we used a nonparametric scoring method to infer the most
likely species from a pool of 310 candidate species from Ensembl (v101).
The return.auc = TRUE
argument allows one to peek into the
AUC (Area Under the Curve) score generated in this inference. The higher
the score, the more likely the data is from the corresponding species.
Knowing the scores can help diagnose misclassifications such as when
several candidate species are closely related and hard to discriminate
from data.
## showing the candidate species with top scores
head(sort(inferSpecies(sdf, return.auc = TRUE), decreasing=TRUE))
## xenopus_tropicalis leptobrachium_leishanense latimeria_chalumnae
## 0.9643305 0.9165120 0.8332575
## pseudonaja_textilis notechis_scutatus salvator_merianae
## 0.8292410 0.8268660 0.8255875
If the user believes that automatic inference gives wrong (most often
still close-related) species, one can force species inference to a
target species by using the updateSigDF
function. For
example, the following code forces the SigDF
to be treated
as a mus_musculus
sample. Note this doesn’t alter signal
intensity but only change the probe masking and color channel spec (the
view of the data, not the data reading itself).
<- updateSigDF(sdf, species="mus_musculus") sdf_mouse
CRITICAL: Since updateSigDF
function
resets the whole mask and col column of SigDF. One should use this
function (and inferSpecies
) before other preprocessing
functions that sets mask and col.
Like species inference, strain-specific masking and preprocessing is
important for mouse array samples. This is achieved by the
inferStrain
function. The function is represented by the
T
code in openSesame
/prepSesame
.
The following example shows how to use inferStrain
in
openSesame
. Note the use of T
in the prep
code.
= tempdir()
tmp sesameAnno_download("Test/204637490002_R05C01_Grn.idat", dest_dir=tmp)
sesameAnno_download("Test/204637490002_R05C01_Red.idat", dest_dir=tmp)
= openSesame(sprintf("%s/Test/204637490002_R05C01", tmp), prep="TQCDPB") betas
Like inferSpecies
, one need to call the
inferStrain
function before calling the standard
noob
, dyeBiasNL
, etc (by having T
before QCDPB
when calling openSesame
). Also
like inferSpecies()
, inferStrain()
will return
a new SigDF
with col and mask updated to reflect the change
of strain. Optionally, one can also specify
return.strain=TRUE
, return.pval=TRUE
or
return.probability=TRUE
to return the inferred strain, the
p-value or the probabilities of all 37 strain candidates. Internally,
the function converts the beta values to variant allele frequencies. It
should be noted that since variant allele frequency is not always
measured by the M-allele of the probe. SeSAMe flips the β values for some probes to
calculate variant allele frequency. The following example shows what
inferStrain
does to a SigDF
:
= sesameDataGet("MM285.1.SigDF") # an example dataset
sdf inferStrain(sdf, return.strain = TRUE) # return strain as a string
## [1] "NOD_ShiLtJ"
= inferStrain(sdf) # update mask and col by strain inference
sdf_after sum(sdf$mask) # before strain inference
## [1] 0
sum(sdf_after$mask) # after strain inference
## [1] 14599
Let’s visualize the probabilities of all candidate strains using the
return.probabilities
option:
library(ggplot2)
= inferStrain(sdf, return.probability = TRUE)
p = data.frame(strain=names(p), probs=p)
df ggplot(data = df, aes(x = strain, y = probs)) +
geom_bar(stat = "identity", color="gray") +
ggtitle("Strain Probabilities") +
ylab("Probability") + xlab("") +
scale_x_discrete(position = "top") +
theme(axis.text.x = element_text(angle = 90, vjust=0.5, hjust=0),
legend.position = "none")
See also The Supplemental Vignette for heatmap validation of strain inference.
In the MM285 array, we designed multimapping probes to allow querying
transposable elements and other biology. We also exposed probes with
potentially design flaws. These suboptimally designed probes take a new
probe ID prefix (“uk”) in addition to the “cg”/“ch”/“rs” typically seen.
By default the repeat and suboptimally designed probes are masked by
NA
. These masking is done by the qualityMask
function (or Q
in prep codes). To override masking these
probes, one can use the resetMask
function (the
0
code in openSesame
) to remove the default
masking. We recommend using it after the preprocessing function that
depends on reliable/uniquely-mapped probes, but before detection p-value
based masking (e.g. pOOBAH). This way, probes that fail detection can
still be flagged (they should be).
= sesameDataGet('MM285.1.SigDF')
sdf sum(is.na(openSesame(sdf, prep="TQCDPB")))
## [1] 20548
sum(is.na(openSesame(sdf, prep="TQCD0PB")))
## [1] 7735
UNDER CONSTRUCTION
There are other inferences one can do on the nonhuman arrays, e.g., sex, age (epigenetic clocks), tissue, copy number alteration etc. These will be elaborated in The Inference Vignette.
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: aarch64-apple-darwin22.3.0 (64-bit)
## Running under: macOS Ventura 13.2.1
##
## Matrix products: default
## BLAS: /Users/zhouw3/.Renv/versions/4.3.0/lib/R/lib/libRblas.dylib
## LAPACK: /Users/zhouw3/.Renv/versions/4.3.0/lib/R/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.4.2 sesame_1.18.4 sesameData_1.18.0
## [4] ExperimentHub_2.8.0 AnnotationHub_3.8.0 BiocFileCache_2.8.0
## [7] dbplyr_2.3.2 BiocGenerics_0.46.0 dplyr_1.1.2
## [10] wheatmap_0.2.0 rmarkdown_2.21 R6_2.5.1
##
## loaded via a namespace (and not attached):
## [1] DBI_1.1.3 bitops_1.0-7
## [3] rlang_1.1.1 magrittr_2.0.3
## [5] matrixStats_0.63.0 compiler_4.3.0
## [7] RSQLite_2.3.1 png_0.1-8
## [9] vctrs_0.6.2 reshape2_1.4.4
## [11] stringr_1.5.0 pkgconfig_2.0.3
## [13] crayon_1.5.2 fastmap_1.1.1
## [15] XVector_0.40.0 ellipsis_0.3.2
## [17] labeling_0.4.2 utf8_1.2.3
## [19] promises_1.2.0.1 tzdb_0.3.0
## [21] preprocessCore_1.62.1 purrr_1.0.1
## [23] bit_4.0.5 xfun_0.39
## [25] zlibbioc_1.46.0 cachem_1.0.8
## [27] GenomeInfoDb_1.36.0 jsonlite_1.8.4
## [29] blob_1.2.4 highr_0.10
## [31] later_1.3.1 DelayedArray_0.26.1
## [33] BiocParallel_1.34.0 interactiveDisplayBase_1.38.0
## [35] parallel_4.3.0 bslib_0.4.2
## [37] stringi_1.7.12 RColorBrewer_1.1-3
## [39] GenomicRanges_1.52.0 jquerylib_0.1.4
## [41] Rcpp_1.0.10 SummarizedExperiment_1.30.1
## [43] knitr_1.42 readr_2.1.4
## [45] IRanges_2.34.0 httpuv_1.6.9
## [47] Matrix_1.5-4 tidyselect_1.2.0
## [49] yaml_2.3.7 codetools_0.2-19
## [51] curl_5.0.0 lattice_0.21-8
## [53] tibble_3.2.1 plyr_1.8.8
## [55] Biobase_2.60.0 shiny_1.7.4
## [57] withr_2.5.0 KEGGREST_1.40.0
## [59] evaluate_0.20 Biostrings_2.68.0
## [61] pillar_1.9.0 BiocManager_1.30.20
## [63] filelock_1.0.2 MatrixGenerics_1.12.0
## [65] stats4_4.3.0 generics_0.1.3
## [67] RCurl_1.98-1.12 BiocVersion_3.17.1
## [69] S4Vectors_0.38.1 hms_1.1.3
## [71] munsell_0.5.0 scales_1.2.1
## [73] BiocStyle_2.27.2 xtable_1.8-4
## [75] glue_1.6.2 tools_4.3.0
## [77] grid_4.3.0 AnnotationDbi_1.62.1
## [79] colorspace_2.1-0 GenomeInfoDbData_1.2.10
## [81] cli_3.6.1 rappdirs_0.3.3
## [83] fansi_1.0.4 S4Arrays_1.0.1
## [85] gtable_0.3.3 sass_0.4.6
## [87] digest_0.6.31 farver_2.1.1
## [89] memoise_2.0.1 htmltools_0.5.5
## [91] lifecycle_1.0.3 httr_1.4.5
## [93] mime_0.12 bit64_4.0.5
## [95] MASS_7.3-58.4