SeSAMe implements inference of sex, age, ethnicity. These are valuable information for checking the integrity of the experiment and detecting sample swaps.
Sex is inferred based on our curated X-linked probes and Y chromosome probes excluding pseudo-autosomal regions and XCI escapes.
Human:
= sesameDataGet('EPIC.1.SigDF')
sdf inferSex(sdf)
## [1] "MALE"
inferSexKaryotypes(sdf)
## [1] "XaY"
Mouse:
= sesameDataGet("MM285.1.SigDF")
sdf inferSex(sdf)
## [1] Male
## Levels: Female Male
Ethnicity is inferred using a random forest model trained based on
both the built-in SNPs (rs
probes) and channel-switching
Type-I probes.
= sesameDataGet('EPIC.1.SigDF')
sdf inferEthnicity(sdf)
## [1] "WHITE"
SeSAMe provides age regression through multiple previously established models, e.g., the well-known Horvath 353 model (Horvath 2013) which returns the chronological age in the number of years. Here is an example:
<- sesameDataGet('HM450.1.TCGA.PAAD')$betas
betas <- sesameAnno_get("Anno/HM450/Clock_Horvath353.rds")
model predictAge(betas, model)
And MM285 mouse array data using a set of 347 CpGs (see Zhou et al. 2022) The function returns the age in the number of months. We recommend using SeSAMe preprocessed data as input to the function. Here’s an example:
library(SummarizedExperiment)
<- assay(sesameDataGet("MM285.10.SE.tissue"))[,1]
betas <- sesameAnno_get("Anno/MM285/Clock_Zhou347.rds")
model predictAge(betas, model)
This indicates that this mouse is approximately 1.41 months old. The function looks for overlapping probes and estimates age using the corresponding clock models. Other available epigenetic clocks are
RDS Key | Platform | N | Reference (PMID) |
---|---|---|---|
Anno/HM450/Clock_Horvath353.rds | 353 | HM450/EPIC | Horvath 2013 (24138928) |
Anno/HM450/Clock_Hannum.rds | 71 | HM450 | Hannum 2013 (23177740) |
Anno/HM450/Clock_SkinBlood.rds | 391 | HM450/EPIC | Horath 2018 (30048243) |
Anno/EPIC/Clock_PhenoAge.rds | 514 | HM450/EPIC | Levine 2018 (29676998) |
Anno/MM285/Clock_Zhou347.rds | 347 | MM285 | Zhou 2022 |
SeSAMe estimates leukocyte fraction using a two-component model.This function works for samples whose targeted cell-of-origin is not related to white blood cells.
<- sesameDataGet('HM450.1.TCGA.PAAD')$betas
betas.tissue estimateLeukocyte(betas.tissue)
## [1] 0.2007592
The goal of data sanitization is to modifiy IDAT files in place, so they can be released to public domain without privacy leak. This will be achieved by deIdentification.
Let’s take DNA methylation data from the HM450 platform for example.
= tempdir()
tmp = sesameAnno_download("Test/3999492009_R01C01_Grn.idat", dest_dir=tmp)
res_grn = sesameAnno_download("Test/3999492009_R01C01_Red.idat", dest_dir=tmp) res_red
This first method of deIdentification masks SNP probe intensity mean by zero. As a consequence, the allele frequency will be 0.5.
deIdentify(res_grn$dest_file, sprintf("%s/deidentified_Grn.idat", tmp))
deIdentify(res_red$dest_file, sprintf("%s/deidentified_Red.idat", tmp))
= getBetas(readIDATpair(sprintf("%s/Test/3999492009_R01C01", tmp)))
betas1 = getBetas(readIDATpair(sprintf("%s/deidentified", tmp)))
betas2
head(betas1[grep('rs',names(betas1))])
head(betas2[grep('rs',names(betas2))])
Note that before deIdentify, the rs values will all be different. After deIdentify, the rs values will all be masked at an intensity of 0.5.
This second method of deIdentification will scramble the intensities using a secret key to help formalize a random number. Therefore, randomize needs to be set to TRUE.
<- 13412084
my_secret set.seed(my_secret)
deIdentify(res_grn$dest_file,
sprintf("%s/deidentified_Grn.idat", tmp), randomize=TRUE)
<- 13412084
my_secret set.seed(my_secret)
deIdentify(res_red$dest_file,
sprintf("%s/deidentified_Red.idat", tmp), randomize=TRUE)
= getBetas(readIDATpair(sprintf("%s/Test/3999492009_R01C01", tmp)))
betas1 = getBetas(readIDATpair(sprintf("%s/deidentified", tmp)))
betas2
head(betas1[grep('rs',names(betas1))])
head(betas2[grep('rs',names(betas2))])
Note that the rs values are scrambled after deIdentify.
To restore order of the deIdentified intensities, one can re-identify IDATs. The reIdentify function can thus restore the scrambled SNP intensities.
<- 13412084
my_secret set.seed(my_secret)
reIdentify(sprintf("%s/deidentified_Grn.idat", tmp),
sprintf("%s/reidentified_Grn.idat", tmp))
<- 13412084
my_secret set.seed(my_secret)
reIdentify(sprintf("%s/deidentified_Red.idat", tmp),
sprintf("%s/reidentified_Red.idat", tmp))
= getBetas(readIDATpair(sprintf("%s/Test/3999492009_R01C01", tmp)))
betas1 = getBetas(readIDATpair(sprintf("%s/reidentified", tmp)))
betas2
head(betas1[grep('rs',names(betas1))])
head(betas2[grep('rs',names(betas2))])
Note that reIdentify restored the values. Subsequently, they are the same as betas1.
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: aarch64-apple-darwin22.3.0 (64-bit)
## Running under: macOS Ventura 13.2.1
##
## Matrix products: default
## BLAS: /Users/zhouw3/.Renv/versions/4.3.0/lib/R/lib/libRblas.dylib
## LAPACK: /Users/zhouw3/.Renv/versions/4.3.0/lib/R/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.42 sesame_1.18.4 sesameData_1.18.0
## [4] ExperimentHub_2.8.0 AnnotationHub_3.8.0 BiocFileCache_2.8.0
## [7] dbplyr_2.3.2 BiocGenerics_0.46.0 rmarkdown_2.21
## [10] R6_2.5.1
##
## loaded via a namespace (and not attached):
## [1] DBI_1.1.3 bitops_1.0-7
## [3] rlang_1.1.1 magrittr_2.0.3
## [5] matrixStats_0.63.0 e1071_1.7-13
## [7] compiler_4.3.0 RSQLite_2.3.1
## [9] png_0.1-8 vctrs_0.6.2
## [11] reshape2_1.4.4 stringr_1.5.0
## [13] pkgconfig_2.0.3 crayon_1.5.2
## [15] fastmap_1.1.1 XVector_0.40.0
## [17] ellipsis_0.3.2 utf8_1.2.3
## [19] promises_1.2.0.1 tzdb_0.3.0
## [21] preprocessCore_1.62.1 purrr_1.0.1
## [23] bit_4.0.5 xfun_0.39
## [25] randomForest_4.7-1.1 zlibbioc_1.46.0
## [27] cachem_1.0.8 GenomeInfoDb_1.36.0
## [29] jsonlite_1.8.4 blob_1.2.4
## [31] later_1.3.1 DelayedArray_0.26.1
## [33] BiocParallel_1.34.0 interactiveDisplayBase_1.38.0
## [35] parallel_4.3.0 bslib_0.4.2
## [37] stringi_1.7.12 RColorBrewer_1.1-3
## [39] GenomicRanges_1.52.0 jquerylib_0.1.4
## [41] Rcpp_1.0.10 SummarizedExperiment_1.30.1
## [43] wheatmap_0.2.0 readr_2.1.4
## [45] IRanges_2.34.0 httpuv_1.6.9
## [47] Matrix_1.5-4 tidyselect_1.2.0
## [49] yaml_2.3.7 codetools_0.2-19
## [51] curl_5.0.0 lattice_0.21-8
## [53] tibble_3.2.1 plyr_1.8.8
## [55] Biobase_2.60.0 shiny_1.7.4
## [57] withr_2.5.0 KEGGREST_1.40.0
## [59] evaluate_0.20 proxy_0.4-27
## [61] Biostrings_2.68.0 pillar_1.9.0
## [63] BiocManager_1.30.20 filelock_1.0.2
## [65] MatrixGenerics_1.12.0 stats4_4.3.0
## [67] generics_0.1.3 RCurl_1.98-1.12
## [69] BiocVersion_3.17.1 S4Vectors_0.38.1
## [71] hms_1.1.3 ggplot2_3.4.2
## [73] munsell_0.5.0 scales_1.2.1
## [75] BiocStyle_2.27.2 xtable_1.8-4
## [77] class_7.3-21 glue_1.6.2
## [79] tools_4.3.0 grid_4.3.0
## [81] AnnotationDbi_1.62.1 colorspace_2.1-0
## [83] GenomeInfoDbData_1.2.10 cli_3.6.1
## [85] rappdirs_0.3.3 fansi_1.0.4
## [87] S4Arrays_1.0.1 dplyr_1.1.2
## [89] gtable_0.3.3 sass_0.4.6
## [91] digest_0.6.31 memoise_2.0.1
## [93] htmltools_0.5.5 lifecycle_1.0.3
## [95] httr_1.4.5 mime_0.12
## [97] MASS_7.3-58.4 bit64_4.0.5