This vignette book documents the SeSAMe package in full to supplement the vignettes hosted on the Bioconductor.
library(sesame)
library(dplyr)
sesameDataCache()SeSAMe is being actively developed. Please find the current version by
sesame_checkVersion()## SeSAMe requires matched versions of R, sesame, sesameData and ExperimentHub.
## Here is the current versions installed:
## R: 4.3.0
## Bioconductor: 3.18
## sesame: 1.19.10
## sesameData: 1.19.0
## ExperimentHub: 2.9.1
See the following for documentations of the previous versions (partial list):
| 1.18 | https://zhou-lab.github.io/sesame/v1.18/sesame.html |
| 1.16 | https://zhou-lab.github.io/sesame/v1.16/sesame.html |
| 1.14 | https://zhou-lab.github.io/sesame/v1.14/sesame.html |
SeSAMe design includes a light-weight fullly-exposed infrastructure
of the internal signal intensities. Central to this infrastructure is
the SigDF data structure, which is a
data.frame subclass. One can treat it like a regulator
data.frame with 7 specific columns, i.e., Probe_ID, MG, MR,
UG, UR, col and mask.
The col column specifies the color channel and takes
G, R and 2. The Infinium-I probes
carry G and R in col to indicate the designed
color. This infrastructure is nimble to allow change of color channel,
and mask (the scope of usable probes) depending on the use of the array
on different species, strain, population etc. For example, the following
data.frame operation let you easily peek into the signal
intensities.
sdf = sesameDataGet('EPIC.1.SigDF') # an example SigDF
head(sdf) # peek into the internalshead(sdf[sdf$col != "2",]) # Infinium-I probeshead(sdf[sdf$col == "2",]) # Infinium-II probesOne can summarize resulting SigDF using the
‘sesameQC_calcStats’ function (more QC can be found in the quality control vignette).
sesameQC_calcStats(sdf, "numProbes")##
## =====================
## | Number of Probes
## =====================
## N. Probes : 866553 (num_probes)
## N. Inf.-II Probes : 724429 (num_probes_II)
## N. Inf.-I (Red) : 92192 (num_probes_IR)
## N. Inf.-I (Grn) : 49932 (num_probes_IG)
## N. Probes (CG) : 862927 (num_probes_cg)
## N. Probes (CH) : 2932 (num_probes_ch)
## N. Probes (RS) : 59 (num_probes_rs)
Sometimes, particularly with older arrays, there might exist a
controls attributes to contain the control probe
information. In the new manifest, the control probes will be parsed and
included as regulator probes (except with a ctl prefix in
the probe ID). The control probe annotation can be found using the
following function:
head(controls(sdf))SigDF can be written as and read from plain text file
(e.g, tab-delimited files and comma-delimited files) with the compliant
column names (see above).
tsv_file_path = sprintf("%s/sigdf.tsv", tempdir())
sdf_write_table(sdf, file=tsv_file_path, sep="\t", quote=FALSE) # save as tsv
sdf2 = sdf_read_table(tsv_file_path) # read back
csv_file_path = sprintf("%s/sigdf.csv", tempdir())
sdf_write_table(sdf, file=csv_file_path, sep=",") # save as csv
sdf2 = sdf_read_table(csv_file_path, sep=",") # read backManifest files describe the probe mapping and chip address. The address files are subset of manifest files that only contains the chip addresses. The manifest files can be downloaded from http://zwdzwd.github.io/InfiniumAnnotation
After downloading, one can build the address file or GRanges object, both from the downloaded manifest files. The following shows an example,
tsv_path = sesameAnno_download("MM285.mm10.manifest.tsv.gz")
## read manifest to GRanges
manifest = sesameAnno_buildManifestGRanges(tsv_path)
## read manifest to tibble
manifest = sesameAnno_readManifestTSV(tsv_path)
## read manifest to address file (for openSesame or readIDATpair)
addr = sesameAnno_buildAddressFile(tsv_path)If you are dealing with a custom-made array instead of the standard
array (MM285, EPIC, HM450 etc) supported natively by SeSAMe, you would
need to provide a address file that describes the probe information. You
should be able to obtain the probe information manifest from the
Illumina support website. The address file should be formated as a data
frame with four columns minimally: Probe_ID,
M, U and col. A optional
mask column may also be included as a default mask for the
platform. The easiest way to format a SeSAMe-compatible address file is
by following internal address table for a SeSAMe-supported platform or
using the sesameAnno_buildAddressFile from a manifest file.
The internal address table can be retrieved with the
sesameDataGet function:
addr = sesameDataGet("MM285.address")$orderingThe col is either G (which stands for
Green) or R (which stands for Red) or 2 (which
stands for Infinium II designs). For Infinium-II probes, the
M column and col column is left as
NA. For example, one can check that both M and
col columns are filled with the Infinium-I probes (in mouse
array this can be indicated by a _[TBN][CON]1 suffix):
head(addr[!is.na(addr$col),])The last column mask is a logical vector that defines
the default masking behavior of SeSAMe for the platform (see below for
discussion of NA-masking). With the manifest, your data can be processed
using the manifest= option in openSesame or
readIDATpair (one sample).
betas = openSesame("IDAT_dir", manifest = addr)
sdf = readIDATpair("your_sample_prefix", manifest = addr) # or one sampleThe raw Infinium BeadChip data are stored in IDAT files. Each sample
has two IDAT files and they correspond to the red and green signal
respectively. Green and red files for the same samples should always
share the same sample name prefix. For example,
204529320035_R06C01_Red.idat and
204529320035_R06C01_Grn.idat correspond to the red and
green signal of one sample. In SeSAMe, we will use the common
prefix, i.e. 204529320035_R06C01, to refer to that
sample. SeSAMe recognizes both the raw IDAT as well as gzipped IDATs
which are common for data stored in GEO. For example, in
addition to the example above, SeSAMe also recognizes
GSM2178224_Red.idat.gz and
GSM2178224_Grn.idat.gz.
The function readIDATpair function reads in the signal
intensity data from the IDAT pairs. The function takes the common prefix
as input and outputs a SigDF object. The SigDF
object is simply an R data.frame with rows representing probes and
columns representing different signal intensity and probe annotations.
The SigDF class will be discussed more in a separate
section below. Using the two examples above, one would run the following
commands.
sdf = readIDAT("idat_folder/204529320035_R06C01") # Example 1
sdf = readIDAT("idat_folder/GSM2178224") # Example 2Note that SeSAMe automatically detects and matches up the green and red signal files for the same sample.
In most cases, we would be working with a folder that contains many
IDATs. Here is where the searchIDATprefixes function comes
in useful. It lets us search all the IDATs in a folder and its
subfolders recursively. Combine this with the R looping functions lets
you process many IDATs without having to specify all IDAT names.
searchIDATprefixes returns a named vector of prefixes with
associated Red and Grn files, which can be
given to readIDATpair:
sdfs = lapply(searchIDATprefixes(system.file(
"extdata/", package = "sesameData")), readIDATpair)which returns a list of “SigDF”s. This is how the
openSesame pipeline is handling your data internally.
DNA methylation level (aka the β values) are defined as β = M/(M+U)
M represents the signal
from methylated allele and U
represents the unmethylated allele. It can be retrieved calling the
getBetas function with the SigDF as input. The
output is a named vector with probe ID as names. For example, the
following commands read in one sample and convert it to β values.
betas = getBetas(sdf) # retrieve beta values
head(betas)## cg00000029 cg00000103 cg00000109 cg00000155 cg00000158 cg00000165
## 0.8237945 0.2100515 0.8125637 0.9152265 0.9105163 0.8196466
CRITICAL: getBetas takes a single
SigDF object as input instead of a list of
SigDFs. A common mistake is to c-merge
multiple SigDFs. To combine multiple SigDFs,
one can use list() instead. To process many
SigDFs, we should combine that with looping functions
lapply or mclapplys, or using the
openSesame pipeline (see below).
β values for Infinium-I probes can also be obtained by summing up the two in-band channel and out-of-band channels. This rescues probes with SNP hitting the extension base and hence switching color channel. More details can be found in Zhou et al 2017.
As you notice the probe ID has the suffix which distinguishes
replicate probes. These replicate probes can be collapsed to the cg
prefix by adding the collapseToPfx=TRUE option in
openSesame and getBetas function (if that
function is used explicitly).
sdf = sesameDataGet('MMB.1.SigDF')
head(openSesame(sdf, prep="")) # before collapsing to prefix## cg00101675_BC21 cg00116289_BC21 cg00211372_TC21 cg00531009_BC21 cg00747726_TC21
## 0.8140124 0.7519231 0.8431105 0.7911640 0.7225773
## cg00896209_TC21
## 0.8377515
head(openSesame(sdf, prep="", collapseToPfx=TRUE)) # collapsed to prefix## cg00101675 cg00116289 cg00211372 cg00531009 cg00747726 cg00896209
## 0.8140124 0.7519231 0.8431105 0.7911640 0.7225773 0.8377515
The NAs in beta values are controlled by the mask column
in the SigDF object. The mask column is a
boolean vector that specifies which probe should be NA-masked when
getBetas function is called. For example, the following
illustrate the relationship between the mask column in
SigDF and NAs in the beta values.
sdf = sesameDataGet('EPIC.1.SigDF')
sdf_withNA = pOOBAH(sdf)
sum(sdf_withNA$mask) # number of probes to be NA-masked## [1] 31631
sum(is.na(getBetas(sdf_withNA))) # should be the same as above## [1] 31631
sum(is.na(getBetas(sdf_withNA, mask=FALSE))) # turn off NA-masking## [1] 0
sum(is.na(getBetas(resetMask(sdf_withNA)))) # reset mask, also expect 0## [1] 0
Note that mask in SigDF does not actually
remove the probe reading but only specifies how SeSAMe currently views
the measurement (as unreliable). This is how SeSAMe gives finer control
over different preprocessing functions, e.g., to make them only use
reliable probes or species-specific. One can add more probes to the mask
with the addMask function. Other functions such as the
detection p-value calculation (e.g., pOOBAH), also modifies
mask. NA-masking influences other normalization and
preprocessing functions. Therefore one should have them set for the
preprocessing methods mentioned below. No mask is set when a new
SigDF is returned from the readIDATpair
function.
The qualityMask function does experiment-independent
masking using pre-defined probe sets. For example, probes with cross
hybridization or are influenced by common polymorphisms can be masked
using this function. For example, the following function add probes with
low mapping quality to the mask:
sdf_masked = qualityMask(sdf, "low_mapq")For a list of options of such pre-defined sets one can use the
listAvailableMasks() function.
listAvailableMasks("EPIC")## [1] "ref_issue" "unmapped" "mapping" "missing_target"
## [5] "channel_switch" "snp5_common" "snp5_GMAF1p" "extension"
## [9] "nonunique" "low_mapq" "sub40_copy" "sub35_copy"
## [13] "sub30_copy" "sub25_copy" "rmsk15" "control"
## [17] "TCGA" "recommended"
sdf_masked = qualityMask(sdf) # using the recommended maskNote there is a “recommended” mask, it’s the merge of several mask
sets specified in the mask_recommended column. Details of
the recommended mask can be found in Zhou et
al. 2017.
Most of the masking comes from two major sources:
Experiment-independent Probe Masking due to
design issues (see Zhou et
al. 2017). This masking can be added by the qualityMask
function. Currently we support EPIC, MM285, HM450 and HM27 and should be
used in openSesame by default.
Experiment-dependent Probe Masking based on
signal detection p-value (Zhou et
al. 2018). Probes with p-value higher than a threshold (default:
0.05, but can also be raised to allow more lenient masking) are masked
(see following for detection p-value calculation using the
pOOBAH method).
As mentioned above, experiment-dependent masking based on signal
detection p-values is effective in excluding artifactual methylation
level reading and probes with too much influence from signal background.
We recommend the pOOBAH algorithm that was based on
Infinium-I probe out-of-band signal for calibrating the distribution of
the signal background:
sdf = sesameDataGet('EPIC.1.SigDF')
sum(pOOBAH(sdf)$mask) # after pOOBAH, expect some probes## [1] 31631
sum(pOOBAH(qualityMask(sdf))$mask) # combined with qualityMask## [1] 132545
sum(is.na(openSesame(sdf, prep="QP"))) # same as above## [1] 132545
prepSesameList() # Q:qualityMask; P:pOOBAHSometimes one would want to calculation detection p-value without
modifying the mask. For example, one may want to upload the p-values to
GEO separately. In those cases one can use the return.pval
option and add pvalue-based mask later.
pvals = pOOBAH(sdf, return.pval=TRUE)
sdf_masked = addMask(sdf, pvals > 0.05) # recommend between 0.05 and 0.2SeSAMe implements the background subtraction based on
normal-exponential deconvolution using out-of-band probes
noob (Triche et
al. 2013) and optionally with more aggressive subtraction
(scrub). One can use following β value distribution plot to see the
effect of background subtraction. For example, the two (M and U) modes
are further polarized.
par(mfrow=c(2,1), mar=c(3,3,2,1))
sesameQC_plotBetaByDesign(sdf, main="Before", xlab="\beta")
sesameQC_plotBetaByDesign(noob(sdf), main="After", xlab="\beta")Dye bias refers to the difference in signal intensity between the two
color channel. SeSAMe offers two flavors of dye bias correction: linear
scaling (dyeBiasCorr) and nonlinear scaling
(dyeBiasCorrTypeINorm). Linear scaling equalize the mean of
all probes from the two color channel.
par(mfrow=c(1,2))
sesameQC_plotRedGrnQQ(dyeBiasCorr(sdf), main="Before") # linear correction
sesameQC_plotRedGrnQQ(dyeBiasNL(sdf), main="After") # nonlinear correctionResidual dye bias can be corrected using nonlinear quantile interpolation with Type-I probes. Under this correction, Infinium-I Red probes and Infinium-I Grn probes have the same distribution of signal. Note that linear scaling does not shift beta values of Type-I probes while nonlinear scaling does shift beta values of Type-I probes.
Sometimes Infinium-I channel spec is inaccurate in the manifest. We can infer the channel from data.
sdf.InfICorrected = inferInfiniumIChannel(sdf, verbose=TRUE)## [2023-10-14 22:01:43.482913] Infinium-I color channel reset: R>R:92127;G>G:49028;R>G:65;G>R:904
As one can see, most probes remain with the designated channel. A small fraction of the probes is considered “channel-switching”.
As we may’ve noticed, even with proper dye bias correction, there is still remaining differences in β value distribution between Infinium-I and Infinium-II probes due to the signal tail inflation. One solution is to perform ad-hoc quantile normalization. We provide a function similar to the BMIQ algorithm with modifications (one-, two- and three-state mixture and Infinium-I to II matching since Infinium-I is more often influenced by signal inflation based on our experience) to match Infinium-I and Infinium-II beta value distribution. We do not recommend the use of this (or any such methods) for all data unless your data is known to be relatively well-behaving in methylation distribution, for protection of real biological signal. This function assumes Infinium-I/II probes are similar in beta value distribution in the unmethylated and methylated mode.
par(mfrow=c(2,1), mar=c(3,3,2,1))
sesameQC_plotBetaByDesign(sdf, main="Before", xlab="\beta")
sesameQC_plotBetaByDesign(matchDesign(sdf), main="After", xlab="\beta")SeSAMe provides functions to create QC plots. Some functions takes
sesameQC as input while others directly plots the SigDF objects. For
example, the sesameQC_plotBar function takes a list of
sesameQC objects and creates bar plot for each metric calculated.
The fraction of detection failures are signs of masking due to
variety of reasons including failed detection, high background, putative
low quality probes etc. To compare samples in terms of detection success
rate, one can use the sesameQC_plotBar function in the
following way:
sesameQC_plotBar(lapply(sdfs, sesameQC_calcStats, "detection"))Dye bias is shown by an off-diagonal q-q plot of the red (x-axis) and green signal (y-axis).
sesameQC_plotRedGrnQQ(sdf)Beta value is more influenced by signal background for probes with low signal intensities. The following plot shows this dependency and the extent of probes with low signal intensity.
sesameQC_plotIntensVsBetas(sdf)## Loading required namespace: KernSmooth
## Loading required namespace: pals
One can output the allele frequencies and output a VCF file with genotypes. This requires additional SNP information (ref and alt alleles), which can be downloaded using the following code:
library(tidyverse)
anno = read_tsv(sesameAnno_download("EPIC.hg38.snp.tsv.gz"))
vcf = formatVCF(sdf, anno)
head(vcf)Note that the VCF combines explicity SNPs and implicit SNPs that
alters color channels of Infinium-I probes. For the REF and ALT slots,
we used H and D to denote A/C/T and A/G/T
respectively.
Some records match common mutations with rs IDs:
head(vcf[grepl("rs",vcf$INFO),"INFO",drop=FALSE])One can output to actual VCF file with a header by
formatVCF(sdf, vcf=path_to_vcf). See https://github.com/zhou-lab/InfiniumAnnotationV1/tree/main/Anno/EPIC
for more annotation files, including EPICv2.
anno = read_tsv(sesameAnno_download("EPICv2.hg38.snp.tsv.gz"))
head(formatVCF(sdf, anno, vcf="~/Downloads/test_output.vcf"))Extra SNP allele frequencies can be obtained by summing up methylated and umethylated alleles of color-channel-switching probes. These allele frequencies can be combined with explicit SNP probes:
head(getAFs(sdf)) # combined## rs10033147 rs1019916 rs1040870 rs10457834 rs10796216 rs10882854
## 0.4731873 0.9265673 0.4825240 0.4711608 0.9561655 0.9326652
head(getAFTypeIbySumAlleles(sdf)) # AFs from only color-channel-switching## cg00038584 cg00408315 cg00413617 cg00488829 cg00519463 cg00523683
## 0.10148976 0.17358858 0.09358459 0.69068952 0.10698911 0.05029466
One can plot explicit and Infinium-I-derived SNPs to identify potential sample swaps.
sdfs <- sesameDataGet("EPIC.5.SigDF.normal")
sesameQC_plotHeatSNPs(sdfs)Infinium platforms are intrinsically robust to incomplete bisulfite conversion as non-converted probes would fail to hybridize to the target. Residual incomplete bisulfite conversion can be quantified using GCT score based on C/T-extension probes. Details of this method can be found in Zhou et al. 2017. The closer the score to 1.0, the more complete the bisulfite conversion.
sdf <- sesameDataGet('EPIC.1.SigDF')
bisConversionControl(sdf)## [1] 1.067769
For newer arrays, one need to extract extR and extA from the manifest. For example,
mft = sesameAnno_buildManifestGRanges(
sesameAnno_download("EPICv2.hg38.manifest.tsv.gz"),
columns = "nextBase")
extR = names(mft)[!is.na(mft$nextBase) & mft$nextBase=="R"]
extA = names(mft)[!is.na(mft$nextBase) & mft$nextBase=="A"]
bisConversionControl(sdf, extR, extA)SeSAMe provide utilities to view methylation reading in a track view. Next, we will demonstrate how to create track view with transcript position marked focusing on a genomic region, a gene and specific probes. Let’s first load example HM450 data
betas <- sesameDataGet('HM450.10.TCGA.PAAD.normal')To visualize probes from arbitrary region, we will call
visualizeRegion:
visualizeRegion(
'chr19',10260000,10380000, betas, platform='HM450',
show.probeNames = FALSE)Zero to full methylation are displayed using the jet color scheme with blue representing no methylation and red full methylation.
To visualize all probes from a gene, we will call
visualizeGene
visualizeGene('DNMT1', betas, platform='HM450')To visualize genome neighborhood using probe names, we will call
visualizeProbes:
visualizeProbes(c("cg02382400", "cg03738669"), betas, platform='HM450')SeSAMe performs copy number variation in three steps: 1) normalizes the signal intensity using a copy-number-normal data set; 2) groups adjacent probes into bins; 3) runs DNAcopy internally to group bins into segments.
sdf = sesameDataGet('EPIC.1.SigDF')
segs <- cnSegmentation(sdf)To visualize segmentation in SeSAMe,
visualizeSegments(segs)The sesame copy number inference was inspired by the excellent conumee package and wraps around the DNAcopy package that implements circular binary segmentation algorithm.
Let’s load beta values from SeSAMeData
library(SummarizedExperiment)
betas = assay(sesameDataGet("MMB.10.SE.tissue"))[,1:2]Compare mouse array data with mouse tissue references. This will return a grid object that contrasts the traget sample with pre-built mouse tissue reference.
compareReference(sesameDataGet("MM285.tissueSignature"), betas)One can also visualize the actual SNP data for strain inference by the following. The target sample is plotted as a vertical bar on the right of the signature heatmap.
sdf = sesameDataGet("MMB.1.SigDF") # an example dataset
inferStrain(sdf, return.strain = TRUE) # return strain as a string## [1] "NOD_ShiLtJ"
compareMouseStrainReference(getBetas(sdf))As expected, the variant configuration is consistent with the NOD strain.
There are many creative ways one can visualize enrichment results. Here we demonstrate a few popular ones supported in sesame. Dot plot:
library(SummarizedExperiment)
df <- rowData(sesameDataGet('MM285.tissueSignature'))
query <- df$Probe_ID[df$branch == "fetal_liver" & df$type == "Hypo"]
results <- testEnrichment(query, "TFBS")KYCG_plotDot(results)or a bar plot:
library(ggplot2)
library(wheatmap)
p1 <- KYCG_plotBar(results, label=TRUE)
p2 <- KYCG_plotBar(results, y="estimate") + ylab("log2(Odds Ratio)") +
xlab("") + theme(axis.text.y = element_blank())
WGG(p1) + WGG(p2, RightOf(width=0.5, pad=0))or a volcano plot. Here we need a two-tailed test to show both enrichment and depletion (the default is one-tailed):
query <- df$Probe_ID[df$branch == "fetal_brain" & df$type == "Hypo"]
results_2tailed <- testEnrichment(query, "TFBS", alternative = "two.sided")
KYCG_plotVolcano(results_2tailed)and a Waterfall plot:
KYCG_plotWaterfall(results)## 316 extremes are capped.
If you have a list of cg groups and they were tested against the same set of databases, you can make a point range plot to summarize the overall trend:
## pick some big TFBS-overlapping CpG groups
cg_lists <- KYCG_getDBs("MM285.TFBS", silent=TRUE)
queries <- cg_lists[(sapply(cg_lists, length) > 40000)]
res <- lapply(queries, function(q) {
testEnrichment(q, "MM285.chromHMM", silent=TRUE)})
## note the input of the function is a list of testEnrichment outputs.
KYCG_plotPointRange(res)plot enrichment over metagene:
KnowYourCG builds in several metagene/meta-feature coordinates. One can test enrichment over meta genes or simply plot the mean over metagenes:
sdf <- sesameDataGet("EPIC.1.SigDF")
KYCG_plotMeta(getBetas(sdf))Here we picked some transcription factor binding sites-overlapping CpGs and tested against chromHMM states. As expected, most of these CGs are enriched at promoters and enhancers but depleted at heterchromatic regions.
Manhattan plots are an intuitive way to visualize the results from
large scale “omics” studies that investigate genetic or epigenetic
features on genome wide scales across large groups of samples. Here we
demonstrate how the KYCG_plotManhattan() function can be
used to visualize the chromosomal distribution and significance level of
CpG probes from an EWAS study. KYCG_plotManhattan() takes a
named numeric vector with CpG probeIDs as names and numeric values
(P,Q,logFC,etc) obtained from analysis as values. To plot EWAS results,
simply load the necessary libraries and pass the named numeric vector to
KYCG_plotManhattan()
By default, KYCG_plotManhattan() will infer the array
platform from the probeIDs and retrieve the corresponding GRanges object
to obtain probe coordinates. However, the platform and GRanges objects
can be pre - specified if offline. Color, title, and the threshold to
label significant probes on the plot can also be specified:
library(ggrepel)
smry_pvals = readRDS(url("https://github.com/zhou-lab/InfiniumAnnotationV1/raw/main/Test/20220413_testManhattan.rds"))
KYCG_plotManhattan(-log10(smry_pvals), platform="HM450",
title="Differentially Methylated CpGs - EWAS Aging",
col=c("navy","skyblue"), ylabel = bquote(-log[10](P~value)), label_min=30) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))In addition to hypothesis testing, knowYourCG also uses the curated database sets for feature engineering. We have a pre-curated summarized experiment containing a samplesheet and beta value matrix corresponding to about 467 MM285 samples with 20k probes. The samplesheet includes UIDs pertaining to the sample and several categorical/numerical features. To use this data for a linear model, we will extract the most relevant prevalent features.
library(SummarizedExperiment)
betas = assay(sesameDataGet('MM285.467.SE.tissue20Kprobes'))We have found that it is computationally expensive to perform a linear model/generalized linear model on a feature set of individual CpGs. Additionally, interpreting the mechanism the significantly contributing CpGs is non-trivial due to their complex interactions. We hope to leverage these pre-curated database sets by using their beta value summary statistics as features instead. We will calculate the summary statistics for the betas matrix using a list of database sets. The default is to calculate the mean.
stats <- dbStats(betas, 'MM285.chromHMM')## Selected the following database groups:
## 1. KYCG.MM285.chromHMM.20210210
head(stats[, 1:5])## Enh EnhG EnhLo EnhPois EnhPr
## [1,] 0.4747438 0.6719724 0.6432090 0.5546328 0.3883567
## [2,] 0.4629993 0.6726350 0.5992930 0.5677431 0.3881347
## [3,] 0.4857179 0.6724716 0.6359722 0.5672481 0.3889514
## [4,] 0.4985426 0.6888043 0.6279642 0.5803560 0.3987007
## [5,] 0.4720005 0.6698668 0.6420811 0.5530925 0.3851314
## [6,] 0.4978402 0.6910251 0.6072940 0.5807078 0.4133535
Just from the few database set means above, we can see that TSS are consistently hypomethylated, which is consistent with known biology.
library(wheatmap)
WHeatmap(both.cluster(stats)$mat, xticklabels=TRUE,
cmp=CMPar(stop.points=c("blue","yellow")))We can also use KYCG database to annotate an arbitrary probe set.
This can be done using the KYCG_annoProbes function:
query <- names(sesameData_getManifestGRanges("MM285"))
anno <- KYCG_annoProbes(query, "designGroup", silent = TRUE)Previously, the signal was implemented an S4 implementation in
SigSet complies with Bioconductor guidelines, and for
backwards compatibility, SigSet can be transformed to a
SigDF using the SigSetToSigDF function
sesame:::SigSetToSigDF(sset).
SigSet/SigDF can be converted back and
forth from minfi::RGChannelSet in multiple ways. One can
sesamize a minfi RGChannelSet which returns a
GenomicRatioSet. See sesamize for more
detail.
sessionInfo()## R version 4.3.0 (2023-04-21)
## Platform: aarch64-apple-darwin22.3.0 (64-bit)
## Running under: macOS Ventura 13.5.2
##
## Matrix products: default
## BLAS: /Users/zhouw3/.Renv/versions/4.3.0.devel/lib/R/lib/libRblas.dylib
## LAPACK: /Users/zhouw3/.Renv/versions/4.3.0.devel/lib/R/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggrepel_0.9.4 wheatmap_0.2.0
## [3] SummarizedExperiment_1.31.1 Biobase_2.61.0
## [5] GenomicRanges_1.53.2 GenomeInfoDb_1.37.6
## [7] IRanges_2.35.3 S4Vectors_0.39.3
## [9] MatrixGenerics_1.13.1 matrixStats_1.0.0
## [11] lubridate_1.9.3 forcats_1.0.0
## [13] stringr_1.5.0 purrr_1.0.2
## [15] readr_2.1.4 tidyr_1.3.0
## [17] tibble_3.2.1 ggplot2_3.4.4
## [19] tidyverse_2.0.0 knitr_1.44
## [21] dplyr_1.1.3 sesame_1.19.10
## [23] sesameData_1.19.0 ExperimentHub_2.9.1
## [25] AnnotationHub_3.9.2 BiocFileCache_2.9.1
## [27] dbplyr_2.3.4 BiocGenerics_0.47.0
## [29] rmarkdown_2.25 R6_2.5.1
##
## loaded via a namespace (and not attached):
## [1] DBI_1.1.3 bitops_1.0-7
## [3] rlang_1.1.1 magrittr_2.0.3
## [5] compiler_4.3.0 RSQLite_2.3.2.9004
## [7] maps_3.4.1 png_0.1-8
## [9] vctrs_0.6.4 reshape2_1.4.4
## [11] pkgconfig_2.0.3 crayon_1.5.2
## [13] fastmap_1.1.1 XVector_0.41.1
## [15] ellipsis_0.3.2 labeling_0.4.3
## [17] utf8_1.2.3 promises_1.2.1
## [19] RPMM_1.25 tzdb_0.4.0
## [21] preprocessCore_1.63.1 bit_4.0.5
## [23] xfun_0.40 pals_1.8
## [25] zlibbioc_1.47.0 cachem_1.0.8
## [27] jsonlite_1.8.7 blob_1.2.4
## [29] later_1.3.1 DelayedArray_0.27.10
## [31] BiocParallel_1.35.4 interactiveDisplayBase_1.39.0
## [33] cluster_2.1.4 parallel_4.3.0
## [35] bslib_0.5.1 stringi_1.7.12
## [37] RColorBrewer_1.1-3 DNAcopy_1.73.0
## [39] jquerylib_0.1.4 Rcpp_1.0.11
## [41] timechange_0.2.0 httpuv_1.6.11
## [43] illuminaio_0.43.0 Matrix_1.6-1.1
## [45] tidyselect_1.2.0 dichromat_2.0-0.1
## [47] abind_1.4-5 yaml_2.3.7
## [49] codetools_0.2-19 curl_5.1.0
## [51] lattice_0.21-9 plyr_1.8.9
## [53] shiny_1.7.5.1 withr_2.5.1
## [55] KEGGREST_1.41.4 askpass_1.2.0
## [57] evaluate_0.22 Biostrings_2.69.2
## [59] pillar_1.9.0 BiocManager_1.30.22
## [61] filelock_1.0.2 KernSmooth_2.23-20
## [63] generics_0.1.3 vroom_1.6.4
## [65] RCurl_1.98-1.12 BiocVersion_3.18.0
## [67] hms_1.1.3 munsell_0.5.0
## [69] scales_1.2.1 BiocStyle_2.29.2
## [71] xtable_1.8-4 glue_1.6.2
## [73] mapproj_1.2.11 tools_4.3.0
## [75] grid_4.3.0 base64_2.0.1
## [77] AnnotationDbi_1.63.2 colorspace_2.1-0
## [79] GenomeInfoDbData_1.2.10 cli_3.6.1
## [81] rappdirs_0.3.3 fansi_1.0.5
## [83] S4Arrays_1.1.6 gtable_0.3.4
## [85] sass_0.4.7 digest_0.6.33
## [87] SparseArray_1.1.12 farver_2.1.1
## [89] memoise_2.0.1 htmltools_0.5.6.1
## [91] lifecycle_1.0.3 httr_1.4.7
## [93] mime_0.12 MASS_7.3-60
## [95] openssl_2.1.1 bit64_4.0.5