注:笔记内容参考官方教程用法摸索而成。因涉及步骤较繁琐,如有错误,欢迎指出!
R包MARVEL是由牛津大学MRC Weatherall分子医学研究所团队开发的,用于分析单细胞水平的可变剪切事件。相关文章于2023年1月在Nucleic Acids Research期刊发表,在其github页面分享了MARVEL工具的分析流程,在此学习、记录如下。
Pape:https://doi.org/10.1093/nar/gkac1260
Github:https://github.com/wenweixiong/MARVEL
Plate pipeline:https://wenweixiong.github.io/MARVEL_Plate.html
Droplet pipeline:https://wenweixiong.github.io/MARVEL_Droplet.html
可变剪切
- 关于可变剪切(Alternative splicing,AS)的基础,可以参看一篇中文教程:169-跟着刘小泽一起探索可变剪切分析 | BIOINFOPLANET;其中关于基因结构组成,可以参考之前的笔记:人类基因组基础知识与下载查询 | Li's Bioinfo-Blog。
- 过去,针对Bulk RNA-seq数据的可变剪切分析工具已经有很多,例如之前学习的IsoformSwitchAnalyzeR包。MARVEL是个人目前了解到的第一个针对单细胞转录数据AS分析工具。
单细胞数据类型
- AS分析的核心是通过检测比对到splice junction位点的read(例如,跨越两个外显子)进行分析。
- 目前主流的单细胞测序技术主要分为Droplet-based 10X方法与Plate-based Smart-seq方法。
- 10X技术由于仅对cDNA的3'端部分进行测序,只能对该区域涉及的splice junction进行分析,且无法具体分析事件类型
- Smart-seq技术虽然通量低,但是全长转录本的单细胞测序原理,可以进行更为深入的AS分析。
一、Plate-based pipeline
1、上游分析
1.1 工作环境
两个同级的目录(1)basic:参考数据以及代码脚本 (2)proj1:具体的分析项目文件
- 后续分析均在 proj1/ 路径下操作
tree basic/ -L 2
# basic/
# ├── gtf
# │ ├── gencode.v31.annotation.gtf
# │ ├── GRCh38.primary_assembly.genome.fa
# │ ├── Log.out
# │ ├── rsem_index
# │ └── star_index
# ├── rscript_Anno_SJ_rMATs.R
# ├── rscript_bedtools_input.R
# └── single_sample_upstream.sh
tree proj1/ -L 2
# proj1/
# ├── SJ_phenoData.txt
# └── work
# ├── fq
# ├── intron
# ├── merge
# ├── rmats
# ├── rsem
# └── star
1.2 conda环境
# 名为marvel_plate的分析环境
conda create -n marvel_plate python=3.7
conda activate marvel_plate
conda install -c hcc aspera-cli
conda install -c bioconda star=2.7.1a rmats=4.1.0 -y
conda install -c bioconda bedtools samtools rsem -y
conda install -c r r-base r-essentials r-tidyverse -y
1.2 准备数据
(1)测序数据
- 官方手册分析了如下数据集的 192个单细胞数据,以其中一个演示流程
# https://www.ebi.ac.uk/ena/browser/view/ERR1562083
id=ERR1562083
# id=$1
mkdir -p ./work/fq/${id}
ascp -QT -l 300m -P33001 \
-i ~/miniconda3/envs/marvel_plate/etc/asperaweb_id_dsa.openssh \
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/${id:0:6}/00${id:0-1}/${id}/${id}_1.fastq.gz ./work/fq/${id}
ascp -QT -l 300m -P33001 \
-i ~/miniconda3/envs/marvel_plate/etc/asperaweb_id_dsa.openssh \
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/${id:0:6}/00${id:0-1}/${id}/${id}_2.fastq.gz ./work/fq/${id}
(2)基因组数据
在 ./basic/gtf 目录下
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/
## STAR索引
STAR --runMode genomeGenerate \
--genomeDir star_index \
--genomeFastaFiles GRCh38.primary_assembly.genome.fa \
--sjdbGTFfile gencode.v31.annotation.gtf \
--runThreadN 40
## RSEM索引
rsem-prepare-reference \
--gtf gencode.v31.annotation.gtf \
--star -p 20 \
GRCh38.primary_assembly.genome.fa \
rsem_index/hg38
(3)教程数据
- 官方教程提供了各个分析步骤的示范文件,可供之后分析作为参考。
- http://datashare.molbiol.ox.ac.uk/public/wwen/MARVEL/Tutorial/Plate/Data.zip
ls Data
# GTF MARVEL rMATS R Object RSEM SJ
1.3 分析步骤
1.3.1 单样本4步分析
(1)单样本STAR比对
将每个样本的Fastq文件使用STAR软件进行比对,产生每个splice junction count matrix
# Reference
less -SN ./Data/SJ/SJ.txt
mkdir -p ./work/star/${id}
ref_idx_star=../basic/gtf/star_index
ref_gtf=../basic/gtf/gencode.v31.annotation.gtf
echo ${id} is runing......
#1st pass mode
STAR --runThreadN 16 \
--genomeDir ${ref_idx_star} \
--readFilesCommand zcat \
--readFilesIn ./work/fq/${id}/${id}_1.fastq.gz ./work/fq/${id}/${id}_2.fastq.gz \
--outFileNamePrefix ./work/star/${id}/${id}. \
--outSAMtype None
#2nd pass mode
STAR --runThreadN 16 \
--genomeDir ${ref_idx_star} \
--readFilesCommand zcat \
--readFilesIn ./work/fq/${id}/${id}_1.fastq.gz ./work/fq/${id}/${id}_2.fastq.gz \
--outFileNamePrefix ./work/star/${id}/${id}. \
--sjdbFileChrStartEnd ./work/star/${id}/*SJ.out.tab \
--sjdbGTFfile ${ref_gtf} \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM XS \
--quantMode TranscriptomeSAM
(2)单样本rMATs注释
根据上述比对BAM结果,注释出每个基因(转录本)涉及的可变剪切事件类型。使用rMATs软件,可注释出5种类型:SE、RI、MXE、A5SS、A3SS。
find ./Data/rMATS/ -name *featureData.txt
# ./Data/rMATS/SE/SE_featureData.txt
# ./Data/rMATS/A3SS/A3SS_featureData.txt
# ./Data/rMATS/MXE/MXE_featureData.txt
# ./Data/rMATS/RI/RI_featureData.txt
# ./Data/rMATS/A5SS/A5SS_featureData.txt
less -SN ./Data/rMATS/SE/SE_featureData.txt
mkdir -p ./work/rmats/${id}
echo ${id} is running......
echo "./work/star/${id}/${id}.Aligned.sortedByCoord.out.bam" > ./work/star/${id}/BAM_fls.txt
rmats.py \
--b1 ./work/star/${id}//BAM_fls.txt \
--gtf ${ref_gtf} \
--od ./work/rmats/${id} \
--tmp ./work/rmats/${id}/tmp \
-t paired \
--readLength 125 \
--variable-read-length \
--nthread 16 \
--statoff
# readLength参考如下
# zcat ./work/fq/${id}_1.fastq.gz | awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' | less
(3)单样本Intron矩阵
统计Reads对于每个内含子的比对情况,逐碱基累加作为比对结果。
# reference
less -SN Data/MARVEL/PSI/RI/Counts_by_Region.txt
mkdir -p ./work/intron/${id}
echo ${id} is running......
samtools view -H ./work/star/${id}/${id}.Aligned.sortedByCoord.out.bam | \
grep SQ | cut -f 2 | awk '{ sub(/^SN:/, ""); print;}' > ./work/star/${id}/sorted_chr_in_bam.txt
# 预定义R脚本,生成bedtools计算所需的两个文件:染色体大小以及内含子坐标
# https://github.com/lishensuo/utils/blob/main/marvel/rscript_bedtools_input.R
Rscript ../basic/rscript_bedtools_input.R ${id}
bedtools coverage \
-g ./work/intron/${id}/hg38.chrom.sizes.txt \
-split \
-sorted \
-a ./work/intron/${id}/RI_Coordinates_sorted.bed \
-b ./work/star/${id}/${id}.Aligned.sortedByCoord.out.bam > \
./work/intron/${id}/intron_count.txt \
-d
(4)单样本gene矩阵
最后需要对基因表达水平进行定量。教程推荐使用RSEM软件,并进行TPM类似的标准化处理
# reference
less -SN ./Data/RSEM/TPM.txt
mkdir -p ./work/rsem/${id}
rsem-calculate-expression \
--bam \
--paired-end \
-p 16 \
./work/star/${id}/${id}.Aligned.toTranscriptome.out.bam \
../basic/gtf/rsem_index/hg38 \
./work/rsem/${id}/${id}
(1)如上,由于4步的分析过程中产生的中间数据可能超过数百兆;且SMART-seq往往涉及成百上千的细胞。为节约空间,可在每个样本分析完成后,删除不必要的数据。
rm ./work/fq/${id}/${id}*
rm ./work/star/${id}/*bam
rm -rf ./work/rmats/${id}/tmp/*
rm ./work/rsem/${id}/${id}.transcript.bam
rm ./work/rsem/${id}/${id}.isoforms.results(2)多样本批量后台运行方式
cat ../basic/single_sample_upstream.sh
chmod u+x ../basic/single_sample_upstream.sh
# 单样本测试
# ../basic/single_sample_upstream.sh ERR1562273
# parallel多线程批量运行
nohup parallel -j 5 ../basic/single_sample_upstream.sh {} ::: \
$(grep -v "sample" ./SJ_phenoData.txt | awk '{print $1}') \
1> ./work/single_sample_upstream.log 2>&1 &
1.3.2 整合多样本结果
即分步整合上游4个步骤中的每个步骤的所有样本结果;均使用R语言完成
library(tidyverse)
library(parallel)
sample_meta = data.table::fread("./SJ_phenoData.txt",data.table=F)
samples=sample_meta$sample.id
- step1: 汇总剪切位点矩阵
sj.list = mclapply(samples, function(sample){
sj_sp = data.table::fread(paste0("./work/star/",sample,"/",sample,".SJ.out.tab"), data.table=F)
if(nrow(sj_sp)==0) return(NULL)
sj_sp = sj_sp %>%
dplyr::mutate(`coord.intron`=paste0(V1,":",V2,":",V3)) %>%
dplyr::select(`coord.intron`, V7)
colnames(sj_sp)[2] = sample
return(sj_sp)
},mc.cores=10)
names(sj.list) = samples
sj.list = sj.list[!unlist(lapply(sj.list, is.null))]
sj_res = sj.list[[1]]
for(x in sj.list[-1]){
sj_res = dplyr::full_join(sj_res, x)
}
for(sample in setdiff(samples,names(sj.list))){
sj_res[,sample]=NA
}
dim(sj_res)
write.table(sj_res, file=paste0("./work/merge/SJ.txt"),
quote=F, row.names=F, sep="\t")
- step2:整理AS注释结果 这一步最为复杂,需要理解每一种剪切事件的定义以及正负链的差异参考官方示意:https://wenweixiong.github.io/Splicing_Nomenclature
# https://github.com/lishensuo/utils/blob/main/marvel/rscript_Anno_SJ_rMATs.R
source(paste0("../basic/rscript_Anno_SJ_rMATs.R"))
AS_types = c("SE", "RI", "MXE", "A3SS", "A5SS")
for(AS_type in AS_types){
# AS_type = AS_types[1]
print(AS_type)
fls_AS = list.files(paste0("./work/rmats"), pattern=paste0("fromGTF.",AS_type,".txt"),
recursive=T, full.name=T)
if (AS_type=="SE"){
func = se_func
} else if (AS_type=="RI"){
func = ri_func
} else if (AS_type=="MXE"){
func = mxe_func
} else if (AS_type=="A3SS"){
func = a3ss_func
} else if (AS_type=="A5SS"){
func = a5ss_func
}
df_AS = mclapply(fls_AS, function(fls){
rmats_AS = data.table::fread(fls, data.table=F)
rmats_AS = rmats_AS %>%
dplyr::mutate(tran_id = func(.)) %>%
dplyr::rename(gene_id=GeneID,gene_short_name=geneSymbol) %>%
dplyr::select(tran_id, gene_id, gene_short_name)
return(rmats_AS)
},mc.cores=10) %>% do.call(rbind, .) %>%
dplyr::distinct()
write.table(df_AS, row.names=F, sep="\t",quote=F,
file=paste0("./work/merge/as_",AS_type,"_featureData.txt"))
}
- step3:汇总intron矩阵
int_mt.list = mclapply(samples, function(sample){
# sample = samples[1]
int_mt = data.table::fread(paste0("./work/intron/",sample,"/intron_count.txt"))
int_mt = int_mt %>%
dplyr::mutate(`coord.intron`=paste(V1,V2,V3, sep=":")) %>%
dplyr::group_by(coord.intron) %>%
dplyr::summarize(count=sum(V5)) %>%
dplyr::select(coord.intron, count) %>% as.data.frame()
colnames(int_mt)[2] = sample
return(int_mt)
},mc.cores=10)
intron_res = int_mt.list[[1]]
for(x in int_mt.list[-1]){
intron_res = dplyr::full_join(intron_res, x)
}
dim(intron_res)
write.table(intron_res, file=paste0("./work/merge/intron_count_by_region.txt"),
quote=F, row.names=F, sep="\t")
- 汇总gene矩阵
gene_mt.list = mclapply(samples, function(sample){
fls=paste0("./work/rsem/",sample,"/",sample,".genes.results")
gene_mt = data.table::fread(fls, data.table=F) %>%
dplyr::select(gene_id, TPM)
colnames(gene_mt)[2] = sample
return(gene_mt)
},mc.cores=10)
gene_res = gene_mt.list[[1]]
for(x in gene_mt.list[-1]){
gene_res = dplyr::full_join(gene_res, x)
}
dim(gene_res)
write.table(gene_res, file=paste0("./work/merge/rsem_tpm.txt"),,
quote=F, row.names=F, sep="\t")
2、下游分析
后续根据教程提供的数据学习,需要安装的主要的MARVEL包以及其它辅助分析包,具体可参考教程。
# 建议再单独建立一个conda R环境
install.packages("MARVEL")
2.1 创建marvel对象
需要准备七类文件,读入为R语言对象
样本表型data.frame:第一列名为sample.id
剪切位点计数矩阵:第一列名为coord.intron
剪切事件类型注释列表:由数据框包装而出的list,每个数据库至少包含tran_id与gene_id
内含子计数矩阵:第一列名为coord.intron
基因表达矩阵:经标准化处理(RPKM/FPKM/TPM),暂时不要log转换
基因类型data.frame:至少包含3列,gene_id, gene_short_name, gene_type
上游分析所用到的GTF数据
library(tidyverse)
library(MARVEL)
# (1) 细胞表型分组
df.pheno = data.table::fread("./SJ_phenoData.txt",data.table=F)
# (2) Gene文件
df.tpm <- read.table("./work/merge/rsem_tpm.txt", sep="\t", header=TRUE)
dim(df.tpm)
# (3) GTF文件
gtf <- data.table::fread("../basic/gtf/gencode.v31.annotation.gtf", sep="\t", header=FALSE, na.strings="NA",quote="\"") %>%
as.data.frame()
# (4) Gene类型
gene.feature = subset(gtf, V3=="gene")
df.tpm.feature <- str_split(gene.feature$V9, ";", simplify=T) %>%
as.data.frame() %>%
dplyr::mutate(gene_id=str_match(V1, '"(.*)"')[,2]) %>%
dplyr::mutate(gene_short_name=str_match(V3, '"(.*)"')[,2]) %>%
dplyr::mutate(gene_type=str_match(V2, '"(.*)"')[,2]) %>%
dplyr::select(gene_id, gene_short_name, gene_type) %>%
dplyr::filter(gene_id %in% df.tpm$gene_id)
df.tpm.feature = df.tpm.feature[match(df.tpm$gene_id, df.tpm.feature$gene_id),]
rownames(df.tpm.feature) = seq(nrow(df.tpm.feature))
# (5) SJ文件
sj = data.table::fread("./work/merge/SJ.txt", sep="\t", header=TRUE, na.strings="NA") %>%
as.data.frame()
dim(sj)
# (6) AS文件
df.feature.list = lapply(c("SE", "MXE", "RI", "A5SS", "A3SS"), function(x){
# x = "SE"
df.feature = read.table(paste0("./work/merge/as_",x,"_featureData.txt"),
sep="\t", header=TRUE, na.strings="NA") %>%
dplyr::distinct(tran_id, .keep_all=TRUE) %>%
dplyr::left_join(df.tpm.feature)
return(df.feature)
})
names(df.feature.list) <- c("SE", "MXE", "RI", "A5SS", "A3SS")
# (7) Intron文件
df.intron.counts <- data.table::fread("./work/merge/intron_count_by_region.txt",
sep="\t", header=TRUE, na.strings="NA") %>%
as.data.frame()
dim(df.intron.counts)
## MARVEL object
marvel <- CreateMarvelObject(SpliceJunction=sj,
SplicePheno=df.pheno,
SpliceFeature=df.feature.list,
IntronCounts=df.intron.counts,
GeneFeature=df.tpm.feature,
Exp=df.tpm,
GTF=gtf
)
2.2 预处理步骤
(1)基因表达log转换
marvel <- TransformExpValues(MarvelObject=marvel,
offset=1,
transformation="log2",
threshold.lower=1
)
(2)检测额外AS事件
# 如果基因表达水平整体较低,则不考虑该基因的可变剪切事件
# min.expr:判定单个细胞是否表达基因的阈值
# min.cells: 基于上述条件,判定细胞群表达该基因的阈值
marvel <- DetectEvents(MarvelObject=marvel,
min.cells=50, #细胞群的表达百分比
min.expr=1, #认为细胞表达该基因的阈值
track.progress=FALSE,
EventType="AFE"
)
marvel <- DetectEvents(MarvelObject=marvel,
min.cells=50,
min.expr=1,
track.progress=FALSE,
EventType="ALE"
)
length(marvel$SpliceFeature)
# 7
(3)计算PSI值
计算一个基因对于特定剪切事件的PSI值,这是后续分析的基础
# CoverageThreshold: 支持该剪切事件的最小reads数,小于该阈值标记为NA
# UnevenCoverageMultiplier: 针对SE与MXE
marvel <- ComputePSI(MarvelObject=marvel,
CoverageThreshold=10,
UnevenCoverageMultiplier=10,
EventType="SE"
)
marvel <- ComputePSI(MarvelObject=marvel,
CoverageThreshold=10,
UnevenCoverageMultiplier=10,
EventType="MXE"
)
marvel <- ComputePSI(MarvelObject=marvel,
CoverageThreshold=10,
EventType="RI",
thread=4 # only support RI
)
marvel <- ComputePSI(MarvelObject=marvel,
CoverageThreshold=10,
EventType="A5SS"
)
marvel <- ComputePSI(MarvelObject=marvel,
CoverageThreshold=10,
EventType="A3SS"
)
marvel <- ComputePSI(MarvelObject=marvel,
CoverageThreshold=10,
EventType="AFE"
)
marvel <- ComputePSI(MarvelObject=marvel,
CoverageThreshold=10,
EventType="ALE"
)
(4)保存中间结果
- 首先检查不同矩阵与对应的meta行列名信息一致
marvel <- CheckAlignment(MarvelObject=marvel, level="SJ")
marvel <- CheckAlignment(MarvelObject=marvel, level="splicing")
marvel <- CheckAlignment(MarvelObject=marvel, level="gene")
marvel <- CheckAlignment(MarvelObject=marvel, level="splicing and gene")
- 可进一步选取感兴趣的细胞群
index.1 <- which(df.pheno$cell.type %in% c("iPSC", "Endoderm"))
index.2 <- which(df.pheno$qc.seq=="pass")
index <- intersect(index.1, index.2)
sample.ids <- df.pheno[index, "sample.id"]
marvel <- SubsetSamples(MarvelObject=marvel,
sample.ids=sample.ids
)
- 保存中间结果,方便后续分析
save(marvel, file="./MARVEL2.RData")
# 后续以官方提供的示例数据学习
# load("../Data/MARVEL.RData")
2.3 细胞群AS特征分析
以其中84个iPSC细胞群进行演示
sample.ids = marvel$SplicePheno %>%
dplyr::filter(cell.type=="iPSC") %>%
dplyr::pull(sample.id)
(1)AS类型占比
marvel <- CountEvents(MarvelObject=marvel,
sample.ids=sample.ids,
min.cells=25
)
# 对于一个基因至少有25个细胞的AS PSI值为非缺失值,则判定发生了可变剪切事件
marvel$N.Events$Table
# event_type freq pct
# 1 SE 5270 40.1523810
# 3 RI 2030 15.4666667
# 6 AFE 1789 13.6304762
# 5 A3SS 1746 13.3028571
# 4 A5SS 1494 11.3828571
# 7 ALE 724 5.5161905
# 2 MXE 72 0.5485714
marvel$N.Events$Plot
# ggplot绘图体系,下同
(2)AS modality
判断细胞群中每个基因的每种可变剪切的PSI分布类型,如下可分为7类
marvel <- AssignModality(MarvelObject=marvel,
sample.ids=sample.ids,
min.cells=25,
seed=1
)
marvel$Modality$Results[1:5, c("tran_id", "event_type", "gene_id", "gene_short_name", "modality.bimodal.adj")]
table(marvel$Modality$Results$modality.bimodal.adj, marvel$Modality$Results$event_type)
# A3SS A5SS AFE ALE MXE RI SE
# Bimodal 20 7 4 6 0 4 17
# Excluded.Dispersed 386 484 538 189 18 783 1260
# Excluded.Primary 289 282 477 101 22 712 1049
# Included.Dispersed 563 385 372 237 12 352 1685
# Included.Primary 439 291 346 176 17 47 1184
# Middle 9 10 9 3 2 51 15
# Multimodal 40 35 43 12 1 81 60
- 后续有两种统计方式:仅按modality分,同时按modality与AS type分。以后者作为演示
marvel <- PropModality(MarvelObject=marvel,
modality.column="modality.bimodal.adj",
modality.type="extended",
event.type=c("SE", "MXE", "RI", "A5SS", "A3SS", "AFE", "ALE"),
across.event.type=TRUE,
prop.test="chisq",
prop.adj="fdr",
xlabels.size=8
)
marvel$Modality$Prop$BarChart$Stats
head(marvel$Modality$Prop$BarChart$Table)
marvel$Modality$Prop$BarChart$Plot
2.4 细胞群差异分析
(1)差异基因分析
cell.group.g1 = marvel$SplicePheno %>%
dplyr::filter(cell.type=="iPSC") %>%
dplyr::pull(sample.id) # (reference)
cell.group.g2 = marvel$SplicePheno %>%
dplyr::filter(cell.type=="Endoderm") %>%
dplyr::pull(sample.id)
marvel <- CompareValues(MarvelObject=marvel,
cell.group.g1=cell.group.g1,
cell.group.g2=cell.group.g2,
min.cells=3,
method="wilcox",
method.adjust="fdr",
level="gene",
show.progress=FALSE
)
marvel$DE$Exp$Table[1:5, ]
(2)差异AS分析
# time consuming
marvel <- CompareValues(MarvelObject=marvel,
cell.group.g1=cell.group.g1,
cell.group.g2=cell.group.g2,
min.cells=25,
method=c("ad", "dts"), #同时使用两种的推荐差异方法
method.adjust="fdr",
level="splicing",
event.type=c("SE", "MXE", "RI", "A5SS", "A3SS", "ALE", "AFE"),
show.progress=FALSE
)
names(marvel$DE$PSI$Table)
head(marvel$DE$PSI$Table[["ad"]])
head(marvel$DE$PSI$Table[["dts"]])
(3)差异基因&AS分析
- 对差异可变剪切的基因执行基因表达差异分析
marvel <- CompareValues(MarvelObject=marvel,
cell.group.g1=cell.group.g1,
cell.group.g2=cell.group.g2,
psi.method=c("ad", "dts"),
psi.pval=c(0.10, 0.10),
psi.delta=0,
method.de.gene="wilcox",
method.adjust.de.gene="fdr",
downsample=FALSE,
show.progress=FALSE,
level="gene.spliced"
)
head(marvel$DE$Exp.Spliced$Table)
对于上述差异分析结果均可快速火山图,标记marker基因等
2.5 差异AS进阶分析
(1)差异PSI分类
- Explicit:明显的变化;Implicit:有限的变化;Restricted:微小的变化
marvel <- ModalityChange(MarvelObject=marvel,
method=c("ad", "dts"),
psi.pval=c(0.10, 0.10)
)
table(marvel$DE$Modality$Table$modality.change)
# Explicit Implicit Restricted
# 161 280 1113
head(marvel$DE$Modality$Table)
marvel$DE$Modality$Plot
cell.group.list <- list("iPSC"=cell.group.g1,
"Endoderm"=cell.group.g2
)
tran_id <- "chr3:129171771:129171634|129171655:-@chr3:129171446:129171538"
# tran_id <- "chr6:21596763:21598248:+@chr6:21598321:21599378"
# tran_id <- "chr3:109337464:109337652:-@chr3:109333920|109333993:109333870"
marvel <- PlotValues(MarvelObject=marvel,
cell.group.list=cell.group.list,
feature=tran_id,
xlabels.size=5,
level="splicing",
min.cells=25
)
marvel$adhocPlot$PSI
(2)差异基因与AS
- Coordinated:基因高表达,相应的PSI也增高;
- Opposing:基因表达与PSI变化方向相反;
- Isoform-switching :基因表达不变,PSI发生变化
- Complex:其它复杂情况
marvel <- IsoSwitch(MarvelObject=marvel,
method=c("ad", "dts"),
psi.pval=c(0.10, 0.10),
psi.delta=0,
gene.pval=0.10,
gene.log2fc=0.5
)
marvel$DE$Cor$Plot
head(marvel$DE$Cor$Table_Raw)
table(marvel$DE$Cor$Table$cor)
# Coordinated Opposing Iso-Switch Complex
# 192 158 331 113
## 以Coordinated的DHX9与相应的AFE剪切事件为例
# gene
df.feature <- marvel$GeneFeature
gene_id <- df.feature[which(df.feature$gene_short_name=="DHX9"), "gene_id"]
marvel <- PlotValues(MarvelObject=marvel,
cell.group.list=cell.group.list,
feature=gene_id,
maintitle="gene_short_name",
xlabels.size=7,
level="gene"
)
plot.1_gene <- marvel$adhocPlot$Exp
# Splicing
tran_id <- "chr1:182839347:182839456|182839734:182839935:+@chr1:182842545:182842677"
marvel <- PlotValues(MarvelObject=marvel,
cell.group.list=cell.group.list,
feature=tran_id,
xlabels.size=7,
level="splicing",
min.cells=25
)
plot.1_splicing <- marvel$adhocPlot$PSI
library(patchwork)
plot.1_gene + plot.1_splicing
(3)NMD预测
- nonsense-mediated decay (NMD)无义介导的mRNA降解预测
- 相比对照组,观测组所剪切的外显子可能由于包含premature stop codons (PTCs),导致编码提前结束,从而进一步使得表达量降低。
- 本质上是对可产生PTC的可变剪切与下调基因取交集。
marvel <- ParseGTF(MarvelObject=marvel)
marvel <- FindPTC(MarvelObject=marvel,
method=c("ad", "dts"),
pval=c(0.10, 0.10),
delta=5
)
marvel <- PropPTC(MarvelObject=marvel,
xlabels.size=8,
show.NovelSJ.NoCDS=FALSE, #仅展示Non PTC与PTC
prop.test="chisq"
)
marvel$NMD$PTC.Prop$Table[1:5, ]
marvel <- CompareExpr(MarvelObject=marvel,
xlabels.size=8
)
marvel$NMD$NMD.Expr$Plot
marvel <- AnnoVolcanoPlot(MarvelObject=marvel,
anno=FALSE,
xlabel.size=7
)
marvel$NMD$AnnoVolcanoPlot$Plot
marvel$NMD$AnnoVolcanoPlot$Table
未完待续......