注：笔记内容参考官方教程用法摸索而成。因涉及步骤较繁琐，如有错误，欢迎指出！

R包MARVEL是由牛津大学MRC Weatherall分子医学研究所团队开发的，用于分析单细胞水平的可变剪切事件。相关文章于2023年1月在Nucleic Acids Research期刊发表，在其github页面分享了MARVEL工具的分析流程，在此学习、记录如下。

Pape：https://doi.org/10.1093/nar/gkac1260

Github：https://github.com/wenweixiong/MARVEL

Plate pipeline：https://wenweixiong.github.io/MARVEL_Plate.html

Droplet pipeline：https://wenweixiong.github.io/MARVEL_Droplet.html

可变剪切

关于可变剪切(Alternative splicing，AS)的基础，可以参看一篇中文教程：169-跟着刘小泽一起探索可变剪切分析 | BIOINFOPLANET；其中关于基因结构组成，可以参考之前的笔记：人类基因组基础知识与下载查询 | Li's Bioinfo-Blog。
过去，针对Bulk RNA-seq数据的可变剪切分析工具已经有很多，例如之前学习的IsoformSwitchAnalyzeR包。MARVEL是个人目前了解到的第一个针对单细胞转录数据AS分析工具。

单细胞数据类型

AS分析的核心是通过检测比对到splice junction位点的read（例如，跨越两个外显子）进行分析。
目前主流的单细胞测序技术主要分为Droplet-based 10X方法与Plate-based Smart-seq方法。
10X技术由于仅对cDNA的3'端部分进行测序，只能对该区域涉及的splice junction进行分析，且无法具体分析事件类型
Smart-seq技术虽然通量低，但是全长转录本的单细胞测序原理，可以进行更为深入的AS分析。

一、Plate-based pipeline

1、上游分析

1.1 工作环境

两个同级的目录（1）basic：参考数据以及代码脚本（2）proj1：具体的分析项目文件

后续分析均在 proj1/ 路径下操作

tree basic/ -L 2
# basic/
# ├── gtf
# │   ├── gencode.v31.annotation.gtf
# │   ├── GRCh38.primary_assembly.genome.fa
# │   ├── Log.out
# │   ├── rsem_index
# │   └── star_index
# ├── rscript_Anno_SJ_rMATs.R
# ├── rscript_bedtools_input.R
# └── single_sample_upstream.sh
tree proj1/ -L 2
# proj1/
# ├── SJ_phenoData.txt
# └── work
#     ├── fq
#     ├── intron
#     ├── merge
#     ├── rmats
#     ├── rsem
#     └── star

1.2 conda环境

# 名为marvel_plate的分析环境
conda create -n marvel_plate python=3.7
conda activate marvel_plate
conda install -c hcc aspera-cli
conda install -c bioconda star=2.7.1a rmats=4.1.0 -y 
conda install -c bioconda bedtools samtools rsem -y 
conda install -c r r-base r-essentials r-tidyverse -y

1.2 准备数据

（1）测序数据

官方手册分析了如下数据集的 192个单细胞数据，以其中一个演示流程

# https://www.ebi.ac.uk/ena/browser/view/ERR1562083
id=ERR1562083
# id=$1
mkdir -p ./work/fq/${id}
ascp -QT -l 300m -P33001  \
-i ~/miniconda3/envs/marvel_plate/etc/asperaweb_id_dsa.openssh   \
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/${id:0:6}/00${id:0-1}/${id}/${id}_1.fastq.gz  ./work/fq/${id}

ascp -QT -l 300m -P33001  \
-i ~/miniconda3/envs/marvel_plate/etc/asperaweb_id_dsa.openssh   \
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/${id:0:6}/00${id:0-1}/${id}/${id}_2.fastq.gz  ./work/fq/${id}

（2）基因组数据

在 ./basic/gtf 目录下
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/

## STAR索引
STAR  --runMode genomeGenerate \
--genomeDir star_index \
--genomeFastaFiles GRCh38.primary_assembly.genome.fa \
--sjdbGTFfile gencode.v31.annotation.gtf \
--runThreadN 40

## RSEM索引
rsem-prepare-reference \
--gtf gencode.v31.annotation.gtf \
--star  -p 20 \
GRCh38.primary_assembly.genome.fa \
rsem_index/hg38

（3）教程数据

官方教程提供了各个分析步骤的示范文件，可供之后分析作为参考。
http://datashare.molbiol.ox.ac.uk/public/wwen/MARVEL/Tutorial/Plate/Data.zip

ls Data
# GTF  MARVEL  rMATS  R Object  RSEM  SJ

1.3 分析步骤

1.3.1 单样本4步分析

（1）单样本STAR比对

将每个样本的Fastq文件使用STAR软件进行比对，产生每个splice junction count matrix
# Reference
less -SN ./Data/SJ/SJ.txt

mkdir -p ./work/star/${id}
ref_idx_star=../basic/gtf/star_index
ref_gtf=../basic/gtf/gencode.v31.annotation.gtf
echo ${id} is runing......
#1st pass mode
STAR --runThreadN 16 \
--genomeDir ${ref_idx_star} \
--readFilesCommand zcat \
--readFilesIn ./work/fq/${id}/${id}_1.fastq.gz ./work/fq/${id}/${id}_2.fastq.gz \
--outFileNamePrefix ./work/star/${id}/${id}. \
--outSAMtype None
#2nd pass mode
STAR --runThreadN 16 \
--genomeDir ${ref_idx_star} \
--readFilesCommand zcat \
--readFilesIn ./work/fq/${id}/${id}_1.fastq.gz ./work/fq/${id}/${id}_2.fastq.gz \
--outFileNamePrefix ./work/star/${id}/${id}. \
--sjdbFileChrStartEnd ./work/star/${id}/*SJ.out.tab \
--sjdbGTFfile ${ref_gtf} \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM XS \
--quantMode TranscriptomeSAM

（2）单样本rMATs注释

根据上述比对BAM结果，注释出每个基因(转录本)涉及的可变剪切事件类型。使用rMATs软件，可注释出5种类型：SE、RI、MXE、A5SS、A3SS。
find ./Data/rMATS/ -name *featureData.txt
# ./Data/rMATS/SE/SE_featureData.txt
# ./Data/rMATS/A3SS/A3SS_featureData.txt
# ./Data/rMATS/MXE/MXE_featureData.txt
# ./Data/rMATS/RI/RI_featureData.txt
# ./Data/rMATS/A5SS/A5SS_featureData.txt
less -SN ./Data/rMATS/SE/SE_featureData.txt

mkdir -p ./work/rmats/${id}

echo ${id} is running......
echo "./work/star/${id}/${id}.Aligned.sortedByCoord.out.bam" > ./work/star/${id}/BAM_fls.txt

rmats.py \
--b1 ./work/star/${id}//BAM_fls.txt \
--gtf ${ref_gtf} \
--od ./work/rmats/${id} \
--tmp ./work/rmats/${id}/tmp \
-t paired \
--readLength 125 \
--variable-read-length \
--nthread 16 \
--statoff
# readLength参考如下
# zcat ./work/fq/${id}_1.fastq.gz | awk '{if(NR%4==2) print NR"\t"$0"\t"length($0)}' | less

（3）单样本Intron矩阵

统计Reads对于每个内含子的比对情况，逐碱基累加作为比对结果。
# reference
less -SN Data/MARVEL/PSI/RI/Counts_by_Region.txt

mkdir -p ./work/intron/${id}
echo ${id} is running......
samtools view -H ./work/star/${id}/${id}.Aligned.sortedByCoord.out.bam | \
  grep SQ | cut -f 2 | awk '{ sub(/^SN:/, ""); print;}' > ./work/star/${id}/sorted_chr_in_bam.txt
# 预定义R脚本，生成bedtools计算所需的两个文件：染色体大小以及内含子坐标
# https://github.com/lishensuo/utils/blob/main/marvel/rscript_bedtools_input.R
Rscript ../basic/rscript_bedtools_input.R ${id}

bedtools coverage \
-g ./work/intron/${id}/hg38.chrom.sizes.txt \
-split \
-sorted \
-a ./work/intron/${id}/RI_Coordinates_sorted.bed \
-b ./work/star/${id}/${id}.Aligned.sortedByCoord.out.bam > \
./work/intron/${id}/intron_count.txt \
-d

（4）单样本gene矩阵

最后需要对基因表达水平进行定量。教程推荐使用RSEM软件，并进行TPM类似的标准化处理
# reference
less -SN ./Data/RSEM/TPM.txt

mkdir -p ./work/rsem/${id}
rsem-calculate-expression \
--bam \
--paired-end \
-p 16 \
./work/star/${id}/${id}.Aligned.toTranscriptome.out.bam \
../basic/gtf/rsem_index/hg38 \
./work/rsem/${id}/${id}

（1）如上，由于4步的分析过程中产生的中间数据可能超过数百兆；且SMART-seq往往涉及成百上千的细胞。为节约空间，可在每个样本分析完成后，删除不必要的数据。
rm ./work/fq/${id}/${id}*
rm ./work/star/${id}/*bam
rm -rf ./work/rmats/${id}/tmp/*
rm ./work/rsem/${id}/${id}.transcript.bam
rm ./work/rsem/${id}/${id}.isoforms.results
（2）多样本批量后台运行方式
cat ../basic/single_sample_upstream.sh
chmod u+x ../basic/single_sample_upstream.sh
# 单样本测试
# ../basic/single_sample_upstream.sh ERR1562273

# parallel多线程批量运行
nohup parallel -j 5 ../basic/single_sample_upstream.sh {} ::: \
$(grep -v "sample" ./SJ_phenoData.txt | awk '{print $1}') \
1> ./work/single_sample_upstream.log 2>&1 &

1.3.2 整合多样本结果

即分步整合上游4个步骤中的每个步骤的所有样本结果；均使用R语言完成

library(tidyverse)
library(parallel)
sample_meta = data.table::fread("./SJ_phenoData.txt",data.table=F)
samples=sample_meta$sample.id

step1: 汇总剪切位点矩阵

sj.list = mclapply(samples, function(sample){   
     sj_sp = data.table::fread(paste0("./work/star/",sample,"/",sample,".SJ.out.tab"), data.table=F)
     if(nrow(sj_sp)==0) return(NULL)
     sj_sp = sj_sp %>% 
       dplyr::mutate(`coord.intron`=paste0(V1,":",V2,":",V3)) %>% 
       dplyr::select(`coord.intron`, V7)
     colnames(sj_sp)[2] = sample
     return(sj_sp)
},mc.cores=10)
names(sj.list) = samples
sj.list = sj.list[!unlist(lapply(sj.list, is.null))]

sj_res = sj.list[[1]]
for(x in sj.list[-1]){
     sj_res = dplyr::full_join(sj_res, x)
}

for(sample in setdiff(samples,names(sj.list))){
     sj_res[,sample]=NA
}
dim(sj_res)

write.table(sj_res, file=paste0("./work/merge/SJ.txt"),
     quote=F, row.names=F, sep="\t")

step2:整理AS注释结果这一步最为复杂，需要理解每一种剪切事件的定义以及正负链的差异参考官方示意：https://wenweixiong.github.io/Splicing_Nomenclature

# https://github.com/lishensuo/utils/blob/main/marvel/rscript_Anno_SJ_rMATs.R
source(paste0("../basic/rscript_Anno_SJ_rMATs.R"))

AS_types = c("SE", "RI", "MXE", "A3SS", "A5SS")

for(AS_type in AS_types){
     # AS_type = AS_types[1]
     print(AS_type)
     fls_AS = list.files(paste0("./work/rmats"), pattern=paste0("fromGTF.",AS_type,".txt"),
          recursive=T, full.name=T)

     if (AS_type=="SE"){
          func = se_func
     } else if (AS_type=="RI"){
          func = ri_func 
     } else if (AS_type=="MXE"){
          func = mxe_func 
     } else if (AS_type=="A3SS"){
          func = a3ss_func 
     } else if (AS_type=="A5SS"){
          func = a5ss_func
     }

     df_AS = mclapply(fls_AS, function(fls){
          rmats_AS = data.table::fread(fls, data.table=F)
          rmats_AS = rmats_AS %>%
               dplyr::mutate(tran_id = func(.)) %>% 
               dplyr::rename(gene_id=GeneID,gene_short_name=geneSymbol) %>% 
               dplyr::select(tran_id, gene_id, gene_short_name)
          return(rmats_AS)
     },mc.cores=10) %>% do.call(rbind, .) %>% 
          dplyr::distinct()
     write.table(df_AS, row.names=F, sep="\t",quote=F,
          file=paste0("./work/merge/as_",AS_type,"_featureData.txt"))
}

step3：汇总intron矩阵

int_mt.list = mclapply(samples, function(sample){
     # sample = samples[1]
     int_mt = data.table::fread(paste0("./work/intron/",sample,"/intron_count.txt")) 
     int_mt = int_mt %>% 
          dplyr::mutate(`coord.intron`=paste(V1,V2,V3, sep=":")) %>% 
          dplyr::group_by(coord.intron) %>% 
          dplyr::summarize(count=sum(V5)) %>% 
          dplyr::select(coord.intron, count) %>% as.data.frame()
     colnames(int_mt)[2] = sample
     return(int_mt)
},mc.cores=10)

intron_res = int_mt.list[[1]]
for(x in int_mt.list[-1]){
     intron_res = dplyr::full_join(intron_res, x)
}
dim(intron_res)

write.table(intron_res, file=paste0("./work/merge/intron_count_by_region.txt"),
     quote=F, row.names=F, sep="\t")

汇总gene矩阵

gene_mt.list = mclapply(samples, function(sample){
     fls=paste0("./work/rsem/",sample,"/",sample,".genes.results")
     gene_mt = data.table::fread(fls, data.table=F) %>% 
          dplyr::select(gene_id, TPM)
     colnames(gene_mt)[2] = sample
     return(gene_mt)
},mc.cores=10)
gene_res = gene_mt.list[[1]]
for(x in gene_mt.list[-1]){
     gene_res = dplyr::full_join(gene_res, x)
}
dim(gene_res)
write.table(gene_res, file=paste0("./work/merge/rsem_tpm.txt"),,
     quote=F, row.names=F, sep="\t")

2、下游分析

后续根据教程提供的数据学习，需要安装的主要的MARVEL包以及其它辅助分析包，具体可参考教程。

# 建议再单独建立一个conda R环境
install.packages("MARVEL")

2.1 创建marvel对象

需要准备七类文件，读入为R语言对象
样本表型data.frame：第一列名为sample.id
剪切位点计数矩阵：第一列名为coord.intron
剪切事件类型注释列表：由数据框包装而出的list，每个数据库至少包含tran_id与gene_id
内含子计数矩阵：第一列名为coord.intron
基因表达矩阵：经标准化处理(RPKM/FPKM/TPM)，暂时不要log转换
基因类型data.frame：至少包含3列，gene_id, gene_short_name, gene_type
上游分析所用到的GTF数据

library(tidyverse)
library(MARVEL)

# (1) 细胞表型分组
df.pheno = data.table::fread("./SJ_phenoData.txt",data.table=F)

# (2) Gene文件
df.tpm <- read.table("./work/merge/rsem_tpm.txt", sep="\t", header=TRUE)
dim(df.tpm)


# (3) GTF文件
gtf <- data.table::fread("../basic/gtf/gencode.v31.annotation.gtf", sep="\t", header=FALSE, na.strings="NA",quote="\"") %>% 
     as.data.frame()

# (4) Gene类型
gene.feature = subset(gtf, V3=="gene") 
df.tpm.feature <- str_split(gene.feature$V9, ";", simplify=T) %>% 
     as.data.frame() %>%
     dplyr::mutate(gene_id=str_match(V1, '"(.*)"')[,2]) %>% 
     dplyr::mutate(gene_short_name=str_match(V3, '"(.*)"')[,2]) %>%      
     dplyr::mutate(gene_type=str_match(V2, '"(.*)"')[,2]) %>% 
     dplyr::select(gene_id, gene_short_name, gene_type) %>% 
     dplyr::filter(gene_id %in% df.tpm$gene_id)
df.tpm.feature = df.tpm.feature[match(df.tpm$gene_id, df.tpm.feature$gene_id),]
rownames(df.tpm.feature) = seq(nrow(df.tpm.feature))

# (5) SJ文件
sj = data.table::fread("./work/merge/SJ.txt", sep="\t", header=TRUE, na.strings="NA") %>% 
     as.data.frame()
dim(sj)

# (6) AS文件
df.feature.list = lapply(c("SE", "MXE", "RI", "A5SS", "A3SS"), function(x){
     # x = "SE"
     df.feature = read.table(paste0("./work/merge/as_",x,"_featureData.txt"), 
          sep="\t", header=TRUE, na.strings="NA") %>% 
          dplyr::distinct(tran_id, .keep_all=TRUE) %>% 
          dplyr::left_join(df.tpm.feature)
     return(df.feature)
})
names(df.feature.list) <- c("SE", "MXE", "RI", "A5SS", "A3SS")

# (7) Intron文件
df.intron.counts <- data.table::fread("./work/merge/intron_count_by_region.txt", 
     sep="\t", header=TRUE, na.strings="NA") %>% 
     as.data.frame()
dim(df.intron.counts)
## MARVEL object
marvel <- CreateMarvelObject(SpliceJunction=sj,
                             SplicePheno=df.pheno,
                             SpliceFeature=df.feature.list,
                             IntronCounts=df.intron.counts,
                             GeneFeature=df.tpm.feature,
                             Exp=df.tpm,
                             GTF=gtf
                             )

2.2 预处理步骤

（1）基因表达log转换

marvel <- TransformExpValues(MarvelObject=marvel,
                             offset=1,
                             transformation="log2",
                             threshold.lower=1
                            )

（2）检测额外AS事件

# 如果基因表达水平整体较低，则不考虑该基因的可变剪切事件
# min.expr:判定单个细胞是否表达基因的阈值
# min.cells: 基于上述条件，判定细胞群表达该基因的阈值
marvel <- DetectEvents(MarvelObject=marvel,
                       min.cells=50, #细胞群的表达百分比
                       min.expr=1,   #认为细胞表达该基因的阈值
                       track.progress=FALSE,
                       EventType="AFE"
                       )
marvel <- DetectEvents(MarvelObject=marvel,
                       min.cells=50,
                       min.expr=1,
                       track.progress=FALSE,
                       EventType="ALE"
                       )
length(marvel$SpliceFeature)
# 7

（3）计算PSI值

计算一个基因对于特定剪切事件的PSI值，这是后续分析的基础

# CoverageThreshold: 支持该剪切事件的最小reads数，小于该阈值标记为NA
# UnevenCoverageMultiplier: 针对SE与MXE
marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     UnevenCoverageMultiplier=10,
                     EventType="SE"
                     )

marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     UnevenCoverageMultiplier=10,
                     EventType="MXE"
                     )

marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     EventType="RI",
                     thread=4  # only support RI
                     )

marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     EventType="A5SS"
                     )

marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     EventType="A3SS"
                     )
 
marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     EventType="AFE"
                     )
    
marvel <- ComputePSI(MarvelObject=marvel,
                     CoverageThreshold=10,
                     EventType="ALE"
                     )

（4）保存中间结果

首先检查不同矩阵与对应的meta行列名信息一致

marvel <- CheckAlignment(MarvelObject=marvel, level="SJ")
marvel <- CheckAlignment(MarvelObject=marvel, level="splicing")
marvel <- CheckAlignment(MarvelObject=marvel, level="gene")
marvel <- CheckAlignment(MarvelObject=marvel, level="splicing and gene")

可进一步选取感兴趣的细胞群

index.1 <- which(df.pheno$cell.type %in% c("iPSC", "Endoderm"))
index.2 <- which(df.pheno$qc.seq=="pass")
index <- intersect(index.1, index.2)
sample.ids <- df.pheno[index, "sample.id"]
marvel <- SubsetSamples(MarvelObject=marvel,
                        sample.ids=sample.ids
                        )

保存中间结果，方便后续分析

save(marvel, file="./MARVEL2.RData")

# 后续以官方提供的示例数据学习
# load("../Data/MARVEL.RData")

2.3 细胞群AS特征分析

以其中84个iPSC细胞群进行演示

sample.ids = marvel$SplicePheno  %>% 
    dplyr::filter(cell.type=="iPSC") %>% 
    dplyr::pull(sample.id)

（1）AS类型占比

marvel <- CountEvents(MarvelObject=marvel,
                      sample.ids=sample.ids,
                      min.cells=25
                      )
# 对于一个基因至少有25个细胞的AS PSI值为非缺失值，则判定发生了可变剪切事件
marvel$N.Events$Table
#   event_type freq        pct
# 1         SE 5270 40.1523810
# 3         RI 2030 15.4666667
# 6        AFE 1789 13.6304762
# 5       A3SS 1746 13.3028571
# 4       A5SS 1494 11.3828571
# 7        ALE  724  5.5161905
# 2        MXE   72  0.5485714
marvel$N.Events$Plot 
# ggplot绘图体系，下同

（2）AS modality

判断细胞群中每个基因的每种可变剪切的PSI分布类型，如下可分为7类

marvel <- AssignModality(MarvelObject=marvel,
                         sample.ids=sample.ids,
                         min.cells=25,
                         seed=1
                         )
marvel$Modality$Results[1:5, c("tran_id", "event_type", "gene_id", "gene_short_name", "modality.bimodal.adj")]
table(marvel$Modality$Results$modality.bimodal.adj, marvel$Modality$Results$event_type)
#                    A3SS A5SS  AFE  ALE  MXE   RI   SE
# Bimodal              20    7    4    6    0    4   17
# Excluded.Dispersed  386  484  538  189   18  783 1260
# Excluded.Primary    289  282  477  101   22  712 1049
# Included.Dispersed  563  385  372  237   12  352 1685
# Included.Primary    439  291  346  176   17   47 1184
# Middle                9   10    9    3    2   51   15
# Multimodal           40   35   43   12    1   81   60

后续有两种统计方式：仅按modality分，同时按modality与AS type分。以后者作为演示

marvel <- PropModality(MarvelObject=marvel,
                       modality.column="modality.bimodal.adj",
                       modality.type="extended",
                       event.type=c("SE", "MXE", "RI", "A5SS", "A3SS", "AFE", "ALE"),
                       across.event.type=TRUE,
                       prop.test="chisq",
                       prop.adj="fdr",
                       xlabels.size=8
                       )
marvel$Modality$Prop$BarChart$Stats
head(marvel$Modality$Prop$BarChart$Table)
marvel$Modality$Prop$BarChart$Plot

2.4 细胞群差异分析

（1）差异基因分析

cell.group.g1 = marvel$SplicePheno  %>% 
    dplyr::filter(cell.type=="iPSC") %>% 
    dplyr::pull(sample.id)  # (reference)
cell.group.g2 = marvel$SplicePheno  %>% 
    dplyr::filter(cell.type=="Endoderm") %>% 
    dplyr::pull(sample.id)

marvel <- CompareValues(MarvelObject=marvel,
                        cell.group.g1=cell.group.g1,
                        cell.group.g2=cell.group.g2,
                        min.cells=3,
                        method="wilcox",
                        method.adjust="fdr",
                        level="gene",
                        show.progress=FALSE
                        )
marvel$DE$Exp$Table[1:5, ]

（2）差异AS分析

# time consuming
marvel <- CompareValues(MarvelObject=marvel,
                        cell.group.g1=cell.group.g1,
                        cell.group.g2=cell.group.g2,
                        min.cells=25,
                        method=c("ad", "dts"),  #同时使用两种的推荐差异方法
                        method.adjust="fdr",
                        level="splicing",
                        event.type=c("SE", "MXE", "RI", "A5SS", "A3SS", "ALE", "AFE"),
                        show.progress=FALSE
                        )
names(marvel$DE$PSI$Table)
head(marvel$DE$PSI$Table[["ad"]])
head(marvel$DE$PSI$Table[["dts"]])

（3）差异基因&AS分析

对差异可变剪切的基因执行基因表达差异分析

marvel <- CompareValues(MarvelObject=marvel,
                        cell.group.g1=cell.group.g1,
                        cell.group.g2=cell.group.g2,
                        psi.method=c("ad", "dts"),
                        psi.pval=c(0.10, 0.10),
                        psi.delta=0,
                        method.de.gene="wilcox",
                        method.adjust.de.gene="fdr",
                        downsample=FALSE,
                        show.progress=FALSE,
                        level="gene.spliced"
                        )
head(marvel$DE$Exp.Spliced$Table)

对于上述差异分析结果均可快速火山图，标记marker基因等

2.5 差异AS进阶分析

（1）差异PSI分类

Explicit：明显的变化；Implicit：有限的变化；Restricted：微小的变化

marvel <- ModalityChange(MarvelObject=marvel,
                     method=c("ad", "dts"),
                     psi.pval=c(0.10, 0.10)
                     )
table(marvel$DE$Modality$Table$modality.change)
# Explicit   Implicit Restricted 
#    161        280       1113
head(marvel$DE$Modality$Table)
marvel$DE$Modality$Plot

cell.group.list <- list("iPSC"=cell.group.g1,
                        "Endoderm"=cell.group.g2
                        )
tran_id <- "chr3:129171771:129171634|129171655:-@chr3:129171446:129171538"
# tran_id <- "chr6:21596763:21598248:+@chr6:21598321:21599378"
# tran_id <- "chr3:109337464:109337652:-@chr3:109333920|109333993:109333870"
marvel <- PlotValues(MarvelObject=marvel,
                     cell.group.list=cell.group.list,
                     feature=tran_id,
                     xlabels.size=5,
                     level="splicing",
                     min.cells=25
                     )
marvel$adhocPlot$PSI

（2）差异基因与AS

Coordinated：基因高表达，相应的PSI也增高；
Opposing：基因表达与PSI变化方向相反；
Isoform-switching ：基因表达不变，PSI发生变化
Complex：其它复杂情况

marvel <- IsoSwitch(MarvelObject=marvel,
                    method=c("ad", "dts"),
                    psi.pval=c(0.10, 0.10),
                    psi.delta=0,
                    gene.pval=0.10,
                    gene.log2fc=0.5
                    )

marvel$DE$Cor$Plot
head(marvel$DE$Cor$Table_Raw)
table(marvel$DE$Cor$Table$cor)
# Coordinated    Opposing  Iso-Switch     Complex 
#         192         158         331         113

## 以Coordinated的DHX9与相应的AFE剪切事件为例
# gene
df.feature <- marvel$GeneFeature
gene_id <- df.feature[which(df.feature$gene_short_name=="DHX9"), "gene_id"]
marvel <- PlotValues(MarvelObject=marvel,
                   cell.group.list=cell.group.list,
                   feature=gene_id,
                   maintitle="gene_short_name",
                   xlabels.size=7,
                   level="gene"
                   )
plot.1_gene <- marvel$adhocPlot$Exp
# Splicing
tran_id <- "chr1:182839347:182839456|182839734:182839935:+@chr1:182842545:182842677"
marvel <- PlotValues(MarvelObject=marvel,
                   cell.group.list=cell.group.list,
                   feature=tran_id,
                   xlabels.size=7,
                   level="splicing",
                   min.cells=25
                   )
plot.1_splicing <- marvel$adhocPlot$PSI
library(patchwork)
plot.1_gene + plot.1_splicing

（3）NMD预测

nonsense-mediated decay (NMD)无义介导的mRNA降解预测
相比对照组，观测组所剪切的外显子可能由于包含premature stop codons (PTCs)，导致编码提前结束，从而进一步使得表达量降低。
本质上是对可产生PTC的可变剪切与下调基因取交集。

marvel <- ParseGTF(MarvelObject=marvel)
marvel <- FindPTC(MarvelObject=marvel,
                  method=c("ad", "dts"),
                  pval=c(0.10, 0.10),
                  delta=5
                  )

marvel <- PropPTC(MarvelObject=marvel,
                  xlabels.size=8,
                  show.NovelSJ.NoCDS=FALSE, #仅展示Non PTC与PTC
                  prop.test="chisq"
                  )
marvel$NMD$PTC.Prop$Table[1:5, ]

marvel <- CompareExpr(MarvelObject=marvel,
                      xlabels.size=8
                      )
marvel$NMD$NMD.Expr$Plot

marvel <- AnnoVolcanoPlot(MarvelObject=marvel,
                          anno=FALSE,
                          xlabel.size=7
                          )

marvel$NMD$AnnoVolcanoPlot$Plot
marvel$NMD$AnnoVolcanoPlot$Table

未完待续......

单细胞工具MARVEL—单细胞可变剪切分析(一)

一、Plate-based pipeline

1、上游分析

1.1 工作环境

1.2 conda环境

1.2 准备数据

1.3 分析步骤

1.3.1 单样本4步分析

1.3.2 整合多样本结果

2、下游分析

2.1 创建marvel对象

2.2 预处理步骤

2.3 细胞群AS特征分析

2.4 细胞群差异分析

2.5 差异AS进阶分析

相关推荐

idea本地配置连接远程hadoop集群的一些网络问题解决汇总

无缓存不行?例行升级的入门级阿斯加特AN2 SSD装机点评

Ceph运维手册(基于P版本)

大数据开发前要做什么准备?8台Hadoop服务器进行集群规划前配置

Tensorflow分类loss函数总结 tensorflow绘制loss曲线

R语言学习笔记(七) -离散型数据的模型预测2

iOS Runtime详解

7 个对 Java 意义重大的性能指标，你知道几个?

PHP 远程调试最佳实践

Laravel框架使用图片处理简单教程