百度360必应搜狗淘宝本站头条
当前位置:网站首页 > 热门文章 > 正文

开箱即用版本 满分室间质评之GATK Somatic SNV+Indel+CNV+SV

bigegpt 2024-08-01 11:59 6 浏览

开箱即用版本 满分室间质评之GATK Somatic SNV+Indel+CNV+SV

最近准备为sliverworkspace图形化生信平台开发报告设计器,需要一个较为复杂的pipeline作为测试数据,就想起来把之前的 [[满分室间质评之GATK Somatic SNV+Indel+CNV+SV(下)性能优化]]翻出来用一下。跑了一遍发现还是各种问题,于是想把pipeline改造成免部署、首次运行初始化环境的版本,以便需要时候能够直接运行起来,于是有了本文。

一句话描述就是:开箱即用的pipeline,能够自动初始化环境、安装所需软件、下载ref文件和数据库的版本

为了让pipeline能够直接运行,无需部署,这里使用docker容器保证运行环境的一致性:见文章:[[基于docker的生信基础环境镜像构建]],这里采用的方案是带ssh服务的docker+conda环境,整个pipeline在一个通用容器中运行。

本文代码较长,可能略复杂,想直接运行的可以下载 workflow文件,导入sliverworkspace图形化生信平台直接运行。

相关代码和workflow文件可以访问笔者github项目地址或这gitee项目地址

导入操作

分析流程整体概览

docker 镜像拉取及部署配置

 # 拉取docker镜像
 docker     pull     doujiangbaozi/sliverworkspace:1.10

docker-compose.yml配置文件

 version: "3"
 services:
   GATK:
     image: doujiangbaozi/sliverworkspace:1.10
     container_name: GATK
     volumes:
       - /home/sliver/Data/data:/opt/data:rw                                #挂载输入数据目录
       - /home/sliver/Manufacture/gatk/envs:/root/mambaforge-pypy3/envs     #挂载envs目录
       - /home/sliver/Manufacture/sliver/ref:/opt/ref:rw                    #挂载reference目录
       - /home/sliver/Manufacture/gatk/result:/opt/result:rw                #挂载中间文件和结果目录
     environment:
       - TZ=Asia/Shanghai                                                   #设置时区
       - PS=20191124                                                        #设置ssh密码
       - PT=9024                                                            #设置ssh连接端口

docker 镜像部署运行

 # 在docker-compose.yml文件同级目录下运行
 docker-compose up -d
 
 # 或者docker-compose -f docker-compose.yml所在目录
 docker-compose -f somedir/docker-compose.yml up -d
 
 # 可以通过ssh连接到docker 运行pipeline命令了,连接端口和密码见docker-compose.yml配置文件相关字段
 ssh root@127.0.0.1 -p9024

测试数据

测试数据来自2017年卫计委室间质评提供的bed文件(pipeline会自动下载)和测试数据,修改命名以匹配pipeline输入端,也可以替换为自己的数据文件,因为室间质评目前参考基因组还停留在hg19版本,所以本流程仍然使用hg19(GRCH37),有需要的可以自行替换。后期会提供hg38(GRCH38)版本

文件名(按照需要有调整)

文件大小

MD5

B1701_R1.fq.gz

4.85G

07d3cdccee41dbb3adf5d2e04ab28e5b

B1701_R2.fq.gz

4.77G

c2aa4a8ab784c77423e821b9f7fb00a7

B1701NC_R1.fq.gz

3.04G

4fc21ad05f9ca8dc93d2749b8369891b

B1701NC_R2.fq.gz

3.11G

bc64784f2591a27ceede1727136888b9

变量名称

 # 变量初始化赋值
   sn=1701                               #样本编号
   pn=GS03                               #pipeline 代号
   version_openjdk=8.0.332               #java openjdk 版本                         
   version_cnvkit=0.9.10                 #cnvkit 版本
   version_manta=1.6.0                   #manta 版本
   version_gatk=4.3.0.0                  #gatk 版本                                 
   version_sambamba=1.0.1                #sambamba 版本                             
   version_samtools=1.17                 #samtools 版本                             
   version_bwa=0.7.17                    #bwa 版本                                  
   version_fastp=0.23.2                  #fastp 版本                                
   version_vep=108                       #vep 版本                                  
   envs=/root/mambaforge-pypy3/envs      #mamba envs 目录                           
   threads=32                            #最大线程数                                   
   memory=32G                            #内存占用                                    
   scatter=8                             #Mutect2 分拆并行运行interval list 份数          
   event=2                               #gatk FilterMutectCalls --max-events-in-region 数值
   snv_tlod=16.00                        #snv 过滤参数 tload 值                        
   snv_vaf=0.01                          #snv 过滤参数 丰度/突变频率                        
   snv_depth=500                         #snv 过滤参数 支持reads数/depth 测序深度            
   cnv_dep=1000                          #cnv 过滤参数 支持reads数/depth 测序深度            
   cnv_min=-0.5                          #cnv 过滤参数 log2最小值                        
   cnv_max=0.5                           #cnv 过滤参数 log2 最大值                       
   sv_score=200                          #sv  过滤参数 score                           
 
 # 以上变量个可以具体根据需求调整

表格:

变量名

变量值

备注

sn

1701

样本编号

pn

GS03

pipeline 代号 GATK Somatic 03版本

version_openjdk

8.0.332

java openjdk 版本

version_cnvkit

0.9.10

cnvkit 版本

version_manta

1.6.0

manta版本

version_gatk

4.3.0.0

gatk 版本

version_sambamba

1.0.1

sambamba 版本

version_samtools

1.17

samtools 版本

version_bwa

0.7.17

bwa 版本

version_fastp

0.23.2

fastp 版本

version_vep

108

vep 版本

envs

/root/mambaforge-pypy3/envs

mamba envs 目录

threads

32

最大线程数

memory

32G

内存占用

scatter

8

Mutect2 分拆并行运行interval list 份数

event

2

gatk FilterMutectCalls --max-events-in-region 数值

snv_tlod

16.00

snv 过滤参数 tload 值

snv_vaf

0.01

snv 过滤参数 丰度/突变频率

snv_depth

500

snv 过滤参数 支持reads数/depth 测序深度

cnv_dep

1000

cnv 过滤参数 支持reads数/depth 测序深度

cnv_min

-0.5

cnv 过滤参数 log2最小值

cnv_max

0.5

cnv 过滤参数 log2 最大值

sv_score

200

sv 过滤参数 score

Pipeline/workflow 具体步骤:

  1. fastp 默认参数过滤,看下原始数据质量,clean data
 #conda检测环境是否存在,首次运行不存在创建该环境并安装软件
if [ ! -d "${envs}/${pn}.fastp" ]; then
  echo "Creating the environment ${pn}.fastp"
  mamba create -n ${pn}.fastp -y fastp=${version_fastp}
fi

mamba	activate ${pn}.fastp

mkdir -p ${result}/${sn}/trimmed

fastp -w 16 \
	-i ${data}/GS03/${sn}_tumor_R1.fq.gz  \
    -I ${data}/GS03/${sn}_tumor_R2.fq.gz  \
    -o ${result}/${sn}/trimmed/${sn}_tumor_R1_trimmed.fq.gz \
    -O ${result}/${sn}/trimmed/${sn}_tumor_R2_trimmed.fq.gz \
    -h ${result}/${sn}/trimmed/${sn}_tumor_fastp.html \
    -j ${result}/${sn}/trimmed/${sn}_tumor_fastp.json &

fastp -w 16 \
	-i ${data}/GS03/${sn}_normal_R1.fq.gz  \
    -I ${data}/GS03/${sn}_normal_R2.fq.gz  \
    -o ${result}/${sn}/trimmed/${sn}_normal_R1_trimmed.fq.gz \
    -O ${result}/${sn}/trimmed/${sn}_normal_R2_trimmed.fq.gz \
    -h ${result}/${sn}/trimmed/${sn}_normal_fastp.html \
    -j ${result}/${sn}/trimmed/${sn}_normal_fastp.json &
wait

mamba	deactivate

2. normal文件fastq比对到参考基因组,sort 排序,mark duplicate 得到 marked.bam

#conda检测环境是否存在,首次运行不存在创建该环境并安装软件
if [ ! -d "${envs}/${pn}.align" ]; then
  mamba create -n ${pn}.align -y bwa=${version_bwa} samtools=${version_samtools} 
fi

#从github下载sambamba static 比 mamba 安装的版本速度快1倍以上.这是个很诡异的地方
if [ ! -f "${envs}/${pn}.align/bin/sambamba" ]; then
	aria2c https://github.com/biod/sambamba/releases/download/v${version_sambamba}/sambamba-${version_sambamba}-linux-amd64-static.gz -d ${envs}/${pn}.align/bin
	gzip -cdf ${envs}/${pn}.align/bin/sambamba-${version_sambamba}-linux-amd64-static.gz  >  ${envs}/${pn}.align/bin/sambamba 
	chmod a+x ${envs}/${pn}.align/bin/sambamba
fi

mamba	activate ${pn}.align

mkdir	-p /opt/ref/hg19
#如果没有检测到参考基因组序列,则下载序列并使用bwa创建索引
if [ ! -f "/opt/ref/hg19/hg19.fasta" ]; then
	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz -d /opt/ref/hg19 -o hg19.fasta.gz
	cd /opt/ref/hg19 && gzip -d /opt/ref/hg19/hg19.fasta.gz
	if  [ ! -f /opt/ref/hg19.fasta.amb ] ||
		[ ! -f /opt/ref/hg19/hg19.fasta.ann ] ||
		[ ! -f /opt/ref/hg19/hg19.fasta.bwt ] ||
		[ ! -f /opt/ref/hg19/hg19.fasta.pac ] ||
    	[ ! -f /opt/ref/hg19/hg19.fasta.sa ]; then
    	bwa index /opt/ref/hg19/hg19.fasta
	fi
elif [ -f "/opt/ref/hg19/ucsc.hg19.fasta.gz.aria2" ]; then
		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz -c -d /opt/ref/hg19 -o hg19.fasta.gz
		cd /opt/ref/hg19 && gzip -d /opt/ref/hg19/hg19.fasta.gz
	if  [ ! -f /opt/ref/hg19.fasta.amb ] ||
		[ ! -f /opt/ref/hg19/hg19.fasta.ann ] ||
		[ ! -f /opt/ref/hg19/hg19.fasta.bwt ] ||
		[ ! -f /opt/ref/hg19/hg19.fasta.pac ] ||
    	[ ! -f /opt/ref/hg19/hg19.fasta.sa ]; then
    	bwa index /opt/ref/hg19/hg19.fasta
	fi
fi
#检测samtools索引是否存在,如不存在则使用samtools创建参考基因组索引
if [ ! -f "/opt/ref/hg19/hg19.fasta.fai" ]; then
	samtools faidx /opt/ref/hg19/hg19.fasta
fi


mkdir -p ${result}/${sn}/aligned
#比对基因组管道输出成bam文件,管道输出排序
bwa mem \
    -t ${threads} -M \
    -R "@RG\\tID:${sn}_normal\\tLB:${sn}_normal\\tPL:Illumina\\tPU:Miseq\\tSM:${sn}_normal" \
    /opt/ref/hg19/hg19.fasta  ${result}/${sn}/trimmed/${sn}_normal_R1_trimmed.fq.gz ${result}/${sn}/trimmed/${sn}_normal_R2_trimmed.fq.gz \
    | sambamba view -S -f bam -l 0 /dev/stdin \
    | sambamba sort -t ${threads} -m 2G --tmpdir=${result}/${sn}/aligned -o ${result}/${sn}/aligned/${sn}_normal_sorted.bam /dev/stdin

#防止linux打开文件句柄数超过限制,报错退出
ulimit -n 10240

#使用sambamba对sorted bam文件标记重复
sambamba markdup \
	--tmpdir ${result}/${sn}/aligned \
	-t ${threads} ${result}/${sn}/aligned/${sn}_normal_sorted.bam ${result}/${sn}/aligned/${sn}_normal_marked.bam 

#修改marked bam文件索引名,gatk和sambamba索引文件名需要保持一致
mv  ${result}/${sn}/aligned/${sn}_normal_marked.bam.bai ${result}/${sn}/aligned/${sn}_normal_marked.bai
#删除sorted bam文件
rm -f ${result}/${sn}/aligned/${sn}_normal_sorted.bam*

mamba	deactivate
  1. tumor文件fastq比对到参考基因组,sort 排序,mark duplicate 得到 marked.bam,与第2步类似
if [ ! -d "${envs}/${pn}.align" ]; then
  mamba create -n ${pn}.align -y bwa=${version_bwa} samtools=${version_samtools} 
fi

#从github下载sambamba static 比 mamba 安装的版本速度快1倍以上.
if [ ! -f "${envs}/${pn}.align/bin/sambamba" ]; then
	aria2c https://github.com/biod/sambamba/releases/download/v${version_sambamba}/sambamba-${version_sambamba}-linux-amd64-static.gz -d ${envs}/${pn}.align/bin
	gzip -cdf ${envs}/${pn}.align/bin/sambamba-${version_sambamba}-linux-amd64-static.gz  >  ${envs}/${pn}.align/bin/sambamba 
	chmod a+x ${envs}/${pn}.align/bin/sambamba
fi

mamba	activate ${pn}.align

mkdir	-p /opt/ref/hg19

if [ ! -f "/opt/ref/hg19/hg19.fasta" ]; then
	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz -d /opt/ref/hg19 -o hg19.fasta.gz
	cd /opt/ref/hg19 && gzip -d /opt/ref/hg19/hg19.fasta.gz
	if  [ ! -f /opt/ref/hg19.fasta.amb ] ||
		[ ! -f /opt/ref/hg19/hg19.fasta.ann ] ||
		[ ! -f /opt/ref/hg19/hg19.fasta.bwt ] ||
		[ ! -f /opt/ref/hg19/hg19.fasta.pac ] ||
    	[ ! -f /opt/ref/hg19/hg19.fasta.sa ]; then
    	bwa index /opt/ref/hg19/hg19.fasta
	fi
elif [ -f "/opt/ref/hg19/ucsc.hg19.fasta.gz.aria2" ]; then
	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz -c -d /opt/ref/hg19 -o hg19.fasta.gz
	cd /opt/ref/hg19 && gzip -d /opt/ref/hg19/hg19.fasta.gz
	if  [ ! -f /opt/ref/hg19.fasta.amb ] ||
		[ ! -f /opt/ref/hg19/hg19.fasta.ann ] ||
		[ ! -f /opt/ref/hg19/hg19.fasta.bwt ] ||
		[ ! -f /opt/ref/hg19/hg19.fasta.pac ] ||
    	[ ! -f /opt/ref/hg19/hg19.fasta.sa ]; then
    	bwa index /opt/ref/hg19/hg19.fasta
	fi
fi

if [ ! -f "/opt/ref/hg19/hg19.fasta.fai" ]; then
	samtools faidx /opt/ref/hg19/hg19.fasta
fi


mkdir	-p ${result}/${sn}/aligned

bwa	mem \
    -t ${threads} -M \
    -R "@RG\\tID:${sn}_tumor\\tLB:${sn}_tumor\\tPL:Illumina\\tPU:Miseq\\tSM:${sn}_tumor" \
    /opt/ref/hg19/hg19.fasta  ${result}/${sn}/trimmed/${sn}_tumor_R1_trimmed.fq.gz ${result}/${sn}/trimmed/${sn}_tumor_R2_trimmed.fq.gz \
    | sambamba view -S -f bam -l 0 /dev/stdin \
    | sambamba sort -t ${threads} -m 2G --tmpdir=${result}/${sn}/aligned -o ${result}/${sn}/aligned/${sn}_tumor_sorted.bam /dev/stdin
ulimit -n 10240
sambamba  markdup \
	--tmpdir ${result}/${sn}/aligned \
	-t ${threads} ${result}/${sn}/aligned/${sn}_tumor_sorted.bam ${result}/${sn}/aligned/${sn}_tumor_marked.bam
mv  ${result}/${sn}/aligned/${sn}_tumor_marked.bam.bai ${result}/${sn}/aligned/${sn}_tumor_marked.bai
rm  -f ${result}/${sn}/aligned/${sn}_tumor_sorted.bam*

mamba	deactivate
  1. 对上述bam文件生成重新校准表,为后续BQSR使用;Generates recalibration table for Base Quality Score Recalibration (BQSR)
#conda检测环境是否存在,首次运行不存在创建该环境并安装软件
if [ ! -f "${envs}/gatk/bin/gatk" ]; then
	mkdir -p ${envs}/gatk/bin
	#替代下载地址
	#https://github.com/broadinstitute/gatk/releases/download/${version_gatk}/gatk-${version_gatk}.zip
	if [ -f ${envs}/gatk/bin/gatk.zip.aria2 ]; then
		aria2c -x 16 -s 32 https://download.yzuu.cf/broadinstitute/gatk/releases/download/${version_gatk}/gatk-${version_gatk}.zip -c -d ${envs}/gatk/bin -o gatk.zip
	else 
		aria2c -x 16 -s 32 https://download.yzuu.cf/broadinstitute/gatk/releases/download/${version_gatk}/gatk-${version_gatk}.zip -d ${envs}/gatk/bin -o gatk.zip
	fi
	apt-get install -y unzip
	cd ${envs}/gatk/bin 
	unzip -o gatk.zip 
	mv ${envs}/gatk/bin/gatk-${version_gatk}/* ${envs}/gatk/bin/
	rm -rf ${envs}/gatk/bin/gatk-${version_gatk}
	#chmod +x ${envs}/bin/gatk
	cd ${result}
fi

if [ ! -x "$(command -v python)" ]; then
	mamba env create -f ${envs}/gatk/bin/gatkcondaenv.yml
fi

if [ ! -x "$(command -v java)" ]; then
	mamba install -y openjdk=${version_openjdk}
fi

if [ ! -x "$(command -v tabix)" ]; then
	mamba install -y tabix
fi

mamba activate gatk

#这里有个巨坑,broadinstitute ftp站点bundle目录提供的hg19版本参考文件,默认格式运行会报错,提示没有索引,使用gatk创建索引仍然报错,其实是gz格式需要使用bgzip重新压缩并且使用tabix创建索引才行
if [ ! -f "/opt/ref/hg19/dbsnp_138.hg19.vcf.gz" ]; then
	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/dbsnp_138.hg19.vcf.gz -d /opt/ref/hg19
	gzip -k -f -d /opt/ref/hg19/dbsnp_138.hg19.vcf.gz > /opt/ref/hg19/dbsnp_138.hg19.vcf
	bgzip -f --threads ${threads} /opt/ref/hg19/dbsnp_138.hg19.vcf > /opt/ref/hg19/dbsnp_138.hg19.vcf.gz
	tabix -f /opt/ref/hg19/dbsnp_138.hg19.vcf.gz
else 
	if [ -f "/opt/ref/hg19/dbsnp_138.hg19.vcf.gz.aria2" ]; then
		echo 'download continue...'
		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/dbsnp_138.hg19.vcf.gz -c -d /opt/ref/hg19
	fi
	if [ ! -f "/opt/ref/hg19/dbsnp_138.hg19.vcf.gz.tbi" ]; then
		gzip -k -f -d /opt/ref/hg19/dbsnp_138.hg19.vcf.gz > /opt/ref/hg19/dbsnp_138.hg19.vcf
		bgzip -f --threads ${threads} /opt/ref/hg19/dbsnp_138.hg19.vcf > /opt/ref/hg19/dbsnp_138.hg19.vcf.gz
		tabix -f /opt/ref/hg19/dbsnp_138.hg19.vcf.gz
	fi
fi

if [ ! -f "/opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz" ]; then
	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz -d /opt/ref/hg19
	gzip -k -f -d /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz > /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
	bgzip -f --threads ${threads} /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf > /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
	tabix -f /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
else 
	if [ -f "/opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz.aria2" ]; then
		echo 'download continue...'
		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz -c -d /opt/ref/hg19
	fi
	if [ ! -f "/opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz.tbi" ]; then
		gzip -k -f -d /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz > /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
		bgzip -f --threads ${threads} /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf > /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
		tabix -f /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
	fi
fi

if [ ! -f "/opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz" ]; then
	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz -d /opt/ref/hg19
	gzip -k -f -d /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz > /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf
	bgzip -f --threads ${threads} /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf > /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
	tabix -f /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
else 
	if [ -f "/opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz.aria2" ]; then
		echo 'download continue...'
		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz -c -d /opt/ref/hg19
	fi
	if [ ! -f "/opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz.tbi" ]; then
		gzip -k -f -d /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz > /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf
		bgzip -f --threads ${threads} /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf > /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
		tabix -f /opt/ref/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
	fi
fi

if [ ! -f "/opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz" ]; then
	aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.gz -d /opt/ref/hg19
	gzip -k -f -d /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz > /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf
	bgzip -f --threads ${threads} /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf > /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz
	tabix -f /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz
else 
	if [ -f "/opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz.aria2" ]; then
		echo 'download continue...'
		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.gz -c -d /opt/ref/hg19
	fi
	if [ ! -f "/opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz.tbi" ]; then
		gzip -k -f -d /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz > /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf
		bgzip -f --threads ${threads} /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf > /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz
		tabix -f /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz
	fi
fi
#创建参考序列hg19的dict字典文件
if [ ! -f "/opt/ref/hg19/hg19.dict" ]; then
	gatk CreateSequenceDictionary -R /opt/ref/hg19/hg19.fasta -O /opt/ref/hg19/hg19.dict
fi
#根据下载的Illumina_pt2.bed 文件创建interval list文件,坐标转换,其实坐标0修改为1
if [ ! -f "/opt/ref/hg19/Illumina_pt2.interval_list" ]; then
	#sed 's/chr//; s/\t/ /g' /opt/ref/hg19/Illumina_pt2.bed > /opt/ref/hg19/Illumina_pt2.processed.bed
	mkdir  -p /opt/ref/hg19
	rm     -f /opt/ref/hg19/Illumina_pt2.bed
	
	aria2c  https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/projects/Illumina_pt2.bed -d /opt/ref/hg19
	#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/projects/Illumina_pt2.bed -d /opt/ref/hg19
	gatk BedToIntervalList \
		-I	/opt/ref/hg19/Illumina_pt2.bed \
		-SD	/opt/ref/hg19/hg19.dict \
		-O	/opt/ref/hg19/Illumina_pt2.interval_list
fi

mkdir -p ${result}/${sn}/recal

gatk BaseRecalibrator \
        --known-sites /opt/ref/hg19/dbsnp_138.hg19.vcf.gz \
        --known-sites /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz \
		--known-sites /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz \
        -L /opt/ref/hg19/Illumina_pt2.interval_list \
        -R /opt/ref/hg19/hg19.fasta \
        -I ${result}/${sn}/aligned/${sn}_tumor_marked.bam \
        -O ${result}/${sn}/recal/${sn}_tumor_recal.table &
        
gatk BaseRecalibrator \
        --known-sites /opt/ref/hg19/dbsnp_138.hg19.vcf.gz \
        --known-sites /opt/ref/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz \
		--known-sites /opt/ref/hg19/1000G_phase1.indels.hg19.sites.vcf.gz \
        -L /opt/ref/hg19/Illumina_pt2.interval_list \
        -R /opt/ref/hg19/hg19.fasta \
        -I ${result}/${sn}/aligned/${sn}_normal_marked.bam \
        -O ${result}/${sn}/recal/${sn}_normal_recal.table &

wait

mamba deactivate
  1. 使用校准表对bam碱基质量校准,因为这一步gatk效率感人,所以同时计算insertsize,拆分interval list(后续mutect2并行运行需要),运行cnvkit batch,运行samtools depth计算测序深度,samtools flagstat 统计mapping比例及质量
mkdir -p ${result}/${sn}/bqsr
mkdir -p ${result}/${sn}/stat
mkdir -p ${result}/${sn}/cnv
mkdir -p ${result}/${sn}/interval

mamba activate gatk

gatk ApplyBQSR \
	--bqsr-recal-file ${result}/${sn}/recal/${sn}_tumor_recal.table \
	-L /opt/ref/hg19/Illumina_pt2.interval_list \
	-R /opt/ref/hg19/hg19.fasta \
    -I ${result}/${sn}/aligned/${sn}_tumor_marked.bam \
    -O ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam &


gatk ApplyBQSR \
    --bqsr-recal-file ${result}/${sn}/recal/${sn}_normal_recal.table \
	-L /opt/ref/hg19/Illumina_pt2.interval_list \
	-R /opt/ref/hg19/hg19.fasta \
    -I ${result}/${sn}/aligned/${sn}_normal_marked.bam \
    -O ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam &


gatk CollectInsertSizeMetrics \
  -I ${result}/${sn}/aligned/${sn}_tumor_marked.bam \
  -O ${result}/${sn}/stat/${sn}_tumor_insertsize_metrics.txt \
  -H ${result}/${sn}/stat/${sn}_tumor_insertsize_histogram.pdf &


gatk CollectInsertSizeMetrics \
  -I ${result}/${sn}/aligned/${sn}_normal_marked.bam \
  -O ${result}/${sn}/stat/${sn}_normal_insertsize_metrics.txt \
  -H ${result}/${sn}/stat/${sn}_normal_insertsize_histogram.pdf &

rm -f ${result}/${sn}/interval/*.interval_list
gatk SplitIntervals \
	-L /opt/ref/hg19/Illumina_pt2.interval_list \
	-R /opt/ref/hg19/hg19.fasta \
    -O ${result}/${sn}/interval \
    --scatter-count ${scatter} &

mamba deactivate

if [ ! -d "${envs}/${pn}.cnvkit" ]; then
  	mamba create -n ${pn}.cnvkit -y cnvkit=${version_cnvkit}
fi

if [ ! -f "/opt/ref/hg19/refFlat.txt" ]; then
	aria2c -x 16 -s 16 http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz -d /opt/ref/hg19
	cd /opt/ref/hg19 && gzip -d refFlat.txt.gz
fi

mamba activate ${pn}.cnvkit

cnvkit.py batch \
	${result}/${sn}/aligned/${sn}_tumor_marked.bam \
    --normal ${result}/${sn}/aligned/${sn}_normal_marked.bam \
    --method hybrid \
    --targets /opt/ref/hg19/Illumina_pt2.bed \
    --annotate /opt/ref/hg19/refFlat.txt \
    --output-reference ${result}/${sn}/cnv/${sn}_reference.cnn \
    --output-dir ${result}/${sn}/cnv/ \
    --diagram \
    -p ${threads} &

mamba deactivate

mamba activate ${pn}.align

samtools depth -a -b /opt/ref/hg19/Illumina_pt2.bed  --threads ${threads} \
	${result}/${sn}/aligned/${sn}_tumor_marked.bam > \
    ${result}/${sn}/stat/${sn}_tumor_marked.depth  &

samtools depth -a -b /opt/ref/hg19/Illumina_pt2.bed  --threads ${threads} \
	${result}/${sn}/aligned/${sn}_normal_marked.bam > \
	${result}/${sn}/stat/${sn}_normal_marked.depth &

samtools flagstat --threads ${threads} \
	${result}/${sn}/aligned/${sn}_tumor_marked.bam  > \
    ${result}/${sn}/stat/${sn}_tumor_marked.flagstat   &

samtools flagstat --threads ${threads} \
	${result}/${sn}/aligned/${sn}_normal_marked.bam > \
    ${result}/${sn}/stat/${sn}_normal_marked.flagstat &
	
mamba deactivate

wait
  1. 计算堆叠数据( pileup metrics )以便后续评估污染,也可以根据拆分的interval list并行处理,处理之后合并。
#官方巨坑,默认提供的small_exac_common_3_b37.vcf.gz默认染色体坐标不是以chr开头而是数字
mamba activate gatk
#这里有个巨坑,从broadinstitute ftp 站点bundle Mutect2目录下载的参考文件,与同样下载的参考序列基因组坐标系不一致,参考基因组参考序列是chr1这种格式,这个af-only-gnomad是1,2,3这种格式,需要编写脚本处理
if [ ! -f "/opt/ref/hg19/small_exac_common_3_b37.processed.vcf.gz" ]; then
	if [ ! -f "/opt/ref/hg19/small_exac_common_3_b37.vcf.gz" ]; then
		aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/GetPileupSummaries/small_exac_common_3_b37.vcf.gz -d /opt/ref/hg19
	elif [ -f "/opt/ref/hg19/small_exac_common_3_b37.vcf.gz.aria2" ]; then
		aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/GetPileupSummaries/small_exac_common_3_b37.vcf.gz -c -d /opt/ref/hg19
	fi
	if [ ! -f "${envs}/VcfProcessUtil.py" ]; then
		aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/VcfProcessUtil.py -d ${envs}/
		#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/VcfProcessUtil.py -d ${envs}/
		chmod a+x ${envs}/VcfProcessUtil.py
	fi
	${envs}/VcfProcessUtil.py \
		-f /opt/ref/hg19/small_exac_common_3_b37.vcf.gz \
		-o /opt/ref/hg19/small_exac_common_3_b37.processed.vcf
	cd /opt/ref/hg19
	bgzip -f --threads ${threads} small_exac_common_3_b37.processed.vcf
	tabix -f small_exac_common_3_b37.processed.vcf.gz
fi


for i in `ls ${result}/${sn}/interval/*.interval_list`;
do
    echo $i
    gatk GetPileupSummaries \
        -R /opt/ref/hg19/hg19.fasta \
        -I ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam \
        -O ${i%.*}-tumor-pileups.table \
        -V /opt/ref/hg19/small_exac_common_3_b37.processed.vcf.gz \
        -L $i \
        --interval-set-rule INTERSECTION &
    
    gatk GetPileupSummaries \
        -R /opt/ref/hg19/hg19.fasta \
        -I ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam \
        -O ${i%.*}-normal-pileups.table \
        -V /opt/ref/hg19/small_exac_common_3_b37.processed.vcf.gz \
        -L $i \
        --interval-set-rule INTERSECTION &
    
done
wait



tables=
for i in `ls ${result}/${sn}/interval/*-tumor-pileups.table`;
do
    tables="$tables -I $i"
done

gatk GatherPileupSummaries \
    --sequence-dictionary /opt/ref/hg19/hg19.dict \
    $tables \
    -O ${result}/${sn}/stat/${sn}_tumor_pileups.table

nctables=
for i in `ls ${result}/${sn}/interval/*-normal-pileups.table`;
do
    nctables="$nctables -I $i"
done

gatk GatherPileupSummaries \
    --sequence-dictionary /opt/ref/hg19/hg19.dict \
    $nctables \
    -O ${result}/${sn}/stat/${sn}_normal_pileups.table
	
mamba deactivate
  1. 使用GetPileupSummaries计算结果评估跨样本污染,结果用于后面 FilterMutectCall 过滤Mutect2输出结果
mamba activate gatk

gatk CalculateContamination \
    -matched ${result}/${sn}/stat/${sn}_normal_pileups.table \
    -I ${result}/${sn}/stat/${sn}_tumor_pileups.table \
    -O ${result}/${sn}/stat/${sn}_contamination.table
	
mamba deactivate
  1. Mutect2 call 突变,使用拆分的interval list,结束后将结果合并;同时并行运行manta call sv突变
mkdir -p ${result}/${sn}/sv
mkdir -p ${result}/${sn}/snv

mamba activate gatk
#这里有个巨坑,从broadinstitute ftp 站点bundle Mutect2目录下载的参考文件,与同样下载的参考序列基因组坐标系不一致,参考基因组参考序列是chr1这种格式,这个af-only-gnomad是1,2,3这种格式,需要编写脚本处理;hg38貌似没有这个问题,hg19的数据都不维护了么?
if [ ! -f "/opt/ref/hg19/af-only-gnomad.raw.sites.b37.processed.vcf.gz" ]; then
	if [ ! -f "/opt/ref/hg19/af-only-gnomad.raw.sites.b37.vcf.gz" ]; then
		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/af-only-gnomad.raw.sites.b37.vcf.gz -d /opt/ref/hg19
	elif [ -f "/opt/ref/hg19/af-only-gnomad.raw.sites.b37.vcf.gz.aria2" ]; then
		aria2c -x 16 -s 32 ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/Mutect2/af-only-gnomad.raw.sites.b37.vcf.gz -c -d /opt/ref/hg19
	fi
	if [ ! -f "${envs}/VcfProcessUtil.py" ]; then
		aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/VcfProcessUtil.py -d ${envs}/
		#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/VcfProcessUtil.py -d ${envs}/
		chmod a+x ${envs}/VcfProcessUtil.py
	fi
	${envs}/VcfProcessUtil.py \
		-f /opt/ref/hg19/af-only-gnomad.raw.sites.b37.vcf.gz \
		-o /opt/ref/hg19/af-only-gnomad.raw.sites.b37.processed.vcf
	cd /opt/ref/hg19
	bgzip -f --threads ${threads} af-only-gnomad.raw.sites.b37.processed.vcf
	tabix -f af-only-gnomad.raw.sites.b37.processed.vcf.gz
fi


if [ ! -f "/opt/ref/hg19/Illumina_pt2.bed.gz" ]; then
	bgzip -f -c /opt/ref/hg19/Illumina_pt2.bed > /opt/ref/hg19/Illumina_pt2.bed.gz
	tabix -f -p bed /opt/ref/hg19/Illumina_pt2.bed.gz
else
	if [ ! -f "/opt/ref/hg19/Illumina_pt2.bed.gz.tbi" ]; then
		tabix -f -p bed /opt/ref/hg19/Illumina_pt2.bed.gz
	fi
fi
mamba deactivate

if [ ! -d "${envs}/${pn}.manta" ]; then
  mamba create -n ${pn}.manta -y manta=1.6.0
fi

mamba activate ${pn}.manta

rm -f ${result}/${sn}/sv/runWorkflow.py*
configManta.py  \
    --normalBam ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam \
    --tumorBam  ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam \
    --referenceFasta /opt/ref/hg19/hg19.fasta \
    --exome \
    --callRegions /opt/ref/hg19/Illumina_pt2.bed.gz \
    --runDir ${result}/${sn}/sv

rm -rf ${result}/${sn}/sv/workspace
python ${result}/${sn}/sv/runWorkflow.py -m local -j ${threads} &

mamba deactivate

mamba activate gatk

rm -f ${result}/${sn}/snv/vcf-file.list
touch ${result}/${sn}/snv/vcf-file.list
for i in `ls ${result}/${sn}/interval/*.interval_list`;
do
   rm -f ${i%.*}_bqsr.vcf.gz
   gatk Mutect2 \
        -R /opt/ref/hg19/hg19.fasta \
        -I ${result}/${sn}/bqsr/${sn}_tumor_bqsr.bam  -tumor  ${sn}_tumor  \
        -I ${result}/${sn}/bqsr/${sn}_normal_bqsr.bam -normal ${sn}_normal \
        -L $i \
        -O ${i%.*}_bqsr.vcf.gz \
        --max-mnp-distance 10 \
        --germline-resource /opt/ref/hg19/af-only-gnomad.raw.sites.b37.processed.vcf.gz \
        --native-pair-hmm-threads ${threads} &
    echo ${i%.*}_bqsr.vcf.gz >> ${result}/${sn}/snv/vcf-file.list
done
wait

rm -f ${result}/${sn}/snv/${sn}_bqsr.vcf.gz.stats
stats=
for z in `ls ${result}/${sn}/interval/*_bqsr.vcf.gz.stats`;
do
    stats="$stats -stats $z"
done

gatk MergeMutectStats $stats \
	-O ${result}/${sn}/snv/${sn}_bqsr.vcf.gz.stats

gatk MergeVcfs \
    -I ${result}/${sn}/snv/vcf-file.list \
    -O ${result}/${sn}/snv/${sn}_bqsr.vcf.gz
	
mamba deactivate
  1. FilterMutectCalls 对Mutect结果突变过滤
mamba activate gatk

gatk FilterMutectCalls \
    --max-events-in-region ${event} \
    --contamination-table ${result}/${sn}/stat/${sn}_contamination.table \
    -R /opt/ref/hg19/hg19.fasta \
    -V ${result}/${sn}/snv/${sn}_bqsr.vcf.gz \
    -O ${result}/${sn}/snv/${sn}_filtered.vcf.gz
	
mamba deactivate
  1. 使用Vep注释过滤结果
#conda检测环境是否存在,首次运行不存在创建该环境并安装软件
if [ ! -d "${envs}/${pn}.vep" ]; then
  echo "Creating the environment ${pn}.vep"
  mamba create -n ${pn}.vep -y ensembl-vep=${version_vep}
fi

mkdir -p /opt/result/${sn}/vcf
#检测vep注释数据库是否存在如果不存在则先下载
if [ ! -d "/opt/ref/vep-cache/homo_sapiens/${version_vep}_GRCh37" ]; then
	aria2c -x 16 -s 48 https://ftp.ensembl.org/pub/release-${version_vep}/variation/indexed_vep_cache/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz -d /opt/ref/
	tar -zxvf /opt/ref/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz -C /opt/ref/vep-cache/
elif [ -f "/opt/ref/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz.aria2" ]; then
	aria2c -x 16 -s 48 https://ftp.ensembl.org/pub/release-${version_vep}/variation/indexed_vep_cache/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz -c -d /opt/ref/
	tar -zxvf /opt/ref/homo_sapiens_vep_${version_vep}_GRCh37.tar.gz -C /opt/ref/vep-cache/
fi

if [ ! -d "/opt/ref/vep-cache/homo_sapiens_refseq/${version_vep}_GRCh37" ]; then
	aria2c -x 16 -s 48 http://ftp.ensembl.org/pub/release-${version_vep}/variation/vep/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz -d /opt/ref/
	tar -zxvf /opt/ref/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz -C /opt/ref/vep-cache/
elif [ -f "/opt/ref/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz.aria2" ]; then
	aria2c -x 16 -s 48 http://ftp.ensembl.org/pub/release-${version_vep}/variation/vep/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz -c -d /opt/ref/
	tar -zxvf /opt/ref/homo_sapiens_refseq_vep_${version_vep}_GRCh37.tar.gz -C /opt/ref/vep-cache/
fi

mamba activate ${pn}.vep

mkdir -p ${result}/${sn}/annotation
vep \
    -i ${result}/${sn}/snv/${sn}_filtered.vcf.gz  \
    -o ${result}/${sn}/annotation/${sn}_filtered_vep.tsv \
    --offline \
    --cache \
    --cache_version ${version_vep} \
    --everything \
    --dir_cache /opt/ref/vep-cache/ \
    --dir_plugins /opt/ref/vep-cache/Plugins \
    --species homo_sapiens \
    --assembly GRCh37 \
    --fasta /opt/ref/hg19/hg19.fasta \
    --refseq \
    --force_overwrite \
    --format vcf \
    --tab \
    --shift_3prime 1  \
    --offline
	
mamba deactivate
  1. 使用脚本处理注释结果和过滤vcf结果,输出和室间质评要求格式的数据表格
mamba activate ${pn}.cnvkit

if [ ! -f "${envs}/MatchedSnvVepAnnotationFilter.py" ]; then
	aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/MatchedSnvVepAnnotationFilter.py -d ${envs}/
	#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/MatchedSnvVepAnnotationFilter.py -d ${envs}/
	chmod a+x ${envs}/MatchedSnvVepAnnotationFilter.py
fi

${envs}/MatchedSnvVepAnnotationFilter.py \
	-e normal_artifact   \
    -e germline   \
    -i strand_bias   \
    -i clustered_events   \
    --min-vaf=${snv_vaf} \
    --min-tlod=${snv_tlod} \
    --min-depth=${snv_depth} \
    -v ${result}/${sn}/snv/${sn}_filtered.vcf.gz   \
    -a ${result}/${sn}/annotation/${sn}_filtered_vep.tsv   \
    -o ${result}/${sn}/annotation/${sn}.result.SNV.tsv
	
mamba deactivate
  1. 使用cnvkit提供工具输出分布图和热图
mamba activate ${pn}.cnvkit

cnvkit.py scatter ${result}/${sn}/cnv/${sn}_tumor_marked.cnr \
	-s ${result}/${sn}/cnv/${sn}_tumor_marked.cns \
	-i ' ' \
	-n ${sn}_normal \
    -o ${result}/${sn}/cnv/${sn}_cnv_scatter.png -t  &&
 
cnvkit.py heatmap ${result}/${sn}/cnv/${sn}_tumor_marked.cns \
	-o ${result}/${sn}/cnv/${sn}_cnv_heatmap.png

mamba deactivate
  1. 使用cnvkit call 根据cnvkit batch输出结果推算拷贝数
mamba activate ${pn}.cnvkit

cnvkit.py call ${result}/${sn}/cnv/${sn}_tumor_marked.cns \
	-o ${result}/${sn}/cnv/${sn}_tumor_marked.call.cns
	
mamba deactivate
  1. 编写脚本处理cnvkit输出,计算cnv基因,exon位置,gain/lost,cn数
mamba activate ${pn}.cnvkit

if [ ! -f "${envs}/CnvAnnotationFilter.py" ]; then
	aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/CnvAnnotationFilter.py -d ${envs}/
	#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/CnvAnnotationFilter.py -d ${envs}/
	chmod a+x ${envs}/CnvAnnotationFilter.py
fi

if [ ! -f "/opt/ref/hg19/hg19_refGene.txt" ]; then
	aria2c -x 16 -s 16 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz -d /opt/ref/hg19 -o hg19_refGene.txt.gz
	cd /opt/ref/hg19 && gzip -d hg19_refGene.txt.gz
fi

python ${envs}/CnvAnnotationFilter.py  \
  -r /opt/ref/hg19/hg19_refGene.txt \
  -i ${cnv_min} \
  -x ${cnv_max} \
  -D ${cnv_dep} \
  -f ${result}/${sn}/cnv/${sn}_tumor_marked.call.cns \
  -o ${result}/${sn}/cnv/${sn}.result.CNV.tsv
  
mamba deactivate
  1. 编写脚本处理manta的输出,获取最终sv输出结果,起始位置,基因、频率等
mamba activate ${pn}.cnvkit

if [ ! -f "${envs}/SvAnnotationFilter.py" ]; then
	aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/SvAnnotationFilter.py -d ${envs}/
	#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/SvAnnotationFilter.py -d ${envs}/
	chmod a+x ${envs}/SvAnnotationFilter.py
fi

if [ ! -f "/opt/ref/hg19/hg19_refGene.txt" ]; then
	aria2c -x 16 -s 16 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz -d /opt/ref/hg19 -o hg19_refGene.txt.gz
	cd /opt/ref/hg19 && gzip -d hg19_refGene.txt.gz
fi

${envs}/SvAnnotationFilter.py \
  -r /opt/ref/hg19/hg19_refGene.txt \
  -s ${sv_score} \
  -f ${result}/${sn}/sv/results/variants/somaticSV.vcf.gz \
  -o ${result}/${sn}/sv/${sn}.result.SV.tsv
  
mamba deactivate
  1. 根据之前fastp,samtools depth,samtools flagstat,gatk CollectInsertSizeMetrics等输出,给出综合 QC数据
mamba activate ${pn}.cnvkit

if [ ! -f "${envs}/MatchedQcProcessor.py" ]; then
	aria2c https://raw.fgit.cf/doujiangbaozi/sliverworkspace-util/main/somatic/MatchedQcProcessor.py -d ${envs}/
	#aria2c https://raw.githubusercontent.com/doujiangbaozi/sliverworkspace-util/main/somatic/MatchedQcProcessor.py -d ${envs}/
	chmod a+x ${envs}/MatchedQcProcessor.py
fi

${envs}/MatchedQcProcessor.py  --bed /opt/ref/hg19/Illumina_pt2.bed \
    --out ${result}/${sn}/stat/${sn}.result.QC.tsv \
    --sample-fastp=${result}/${sn}/trimmed/${sn}_tumor_fastp.json \
    --sample-depth=${result}/${sn}/stat/${sn}_tumor_marked.depth \
    --sample-flagstat=${result}/${sn}/stat/${sn}_tumor_marked.flagstat \
    --sample-insertsize=${result}/${sn}/stat/${sn}_tumor_insertsize_metrics.txt \
    --normal-fastp=${result}/${sn}/trimmed/${sn}_normal_fastp.json \
    --normal-depth=${result}/${sn}/stat/${sn}_normal_marked.depth \
    --normal-flagstat=${result}/${sn}/stat/${sn}_normal_marked.flagstat  \
    --normal-insertsize=${result}/${sn}/stat/${sn}_normal_insertsize_metrics.txt
	
mamba deactivate

最终输出

文件名

备注

1701/1701.result.SNV.tsv

SNV最终突变结果

1701/1701/cnv/1701_cnv_heatmap.png

CNV结果热图

1701/cnv/1701_cnv_scatter.png

CNV结果分布图

1701/cnv/1701.result.CNV.tsv

CNV最终结果

1701.result.SV.tsv

SV最终结果

1701.result.QC.tsv

最终质控结果

相关推荐

得物可观测平台架构升级:基于GreptimeDB的全新监控体系实践

一、摘要在前端可观测分析场景中,需要实时观测并处理多地、多环境的运行情况,以保障Web应用和移动端的可用性与性能。传统方案往往依赖代理Agent→消息队列→流计算引擎→OLAP存储...

warm-flow新春版:网关直连和流程图重构

本期主要解决了网关直连和流程图重构,可以自此之后可支持各种复杂的网关混合、多网关直连使用。-新增Ruoyi-Vue-Plus优秀开源集成案例更新日志[feat]导入、导出和保存等新增json格式支持...

扣子空间体验报告

在数字化时代,智能工具的应用正不断拓展到我们工作和生活的各个角落。从任务规划到项目执行,再到任务管理,作者深入探讨了这款工具在不同场景下的表现和潜力。通过具体的应用实例,文章展示了扣子空间如何帮助用户...

spider-flow:开源的可视化方式定义爬虫方案

spider-flow简介spider-flow是一个爬虫平台,以可视化推拽方式定义爬取流程,无需代码即可实现一个爬虫服务。spider-flow特性支持css选择器、正则提取支持JSON/XML格式...

solon-flow 你好世界!

solon-flow是一个基础级的流处理引擎(可用于业务规则、决策处理、计算编排、流程审批等......)。提供有“开放式”驱动定制支持,像jdbc有mysql或pgsql等驱动,可...

新一代开源爬虫平台:SpiderFlow

SpiderFlow:新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。-精选真开源,释放新价值。概览Spider-Flow是一个开源的、面向所有用户的Web端爬虫构建平台,它使用Ja...

通过 SQL 训练机器学习模型的引擎

关注薪资待遇的同学应该知道,机器学习相关的岗位工资普遍偏高啊。同时随着各种通用机器学习框架的出现,机器学习的门槛也在逐渐降低,训练一个简单的机器学习模型变得不那么难。但是不得不承认对于一些数据相关的工...

鼠须管输入法rime for Mac

鼠须管输入法forMac是一款十分新颖的跨平台输入法软件,全名是中州韵输入法引擎,鼠须管输入法mac版不仅仅是一个输入法,而是一个输入法算法框架。Rime的基础架构十分精良,一套算法支持了拼音、...

Go语言 1.20 版本正式发布:新版详细介绍

Go1.20简介最新的Go版本1.20在Go1.19发布六个月后发布。它的大部分更改都在工具链、运行时和库的实现中。一如既往,该版本保持了Go1的兼容性承诺。我们期望几乎所...

iOS 10平台SpriteKit新特性之Tile Maps(上)

简介苹果公司在WWDC2016大会上向人们展示了一大批新的好东西。其中之一就是SpriteKitTileEditor。这款工具易于上手,而且看起来速度特别快。在本教程中,你将了解关于TileE...

程序员简历例句—范例Java、Python、C++模板

个人简介通用简介:有良好的代码风格,通过添加注释提高代码可读性,注重代码质量,研读过XXX,XXX等多个开源项目源码从而学习增强代码的健壮性与扩展性。具备良好的代码编程习惯及文档编写能力,参与多个高...

Telerik UI for iOS Q3 2015正式发布

近日,TelerikUIforiOS正式发布了Q32015。新版本新增对XCode7、Swift2.0和iOS9的支持,同时还新增了对数轴、不连续的日期时间轴等;改进TKDataPoin...

ios使用ijkplayer+nginx进行视频直播

上两节,我们讲到使用nginx和ngixn的rtmp模块搭建直播的服务器,接着我们讲解了在Android使用ijkplayer来作为我们的视频直播播放器,整个过程中,需要注意的就是ijlplayer编...

IOS技术分享|iOS快速生成开发文档(一)

前言对于开发人员而言,文档的作用不言而喻。文档不仅可以提高软件开发效率,还能便于以后的软件开发、使用和维护。本文主要讲述Objective-C快速生成开发文档工具appledoc。简介apple...

macOS下配置VS Code C++开发环境

本文介绍在苹果macOS操作系统下,配置VisualStudioCode的C/C++开发环境的过程,本环境使用Clang/LLVM编译器和调试器。一、前置条件本文默认前置条件是,您的开发设备已...