"genomes/hg38/" # reference genomes (i.e. genome sequence and annotation)
"shared_scripts/" # shared scripts by Lu Lab
~/github # I sync my own scripts to github
~/proj_exRNA/
|-- RNA_index # rerference transcriptomes (fasta and index)
|-- sample*_name # ...
|-- sample2_name # different samples
`-- sample1_name
|-- fastq # raw data: fastq files
|-- fastqc # QC of fastq
|-- trim # trimmed fastq (e.g. 3' adaptor cutted)
`-- mapped # mapped data: SAM/BAM files
Inputs
Outputs
Running Scripts
Software/Tools
RSEM: index and many utility package for transcriptome analysis
fastqc: QC for fastq file
cutadapt: trim and cut adaptor
bowtie2: mapping
Example of single case
# genome seuqneces and annotaionsln-s/BioII/lulab_b/shared/genomes $HOME/genomesexport hg38=$HOME/genomes/human_hg38/sequence/GRCh38.p10.genome.faexport gtf=$HOME/genomes/human_hg38/anno/gtfexport index=$HOME/genomes/human_hg38/index/transcriptome_rsem_bowtie2# raw data (fastq files)cd $HOME/proj_exRNA/example/fastqln-s/BioII/lulab_b/shared/projects/exRNA/example/100K/*.fastq.# working spacecd $HOME/proj_exRNA
# Produce indexed transcriptome, using miRNA as an example# Usually, you do not need to do this because we have already index one# in "$index"# We need RSEM to index transcriptome (RNAs) because we'll convert # the relative coordinates to genomic coordinats after mapping to RNAs############################################################## 0: build bowtie2 index using RSEM for RNA transcriptome#############################################################rsem-prepare-reference--gtf $gtf/miRNA.gencode27.gtf--bowtie2 $hg38 \RNA_index/miRNA.gencode27>log.02>err.0&# RSEM will take exons only # only if you prepare gtf files in the right format
# check raw reads' quality# trim low quality ends + keep first 50nt + remove 3' adapter############################################################## 1.1: fastaqc - repeat fastqc before and after each trim step#############################################################fastqc-q-oexample/fastqcNC_1.fastq \>log.1.12>err.1.1&#output the QC files to a dir examples/fastqc############################################################## 1.2: trim - cut adaptor + trim long read#############################################################cutadapt-u-100-q30,30--trim-n-m15-a \AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC-oNC_1.fastq.trimNC_1.fastq# -u -100 remove last 100nt so that the first 50nt is kept# -q 30,30 read quality need to be above 30# -m 15 reads less than 15nt are removed############################################################## repeat 1.3 QC after 1.2 Trim# make sure the low quality reads have been removed and/or trimmed.#############################################################fastqc-q-oexample/fastqcNC_1.fastq.trim \>log.1.32>err.1.3&
############################################################# # 1.4: clean rRNA reads #############################################################cd $proj/fastqfor i in`ls*.fastq` ; do sample="${i/.fastq/}" sam=$sample.rRNA.sam bam=$sample.rRNA.bam fqin=$sample.fastq.trim fqout=$sample.no_rRNA.fq RNA=$index/rRNAecho"start $sample :"bowtie2-p4--sensitive-local--norc--no-unal--un../mapped/$fqout \-x $RNA $fqin -S../mapped/$sam #map to rRNAs (exons only, sense strand only)# (default) look for multiple alignments, report best, with MAPQ# --sensitive-local(default): allow no mismatch, etc# --norc: do not align reverse-complement version of read# --no-unal: suppress SAM records for unaligned reads# --un: store unmapped reads# -x: indexed genome/transcriptome # -S: output file foramt as samsamtoolsview-b../mapped/$sam >../mapped/$bam#convert sam to bam for mapped reads on rRNAs (sense strand)echo"$sample finished."done
# map to different RNAs (transcriptome) step by step############################################################## 1.5 align cleaned reads to transcriptome (different RNAs) # first align to miRNAs#############################################################cd $proj/fastqfor i in`ls*.fastq` ; do sample="${i/.fastq/}" fqin=$sample.no_rRNA.fq fqout=$sample.no_miRNA.fq sam=$sample.miRNA.sam bam=$sample.miRNA.bam RNA=$index/miRNAecho"start $sample :"bowtie2-p4--sensitive-local--norc--no-unal--un../mapped/$fqout \-x $RNA ../mapped/$fqin -S../mapped/$sam #map to miRNAs (pre-miRNA or pri-miRNA, sense strand only)# (default) look for multiple alignments, report best, with MAPQ# --sensitive-local(default): allow no mismatch, etc# --norc: do not align reverse-complement version of read# --no-unal: suppress SAM records for unaligned reads# --un: store unmapped reads# -x: indexed genome/transcriptome # -S: output file foramt as samsamtoolsview-b../mapped/$sam >../mapped/$bam#convert sam to bam for mapped reads on miRNAs (sense strand)echo"$sample finished."done############################################################### then align to next RNA (e.g. piRNA)# repeat mapping ...
less-S*.gff# chop long lines rather than wrap them
Convert gff to gtf
gffreadtRNA.gff-T-otRNA.gtf
Lift over gff/bed/bam version
# convert bedliftOverold.bedmap.chainnew.bedunmapped# convert gtf## option 1: simple method (not recommended)liftOver-gffold.gtfmap.chainnew.gtfunmapped### -gff File is in gff/gtf format. Note that the gff lines are converted### separately. It would be good to have a separate check after this### that the lines that make up a gene model still make a plausible gene### after liftOver## option 2: conver to genePred format firstgtfToGenePred-genePredExtold.gtfold.gpliftOver-genePredold.gpmap.chainnew.gpunmappedgenePredToGtffilenew.gpnew.gtf
Collapse duplicated reads
Not needed for sRNA-seq.
#option 1: collapse before mappingfastx_collapser-Q33-iinput.fastq-ooutput.collapsed.fastq#option 2: Collapse after mappingsamtoolsrmdup-ssorted.bamoutput.dedup.bamjava-jarpicard.jarMarkDuplicates \I=input.bam \O=output.markDuplicates \M=output.metric \REMOVE_DUPLICATES=true
Homework and more
1 Map on example data
Map the example fastq files and summarize the distribution of 1) reads' length and 2) reads' percentage for each type of RNA (e.g. miRNA, piRNA, Y RNA, srp RNA, etc) **. (You may find some script examples for this statistics calculation from this link.)
2 Sample QC
Use the following criteria to QC the mapped samples for cfRNAs (exRNAs):
3 A fast multi mapping methods to test different mapping orders
3.1Why use this method: We want to determine a satisfying mapping order by examine the mapping ratio and length distribution. It takes hours to map and collect statistics from a specific order. We use a special method, using python (numpy) to manipulate matrix. By using this method, we can test hundreds of mapping order within few minutes.
3.2Scripts
At first we do mapping for all RNAs at one time, we use this snakefile and config file to do mapping at one time. (One map snakefile. One map config.)
Secondly we do some matrix manipulation tricks. Using numpy to do matrix indexing and sorting is ultra fast. By using some tricks, we can have the desired statistics within few minutes for hundreds of different mapping orders: (Mapping order for ratio. Mapping order for length distribution)(You can find some tries and tests of the mapping order methods and statistics and plot jupyter notebook here: Statistics and plot)
At last we also extend our algorithm, we can create a big dictionary to store all the possible mapping assignment in every order and every one map condition. We encode the binary condition sequence into 10-bit digits. And store the whole dictionary together. It can test up to 100,000 kinds of mapping order within few hours. The jupyter-notebook presents the main idea of this method. (Ultra-fast mapping methods)