# 1.Mapping, Annotation and QC

## Pipeline

![](https://2588312385-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-Lf2oKhsQtPU8HhQ7FMe%2F-Lf2p7Rgz7QJMvl3HLTb%2F-Lf2p9lGfNDORGtf4hnC%2Fmapping.png?generation=1558063670333738\&alt=media)

## Data Structure

```
"genomes/hg38/"         # reference genomes (i.e. genome sequence and annotation)
"shared_scripts/"       # shared scripts by Lu Lab
~/github                # I sync my own scripts to github

~/proj_exRNA/
|-- RNA_index           # rerference transcriptomes (fasta and index) 
|-- sample*_name        # ...
|-- sample2_name        # different samples     
`-- sample1_name        
    |-- fastq           # raw data: fastq files
    |-- fastqc          # QC of fastq
    |-- trim            # trimmed fastq (e.g. 3' adaptor cutted)
    `-- mapped          # mapped data: SAM/BAM files
```

### **Inputs**

| **File format** | **Information contained in file**  | **File description**                                                  | **Notes**                                                                                                                                         |
| --------------- | ---------------------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| fastq           | **reads**                          | small RNA-seq reads from single-end libraries.                        | This will be processed from long reads we got. Raw reads are trimmed as single-ended, 50nt clean reads (see details in the protocol and example). |
| fasta           | **genome/transcriptome sequences** | sequences of the whole human genome or different RNAs (transcriptome) |                                                                                                                                                   |
| gtf / gff       | **genome annotation**              | Default genome annotation file is from GENCODE, mirBase, etc.         | Be careful about exon and intron for mRNA/lncRNA,  precursor and mature product for miRNA.                                                        |

### **Outputs**

| **File format** | **Information contained in file**                                 | **File description**                       | **Notes**                                                                               |
| --------------- | ----------------------------------------------------------------- | ------------------------------------------ | --------------------------------------------------------------------------------------- |
| sam/bam         | **alignments, i.e. mapped reads (bam is a binary format of sam)** | Produced by mapping reads to transcriptome | We map to transcriptome for a better sensitivity (see details in protocol and example). |
| bt2             | **indexed sequences**                                             | Indexed genome/ transcriptome              | Different ncRNA types are indexed separately and mapped sequentially.                   |

## Running Scripts

### Software/Tools

* **RSEM**: index and many utility package for transcriptome analysis
* **fastqc**: QC for fastq file&#x20;
* **cutadapt**: trim and cut adaptor
* **bowtie2**: mapping

### Example of single case

{% tabs %}
{% tab title="Set Parameters" %}

```bash
# genome seuqneces and annotaions
ln -s /BioII/lulab_b/shared/genomes $HOME/genomes
export hg38=$HOME/genomes/human_hg38/sequence/GRCh38.p10.genome.fa
export gtf=$HOME/genomes/human_hg38/anno/gtf
export index=$HOME/genomes/human_hg38/index/transcriptome_rsem_bowtie2

# raw data (fastq files)
cd $HOME/proj_exRNA/example/fastq
ln -s /BioII/lulab_b/shared/projects/exRNA/example/100K/*.fastq .

# working space
cd $HOME/proj_exRNA
```

{% endtab %}

{% tab title="0.Index" %}

```bash
# Produce indexed transcriptome, using miRNA as an example
# Usually, you do not need to do this because we have already index one
# in "$index"
# We need RSEM to index transcriptome (RNAs) because we'll convert 
# the relative coordinates to genomic coordinats after mapping to RNAs

#############################################################
# 0: build bowtie2 index using RSEM for RNA transcriptome
#############################################################

rsem-prepare-reference --gtf $gtf/miRNA.gencode27.gtf --bowtie2 $hg38 \
  RNA_index/miRNA.gencode27 > log.0 2>err.0 &
  # RSEM will take exons only 
  # only if you prepare gtf files in the right format
```

{% endtab %}

{% tab title="1.1QC-1.2 Trim-1.3QC" %}

```bash
# check raw reads' quality
# trim low quality ends + keep first 50nt + remove 3' adapter

#############################################################
# 1.1: fastaqc - repeat fastqc before and after each trim step
#############################################################

fastqc -q -o example/fastqc NC_1.fastq \
  > log.1.1 2>err.1.1 &
    #output the QC files to a dir examples/fastqc

#############################################################
# 1.2: trim - cut adaptor + trim long read
#############################################################
cutadapt -u -100 -q 30,30 --trim-n -m 15 -a \
 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -o NC_1.fastq.trim NC_1.fastq  
    # -u -100    remove last 100nt so that the first 50nt is kept
    # -q 30,30   read quality need to be above 30
    # -m 15      reads less than 15nt are removed


#############################################################
# repeat 1.3 QC after 1.2 Trim
# make sure the low quality reads have been removed and/or trimmed.
#############################################################

fastqc -q -o example/fastqc NC_1.fastq.trim \
  > log.1.3 2>err.1.3 &
```

{% endtab %}

{% tab title="1.4 Clean rRNA reads" %}

```bash
############################################################# 
# 1.4: clean rRNA reads 
#############################################################

cd $proj/fastq 
for i in `ls *.fastq` ; do
    sample="${i/.fastq/}"
    sam=$sample.rRNA.sam
    bam=$sample.rRNA.bam
    fqin=$sample.fastq.trim
    fqout=$sample.no_rRNA.fq
    RNA=$index/rRNA
    echo "start $sample :"

    bowtie2 -p 4 --sensitive-local --norc --no-unal --un ../mapped/$fqout \
     -x $RNA $fqin -S ../mapped/$sam 
    #map to rRNAs (exons only, sense strand only)
    # (default) look for multiple alignments, report best, with MAPQ
    # --sensitive-local(default): allow no mismatch, etc
    # --norc:  do not align reverse-complement version of read
    # --no-unal: suppress SAM records for unaligned reads
    # --un: store unmapped reads
    # -x: indexed genome/transcriptome 
    # -S: output file foramt as sam

    samtools view -b ../mapped/$sam > ../mapped/$bam
    #convert sam to bam for mapped reads on rRNAs (sense strand)

    echo "$sample finished."
done
```

{% endtab %}

{% tab title="1.5 Mapping" %}

```bash
# map to different RNAs (transcriptome) step by step
#############################################################
# 1.5 align cleaned reads to transcriptome (different RNAs) 
# first align to miRNAs
#############################################################
cd $proj/fastq 
for i in `ls *.fastq`  ; do
    sample="${i/.fastq/}"
    fqin=$sample.no_rRNA.fq
    fqout=$sample.no_miRNA.fq
    sam=$sample.miRNA.sam
    bam=$sample.miRNA.bam
    RNA=$index/miRNA
    echo "start $sample :"

    bowtie2 -p 4 --sensitive-local --norc --no-unal --un ../mapped/$fqout \
         -x $RNA ../mapped/$fqin -S ../mapped/$sam 
    #map to miRNAs (pre-miRNA or pri-miRNA, sense strand only)
    # (default) look for multiple alignments, report best, with MAPQ
    # --sensitive-local(default): allow no mismatch, etc
    # --norc:  do not align reverse-complement version of read
    # --no-unal: suppress SAM records for unaligned reads
    # --un: store unmapped reads
    # -x: indexed genome/transcriptome 
    # -S: output file foramt as sam

    samtools view -b ../mapped/$sam > ../mapped/$bam
    #convert sam to bam for mapped reads on miRNAs (sense strand)

    echo "$sample finished."
done


##############################################################
# then align to next RNA (e.g. piRNA)
# repeat mapping ...
```

{% endtab %}
{% endtabs %}

### [Example of batch job](https://github.com/urluzhi/scripts/blob/master/projects/exRNA/run_example.sh)

## Tips/Utilities

#### A better view of gff/gtf file

```bash
less -S *.gff  # chop long lines rather than wrap them
```

#### Convert gff to gtf

```bash
gffread tRNA.gff -T -o tRNA.gtf
```

#### Lift over gff/bed/bam version

```bash
# convert bed
liftOver old.bed map.chain new.bed unmapped

# convert gtf
## option 1: simple method (not recommended)
liftOver -gff old.gtf map.chain new.gtf unmapped
###   -gff  File is in gff/gtf format.  Note that the gff lines are converted
###         separately.  It would be good to have a separate check after this
###         that the lines that make up a gene model still make a plausible gene
###         after liftOver

## option 2: conver to genePred format first
gtfToGenePred -genePredExt old.gtf old.gp
liftOver -genePred old.gp map.chain new.gp unmapped
genePredToGtf file new.gp new.gtf
```

#### Collapse duplicated reads

> Not needed for sRNA-seq.

```bash
#option 1: collapse before mapping
fastx_collapser -Q 33 -i input.fastq -o output.collapsed.fastq

#option 2: Collapse after mapping
samtools rmdup -s sorted.bam output.dedup.bam

java -jar picard.jar MarkDuplicates \
I=input.bam \
O=output.markDuplicates \
M=output.metric \
REMOVE_DUPLICATES=true
```

## Homework and more

#### 1 Map on example data

Map the example fastq files and summarize the distribution of 1) reads' length and 2) reads' percentage for **each type of RNA** (e.g. miRNA, piRNA, Y RNA, srp RNA, etc) *\*\**. (You may find some script examples for this statistics calculation from [this link](https://github.com/lulab/training/blob/master/proj_exRNA/example_small/01.QC_mapping.sh).)

#### 2 Sample QC

Use the following criteria to QC the mapped samples for **cfRNAs** (exRNAs):

| **Check point**    | **Threshold**                                      | **Notes**                                                                                                                                        |
| ------------------ | -------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| Raw reads quality  | reads quality >28 (median lines in green area)     | Check fastqc results(\*.html)                                                                                                                    |
| Clean reads number | **> 10 million**                                   | Adaptors and too-short sequences removed reads                                                                                                   |
| rRNAs%             | **< 10%**                                          | Reads mapped to rRNAs (all following % are divided by the **total number of clean reads**)                                                       |
| HG%                | > 60% (optional)                                   | Reads mapped to Human Genome **except rRNAs**                                                                                                    |
| Transcriptome%     | **> 50%**                                          | Reads mapped to Human **Transcriptome** (including rRNA, miRNA, piRNA, Y RNA, srpRNA, snRNA, snoRNA, tRNA, mRNA exons, lncRNA exons, TUCP exons) |
| Y RNA%             | **10%\~65%**                                       | Reads mapped to Y RNA                                                                                                                            |
| miRNA%             | **10%\~65% (**&#x75;p to 80% for exoRNA&#x73;**)** | Reads mapped to miRNA                                                                                                                            |

#### 3 A fast multi mapping methods to test different mapping orders

**3.1** **Why use this method**: We want to determine a satisfying mapping order by examine the mapping ratio and length distribution. It takes hours to map and collect statistics from a specific order. We use a special method, using python (numpy) to manipulate matrix. By using this method, we can test hundreds of mapping order within few minutes.

**3.2** **Scripts**

**At first** we do mapping for all RNAs at one time, we use this snakefile and config file to do mapping at one time. ([One map snakefile](https://github.com/lulab/training/blob/master/proj_exRNA/example_small/fast_mapping_order/onemap). [One map config](https://github.com/lulab/training/blob/master/proj_exRNA/example_small/fast_mapping_order/config).)

**Secondly** we do some matrix manipulation tricks. Using numpy to do matrix indexing and sorting is ultra fast. By using some tricks, we can have the desired statistics within few minutes for hundreds of different mapping orders: ([Mapping order for ratio](https://github.com/lulab/training/blob/master/proj_exRNA/example_small/fast_mapping_order/mappingmx.py). [Mapping order for length distribution](https://github.com/lulab/training/blob/master/proj_exRNA/example_small/fast_mapping_order/mappingmxlength.py))(You can find some tries and tests of the mapping order methods and statistics and plot jupyter notebook here: [Statistics and plot](https://github.com/lulab/training/blob/master/proj_exRNA/example_small/fast_mapping_order/mapping_test_and_statistics.ipynb))

**At last** we also extend our algorithm, we can create a big dictionary to store all the possible mapping assignment in every order and every one map condition. We encode the binary condition sequence into 10-bit digits. And store the whole dictionary together. It can test up to 100,000 kinds of mapping order within few hours. The jupyter-notebook presents the main idea of this method. ([Ultra-fast mapping methods](https://github.com/lulab/training/blob/master/proj_exRNA/example_small/fast_mapping_order/ultra_fast_dict_mapping.ipynb))

#### More Reading and Practice[ ](https://youngleebbs.gitbooks.io/bioinformatics-training-program/content/exrna-seq-analysis/1preprocessing-mapping-and-qc.html)

* [Additional Tutorial](https://lu-lab.gitbook.io/training/getting-startted#learning-materials): 1.preprocessing\_mapping\_QC
* [Teaching PPT](https://lu-lab.gitbook.io/training/getting-startted#learning-materials):&#x20;
  \*
  1. Introduction to exRNA-seq;&#x20;
  * 1) Mapping, Annotation and QC
* [Teaching Video](https://lu-lab.gitbook.io/training/getting-startted#learning-materials): &#x20;
  * Week V - PARTII.1.Mapping etc - Wang
  * Week VI - PARTII. 2. run your job - example 1 bash
* Bioinformatics Data Skills
  * 10\) Working with Sequence Data

## Video

### a) Mapping

[@Youtube](https://youtu.be/5v2fbx93TF4)

[@Bilibili](https://player.bilibili.com/player.html?aid=30591535\&cid=53394590\&page=1)

### b) Run your job example in bash

[@Youtube](https://youtu.be/BqxHRxxBVeQ)

[@Bilibili](https://player.bilibili.com/player.html?aid=30592322\&cid=53396126\&page=1)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://lu-lab.gitbook.io/training/part-ii.-basic-bioinfo-analyses/1.mapping-annotation-and-qc.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
