# 2.Expression Matrix

## Pipeline

![](https://2588312385-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-Lf2oKhsQtPU8HhQ7FMe%2F-Lf2p7Rgz7QJMvl3HLTb%2F-Lf2pBH-n3hPheTggqUG%2Fexpression.png?generation=1558063671020410\&alt=media)

## Data Structure

```
"genomes/hg38/"         # reference genomes (i.e. genome sequence and annotation)
"shared_scripts/"       # shared scripts by Lu Lab
~/github                # I sync my own scripts to github

~/proj_exRNA/
|-- RNA_index           # rerference transcriptomes (fasta and index) 
|-- sample*_name        # ...
|-- sample2_name        # different samples     
`-- sample1_name        
    |-- fastq           # raw data: fastq files
    |-- fastqc          # QC of fastq
    |-- trim            # trimmed fastq (e.g. 3' adaptor cutted)
    |-- mapped          # mapped data: SAM/BAM files
    `-- exp             # expression of each gene/ncRNA
```

### **Inputs**

| **File format** | **Information contained in file** | **File description**                            | **Notes**                                                                                                                                          |
| --------------- | --------------------------------- | ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| bam             | **alignments**                    | Produced by mapping reads to the transcriptome. | Reads are trimmed using a proprietary version of cutAdapt. We map to transcriptome for a better sensitivity (see details in protocol and example). |

### **Outputs**

| **File format** | **Information contained in file** | **File description**      | **Notes**                                                                                                                   |
| --------------- | --------------------------------- | ------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| bigWig          | **signal**                        | Normalized RNA-seq signal | Signals are generated for transcriptome both the plus and minus strands and for unique reads and unique+multimapping reads. |
| tsv             | **gene (ncRNA) quantifications**  | Non-normalized counts.    |                                                                                                                             |

## Running Scripts

### Software/Tools

* RSEM
* homer
* HTseq
* FeatureCounts

### Example of single case

{% tabs %}
{% tab title="Set Parameters" %}

```bash
# genome seuqneces and annotaions
ln -s /BioII/lulab_b/shared/genomes ~/genomes
export hg38=~/genomes/human_hg38/sequence/GRCh38.p12.genome.fa
export gtf=~/genomes/human_hg38/gtf

# working space
cd ~/proj_exRNA
```

{% endtab %}

{% tab title="0.Convert & View" %}

```bash
# first convert relative positions to genomic coordinates
# then convert bam to bigWig, so that we can view it in genome browser

#####################################################################
#0.1 convert the relative positions on RNAs to genomic coordinates 
#####################################################################

rsem-tbam2gbam miRNA.indexDir/ NC_1.miRNA.sam NC_1.miRNA.rsem.bam

#####################################################################
#0.2 two ways to convert bam/sam to bedGraph/bigwig for visulaization 
#####################################################################
#0.2a Using homer
#Make tag directories for each sample (e.g. NC_1)
#works with sam or bam (samtools must be installed for bam) 
makeTagDirectory NC_1.miRNA.tagsDir/ NC_1.miRNA.sorted.bam

## Add "-strand separate" for strand-specific sequencing 
makeUCSCfile NC_1.miRNA.tagsDir/ -fragLength given -o auto
# (repeat for other tag directories)
```

{% endtab %}

{% tab title="a. homer" %}

```
makeTagDirectory NC_1.miRNA.tagsDir/ NC_1.miRNA.sorted.bam analyzeRepeats.pl analyzeRepeats.pl /BioII/lulab_b/shared/genomes/human_hg38/gtf/miRNA.gtf hg38 -count exons -d NC_1.miRNA.tagsDir/ -gid -noadj > NC_1.miRNA.homer.rpm
```

{% endtab %}

{% tab title="b. HTseq" %}

```
htseq-count -f bam -m intersection-strict miRNA_primary_transcript NC_1.miRNA_hg38_mapped.sam /BioII/lulab_b/shared/genomes/human_hg38/gtf/miRNA.gff > NC_1.miRNA.htseq.counts
```

{% endtab %}

{% tab title="c.FeatureCounts" %}

```
featureCounts -s 1 -a /BioII/lulab_b/shared/genomes/human_hg38/gtf/miRNA.gff -o NC_1.miRNA.featureCounts.counts NC_1.miRNA.sorted.bam
```

{% endtab %}
{% endtabs %}

[Example of batch job](https://github.com/lulab/training/tree/master/proj_exRNA/example_small)

## Tips/Utilities

#### Merge multiple bams to one

```
# mutiple bams to one bam file, in Snakemake form

rule mergeBam:
    input:
        bam = "02.mapping/{library}/sequentialMap/{library}.hg38other.bam"
    output:
        bam = "02.mapping/{library}/sequentialMap/{library}.sequentialMap.bam"
    params:
        jobname = "{library}.mergeBam",
        dir = "02.mapping/{library}/sequentialMap/"
    shell:
        """
        samtools merge {output.bam} {input.bam} `find {params.dir} -name "*.rsem.clean.bam"`
```

## Homework and more

1. Visualize your mapped reads with IGV (locally) and/or UCSC Genome Browser (on line).
2. Learn how to construct the expression matrix using HTSeq, featureCounts and homer; then compare the difference among these three methods.&#x20;

#### More Reading and Practice[ ](https://youngleebbs.gitbooks.io/bioinformatics-training-program/content/exrna-seq-analysis/1preprocessing-mapping-and-qc.html)

* [Additional Tutorial ](https://lu-lab.gitbook.io/training/getting-startted#learning-materials): 2. Construction of expression matrix
* Bioinformatics Data Skills
  * 11\) Working with Alignment Data

## Video

### a) Expression matrix

[@Youtube](https://youtu.be/DPGpPuzVg_o)

[@Bilibili](https://player.bilibili.com/player.html?aid=30592382\&cid=53396383\&page=1)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://lu-lab.gitbook.io/training/part-ii.-basic-bioinfo-analyses/2.expression-matrix.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
