2018 - Bioinformatics Tutorial - Advanced (2018)
  • Bioinformatics Tutorial - Advanced (2018)
  • Getting Startted
  • PART I Basic Skills
    • Introduction of PART I
    • 1.Setup
    • 2.Linux
    • 3.Bash and Github
    • 4.R
    • 5.Python
    • 6.Perl
    • Conclusion of PART I
  • PART II. Basic Bioinfo Analyses
    • Introduction of PART II
    • 1.Mapping, Annotation and QC
    • 2.Expression Matrix
    • 3.Differential Expression
    • Midterm Conclusion
    • 4.Normalization
    • 5.Control Data
    • 6.Motif and Structure
  • PART III. Advanced Bioinfo Analyses
    • Introduction of PART III
    • 1.Machine Learning
    • 2.Feature Selection
    • 3.Deep Learning
  • Appendix
    • Appendix I. Keep Learning
    • Appendix II. Docker Manual
    • Appendix III. Mapping Protocol
Powered by GitBook
On this page
  • Pipeline
  • Data Structure
  • Inputs
  • Outputs
  • Running Scripts
  • Software/Tools
  • Example of single case
  • Tips/Utilities
  • Homework and more
  • Video
  • a) Expression matrix

Was this helpful?

  1. PART II. Basic Bioinfo Analyses

2.Expression Matrix

Previous1.Mapping, Annotation and QCNext3.Differential Expression

Last updated 5 years ago

Was this helpful?

Pipeline

Data Structure

"genomes/hg38/"         # reference genomes (i.e. genome sequence and annotation)
"shared_scripts/"       # shared scripts by Lu Lab
~/github                # I sync my own scripts to github

~/proj_exRNA/
|-- RNA_index           # rerference transcriptomes (fasta and index) 
|-- sample*_name        # ...
|-- sample2_name        # different samples     
`-- sample1_name        
    |-- fastq           # raw data: fastq files
    |-- fastqc          # QC of fastq
    |-- trim            # trimmed fastq (e.g. 3' adaptor cutted)
    |-- mapped          # mapped data: SAM/BAM files
    `-- exp             # expression of each gene/ncRNA

Inputs

File format

Information contained in file

File description

Notes

bam

alignments

Produced by mapping reads to the transcriptome.

Reads are trimmed using a proprietary version of cutAdapt. We map to transcriptome for a better sensitivity (see details in protocol and example).

Outputs

File format

Information contained in file

File description

Notes

bigWig

signal

Normalized RNA-seq signal

Signals are generated for transcriptome both the plus and minus strands and for unique reads and unique+multimapping reads.

tsv

gene (ncRNA) quantifications

Non-normalized counts.

Running Scripts

Software/Tools

  • RSEM

  • homer

  • HTseq

  • FeatureCounts

Example of single case

# genome seuqneces and annotaions
ln -s /BioII/lulab_b/shared/genomes ~/genomes
export hg38=~/genomes/human_hg38/sequence/GRCh38.p12.genome.fa
export gtf=~/genomes/human_hg38/gtf

# working space
cd ~/proj_exRNA
# first convert relative positions to genomic coordinates
# then convert bam to bigWig, so that we can view it in genome browser

#####################################################################
#0.1 convert the relative positions on RNAs to genomic coordinates 
#####################################################################

rsem-tbam2gbam miRNA.indexDir/ NC_1.miRNA.sam NC_1.miRNA.rsem.bam

#####################################################################
#0.2 two ways to convert bam/sam to bedGraph/bigwig for visulaization 
#####################################################################
#0.2a Using homer
#Make tag directories for each sample (e.g. NC_1)
#works with sam or bam (samtools must be installed for bam) 
makeTagDirectory NC_1.miRNA.tagsDir/ NC_1.miRNA.sorted.bam

## Add "-strand separate" for strand-specific sequencing 
makeUCSCfile NC_1.miRNA.tagsDir/ -fragLength given -o auto
# (repeat for other tag directories)
makeTagDirectory NC_1.miRNA.tagsDir/ NC_1.miRNA.sorted.bam analyzeRepeats.pl analyzeRepeats.pl /BioII/lulab_b/shared/genomes/human_hg38/gtf/miRNA.gtf hg38 -count exons -d NC_1.miRNA.tagsDir/ -gid -noadj > NC_1.miRNA.homer.rpm
htseq-count -f bam -m intersection-strict miRNA_primary_transcript NC_1.miRNA_hg38_mapped.sam /BioII/lulab_b/shared/genomes/human_hg38/gtf/miRNA.gff > NC_1.miRNA.htseq.counts
featureCounts -s 1 -a /BioII/lulab_b/shared/genomes/human_hg38/gtf/miRNA.gff -o NC_1.miRNA.featureCounts.counts NC_1.miRNA.sorted.bam

Tips/Utilities

Merge multiple bams to one

# mutiple bams to one bam file, in Snakemake form

rule mergeBam:
    input:
        bam = "02.mapping/{library}/sequentialMap/{library}.hg38other.bam"
    output:
        bam = "02.mapping/{library}/sequentialMap/{library}.sequentialMap.bam"
    params:
        jobname = "{library}.mergeBam",
        dir = "02.mapping/{library}/sequentialMap/"
    shell:
        """
        samtools merge {output.bam} {input.bam} `find {params.dir} -name "*.rsem.clean.bam"`

Homework and more

  1. Visualize your mapped reads with IGV (locally) and/or UCSC Genome Browser (on line).

  2. Learn how to construct the expression matrix using HTSeq, featureCounts and homer; then compare the difference among these three methods.

  • Bioinformatics Data Skills

    • 11) Working with Alignment Data

Video

a) Expression matrix

More Reading and Practice

: 2. Construction of expression matrix

Example of batch job
@Youtube
@Bilibili
Additional Tutorial