TD L3 NGS-2

Introduction

Goal
The aim is to : get familiar with datasets produced by Next Generation Sequencing (NGS).

Topics:

mapping the reads contained in a FASTQ file
undertand the concept of genomic coordinates
becoming familiar with the SAM and BAM file formats
visualisation in a genome browser

Datasets description

For this training, we will use various ChIP-seq datasets. ChIP-seq is a functional genomics assay to find all the regions in a genome bound by a particular protein (e.g. a transcription factor).

Bacteria: ChIP-seq dataset produced by Myers et al [PDF] involved in the regulation of gene expression under anaerobic conditions in bacteria. We will focus on one factor: FNR.

Mapping the reads with Bowtie

Goal: Obtain the coordinates of each read on the reference genome.

1 - Choosing a mapping program

There are multiple programs to perform the mapping step. For reads produced by an Illumina machine for ChIP-seq, the currently "standard" programs are BWA and Bowtie (versions 1 and 2). We will use Bowtie version 1.0.0 for this exercice.

2 - Prepare the index file

Open the terminal and go to the Desktop.
Try out bowtie

bowtie

manual on the website (if it's up-to-date !)

bowtie needs the reference genome to align each read on it. This genome needs to be in a specific format (=index) for bowtie to be able to use it. Several pre-built indexes are available to download on the bowtie webpage, but our genome is not there. You will need to make this index file.
Download the genome from the NCBI . Note that we will not take the lastest version (NC_000913.3) but the previous one (NC_000913.2), because the available tools for visualization have not been updated yet to the latest version. This will not affect our results.

Rename the downloaded file as Escherichia_coli_K_12_MG1655.fasta
Build the index for bowtie

bowtie-build Escherichia_coli_K_12_MG1655.fasta Escherichia_coli_K_12_MG1655

3 - Mapping the experiment

Let's see the parameters of bowtie before launching the mapping:

Escherichia_coli_K_12_MG1655 is the name of our genome index file
Number of mismatches for SOAP-like alignment policy (-v): to 2, which will allow two mismatches anywhere in the read, when aligning the read to the genome sequence.
Suppress all alignments for a read if more than n reportable alignments exist (-m): to 1, which will exclude the reads that do not map uniquely to the genome.
-q indicates the input file is in FASTQ format. SRR576933.fastq is the name of our FASTQ file.
-3 will trim x base from the end of the read. As our last position is of low quality, we'll trim 1 base.
-S will output the result in SAM format

bowtie Escherichia_coli_K_12_MG1655 -q SRR576933.fastq  -v 2 -m 1 -3 1 -S  > SRR576933.sam

This should take few minutes as we work with a small genome. For the human genome, we would need either more time, or a dedicated server.

Analyze the result of the mapped reads:
How many reads were mapped ?

# reads processed: 3603544
# reads with at least one reported alignment: 2242431 (62.23%)
# reads that failed to align: 1258533 (34.92%)
# reads with alignments suppressed due to -m: 102580 (2.85%)
Reported 2242431 alignments to 1 output stream(s)
Type q to go back.
62% of mapped reads is good, especially as we know there were adapter sequences in 29% of the reads.

4 - Mapping the control experiment

In ChIP-seq, a control experiment is performed to assess whether the signal observed is relevant or due to experimental biaises. In this experiment, the control is called anaerobic INPUT DNA.

Retrieve the SRA identifier for this experiment, download the FASTQ file, analyse its quality and perform the read mapping.

Analyze the result of the mapped reads:
How many reads were mapped ?

# reads processed: 6717074
# reads with at least one reported alignment: 6393027 (95.18%)
# reads that failed to align: 107750 (1.60%)
# reads with alignments suppressed due to -m: 216297 (3.22%)
Reported 6393027 alignments to 1 output stream(s)

95% of mapped reads is very good.

At this point, you should have two SAM files, one for the experiment, one for the control. Check the size of your files, how large are they ?

Visualizing the mapped reads in IGV

Goal: View the reads in their genomic context, to help the biological interpretation of the results

1 - Choosing a genome browser

There are several options for genome browsers, divided between the local browsers (need to install the program, eg. IGV) and the online web browsers (eg. UCSC genome browser, Ensembl). We often use both types, depending on the aim and the localisation of the data.
If the data are on your computer, to prevent data transfer, it's easier to visualize the data locally (IGV). Note that if you're working on a non-model organism, the local viewer will be the only choice. If the aim is to share the results with your collaborators, view many tracks in the context of many existing annotations, then the online genome browsers are more suitable.

2 - Convert the SAM file for viewing in IGV

We will now convert the SAM file into BAM format, which is a zipped (binary) format that has become a standard for exchanging/working with mapped reads.

samtools view -bS -o SRR576933_experiment.bam SRR576933.sam

check the size of the files, you will notice the BAM is smaller than the SAM

ls -lh

We will now sort the BAM file and create its index .bai file (warning, this is not the same index as above !). This index is a technical way to allow fast access to the .bam file.

samtools sort SRR576933_experiment.bam SRR576933_experiment_sorted
samtools index SRR576933_experiment_sorted.bam

3 - Viewing the reads in IGV

Open IGV by typing igv in the terminal.
Load the desired genome:

Load the Escherichia coli K12 MG1655 genome as reference
(from the fasta file Escherichia_coli_K_12_MG1655.fasta, downloaded to build the bowtie index file)

The loaded genome appears in the top left panel:

View the experiment mapped reads by loading the file SRR576933_experiment_sorted.bam

Add the mapped reads for the control experiment.

To see the genes, save the file Escherichia_coli_K_12_MG1655.annotation.fixed.bed on your computer and load it into IGV.

At this point, you should have viewed your mapped reads in the genome browser IGV.

TD L3 NGS (2)

Introduction

Mapping the reads with Bowtie

1 - Choosing a mapping program

2 - Prepare the index file

3 - Mapping the experiment

4 - Mapping the control experiment

Visualizing the mapped reads in IGV

1 - Choosing a genome browser

2 - Convert the SAM file for viewing in IGV

3 - Viewing the reads in IGV

Linux guide:

Formats:

Links: