Introduction
Goal
The aim is to : get familiar with datasets produced by Next Generation Sequencing (NGS).
Topics:
- mapping the reads contained in a FASTQ file
- undertand the concept of genomic coordinates
- becoming familiar with the SAM and BAM file formats
- visualisation in a genome browser
Datasets description
For this training, we will use various ChIP-seq datasets. ChIP-seq is a functional genomics assay to find all the regions in a genome bound by a particular protein (e.g. a transcription factor).
- Bacteria: ChIP-seq dataset produced by Myers et al [PDF] involved in the regulation of gene expression under anaerobic conditions in bacteria. We will focus on one factor: FNR.
Mapping the reads with Bowtie
Goal: Obtain the coordinates of each read on the reference genome.
1 - Choosing a mapping program
There are multiple programs to perform the mapping step. For reads produced by an Illumina machine for ChIP-seq, the currently "standard" programs are BWA and Bowtie (versions 1 and 2). We will use Bowtie version 1.0.0 for this exercice.
2 - Prepare the index file
- Open the terminal and go to the Desktop.
- Try out bowtie
bowtie
This prints the help of the program. However, this is a bit difficult to read ! If you need to know more about the program, it's easier to directly check the manual on the website (if it's up-to-date !)
- bowtie needs the reference genome to align each read on it. This genome needs to be in a specific format (=index) for bowtie to be able to use it. Several pre-built indexes are available to download on the bowtie webpage, but our genome is not there. You will need to make this index file.
- Download the genome from the NCBI . Note that we will not take the lastest version (NC_000913.3) but the previous one (NC_000913.2), because the available tools for visualization have not been updated yet to the latest version. This will not affect our results.
- Rename the downloaded file as Escherichia_coli_K_12_MG1655.fasta
- Build the index for bowtie
bowtie-build Escherichia_coli_K_12_MG1655.fasta Escherichia_coli_K_12_MG1655
3 - Mapping the experiment
- Let's see the parameters of bowtie before launching the mapping:
- Escherichia_coli_K_12_MG1655 is the name of our genome index file
- Number of mismatches for SOAP-like alignment policy (-v): to 2, which will allow two mismatches anywhere in the read, when aligning the read to the genome sequence.
- Suppress all alignments for a read if more than n reportable alignments exist (-m): to 1, which will exclude the reads that do not map uniquely to the genome.
- -q indicates the input file is in FASTQ format. SRR576933.fastq is the name of our FASTQ file.
- -3 will trim x base from the end of the read. As our last position is of low quality, we'll trim 1 base.
- -S will output the result in SAM format
bowtie Escherichia_coli_K_12_MG1655 -q SRR576933.fastq -v 2 -m 1 -3 1 -S > SRR576933.sam
- This should take few minutes as we work with a small genome. For the human genome, we would need either more time, or a dedicated server.
Analyze the result of the mapped reads:
How many reads were mapped ?
# reads processed: 3603544
# reads with at least one reported alignment: 2242431 (62.23%)
# reads that failed to align: 1258533 (34.92%)
# reads with alignments suppressed due to -m: 102580 (2.85%)
Reported 2242431 alignments to 1 output stream(s)
Type q to go back.
62% of mapped reads is good, especially as we know there were adapter sequences in 29% of the reads.
4 - Mapping the control experiment
In ChIP-seq, a control experiment is performed to assess whether the signal observed is relevant or due to experimental biaises. In this experiment, the control is called
anaerobic INPUT DNA.
- Retrieve the SRA identifier for this experiment, download the FASTQ file, analyse its quality and perform the read mapping.
Analyze the result of the mapped reads:
How many reads were mapped ?
# reads processed: 6717074
# reads with at least one reported alignment: 6393027 (95.18%)
# reads that failed to align: 107750 (1.60%)
# reads with alignments suppressed due to -m: 216297 (3.22%)
Reported 6393027 alignments to 1 output stream(s)
95% of mapped reads is very good.
At this point, you should have two SAM files, one for the experiment, one for the control. Check the size of your files, how large are they ?
Visualizing the mapped reads in IGV
Goal: View the reads in their genomic context, to help the biological interpretation of the results
1 - Choosing a genome browser
There are several options for genome browsers, divided between the local browsers (need to install the program, eg. IGV) and the online web browsers (eg. UCSC genome browser, Ensembl). We often use both types, depending on the aim and the localisation of the data.
If the data are on your computer, to prevent data transfer, it's easier to visualize the data locally (IGV). Note that if you're working on a non-model organism, the local viewer will be the only choice. If the aim is to share the results with your collaborators, view many tracks in the context of many existing annotations, then the online genome browsers are more suitable.
2 - Convert the SAM file for viewing in IGV
- We will now convert the SAM file into BAM format, which is a zipped (binary) format that has become a standard for exchanging/working with mapped reads.
samtools view -bS -o SRR576933_experiment.bam SRR576933.sam
- check the size of the files, you will notice the BAM is smaller than the SAM
ls -lh
- We will now sort the BAM file and create its index .bai file (warning, this is not the same index as above !). This index is a technical way to allow fast access to the .bam file.
samtools sort SRR576933_experiment.bam SRR576933_experiment_sorted
samtools index SRR576933_experiment_sorted.bam
3 - Viewing the reads in IGV
- Open IGV by typing igv in the terminal.
- Load the desired genome:
Load the Escherichia coli K12 MG1655 genome as reference
(from the fasta file Escherichia_coli_K_12_MG1655.fasta, downloaded to build the bowtie index file)
The loaded genome appears in the top left panel:
View the experiment mapped reads by loading the file SRR576933_experiment_sorted.bam
Add the mapped reads for the control experiment.
- To see the genes, save the file Escherichia_coli_K_12_MG1655.annotation.fixed.bed on your computer and load it into IGV.
At this point, you should have viewed your mapped reads in the genome browser IGV.