Skip to content

SU Course: Methods in Molecular Life Sciences Part – 3

2. Lab Practicals

2. Data analysis – Dry Lab

The Data from Illumina sequencing was then transferred to the Uppmax server. Each of us made an Uppmax account and used to access this data on the server.

Uppmax account accessed through the terminal to access our sequencing data on the server

In order to get an idea of the data analysis pipeline, on the first day of the exercise, we were given sample data to practice the use of various tools and get used to the command line usage.

On the second day, we got our own sample data from the Illumina sequencing run. The Illumina converts the image files and provides them in the form of Fastq files. We then followed the general flow of the data analysis from the practice exercise and the tutorial provided.

This is an example of how the sequencing data looks like – Fastq file

We started by checking the quality of the data using FastQC tools available on uppmax. We then used BWA tool to map the reads to the reference genome. This step generates human readable ‘SAM’ format.

Alignment file – SAM format . An example of how the mapped file looks like. Each column represents different values

This was then converted to ‘BAM’ i.e. binary format using Picard tools. This reduce the size as well as increase the processing efficiency. Further, most of the tools used downstream are compatible with the binary format of input files.

An example of BAM file. Various tools can be used to extract information from this binary file. Samtools can be used to convert it into human readable format

BAM files were then used for variant calling using GATK tool. The GATK looks for the difference in the nucleotide present in the mapped reads as compared to the reference. These were stored in a file format call ‘vcf’ file or the variant calling file. GATK was instructed to look for SNP as well as Indels (Insertions and Deletions) using different commands.

This is a VCF file. It shows the position of the variant in the reference, the reference nucleotide and to what it is mutated to.

Once this was done, the reads were recalibrated and another round of variant calling was performed. The recalibration improves the alignment near the regions of Indels and improves the chances of detection of variants more efficiently and accurately.

Finally, this file was used to visualise the regions having Indels in a graphical manner. IGV tool was used for visualisation. This tool enabled us to look navigate to a specific gene (LYS2, LYS5 and LYS14 genes in our case) and look for the mutations i.e. variants.

IGV visualisation of mutations in the entire yeast genome

We then used the loci information from the reference to look for information i the SNP databases for yeast and other similar resources.

One such mutation observed in majority of the reads was T to C mutation in the LYS5 gene.

Finally, on comparing the information obtained from IGV analysis and complementation test we were able to confirm the mutation was in LYS5 gene (for our group named Cocolocos).