My summer internship at MRC Stem cell institute – Cambridge University was a dream come true. I was able to spend 2 months of my summer vacation and a bit more in Dr. Srinjan Basu‘s Lab working on Chromatin dynamics and developing methods to study the same.
My journey and decision on actually going to Cambridge is a story in itself as the discussion on Brexit were going on and nothing was finalized apart from the fact that me and my PI Srinjan were looking forward to me working there over the summer. However, fortunately everything turned out well (a tad bit late may be) and finally I got my UK visa and on the afternoon of 10th and I literally booked the next flight out of Stockholm. Which means I was in Cambridge within 18 hrs of having any confirmation at all of me actually being able to go there.
The MRC Stem cell institute is located in the Addenbrookes campus, which consists of the NHS’s Addenbrookes hospital and is the hub for biology research. MRC-LMB (Laboratory of Molecular Biology), UK-Cancer research, and big companies like Astrazeneca, etc. are situated.
This enables a highly collaborative environment among the leaders in various fields. And this is highly evident from the type of research conducted and the number of nobel laureates present in each institute.
The final course of the Stockholm University semester is on Comparative Genomics. The course is bioinformatics oriented and focuses on the use of computational tools and writing small scripts to perform computational analysis of genetic information available on various resources. The course is conducted by Erik Sonnhammer.
The course has 2 components:
1) Theory classes
The theory classes focus on basic concepts of Genome organisation, Gene prediction, Phylogenetics, Orthology methods, etc. The schedule is designed such that the lectures are on Monday afternoons, where one main concept is discussed. Before each lecture there are assigned readings and the a quiz based on the assigned reading. The lecture then starts by the discussion of the quiz followed by in-depth discussions during the rest of the lecture. The lectures act as an introduction to the corresponding lab assignments for that week.
2) Bioinformatics lab
The lab is the main component of the course. All the students are divided into groups of 4-5 students and each group is given a starting data set, a set of genomes. The first lab focuses on exploring the databases and tools to find which organisms the genomes belong to. Each week’s lab sessions are a built up of the previous week’s lab. For example, a part of the 2nd week’s lab focuses on predicting Open Reading Frames and the amino acid sequences for each ORF using various tools.
As you may know the genomes of Eukaryotes and Prokaryotes are significantly different in terms of genome organisation so the same algorithm cannot be used to predict ORF’s. Which means to know which tool to apply on which genome, we needed to know the type of organism from the previous weeks lab.
We had 7 such weeks of lab work. The labs were very stimulating for self learning, as we had to write many scripts in python to get the correct input format for different tools, to perform certain types of custom analysis, etc. Interestingly, we discovered in the lab that many of the tools used in the field are developed in Erik Sonnhammer’s research lab itself, which inspired us as we were learning from the leader in field himself! Which also meant that we had an intensive course and we ended up spending whole days in the lab trying to figure out the codes. Of course, we had help and guidance from the two TA’s but it was limited to general guidance and not to write the scripts itself. However, I believe that was something necessary for us to get some confidence in writing scripts in python independently!
The last week of the course was focused on the small project. In this all the groups were given 3 tasks i.e. to analyse the genome for the basic characteristics such as GC content, perform a phylogenetic analysis and the most challenging one i.e. to write our own ORF predictor tool!!
As the last step we presented our findings about our genome in a talk and compared the sensitivity and specificity of the ORF predictor tool with the professional tools. Surprisingly, many of the tools designed had good sensitivity or specificity that was close to the the available tools in the field even though we had considered only the basic criteria.
In general, the course is a very elegant way to learn python scripting while applying it to relevant biological questions.
Stay tuned as next we go to our third semester at KTH!
The Stockholm University follows a pattern of single full time courses, as opposed to Karolinska and KTH. However, this course for our masters programme is an exception.
The Applied Programming for Life Sciences is the continuation of python programming course from the Karolinska Institute. The course instructor Lars Arvestad conducts the classes distributed throughout the semester.
It is a very small course with 1.5 credits, with the focus on three main topics: Structuring and modularising python scripts, Classes and Objects and writing a command line tool.
Each concept is covered in individual class at the Albanova campus of SU. Each class is 2 hours long and followed by an build up exercise to implement new concepts on the previous exercise.
Another important aspect of this class was the instructors emphasis on writing codes in a standard format. The reason being, if we revert to the code after a year then we should be able to understand what the code does and how it does that. This practice has helped me a lot in the other courses where I was able to write codes with comments and doc strings, which not only helped me understand my code later but also my team mates who were doing the project with me.
The Data from Illumina sequencing was then transferred to the Uppmax server. Each of us made an Uppmax account and used to access this data on the server.
In order to get an idea of the data analysis pipeline, on the first day of the exercise, we were given sample data to practice the use of various tools and get used to the command line usage.
On the second day, we got our own sample data from the Illumina sequencing run. The Illumina converts the image files and provides them in the form of Fastq files. We then followed the general flow of the data analysis from the practice exercise and the tutorial provided.
We started by checking the quality of the data using FastQC tools available on uppmax. We then used BWA tool to map the reads to the reference genome. This step generates human readable ‘SAM’ format.
This was then converted to ‘BAM’ i.e. binary format using Picard tools. This reduce the size as well as increase the processing efficiency. Further, most of the tools used downstream are compatible with the binary format of input files.
BAM files were then used for variant calling using GATK tool. The GATK looks for the difference in the nucleotide present in the mapped reads as compared to the reference. These were stored in a file format call ‘vcf’ file or the variant calling file. GATK was instructed to look for SNP as well as Indels (Insertions and Deletions) using different commands.
Once this was done, the reads were recalibrated and another round of variant calling was performed. The recalibration improves the alignment near the regions of Indels and improves the chances of detection of variants more efficiently and accurately.
Finally, this file was used to visualise the regions having Indels in a graphical manner. IGV tool was used for visualisation. This tool enabled us to look navigate to a specific gene (LYS2, LYS5 and LYS14 genes in our case) and look for the mutations i.e. variants.
We then used the loci information from the reference to look for information i the SNP databases for yeast and other similar resources.
Finally, on comparing the information obtained from IGV analysis and complementation test we were able to confirm the mutation was in LYS5 gene (for our group named Cocolocos).
Today I will talk about the wet lab aspect of the Practicals of the methods course.
2. Lab Practicals
The main aim of the lab exercise was to generate mutant strains of the Saccharomyces cerevisiae and look for the mutations in genes involved in Lysine production pathway using Illumina sequencing.
1. Experimentation and Data generation – Wet lab
a. Yeast Culturing
We started by culturing yeast in plates from the culture flasks by using different OD to get a good growth that can be used for further experimentation. Saccharomyces cerevisiae or commonly known as Yeast takes about 2 days for good growth. Thus, after streaking on the YAPD (rich media) we let them grow for about 2 days.
b. Clonal Purification
The yeast cells were then clonally purified on nitrogen-poor media that contained high levels of α-Aminoadipate and lysine. This is an intermediate from the Lysine production pathway. However, downstream synthesis product derived from this intermediate is shown to cause the growth inhibition on nitrogen poor media. Thus, if cells do not have the 3 important lysine genes (LYS2, LYS5, LYS14) then the conversion of α-Aminoadipate to the toxic intermediate does not take place, allowing the cells to grow.
Thus, we clonally select cells that have mutation in these Lysine genes or so called auxotrophic for Lysine.
c. Validation of auxotrophy
To verify that the cells thus selected actually have the mutations in the lysine genes, they are replica plated on Lysine free media to see for no growth as opposed to lysine rich media.
d. Complementation test
Plate containing the auxotrophic mutants was crossed with the tester strains (strains that are known to have mutations in LYS2, LYS5 and LYS14 genes) by replica plating. This test is used to determine experimentally the specific gene in which each mutant has a mutation. This observation is used to obtain preliminary information if it is worth sequencing the sample and where to expect the mutation while analysing sequencing data. Good samples were selected and used to perform Illumina sequencing.