Till KTH:s startsida Till KTH:s startsida

Unravel motifs in UTRs and introns

The image displays a sequence logo, a visualization of a motif, for the region around the translation start site of genes in E. coli. As hinted by this visualization, the positions in front of the start codon has a interesting bias towards G and A. In this project, you will study what this pattern looks like in the human genome.

Some students have found this project harder than it sounds because there is a lot of data to work with.

Questions

What are the sequence logos for the regions before and after

  • the translation start site?
  • the beginning and end of the first intron (for those genes that have at least one intron)?

Note that this project involves gathering the data and preparing it, which in this case involves (at least) to align sequences before trying to get a sequence logo.

Data

You will have to extract the data you need from, e.g., Ensembl's BioMart.

Tools

I suggest you create the sequence logos using the online WebLogo system, but there might be other easy ways of making them.

Lars Arvestad skapade sidan 2 november 2015

Lärare Lars Arvestad ändrade rättigheterna 30 november 2015

Kan därmed läsas av alla och ändras av lärare.
kommenterade 18 december 2015

We were wondering some things regarding this project. 

When we have analysed our data we get a really low bit-score (~0.5) for the logo sequences we want to find (eg "GT"). After investigating we found that when we look at the positive strand we get all of our target sequences with a bit-score of 2, whilst the negative strand seem to be random - indicating faulty retrieval of the sequences on the negative strands.

By now we have been stuck on trying to isolate the site sequences for the negative strands a really long time but it is not working. Our main idea so far have been to isolate coordinates by taking the 3' UTR end site position minus the first exon chromosome start position. It would seem we don't really have a full understanding of how the sequence/sequence positions are provided when using ensembl, could you get us some indication or somewhere where we can look it up?

Furthermore, is it necessary to use the negative strand as well? We can see no reason why only investigating the positive strand should infer a bias in the results. On the other hand we guess it is bad practice to exclude data, if it is available.

kommenterade 19 december 2015

Nevermind! We finally solved it!

Lärare kommenterade 21 december 2015

Bra!