Scalable Analysis of Large Datasets in Life Sciences
Tid: To 2018-10-18 kl 11.15 - 12.15
Föreläsare: Laeeq Ahmed, CST/EECS/KTH
Plats: Room 4423, Lindstedtsvägen 5, KTH main campus
We are seeing a deluge of data in all fields of science and business including life sciences due to better instrumentation and rapid advancement in information technology. On the other hand, cloud computing has enabled us to manipulate this data at a cheaper cost. Major challenges with such data is managing data size, data understanding, rapid processing, finding meaningful results near real time, handling outliers and visualization.
In this thesis, I present the parallel methods to efficiently manage, process, analyse and visualize massive datasets at a rapid rate in the fields of life science i.e. chemo-informatics and neuro-informatics, while building and utilizing various machine learning techniques in a novel way.
First, I evaluate the suitability of spark, a parallel framework for large datasets, for performing large-scale parallel virtual screening and provided an architecture for parallel virtual screening. As a case study, I classify molecules in the Zinc library using prebuilt SVM based classification models. Virtual screening using SVM not only involves huge datasets, but it is also compute expensive with a complexity that can grow at least up to O(n2). I found that spark has good scaling behaviour and opens up possibility to perform large-scale virtual screening on public cloud infrastructures.
Secondly, I presented a method to predict seizures in long term EEG in real time as data streams. In this work, I tackle the challenges of real time decision-making, storing huge datasets in memory and updating the prediction model with newly produced data at rapid rate. Our algorithm not only classified seizures in real time, it also learned the threshold in real time. I additionally presented a new feature "top-k amplitude measure" for classifying seizures from non-seizure EEG data, which helps with the size reduction of data. Our work demonstrates that EEG can be processed and analyzed as streams in real time using big data analytics technologies, providing a new way to process EEG data on huge cloud computing platforms.
Thirdly, I exhibited a strategy that permits scientists to create cloud-ready pipelines for structure based high throughput virtual screening (SBVS) with little coding effort. The significant advantages of this technique are productivity and high throughput. In this work, I have demonstrated that Spark applies to SBVS, and it is in general an appropriate solution for enormously parallel pipelining. Moreover, I bring up that Big Data analytics is valuable in life science datasets too.
In the fourth study, I further improved our SBVS by using SVM and conformal prediction based model. Docking in virtual screening is an expensive process and need novel strategy to further reduce the overall time taken by virtual screening. I use SVM model with conformal prediction to classify molecules and only dock those molecules that have better chance to be an inhibitor to become a drug. I was able to remove 62.61% molecules that were predicted as "low-scoring" molecules by the model and thus got a speedup of 3.7 while keeping the results of model 94% correct on average.
In the fifth study, I build a webservice, Predicting Target Profile as a service (PTPAAS) which allows users to predict the target profile of multiple compounds against ready made models for a list of targets where 3D structure is available. These target predictions can be used to predict off-target effects, for example in early stages in drug discovery projects. The models are based on our docking strategy from fourth study that enabled us to build accurate models quickly. The service also enable users to dock compounds of interest once target profiles are predicted. The service was implemented in play 2.0 programming framework with Scala and deployed using OpenShift origin.