Computational methods for analysis of spatial trancsriptomics data
An exploration of the spatial gene expression landscape
Time: Fri 2022-03-18 10.00
Location: Air&Fire, Tomtebodavägen 23A, Solna
Video link: https://kth-se.zoom.us/j/61241436735
Subject area: Biotechnology
Doctoral student: Alma Andersson , Genteknologi, Lundeberg Lab
Opponent: Dr Omer Bayraktar,
Supervisor: Professor Joakim Lundeberg, Genteknologi, Science for Life Laboratory, SciLifeLab
Transcriptomics techniques, whether in the form of bulk, single cell/nuclei, or spatial methods have fueled a substantial expansion of our knowledge about the biological systems within and around us. In addition, the rate of innovation has accelerated over the last decade, resulting in a multitude of technological advances and new methods for generation of transcriptomics data. In 2009, isolating and characterizing the transcriptome of a single cell was seen as a major achievement, ten years later, in 2019, studies surveying a hundred thousand cells were commonplace. The field of spatial transcriptomics went through an equally transformative phase; from struggling with simultaneous characterization of a few targets, to seamlessly provide spatially resolved maps of the full transcriptome. Inevitably, we’re approaching an inflection point where the generation of data is no longer the bottleneck, but rather its analysis. Alas, with standardized commercial products, high-quality spatial transcriptomics data can now be generated en masse. Hence, questions about data analysis have started to replace those of data generation. The work in this thesis seeks to address some of these emerging questions; the five articles it encompasses presents new methods for analysis of spatial transcriptomics data and examples of their application. Furthermore, it contains an introduction to current experimental and computational spatial transcriptomics techniques, as well as a section about data modeling.
In Article I, a probabilistic model for integration of single cell/nuclei and spatial transcriptomics data is presented. In short, the method allows for mixed signals – present in certain spatial transcriptomics platforms – to be decomposed into contributions from biologically relevant cell types or states derived from single cell/nuclei data. The model was implemented in code as a software, stereoscope, which is open source and publicly available. The same policy of open source and high transparency holds true for all software or code associated with this thesis. The stereoscope method has been used in several studies, one example being Article II, where we examined the spatial transcriptomics landscape of HER2-positive breast cancer patients. By integrating single cell and spatial transcriptomics data, several intriguing co-localization signals emerged. These signals allowed us to identify a signature for tertiary lymphoid structures and evidence of a trifold interaction involving: type I interferon signals, a T-cell subset, and a macrophage subset. However, the work also included other forms of explorative data analysis, such as unsupervised expression-based clustering. The clusters from this analysis, once annotated, exhibited high concordance with annotations provided by a pathologist and the tissue morphology. Taken together, this makes a compelling case for the use of spatial transcriptomics in the age of “digital pathology.” Finally, we also derived “core signatures” from the expression-based clusters, representing common expression profiles shared across the patients.
In Article III, we present a computational method, sepal, designed to identify genes with distinct spatial patterns, often referred to as “spatially variable genes.” The method uses Fick’s second law to simulate diffusion of transcripts in the tissue, measuring the time until convergence (a spatially uniform and homogeneous state). It then ranks the genes by their “diffusion time.” The assumption being that genes exhibiting strong spatial patterns will take longer time to converge compared to genes with no pattern, thus relating the diffusion time to the degree of spatial structure.
Article IV constitutes a study of the mouse liver using spatial transcriptomics. As before, we employed stereoscope for the purpose of single cell integration, but realized more tailored computational tools – towards the specific tissue – were required to address certain questions. Thus, we developed two computational methods, one devoted to vein type identity prediction, the other enabling a change of data representation. In essence, to predict the vein identities, we first assembled spatially weighted composite expression profiles from – to the vein – neighboring observations. Then, a logistic classifier was trained using the composite profiles. Once the model was trained, it could be used to assign vein type identities to ambiguous or unannotated veins. In the second method, the two-dimensional spatial data was recast into a more informative one-dimensional representation by treating gene expression as a function of an observation’s distance to its nearest vein structure.
The final work, Article V, expands the idea of recasting data into a more informative or helpful representation. More precisely, we present a method, eggplant, that allows the user to transfer spatial transcriptomics data from multiple sources to a common coordinate framework (CCF). Transfer of information to a CCF means spatial signals can be compared across conditions and time points, unlocking a plethora of valuable downstream analyses. For example, we perform spatiotemporal modeling of a synthetic system, and introduce the concept of “spatial arithmetics” to study local expression differences. With a growing corpus of spatial trancsriptomics data and ambitious international efforts like the Human Cell Atlas, we deem these sort of methods essential to leverage the data’s full potential.