Projektförslag

Projekt idéer

Title: Sentiment Classification in Social Media

Theme: Natural language processing, Classification

Subject: The proliferation of social media has generated much interest amongst scientists and within industry - people's opinions are freely expressed and this carries value for many people. However, social media is a challenge to study as despite the availability of data, extracting interesting information is difficult due to many reasons including: short form communication, abbreviations, slang words, lack of capitals, sarcastic humour, and poor spelling, punctuation and grammar. One type of information of value is sentiment - the emotional payload of a unit of social media. Two prevailing approaches attempt to solve the challenge of sentiment classification: machine-learning and lexicon-based methods. This project will investigate these approaches with a view to comparing their performance using datasets that contain gold standard human judgements of sentiment.

Supervisor: Richard Glassey

Title: Detecting Visual Plagiarism

Theme: Computer vision, Classification

Subject: Textual plagiarism is routinely used within universities to provide evidence that students have committed plagiarism within their reports. However, visual artefacts within a report are often discarded during analysis. Clearly, building a visual corpus demands more resources than a textual corpus, but this topic raises the concern: if a picture paints 1000 words, should we not also be equally concerned about the plagiarism potential? As reports may combine several types of picture (photograph, diagram, graph) as well as deliberate attempts to obfuscate the true source of a picture (rotation, scaling, cropping), several complementary methods may be required to arrive at a solution that balances accuracy with reasonable performance.

Supervisor: Richard Glassey

Title: Are They Paying Attention?

Theme: Computer vision

Subject: A concern for a lecturer is whether students are paying attention during a lecture; various context switches can recapture attention, and knowing when to deploy them would be a benefit. However, this can be difficult to gauge for various reasons - large class sizes, low lighting conditions and a dependence upon technology to capture data in for analysis in the first place. One potential solution is to apply face detection in combination with image capture (stills or video). Now familiar in social media and photo editing applications, this project considers if and how face detection could be used to measure 'group attention' - that is, given a crowd of size n, what proportion of n is paying attention, and how stable this is across the duration of a lecture. Ideally, this project will investigate the limits of state of the art face-detection algorithms, whilst also considering the engineering challenges to deliver such a project.

Supervisor: Richard Glassey

Title: Brain signal pattern recognition

Theme: Algorithms, Machine learning, Classification, Signals

Subject: Pattern recognition and machine learning have significantly advanced the field of biological data analysis leading in consequence to the development of effective diagnostic tools and supporting research efforts. The contribution of novel pattern recognition methods has been particularly appreciated in brain data mining as this new approach allows for exploratory search for spatio-temporal patterns in large quantities of high-dimensional nonstationary recordings of brain activity. The emerging trend is to combine machine learning techniques with brain-inspired computing algorithms to address increasingly demanding objectives of brain signal analysis in novel applications.

Below you can find a set of alternative projects (they can be treated individually or in combination).

Possible essay projects:

Develop your own approach or build upon the existing approaches to a specific brain signal pattern recognition problem, e.g.
- electroencephalographic (EEG) signal classification for a brain-computer interface (BCI),
- EEG-based epileptic seizure prediction (identifying precursors in high-dimensional brain signal recordings)
- automated sleep scoring based on physiological signals including EEG
Alternatively, select and compare a few existing state-of-the-art methods. Focus on selected aspects of a brain signal pattern recognition problem of your choice (handling signal, extracting patterns, classifying and interpreting brain signal correlates)
Discuss key challenges, emerging trends and propose future applications for brain signal recognition methodology.

Supervisor: Pawel Herman

Title: Computer-aided medical diagnostics

Theme: Artificial intelligence, Classification, Machine learning, Algorithms

Subject: Computer-aided diagnosis has been extensively validated in various medical domains, ranging from biomedical image or signal analysis to expert systems facilitating the process of decision making in clinical settings. Although the usefulness of computational approaches to medical diagnostics is beyond any doubt, there is still a lot of room for improvement to enhance the sensitivity and specificity of algorithms. The diagnostic problems are particularly challenging given the complexity as well as diversity of disease symptoms and pathological manifestations. In the computational domain, a diagnostic problem can often be formulated as a classification or inference task in the presence of multiple sources of uncertain or noisy information. This pattern recognition framework lies at the heart of medical diagnostics projects proposed here.

Below you can find a set of alternative projects (they can be treated individually or in combination).

Possible essay projects:

Define a diagnostic problem within the medical domain and examine the suitability of machine learning, connectionist (artificial network-based), statistical or soft computing methods to your problem.
Survey the state-of-the-art in computational tools supporting classification of disease symptoms and comparatively examine the diagnostic performance of some of them on a wide range of available benchmark data sets. Define a measure for diagnostic performance.
Discuss most recent trends in the field and address some of the urgent challenges for computer-assisted diagnostics in medicine.

Supervisor: Pawel Herman

Title: Brain-inspired (biomimetic) computing

Theme: Algorithms, Connectionist systems, Networks

Subject: Current computational methods, algorithms and available software turn computers into machines particularly effective in bookkeeping, solving complex but well-defined problems, searching for specific patterns etc. However, today’s computers perform rather poorly in tasks where multi-modal perception allowing to identify complex undefined patterns in large data streams is needed, or common sense reasoning and handling ambiguity are required among others at the cost of precision or speed to effectively solve a range of real-world problems and meet growing demands for robustness, flexibility, adaptation as well as computational scalability (for example, big data challenge). Brain- or neural-inspired computing is an emerging field of research that aims to design such efficient algorithms based on the principles used by the nervous system with the brain in the first place to process generic information. In the family of neural network based (connectionist) systems the focus is on the biomimetic nature of the network architectures and learning mechanisms. Some of these connectionist methods devised in the realm of brain-inspired computing achieve human-competitive performance in recognition tasks.

In this project, students are first obliged to familarise themselves with the state-of-the-art methods in the emerging field of computational methods inspired by brain’s neural networks (connectionist approach) or more general cognitive architectures. Using this background information, students will be in a position to devise their own method or build upon the existing techniques to address a selected computational problem. A range of potential applications is wide as it involves among others problems with high-dimensional data (multivariate) having complex relationships, requiring explorative search for interesting multi-dimensional patterns with potentially hierarchical structure (low-level features that serve as building blocks for high-level data representations) and a possibility to perform a classification task or make inference. Some examples of broad applications are image analysis (or generally pattern recognition), speech recognition, data mining (e.g. medical, financial, industrial), high-dimensional time-series prediction etc. In the course of the project, special attention should be paid to the scale of the developed computational algorithm, implementation challenges, modularisation and, most importantly, functionality (robustness to noisy conditions, flexibility, effective learning from environment., capability to handle unsupervised or semi-supervised learning scenarios etc.)

Supervisor: Pawel Herman

Title: Automated scheduling, e.g. university timetabling

Theme: Artificial intelligence, Machine learning, Algorithms, Optimisation

Subject: Planning is one of the key aspects of our private and professional life. Whereas planning our own daily activities is manageable, scheduling in large multi-agent systems with considerable amounts of resources to be allocated in time and space subject to multitude of constraints is a truly daunting task. In consequence, scheduling or timetabling as prime representatives of hard combinatorial problems have increasingly become addressed algorithmically with the use of computational power of today's computers. This computer-assisted practice in setting up timetables for courses, students and lecturers has also gained a lot of interest at universities around the world and still constitutes an active research topic.

In this project, students can address a scheduling problem of their own choice or they can use available university timetabling benchmark data and tailor it to the project's needs. An important aspect of such project would be to select or compare different algorithms for combinatorial optimisation, and define a multi-criterion optimisation objective. It could be an opportunity to test computational intelligence and machine learning methodology.

Supervisor: Pawel Herman

Title: Stock forecasting, financial data mining

Theme: Machine learning, Algorithms, Time-series prediction

Subject: Stock trading is one of the most common economic activities in the world. Stock prices are very dynamic and commonly undergo quick changes due to the intrinsic nature of the financial domain. From a computational perspective, intelligent trading can be formulated as a data (or more specifically, time-series) prediction problem that involves both known parameters and unknown factors. The overarching idea is to design an algorithm that provides accurate prediction thus allowing for making optimal trading decisions. In this domain, machine learning and oft computing methods have recently proven great potential.

Possible essay projects:

Compare some of the most recent approaches to financial time series prediction and validate their performance on available benchmark data sets (e.g., http://www.stockhistoricaldata.com/download).
Propose a method with your own novel component and verify its suitability for the problem of financial time series analysis on different benchmark data sets.
Concentrate on one or two of the emerging challenges in computer-based stock forecasting and address them using selected methodology. Reflect on the statistical nature of the data that reflects complex characteristics of financial markets.

Supervisor: Pawel Herman

Title: Intelligent control systems

Theme: Machine learning, Artificial intelligence, Algorithms, Control, Soft computing

Subject: There is a clear trend for smarter machines that are able to collect data, learn, recognize objects, draw conclusions and perform behaviors to emerge in our daily life. Advanced intelligent control systems affect many aspects of human activities and can be found in a wide range of industries, e.g. healthcare, automotive, rail, energy, finance, urbanization and consumer electronics among others. By adapting and emulating certain aspects of biological intelligence this new generation of control approaches makes it possible for us to address newly emerging challenges and needs, build large-scale applications and integrate systems, implement complex solutions and meet growing demand for safety, security and energy efficiency.

Possible essay projects:

Select a real-world control problem (traffic control, energy management, helicopter or ship steering, industrial plant control, financial decision support and many others) and propose a new approach using machine learning and soft computing methodology (computational intelligence) that enhances functionality, automatisation and robustness when compared to classical solutions.
Demonstrate functional (and other) benefits of “computationally intelligent” control approaches in relation to the classical methodology in a range of low-scale control problems (benchmarks). Discuss a suitable framework of comparison and potential criteria.
Consider a control robotic application with all constraints associated with autonomous agents and real-world environments (which can be emulated in software). Propose “computationally intelligent” methods to enable your robot agent prototype to robustly perform complex tasks (learn from the environment, evolve over time, find solutions to new emerging problems and adapt to new conditions among others).

Supervisor: Pawel Herman

Title: Web document classification

Theme: Pattern recognition, Machine learning, Algorithms, Statistics, Computational linguistics

Subject:

Possible essay projects:

Supervisor: Pawel Herman

Title: Visualisation of neural data

Theme: Visualisation / Simulations

Subject: Visualisation is one of the most neglected aspect of a rapidly developing field of computational biology. Only recently can we observe an emerging trend for combining neural simulation frameworks with visualisation software. Still there are a plethora of challenging problems that need to be urgently addressed (high-dimensional data, pre-processing, integration with a simulation software, demands for purely visual aspects, interactive environment) to render visualisation a practical tool in computational studies. This is envisaged to facilitate computational modelling and assist in demonstrating scientific findings.

Possible essay projects:

Visualisation of existing data produced by models (different types of high-dimensional spatiotemporal data are available).
Conceptual integration with simulating environment to help with data pre-processing (or post-procesing) and facilitate iteractive mode with the user.
Review of the state-of-the-art methodology and a motivated choice of a tool for the computational problem at hand.

Supervisor: Pawel Herman

Title: Optimisation and parameter search in computational modelling

Theme: Algorithms

Subject: Model's parameters have a decisive effect on its behaviour and dynamics. Search for parameters is at the same time the most tedious component of computational modelling. Neural simulations are no exception. On the contrary since they account for nonlinear and stochastic effects in brain data, parameters need to be carefully tuned to obtain a desirable functional and/or dynamical outcome. This optimisation procedure is commonly carried out manually on a trial-and-error basis. It is thus desirable to automatise this tedious process by providing an effective parameter search and optimisation scheme. One of key challenges to address is computational efficiency of the implemented method and the definition of a cost function based on the existing "manual" evaluation criteria. Tests in the project will be perfomed with the use of existing neural models or a low-scale simulation demo will be developed.

Possible essay projects:

Define the cost function that reflects the fundamental model evaluation criteria and propose an effective way of its calculation.
Propose a computationally effective way of evaluating the cost function (p. 1)
Review and propose a parameter search method (from the existing approaches) that match the specificity of computational modelling.

Supervisor: Pawel Herman

Title: Multi-scale brain simulations

Subject: Simulations of neural systems and the brain can be performed on different scales, e.g. we could try to simulate every single neuron as detailed as possible. We could also assume populations of neurons to be the basic computational units and neglect the dynamics of single neurons. Libraries exist to simulate neural systems at several of these scales (e.g. GENESIS, NEURON, Nest, Nexa). These can be called from C++ or Python and be run on desktop machines as well as on supercomputers.

Possible essay projects:

Implement some basic models in available simulators.
Compare the type of simulations they can perform – what is the “correct” scale to perform brain simulations?
As these simulations get closer to mimic real brains, what are the implications for medicine and computing? Ethical concerns?

Supervisor: Pawel Herman

Title: Spatial analysis of neuron populations

Theme: Neuroscience, Neural networks, Computer science

Subject: Neurons in the brain make networks by connecting to each other at the points of close appositions (touches) of their neurites forming chemical or electrical synapses. In the large-scale simulations of realistic neural networks the touch points can be directly computed from the morphologies of individual neurons. The task is to suggest efficient and scalable algorithms of touch detection in populations of morphologically detailed neurons. Digitally reconstructed neurons are available from the public repositories [1, 2].

References:

Neuromorpho.org
Computational Neurobiology and Imaging Center, http://research.mssm.edu/cnic/repository.html

Supervisor: Alexander Kozlov

Title: Smoothing sparse neuronal data

Theme: Neuroscience, Statistics, Data processing

Subject: A problem comes from the neuroscience studies. The brain tissue is not homogeneous: there are trends in locations of neurons with specific functions and, correspondingly, specific properties. The neuronal properties (e.g. background spike firing frequency, or responses to a particular type of stimulation) can be measured experimentally in individual neurons. To reveal the trends in their distribution, heatmaps can be drawn for the measured parameters [1]. Besides looking for trends with just one heatmap obtained in one situation, it is interesting to compare several heatmaps to reveal differences between different situations (e.g. a healthy subject vs. an injured subject).

The problem. There are two sets of data {f₁(x_i,y_i)}, i=1,M, {f₂(x_j,y_j)}, j=M+1,M+N (M,N≈200-500), x and y are space coordinates, f is the measured parameter. The task is

1) to draw heatmaps for f₁ and f₂; different types of spatial smoothening/filtering with different precision can be applied, it would be good to estimate how to smoothen optimally and why;

2) to compare the heatmaps estimating the statistical significance of the local differences.

The main difficulty is a low number of measurements. So an additional task would be to estimate the minimal number of measurements necessary to provide a particular level of significance (e.g., p<0.05, 0.01, 0.001) and, maybe, to recommend a number of additional measurements for the locations where the differences are suspected but not significant yet.

References:

Zelenin P. V., Hsu L.-J, Lyalka V. F., Orlovsky G.N., Deliagina T.G. Putative spinal interneurons mediating postural limb reflexes provide basis for postural control in different planes. Eur. J. Neurosci., 41: 168-181, 2015.

Supervisor: Alexander Kozlov (KTH), Pavel Zelenin (Karolinska Institute)

Title: Efficient data structure for encoding neuronal morphology

Theme: Neuroscience, Cell biology, Computer science

Subject: Neurons have complex shape determined by a cell body (soma) and neurites (dendrites and axon). Computer simulations of nervous system often use realistic models of neurons which preserve realistic geometry of individual cells. Neuron morphology has a tree-like structure and can be fully specified in a very simple format [1, 2]. The task is to find efficient ways of representing morphological data in computer memory which would allow fast random and sequential access to neurites.

References:

Cannon RC et al. (1998) An on-line archive of reconstructed hippocampal neurons. J Neurosci Methods 84(1-2):49-54.
Neuromorpho.org

Supervisor: Alexander Kozlov

Title: Automatic classification of neurons by their morphology

Theme: Neuroscience, Cell biology, Machine learning

Subject: Neurons have tree-like shape (morphology) and differ by size, topology and other geometric properties depending on their role and place in the nervous system. Each morphological type exhibits characteristic set of geometric features. The task is to divide neurons to different morphological types using automatic procedure. Test candidate algorithms for accuracy and speed. Digitally reconstructed neurons are available from the public repositories [1, 2].

References:

Neuromorpho.org
Computational Neurobiology and Imaging Center, http://research.mssm.edu/cnic/repository.html

Supervisor: Alexander Kozlov

Title: Creating the brain tissue

Theme: Neuroscience, Cell biology, Neural networks, Computer science

Subject: Neurons are tightly packed in the brain, with neuron density sometimes exceeding 100,000 neurons per cubic millimeter. This makes cell placement a non-trivial problem in the large-scale realistic simulations of neural networks. The task is to suggest and try algorithms of filling the structured volume with cells which somata (ie cell bodies only, ignoring thin multiple neurites) do not overlap. All cells have different sizes; neurons should be placed randomly with possible gradients for different cell types.

Supervisor: Alexander Kozlov

Title: Multi-objective optimization

Theme: Applied Mathematics, Decision Making, Optimization

Subject: In multi-objective optimization several goal functions are optimized simultaneously. Due to possible inherent conflicts the optimal solution does not exist and optimality is considered as a trade-off between objectives. This gives rise to interesting semi-heuristic approaches
and opens opportunities for experimentation.

Possible essay projects:

Inverse parameter estimation using multiobjective approach
Mathematical synthesis of artificial neurons
Evolutionary algorithms for solving multi-objective problems

Supervisor: Alex Kozlov

-------------------------------------------------------------------------------------------------------------------------------

Title: Structured prediction

Subject: An exciting new direction in supervised Machine Learning is to learn functions to structures, not just single values. For example we might learn mappings from sequences (or bags) of words to trees. Structured learning is particularly promising in domains such as computational linguistics and computational biology. In this thesis you will explore the techniques in this area and evaluate results of various approaches over a domain of your choosing.

Reference:

Search-based structured prediction, Hal Daumé III, John Langford and Daniel Marcu. Machine Learning (2009) 75: 297–325

Supervisor: Michael Minock

Title: Grounding Language in Perception

Quoting Jeffery Siskind:

Suppose that you were a child. And suppose that you heard the utterance 'John walked to school'. And suppose that when hearing this utterance, you saw John walk to school. And suppose, following Jackendoff (1983), that upon seeing John walk to school, your perceptual faculty could produce the expression GO(John, TO(school)) to represent that event. And further suppose that you would entertain this expression as the meaning of the utterance that you just heard. At birth, you could not have known the meanings of the words 'John', 'walked', 'to', and 'school', for such information is specific to English. Yet, in the process of learning English, you come to possess a mental lexicon that maps the words 'John', 'walked', 'to', and 'school' to representations like 'John', 'GO(x,y)', 'TO(x)', and 'school', respectively. This paper explores one way that this might be done.

In this thesis, you are to replicate (some variant of) Siskind's experiment and comment on its suitability for practical application.

Reference:

A Computational Study of Cross-Situational Techniques for Learning Word-to-Meaning Mappings,' Cognition, 61(1-2):39-91, October/November 1996.

Supervisor: Michael Minock

Title: Replication of Lambda-WASP

Subject: Baby hears word sequences coupled with baby's pre-linguistic understandings of objects and events in the world. From this baby learns knowledge that lets baby map from word sequences to meaning expressions (natural language understanding) and map from meaning expressions to word sequences (natural language generation). Raymond Mooney's group looked at this problem for natural language understanding in their 2007 paper 'Learning Synchronous Grammars for Semantic Parsing with Lambda Calculus'. Attempt to replicate the results of this work and evaluate their claims. Time permitting, review additional work that has looked at this problem more recently. Critically discuss.

Supervisor: Michael Minock

Title: Query Containment

Subject: In a variety of settings (mobile caches, query optimizations, etc.) it is useful to be able to decide whether one query contains another. That is, over any possible state of the database, are the results of one query always a subset of the other? In this thesis you will survey query containment results for a given query language (e.g. fragments of SQL, or SPARQL, or XPATH, etc... ). You will implement containment algorithms and evaluate their effectiveness over benchmarks.

Supervisor: Michael Minock

Title: Gamification of NLI Acquisition

Subject: Several researchers have explored the possibility of using computer games to acquire knowledge to build Natural Language Interfaces (NLIs). In this thesis, review earlier attempts, build your own gamification prototype and evaluate.

Supervisor: Michael Minock

Title: Natural Language Interface technology in Computer Games

Subject: Early in the history of computer gaming there were many delightful text-based adventure games. In such games the user made their way through the game by typing in natural language commands. Have such interfaces been integrated into more modern games and, if so, how? After you survey developments, implement a small game using NLI techniques and evaluate it, or variations of it with a group of users. Are NLIs ‘fun’? What are future prospects or opportunities for NLIs in computer games?

Supervisor: Michael Minock

Title: Replication of Precise

Subject: The paper 'Towards a theory of natural language interfaces to databases' by Popescu, Etzioni and Kautz, Proc of IUI, 2003, developed a novel method to support natural language interfaces to databases. Over the graph of tokens representing the database, max-flow is computed over the tokens represented in the user's natural language query. From the resulting sub-graph an SQL expression is trivially generated. Crucially, the approach claims that it can identify 'semantically tractable' queries that can be answered with 100% confidence. The paper presented results over the GEO corpus. In this work, attempt to replicate their results and then critically evaluate their claims.

Supervisor: Michael Minock

Title: Random Query Generation
Subject: In many applications ranging from data mining to testing there is a need to generate random query expressions. That is, given a database schema and a representative sample of queries, how does one generate arbitrarily large samples of queries that are similar, yet novel? The key issue is what ‘similar’ and ‘novelty’ mean and how one can evaluate them in a meaningful way. In this work, you will survey earlier attempts at solving this problem and you will implement and compare various approaches.

Supervisor: Michael Minock

Title: Design for web-based Natural Language Interfaces

Subject: It is often claimed that natural language interfaces, if they actually worked, would be a very effective method for people to access information on the web. Still, what are some design tricks that help improve the ergonomics of natural language questions/commands typed through browsers? You will perform WoZ experiments over various implemented approaches on a domain example of your choice (e.g. sports databases, travel databases, etc).

Supervisor: Michael Minock

Title: NLIs over APIs (idea courtesy of Puru Govind)

Subject: Let us consider calling a web-based API via natural language. For example the API from expedia: http://developer.ean.com/docs/hotel-list uses the request parameters arrival date, departure date, number of adults, number of children, destination, etc.

Can we easily build (or adapt existing tools to build) an interface that can be called via natural language? (e.g. “Show me hotels in Berlin between 25 Dec and 1 Jan”,“Hotels in Berlin for five adults”, etc.) In this thesis you will explore the feasibility of this idea through implementing and evaluating an NLI over expedia or another API of your choice.

Supervisor: Michael Minock

Title: Theorem Provers and Practical First-Order Decidability

Subject: Let’s evaluate how well a modern first-order theorem prover such as SPASS does with respect to terminating (within t seconds) on various problems. In this thesis you will build a system that takes as input sigma (a fixed pure relational vocabulary) and n (the number of variables). Your program will generate (or sample) the semantically distinct CNF expressions over n variables and test for satisfiability using SPASS over all different quantifier prefixes. The output of your program will print what proportion of problems are sat, unsat, or non-terminating (after t seconds) for various quantifier prefixes. Clearly n and sigma will be small to be able to practically run these experiments. Your program will generate interesting results worthy of discussion.

Supervisor: Michael Minock

Title: Ant Colonies and Bio-inspired computing

Theme: Algorithms

Subject: An ant colony is an example of highly distributed and dynamic system where agents (ants) have to collaboratively achieve some goals such as searching for food. This project will study algorithmic problems inspired by ant colonies. See [1, 2] for an example. More generally, one can formulate and study any problem inspired by a biological system. See [3, 4] for an example of research in this direction.

Possible essay projects: Formulate a new problem inspired by ant colonies or a biological system. Experimentally or theoretically study this problem, e.g. compare different heuristics for solving this problem by running them on some data, or show that the problem can or cannot be solved in polynomial time.

References:

[1] Task Allocation in Ant Colonies, paper by Cornejo et al.

[2] Towards More Realistic ANTS, paper by Emek et al.

[3] 2nd Workshop on Biological Distributed Algorithms

[4] Bioinspired computation in combinatorial optimization: algorithms and their computational complexity, paper by Neumann and Witt

Supervisor:

Title: Numerical Algorithms for technical and scientific Computation

Keywords: Numerical mathematics, simulations, algorithms

Subject: Numerical algorithms gain an ever increasing importance together with the wider use of simulations and numerical calculations in many fields. This starts with basic research in physics and chemistry, covers life sciences or material sciences, and does not end in the product development of many engineering sciences.

The efficient implementation of numerical algorithms on modern computer systems is at the heart of software development for scientific and technical applications and an interesting and challenging task for computer scientists. Knowledge about the implementation of basic algorithms is a prerequisite for the efficient solution of large and complex calculation projects later on.

Possible essay projects

The idea of the proposed projects is to dive into the implementation of numerical algorithms. This will include the exploration of the algorithm's behaviour with convenient software tools that have been tailored for mathematical problems like Matlab or some Python-based software packages. Finally, the algorithm will be implemented, analysed and optimised for modern processors using a compiled programming language, for example C or C++. Specific topics can be selected from different application fields and can be chosen according to the interests and the previous knowledge of the student:

linear algebra, the solution of linear systems,
calculus, the evaluation of functions, numerical differentiation and integration,
the solution of differential equations,
applications of the Fast Fourier Transformation (FFT),
and many more

Supervisor: Michael Schliephake

Title: Software application for discovering ’patterns’ of events in parallel processes

Theme: simulations, statistics, machine learning, signal processing

Subject: In various fields parallel and time dependent processes consisting of the presence or absence of particular events are important. Such parallel processes can for instance be the spiking activity of individual neurons in the brain or certain activity in various internet forums, etc. Clustering of or quantification of such parallel events may be necessary to do to get crucial information from such data. If examplifying with brain data, an altered occurrences of events (spikes from single neurons) can be linked to cognitive processing or the presence of disease. If instead using the internet as an example a change in events over time could alert us that a dangerous epidemy is spreading in a part of the world.

Possible work sequences during the project:

Read up on some basic background (if relating to brain data of spiking neurons see e.g. http://www.scholarpedia.org/article/Measures_of_spike_train_synchrony; Agmon (2012) at http://www.neuralsystemsandcircuits.com/content/2/1/5
Generate or receive trains of ’events’ (i.e. data to be used as a starting point)
Choose a couple of quantitative ’measures’ on e.g. synchrony of events in parallel processes and visualize the result

Supervisor: Jeanette Hellgren-Kotaleski

Title: A software application for visualizing activity and/or changes in a biochemical network

Theme: Simulations, visualization, data analysis

Subject: Within living cells dynamic and complex interactions between proteins and other molecules take place, and the changes over time of the amounts of specific molecular species typically depend on changes in other parts of the network. To get a quantitative understanding of such systems, simulations are becoming an indispensable tool.

Possible work sequences during the project:

Create or receive a list of coupled differential equations, e.g. in the format below

dA/dt=f_a(A,B,C,....)

dB/dt=f_b(A,B,C,....)

dC/dt=f_c(A,B,C,....)

...

Read up on relevant background, e.g. biochemical reactions occurring in living cells and how to map those on the above description; see e.g. http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-014-0420-0;

http://www.nature.com/articles/srep12569;

http://delivery.acm.org/10.1145/1360000/1358033/p802-hardy.pdf?ip=130.229.49.87&id=1358033&acc=ACTIVE%20SERVICE&key=74F7687761D7AE37%2E35CB82741CCF0CF9%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=575045198&CFTOKEN=47177207&__acm__=1452702958_c0488a3e726f622270c42f0bf3289bfe;

http://bioinformatics.oxfordjournals.org/content/24/2/209.full.pdf;

PLoS Comput Biol. 2008 Feb 29;4(2):e1000005. doi: 10.1371/journal.pcbi.1000005

3) Visualize during runtime the variables A,B,C, etc as nodes in a network;

4) Implement an interactive control of a few of the network nodes

5) use your program to look at protein amount in the simulated network during run time

Supervisor: Jeanette Hellgren-Kotaleski

Title: Cancer tumor heterogeneity and clustering.

Since a couple of years, biology labs have been generating genomic data from cancer tumors. A genome of a cancer cell typically contains many mutations, some consist of a changed nucleotide, some of a deletion or duplication of a longer region, and some of other structural changes of the genome. The mutation are often partitioned into driver mutations that affect the phenotype (the functional characteristics and behavior) and passengers mutations not affecting it.

Some of this data is bulk data in the sense that there actually are subpopulations of cells in the tumor with radically different genomes. Recently several machine learning methods have been suggested for deconvoluting such data, that is, separating the bulk data into clusters representing subpopulations. Based on existing literature and your one ideas implement one such method as well as a generator for synthetic data (similar to the real bulk data). Apply your algorithm to synthetic as well as real data.

Supervisor: Jens Lagergren

Title: Somatic evolution in cancer

Today, biology labs have been generating single cell genomic data from cancer tumors. A genome of a cancer cell typically contains many mutations, some consist of a changed nucleotide, some of a deletion or duplication of a longer region, and some of other structural changes of the genome. Based on these genomes the a phylogenetic tree representing the evolutionary history of the cells be inferred. Based on existing literature and your one ideas implement one such method as well as a generator for synthetic data (similar to the real bulk data). Apply your algorithm to synthetic as well as real data.

Supervisor: Jens Lagergren

Title: Sub-population dynamics in tumors.

There are several strongly simplified stochastic processes that capture various aspect of a a tumor grows. Two particular aspects are especially interesting. First, how different subpopulations aries and become the dominant type due to higher fitness (basically efficiency of proliferation). Second, how different models of the cell division, and their parameters, affect the tumor growth and how the obtained model behavior fits available biological data. Implement variations of the models and study their behavior and how it fits data.

Supervisor: Jens Lagergren

Title: Mining of Variability Models for Software Product Lines

Theme: Software Technology / Software Engineering

Subject description:

Software product lines (SPL) engineering is a software technology to decrease development and maintenance costs of software products that are customized for a large number of customers by managed reuse of the common software artefacts. To capture the commonalities and variabilities among the products, various variability models are used.

In practice one has often to deal with legacy software that has been customized for many customers and thus form a product line, even if not developed by following SPL engineering methods. Recovering the SPL in terms of variability models can greatly help the creation of new customized products.

The project investigates the utility of the hierarchical variability model presented in [1]. More specifically, the project should develop:

(1) a translation of sets of software products to software families represented as relations, and (2) an implementation of the algorithm from [1] to extract models from the relations. Visualization of the models would be an asset.

References:

[1] Dilian Gurov, Bjarte M. Østvold, Ina Schaefer: A Hierarchical Variability Model for Software Product Lines. ISoLA Workshops 2011: 181-199.

Supervisor: Dilian Gurov

Title: Extraction of the evolving landscape of neuroscience

Subject description:

In 2008 Lin et al. (Plos One 2008) presented a detailed analysis of the abstracts presented at the Society of Neuroscience (SFN) meetings between 2001-2006. More than 30000 posters are presented at the SFN annually. Lin et al. (2008) provided a detailed demographic and geographical distribution of neuroscientists and their research foci. Moreover, they investigated the dynamics of different research themes over time and created an authorship network of neuroscientists.

Since 2006, neuroscience has changed dramatically by the advent new experimental tools such as the optogenetics and in vivo calcium imaging and growing appreciation of mathematical models and simulations. Give this massive change it would be interesting to analyse the SFN abstracts from 2008-2015 to see how new collaborations and research themes have emerged in neuroscience. This kind of meta-analysis not only will reveal the current landscape but it can also be used to infer the emerging new collaborations that will shape the future of neuroscience.

The student will get to use methods from graph theory and natural language processing to complete the project.

References:

Lin JM, Bohland JW, Andrews P, Burns GAPC, Allen CB, Mitra PP (2008) An Analysis of the Abstracts Presented at the Annual Meetings of the Society for Neuroscience from 2001 to 2006. PLoS ONE 3(4): e2052. doi:10.1371/journal.pone.0002052[http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0002052]

Supervisor: Arvind Kumar

Title: Collaboration network of KTH researchers: Testing the scope of department boundaries

Subject description:

In academic institutes traditionally researchers are pooled together into departments. Usually this distribution is based on research profile of the researchers. However, modern science is highly interdisciplinary and researcher keep changing their research focus as their questions evolve. In such an evolving landscape of science how good are the department boundaries? This can be tested by studying at the collaboration networks by estimating the within and across department collaboration.

The easiest way to find out collaborations is by looking at the co-authored papers. Fortunately at KTH all researcher submit their publications in DIVA repository. And DIVA data can be used to extract the authorship network. In this project the students will use graph theory to reveal the within and across department collaborations. Next, they will estimate how often researchers from a department publish papers in a journal associated with another department. Thus, this project will create a better landscape of scientific research at KTH which will go beyond the current department and school boundaries.

The student will get to use methods from graph theory and natural language processing to complete the project

Supervisor: Arvind Kumar

Title: Graph properties of the KTH web (1 thesis)

Theme: Algorithms/Graphs

Subject: Graphs have emerged as a powerful tool to study complex systems. The graph of a system describes an interaction between a pair of nodes as an edge, which may have a weight, and a direction. The ‘adjacency matrix’ described all the possible pair-wise integrations between the nodes. Such graphical representations can give deep insights about the robustness and dynamics properties of a system (Newman 2003, ). A typical example of a system that can be studied using graph theory is the world-wide-web. Previously, several studies (e.g. Barabasi 2007, Barabasi et al. 2003) have described the scale-free nature of the internet and its fault tolerance.

Possible essay projects:

Extract the adjacency matrix of the pages in the KTH web
Study the degree distribution and eigenvalue spectrum of the adjacency matrix
Evaluate graph descriptors such as eigenvalue centrality, k-shell index, clustering coefficients, network diameter, driver nodes, etc.

References:

Barabasi A-L (2007) The architecture of complexity. IEEE Control Systems Magazine. 27(4):33-42
Barabasi A-L, Albert R, Jeong H (2000) Scale-free characteristics of random networks: the topology of the world wide web. Physica A 281, 69-77.
Newman M (2003) The structure and function of complex networks. SIAM Reviews 55: 167-257.

Supervisor: Arvind Kumar

Title: Improving the scientific peer-review process

Theme: Algorithms / Simulations

Subject: The principle of peer review is central to the evaluation of research, by ensuring that only high-quality scientific works are funded or published. But peer review has also received criticism, as the reviewers’ time and attention are limited (and unpaid) resources. Naturally, human bias also affects the review process. In 2014, the organizers of the premier machine learning conference “Neural Information Processing Systems” conducted an experiment in which 10% of submitted manuscripts (166 items) went through the review process twice. Arbitrariness was measured as the conditional probability for an accepted submission to get rejected if examined by the second committee. This number was equal to 60%, indicating

Possible essay project:

Learn what goes into a typical peer-review, and develop a method to simulate peer-review for a scientific conference (model the submissions & reviewers, include noise)
Think about ways you can improve the accuracy peer-review process without creating (too much) extra work for the reviewers. Hint: think about the ELO rating system used in chess and other pairwise comparison methods.

References:

ELO rating system, Pairwise comparison
Slides regarding the Neural Information Processing Systems Experiment
Blog posts regarding the experiment (#1, #2, #3)

Supervisor: Kevin Smith

Title: User behavior prediction

Theme: Algorithms

Subject: How can we model users’ preferences? How can we use this data to predict their actions? By modeling user behavior using data such as their browsing history or their ratings of products/media, we can help users obtain needed information more directly. Netflix held an open competition from 2006 to 2009 for the best algorithm to predict user ratings for films based on previous ratings, and offered a $1,000,000 prize to the winners in 2009. Many other high-tech companies face similar problems, and databases are publicly available for many such problems.

Possible essay project:

Predict whether a user will like a movie based on previous ratings
Predict where a user will spend their next holiday based on their browsing history

References:

The Airbnb New User Bookings dataset
The MovieLens dataset

Supervisor: Kevin Smith

Title: Credibility analysis in Twitter feeds using domain knowledge

Theme: Algorithms

Subject: Society, particularly the younger generation has become increasingly reliant on internet-based sources such as Twitter or Facebook to receive news and updates. Often, the information passed on is unvetted from non-traditional news sources, often individuals. Traditional journalists verify the information and sources before publishing, but online platforms often allow various false information and rumor to spread. Numerous examples can be cited following recent disasters. One possible way to cope with this problem is to develop automatic methods to assess the credibility of tweets by analyzing the language and the source identity. One potential bottleneck is the availability of annotated Twitter data to train and evaluate credibility models.

Possible essay project:

Develop a method to predict the trustworthiness of sources and the credibility of their tweets. Familiarize yourself with existing research on the topic, and consider methods in natural language processing
Think about alternative methods to collect the data necessary to build an automatic credibility assessment algorithm

References:

The Apollo Project, Twitter datasets
Defining ground truth in information credibility on Twitter
Tweets 2011, a corpus of tweet identifiers

Supervisor: Kevin Smith

Title: Character recognition in natural images

Theme: Algorithms

Subject: Character recognition is a classic pattern recognition problem. For the Latin script, character recognition is largely a solved problem given certain constraints (eg. images of a scanned document). However, character recognition in natural images (photographs) pose a much more difficult problem, where characters can be much more difficult to recognized (eg. characters in neon script outside of a restaurant).

Possible essay project:

Develop a method to recognize characters found in natural images

References:

The Chars 74 K dataset
A paper on Character recognition in natural images
The Julia language
Tensorflow convolutional neural networks

Supervisor: Kevin Smith

Title: Information Theoretic Similarity Measures for Robust Image Matching.

Theme: Image Analysis, Similarity Measures, SCV, SCVD

Subject: A robust similarity measure between different regions of images plays a fundamental role in several image analysis applications such as stereo matching, motion estimation, registration and tracking. Similarity measures commonly used in these tasks, SSD or NCC for example, can at most cope with linear variations of intensity, such as global changes in gain and bias. Matching and registration techniques in general need to be robust to a wider range of transformations that can arise from non-linear illumination changes caused by anisotropic radiance distribution functions, occlusions or different acquisition processes (e.g. visible light and infrared, those employed in medical imaging). The main focus of this project is on these more challenging contexts.

Most of the existing methods for computing similarity measures across multi-modal images are based on information theoretic approaches and make use of the probability of the intensity co-occurrence, including the seminal works on mutual information (MI) which introduced the use of joint intensity distributions, recognising the statistical dependence between intensities of corresponding pixels. The increased resilience to non-linear intensity transformations however comes at the cost of a higher computational complexity than conventional sum-comparing metrics.

Below you can find a set of alternative projects (they can be treated individually or in combination).

Possible essay projects:

- Develop your own approach to robust image matching whose complexity is (close to) linear with respect to the number of elements.

- Alternativey, build your approach based on the recently propoesd approaches such as SCV(D): the sum of conditional variances (of diffrences). Those use the joint distribution of image intensities, but generates it directly as a histogram. Intuitively, SCV exploits a statistical property assuming that a group of pixels clustered by neighbouring intensities in the first image should be similarly clustered in the second, even if their mapped ranges are very different.

- Select and compare a few existing state-of-the-art methods.

Supervisor: Atsuto Maki

Title: Troll Detection

Theme: Natural language processing, clustering

Subject: Social media is today widely used for propaganda purposes. False information is intentionally spread as a means to confuse and create distrust for politicians, as well as for media. Lies can be produced faster, than the speed with which they can be debunked. A lie repeated often enough may soon be considered an established true, even after been shown to be completely false. Disinformation is often spread on channels like Twitter by so called ‘trolls’ organized in ‘farms’. Since trolls typically concentrate on one talking point at the time, to make a message as strong as possible, these trolls can quite easily be detected by the similarity in the tweets they produce. This is true for an experienced user or for a machine, but hardly for an average user that might get overwhelmed by false news without realizing that they all come from the same original source.

Possible essay project: Create a similarity measure between different Twitter profiles by measuring the similarity and frequency between tweets. Use this measure to group profiles into clusters that all seem to convey the same original message. In the long run this might lead to a service in which a user can receive a cluster of highly correlated Twitter accounts, given the name of an account from same cluster, an account they believe belongs to a troll spreading disinformation.

References:

Supervisor: Mårten Björkman

Title: Algorithms that are hard to parallelize

Theme: Algorithms, GPGPU

Subject: By exploiting the massive levels of parallelism that they allow, GPUs have in recent years been used to speed up a whole range of different algorithms and applications. Reports claim speed ups of as much as 100 times, compared to CPU implementations. However, some classes of problems, in particular those that are P-complete, are inherently difficult to implement on parallel architectures like GPUs. State-of-the-art solutions may for example rely on intricate graph methods that are sequential in nature, with few opportunities to exploit parallelism. The question is whether more brute force methods, that in terms of complexity should be less effective, are in fact preferable in practical problems, if they allow the full potential of GPUs to be exploited. Since the speed of a complete system is often determined by its slowest component, also less significant speed ups can be of great important in practice.

Possible essay project: Identify a problem (such as connected components, graph cuts or nearest neighbour search) for which classical solutions are very hard to implement on a massively parallel architecture. Suggest an alternative solution that is better suited for parallel implementation and test it on a GPU using CUDA.

References:

CUDA Zone
Vibhav Vineet and P. J. Narayanan, “CudaCuts: Fast Graph Cuts on the GPU”, 2008.
Shengren Li et al., “kANN on the GPU with Shifted Sorting”, 2012.

Supervisor: Mårten Björkman

Title: Modern image features

Theme: Computer vision

Subject: For more than a decade, SIFT (Scale Invariant Feature Transform) features have been the dominating image features used in computer vision applications that involve object recognition, tracking or stereo matching. Other features have been proposed, many of which are faster to compute, but in terms of accuracy few have been able to really compete. KAZE features seem to be an exception though, at least based on observations in recent offline benchmarks that are typically used for assessing the performance of image features. The question is whether KAZE features also work in practice, when you use them on real-world image data for which people so far have used SIFT features.

Possible essay project: Create a brute force image matching system using the OpenCV implementations of SIFT and KAZE, and test the matching performance on a realistic set of images. The dataset may be targeted for either object recognition, tracking or stereo matching. Analyze cases in which one method is preferable from the other and try to explain why they differ.

References:

OpenCV API Reference
David Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, 2003.
Pablo Alcantarilla et al., “Accelerated-KAZE Features”, 2013.

Supervisor: Mårten Björkman