Till KTH:s startsida Till KTH:s startsida

Logga in till din gruppwebb

Du är inte inloggad på KTH så innehållet är inte anpassat efter dina val.

Ändra tidsperiod eller vy
Vecka 33 Visa i Mitt schema
Tis 15 aug 13:00-15:00 Large Scale ETL Design, Optimization and Implementation Based On Spark and AWS Platform
Plats: Ada room

Student:  Di Zhu
Date and Time:  13:00pm, Tuesday, 15th August, 2017
Place: Ada room
Examiner:  Šarūnas Girdzijauskas
Supervisor:  Vladimir Vlassov
Title: Large Scale ETL Design, Optimization and Implementation Based On Spark and AWS Platform
Opponent: Xiaoxu Gao


Abstract

Nowadays, the amount of data generated by users within an Internet product is increasing exponentially. All these data may be yielded more than billions every day, which is not surprisingly essential that insights could be extracted or built. For instance, monitoring system, fraud detection, user behavior analysis and feature verification, etc. Nevertheless, technical issues emerge accordingly. Heterogeneity, massiveness and miscellaneous requirements for taking use of the data from different dimensions make it much harder when it comes to the design of data pipelines, transforming and persistence in data warehouse. Undeniably, there are traditional ways to build ETLs - from mainframe, RDBMS, to MapReduce and Hive. Yet with the emergence and popularization of Spark framework and Amazon Web Services (AWS), this procedure could be evolved to a more robust, efficient, less costly and easy-to-implement architecture for collecting, building dimensional models and proceed analytics on massive data. With the advantage of being in a car transportation company, billions of user behavior events come in every day, this paper contributes to an exploratory way of building and optimizing ETL pipelines based on AWS and Spark, and make the comparison with current main Data pipelines from different aspects like efficiency, robustness, ease of maintenance, etc.

Tis 15 aug 15:00-17:00 Predicting the risk of accidents for downhill skiers
Plats: Ada room

Student:  Marco Dallagiacoma
Date and Time:  Tuesday August 15th, 15:00
Examiner:  Šarūnas Girdzijauskas
Supervisor:  Amira Soliman El Hosary

Title: Predicting the risk of accidents for downhill skiers

Opponents: Marc Höffl - mhoffl@kth.se    Di Zhu - dzhu@kth.se


Abstract

In recent years, the need for insurance coverage for downhill skiers is becoming increasingly important. The goal of this thesis work is to enable the development of innovative insurance services for skiers. Specifically, this project addresses the problem of estimating the probability for a skier to suffer injuries while skiing.

This problem is addressed by developing and evaluating a number of machine- learning models. The models are trained on data that is commonly available to ski- resorts, namely the history of accesses to ski-lifts, reports of accidents collected by ski-patrols, and weather-related information retrieved from publicly accessible weather stations. Both personal information about skiers and environmental variables are considered to estimate the risk. Additionally, an auxiliary model is developed to estimate the condition of the snow in a ski-resort from past weather data. A number of techniques to deal with the problems related to this task, such as the class imbalance and the calibration of probabilities, are evaluated and compared.

The main contribution of this project is the implementation of machine learning models to predict the probability of accidents for downhill skiers. The obtained models achieve a satisfactory performance at estimating the risk of accidents for skiers, provided that the needed historical data for the target ski-resorts is available. The biggest limitation encountered by this study is related to the relatively low volume and quality of available data, which suggests that there are opportunities for further enhancements if additional (and especially better) data is collected. 

Vecka 35 Visa i Mitt schema
Mån 28 aug 09:00-10:30 Master Thesis Defense on "Dataset versioning in Hops Filesystem" (Monday, Aug 28, 9:00am)
Plats: Ada Room

Student:  Braulio Grana Gutiérrez
Date and Time:  09:00am, Monday, 28th August 2017
Place: Ada room, 4th floor
Examiner:  Šarūnas Girdzijauskas
Supervisor:  Jim Dowling
Title: Dataset versioning in Hops Filesystem
Opponent: Adrián Ramírez del Río


Abstract
As the awareness of the potential of Big Data araises, more and more companies
are starting to create their own Data Science divisions and their projects are
becoming big and complex handled by big multidisciplinary teams. Furthermore,
with the expansion of fields such as Deep Learning, Data Science is becoming a
very popular research field both in companies and universities.

In this context it becomes crucial for Data Scientists to be able to reproduce
their experiments in a reliable way. This Master Thesis project
presents the design
of a snapshotting system for the distributed File System HopsFS based on Apache
HDFS and developed at the Swedish Institute of Computer Science (SICS) along
with comments and discussion on the implementation of said system.

Among the contributions of this project are, not only to build the mentioned
snapshotting system for HopsFS but to improve on previous solutions designed
for both HopsFS and HDFS by solving problems such as the incomplete block
problem as well as finding adding new uses to the system such as the automatic
snapshots to allow users to undo the last few changes of a file.

Mån 28 aug 10:30-12:00 3 of 2,721 Print all In new window Master Thesis Defense on "Churn Analysis in a Music Streaming Service: Predicting and understanding retention" (Monday, Aug 28, 10:30am)
Plats: Ada Room

Student:  Guilherme Dinis Chaliane Junior
Date and Time:  10:30am, Friday, 28th August 2017
Examiner:  Šarūnas Girdzijauskas
Supervisor:  Vladimir Vlassov
Title: "Churn Analysis in a Music Streaming Service: Predicting and understanding retention"
Opponent: Philipp Eisen, Ignacio Amaya 


Abstract
Churn analysis can be understood as a problem of predicting and understanding abandonment of use of a product or service. Different industries ranging from entertainment to financial investment, and cloud providers make use of digital platforms where their users access their product offerings. Usage often leads to behavioural trails being left behind. These trails can then be mined to understand them better, improve the product or service, and to predict churn. In this thesis, we perform churn analysis on a real-life data set from a music streaming service, Spotify AB, with different signals, ranging from activity, to financial, temporal, and performance indicators. We compare logistic regression, random forest, along with neural networks for the task of churn prediction, and in addition to that, a fourth approach combining random forests with neural networks is proposed, and evaluated. Then, to come up with rules that are understandable to decision makers, a meta- heuristic technique is applied over the data set to extract Association Rules that describe quantified relationships between predictors and churn. We relate these findings to observed patterns in aggregate level data, finding probable explanations to how specific product features and user behaviours lead to churn or activation. For churn prediction, we found that all three non-linear methods performed better than logistic regression, suggesting the limitation of linear models for our use case, and our proposed enhanced random forest model performed mildly better than conventional random forest.

Mån 28 aug 13:00-14:30 Master Thesis Defense on "Fraud detection in online payments using Spark ML" (Monday, Aug 28, 1pm)
Plats: Ada

Student:  Ignacio Amaya
Date and Time:  1:00pm, Monday, 28th August 2017
Place: Ada room, 4th floor
Examiner:  Šarūnas Girdzijauskas
Supervisor:  Vladimir Vlassov
Title: Fraud detection in online payments using Spark ML
Opponent: Adrián Ramírez

Abstract:
Fraudulent online payments cause large amount of losses, so companies build fraud detection systems to prevent them.
In this thesis we study how machine learning can improve those systems.
Previous academic work have failed to address fraud detection in real-world datasets using distributed computing frameworks, which are needed due to their big data volume.
To fill this gap, we have used real-world payment data to build a fraud detection classifier on Spark ML. Class imbalance and non-stationarity reduced the performance of our models, so experiments to tackle those problems have been performed.
Our best results are obtained combining undersampling and oversampling on the training data. Keeping only the newest data and ensembling several models with different majority class instances also improve the predictions.
A final model has been deployed at Qliro, an important online payments provider in the Nordics, enhancing their fraud detection system and helping investigators catch frauds that were being missed before.

Mån 28 aug 14:30-16:00 Master Thesis Defense on " Decentralized Diffusion-Controlled Algorithm for Community Detection" (Monday, Aug 28, 2:30pm)
Plats: Ada Room

Student: Adrian Ramirez

Date and Time: 14:30, Monday, 28th August 2017

Place: Ada Room, 4th floor

Examiner: Sarunas Gridzijauskas

Supervisor: Amira Soliman

Title: Decentralized Diffusion-Controlled Algorithm for Community Detection

Opponents: Braulio Grana, Marco Dallagiacoma
Abstract
Community detection in graphs has been an important research topic for many fields.  The aim of community detection is to extract from graphs those groups of nodes that present more connections between them than with the rest of the network. Extracting such groups at different scales can help understanding the global behaviour of the system.  However,  recent studies have shown that real-world  graphs  follow  power-law  distributions  for  degree  and  community  sizes.Specifically, these graphs present many small communities but just a few large ones.  This unbalanced community size distribution poses a great challenge for community detection algorithms.    Most  of  the  existing  methods  are  based  on  global  approaches  that  require information about the network to be processed as a whole. Thus, those techniques can not be applied when the graph is too big to fit into one single machine, or in distributed setting when the graph is portioned among multiple machines.  To solve this limitations a completely decentralized community detection algorithm is presented. It is based on diffusion, following a vertex-centric approach that let each node decide the diffusion rates based on local information. It adds as well a mechanism for controlling the diffusion speed through a customizable function.    We  evaluate  the  algorithm  with  a  variety  of  graphs  with  different  levels  of imbalance and community structures. Our algorithm is able to detect (almost)perfectly the communities when the imbalance between community sizes is not extreme. We show  as  well  how  the  sizes  of  the  detected  communities  can  be controlled  by  the  diffusion  strategy,  allowing  for  better detection  of  finer  or coarser resolutions in hierarchical graphs. The algorithm is also compared to other two well-known existing methods, achieving similar results in most of the cases though with a higher computation time.

Vecka 36 Visa i Mitt schema
Ons 6 sep 09:00-10:30 Master thesis "Building Evolutionary Clustering Algorithms on Spark"
Plats: Ada room

Student:  Xinye Fu
Date and Time:  9:00 am, Wednesday, 6th September 2017
Examiner:  Šarūnas Girdzijauskas
Supervisor:  Vladimir Vlassov
Title: "Building Evolutionary Clustering Algorithms on Spark"
Opponent: Jing Li, Pietro Cannalire


Abstract
Evolutionary clustering (EC) is a kind of clustering algorithm to handle the noise of time-evolved data. It can track the truth drift of clustering across time by considering history. EC tries to make clustering result fit both current data and historical data/model well, so each EC algorithm defines snapshot cost (SC) and temporal cost (TC) to reflect both requests. EC algorithms minimize both SC and TC by different methods, and they have different ability to deal with a different number of cluster, adding/deleting nodes, etc.

Until now, there are more than 10 EC algorithms, but no survey about that. Therefore, a survey of EC is written in the thesis. The survey first introduces the application scenario of EC, the definition of EC, and the history of EC algorithms. Then two categories of EC algorithms - model-level algorithms and data-level algorithms are introduced one-by-one. What's more, each algorithm is compared with each other. Finally, performance prediction of algorithms is given. Algorithms which optimize the whole problem (i.e., optimize change parameter or don't use change parameter to control), accept a change of cluster number perform best in theory.

EC algorithm always processes large datasets and includes many iterative data-intensive computations, so they are suitable for implementing on Spark. Until now, there is no implementation of EC algorithm on Spark. Hence, four EC algorithms are implemented on Spark in the project. In the thesis, three aspects of the implementation are introduced. Firstly,  algorithms which can parallelize well and have a wide application are selected to be implemented. Secondly, program design details for each algorithm have been described. Finally, implementations are verified by correctness and efficiency experiments.

Vecka 41 Visa i Mitt schema
Tors 12 okt 09:00-10:30 Master Thesis Defense on "Exploring consensus-mediating arguments in online debates" (Thu, Oct 12, 9:00am)
Plats: Knuth room, SICS, Electrum building 6th floor

Student:  Andreas Kaas Johansen

Date and Time:  09:00am, Thursday, 12th October 2017

Place: Knuth room, SICS, Electrum building 6th floor

Examiner:  Šarūnas Girdzijauskas

Academic Supervisor: Vladimir Vlassov

Industrial Supervisor: Magnus Sahlgren

Title: Exploring consensus-mediating arguments in online debates

Abstract
This work presents a first venture into the search for features that define the rhetorical strategy known as Rogerian rhetoric. Rogerian Rhetoric is a conflict-solving rhetorical strategy intended to find common ground instead of polarizing debates further by presenting strong arguments and counter arguments, as is often done in debates.
That goal of the thesis is to lay the ground work, a feature exploration and evaluation of machine learning in this domain, for others tempted to model consensus-mediating arguments.
In order to evaluate different sets of features statistical testing is applied to test if the distribution of certain features differ over consensus-mediating comments compared to non-consensus mediating comments. Machine Learning in this domain is evaluated using support vector machines and different featuresets. 
The results show that on this data the consensus-mediating comments do have some characteristics that differ from other comments, some of which may generalize across debates. Next, as consensus-mediating arguments proved to be rar, these comments are a minority class, and in order to classify them using machine learning techniques overfitting needs to be addressed, the results suggest that the strategy applied to deal with overfitting is highly important. Finally the feature “polarity” is suggested and the evaluation shows that the hand-annotated comments as well as false-positives found by machine learning models in other debates have significantly lower “polarity” than non-consensus-mediating comments.
Due to the bias inherent in the hand annotated dataset the results should be considered provisional, more studies using debates from more domains with either expert or crowdsourced annotations are necessary to take the research further and produce results that generalize well.