Project #01

Title: Similarity Joins in MapReduce

Leader's Name: Benjamin Coors
Member2 Name: Alain Kaeslin
Member3 Name: Kristian Hunt

Related paper: Rares Vernica, Michael J. Carey, and Chen Li. 2010. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (SIGMOD '10). ACM, New York, NY, USA, 495-506. DOI=10.1145/1807167.1807222 http://doi.acm.org/10.1145/1807167.1807222

Presentation Day: May 20

Model: LE

Abstract:

Similarity join is the problem of finding all pairs of similar records from a given data set. This problem occurs in a variety of applications, such as document clustering, plagiarism detection, recommender systems or data integration. All of these applications need to handle increasing amounts of data. Therefore, it is beneficial to have distributed implementations in order to scale up similarity joins for large data sets. Recently, the MapReduce paradigm has received a lot of attention as a powerful framework for parallel data-processing.

The aim of this project is to research how similarity joins can be implemented using the MapReduce framework and provide a working prototype based on either Apache Hadoop or MongoDB. For this purpose, an example application domain, such as one of the above mentioned applications (i.e. plagiarism detection), will be chosen.

sigmod10-vernica.pdf