Large Scale Topic Detection using Node-Cut Partitioning on Dense Weighted Graphs
Speaker: Kambiz Ghoorchian
I am a Ph.D. candidate at Software and Computer Systems (SCS) laboratory at KTH Royal Institute of Technology. My research is funded by the Marie Curie Initial Training Network project calle iSocial. I received an M.Sc in Software Engineering of Distributed Systems (SEDS) from KTH, Sweden (2010) and a high-level M.Sc in Computational Linguistics from KULeuven, Belgium (2011). My current research is focused on designing algorithms for large-scale data processing with applications on Information Retrieval (IR) and text analysis.
Topic Detection in Text (TDT) is the problem of automatic identification of most frequent topics in a given corpus of documents. Traditional solutions, based on word-space modeling and similarity comparison, soon proved inefficient and unscalable due to their excessive representation models. Next generation of approaches leveraged efficiency using Dimensionality Reduction, Statistical Analysis, and Machine Learning techniques. However, scalability is still an issue, especially upon rapid growth of publications in web documents and Online Social Networks (OSNs).
We propose an innovative algorithm for TDT, based on dimensionality reduction and graph partitioning. The main idea is to create a highly dense and weighted graph that contains topics as small weighted-dense subgraphs and extract those topics using graph partitioning. The graph is created using a dimensionality reduction method called Random Indexing (RI). Then, the topics are extracted using a vertex-cut partitioning algorithm inspired from JaBeJa-VC. We show that the proposed approach outperforms the state-of-the-art solutions in both efficiency and scalability.