Optimizing Across Relational and Linear Algebra in Parallel Analytics Pipelines
Speaker: Asterios Katsifodimos, Assistant Professor and Delft Technology Fellow at the Web Information Systems group at the Delft University of Technology
Title: Optimizing Across Relational and Linear Algebra in Parallel Analytics Pipelines
Advanced data analysis typically requires some form of preprocessing in order to extract and transform data before processing it with machine learning and statistical analysis techniques.
Pre-processing pipelines are naturally expressed in dataflow APIs (e.g., MapReduce, Flink, etc.), while machine learning is naturally expressed in linear algebra with iterations, using programming abstractions such as R’s Dataframe or Python’s Pandas. Data scientists nowadays perform scalable end-to-end data analysis programs by either i) using parallel dataflow APIs for the complete program, which introduce impedance mismatch and hinder programmers’ productivity or by ii) using multiple programming paradigms (e.g., SQL and Dataframes) and systems (e.g., Hadoop and Python Pandas). Using multiple paradigms and systems prevents optimization opportunities such as parallelization, sharing of physical data layouts (e.g., partitioning) and data structures, among different parts of a data analysis program.
My talk will be split in two parts. In the first part I will briefly introduce a deeply embedded language in Scala, which enables authoring scalable programs using two abstract data types, namely DataBag and Matrix, enabling joint optimizations over both relational and linear algebra. In the second part of my talk I will discuss a concrete optimization which can be applied in the context of analysis programs comprising both linear and relational algebra operations. More specifically, I will present BlockJoin, a distributed join algorithm which emits block-partitioned results to subsequent linear algebra operations such as matrix multiplications. BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines, in order to minimize very expensive shuffling costs.
Bio Asterios is an Assistant Professor and Delft Technology Fellow at the Web Information Systems group at the Delft University of Technology. Before joining TU Delft, Asterios worked at the SAP Innovation Center in Berlin, designing and implementing scale-out data management architectures for SAP's Leonardo ML foundation. Before joining SAP, Asterios spent three years as a senior researcher with the database systems group in TU Berlin, working on language models and systems for scalable analytics. Asterios received his PhD from INRIA Saclay & Universite Paris-Sud in 2013 for his work on distributed database architectures based on materialized views. During his MSc studies, Asterios was a full-time research member of the High Performance Computing systems Lab (HPCL, now LINC), at the University of Cyprus.