Seminar 2015-11-09 W.Z.
Analytics on Graphs with a Trillion Edges
Speaker: Willy Zwaenepoel, EPFL, Analytics on Graphs with a Trillion Edges
Willy Zwaenepoel received his B.S. from the University of Gent, Belgium in 1979, and his M.S. and Ph.D. from Stanford University in 1980 and 1984, respectively. In September 2002, he joined EPFL. He was Dean of the School of Computer and Communications Sciences at EPFL from 2002 to 2011. Before joining EPFL, Willy Zwaenepoel was on the faculty at Rice University, where he was the Karl F. Hasselmann Professor of Computer Science and Electrical and Computer Engineering.
He was elected Fellow of the IEEE in 1998, and Fellow of the ACM in 2000. In 2000 he received the Rice University Graduate Student Association Teaching and Mentoring Award. In 2007 he received the IEEE Tsutomu Kanai award. He was elected to the European Academy in 2009. He won best paper awards at SigComm 1984, OSDI 1999, Usenix 2000, Usenix 2006 and Eurosys 2007. He was program chair of OSDI in 1996 and Eurosys in 2006, and general chair of Mobisys in 2004. He was also an Associate Editor of the IEEE Transactions on Parallel and Distributed Systems from 1998 to 2002.
Willy Zwaenepoel has worked in a variety of aspects of operating and distributed systems, including microkernels, fault tolerance, parallel scientific computing on clusters of workstations, clusters for web services, mobile computing, database replication and virtualization. He is most well known for his work on the Treadmarks distributed shared memory system, which was licensed to Intel and became the basis for Intel’s OpenMP cluster product. His work on high-performance software for network I/O led to the creation of iMimic Networking, Inc, which he led from 2000 to 2005. His current interests include large-scale data stores and software testing. Most recently, his work in software testing led to the creation of BugBuster, a startup based in Lausanne.
Big graphs occur naturally in many applications, most obviously in social networks, but also in many other areas such as biology and forensics. Current approaches to processing large graphs use either supercomputers or very large clusters. In both cases the entire graph must reside in memory before it can be processed. We are pursuing an alternative approach, processing graphs from secondary storage. While this comes with some performance penalty, it makes analytics on very large graphs feasible on a small number of commodity machines. It also has the pleasing property that "if you can store a graph, you can compute on it".
We have developed two systems, one for a single machine and one for a cluster of machines. X-Stream, the single-machine solution, aims to make all secondary storage access sequential. It uses two techniques to achieve this goal: edge-centric processing and streaming partitions. X-Stream outperforms the state-of-the-art GraphChi system, because it achieves better sequentiality and because it requires less preprocessing. Slipstream, the cluster solution, starts from the observation that there is little benefit to locality when accessing secondary storage over a high-speed network. As a result, we use lightweight dynamic partitioning, focusing on achieving load balance and sequential access to secondary storage. The resulting system achieves good scaling and outperforms other systems. With Slipstream we have also been able to process a trillion-edge graph, a new milestone for graph size on a small cluster. I will describe both systems and their performance on a number of benchmarks and in comparison to the state-of-the-art alternatives.
This work is joint work with Laurent Bindschaedler, Jasmina Malicevic and Amitabha Roy.