Explaining the outputs of modern data analytics
Speaker: Frank McSherry
Frank McSherry is an independent researcher formerly affiliated with Microsoft Research, Silicon Valley. While there he led the Naiad project, which introduced both differential and timely dataflow, and remains one of the top-performing big data platforms. He also works with differential privacy, due in part to its interesting relationship to data-parallel computation. Frank currently enjoys spending his time in places other than Silicon Valley.
We have made substantial progress with modern data analytics, moving well beyond the realm of simply counting words. We can determine interesting graph properties---connectivity, reachability, matchings---and maintain these properties in real time. We can produce a tremendous amount of output, but it isn't clear that we understand it all yet.
In this talk, I'll explain a framework for interactively determining and tracking *explanations* for outputs of arbitrary differential dataflow computations: subsets of the actual input which reproduce specified outputs. In the relational setting, this would be "provenance" or "lineage", but in the big data space, including iteration and non-monotonic reducers, existing techniques do not work: they return either (i) too much input data or (ii) insufficient input data to reproduce the output. We'll fix all of that.
This talk reflects joint work with Zaheer Chothia, John Liagouris, and Mothy Roscoe in the Systems Group in ETH Zurich.
Slides: F.McSherry.pdf (pdf 1,5 MB)