Foundations of Trustworthy AI-Native Data Systems
Time: Mon 2026-06-15 14.00
Location: F3 Flodis, Lindstedtvägen 26
Video link: https://kth-se.zoom.us/j/65502477126
Language: English
Doctoral student: Sonia-Florina Horchidan , Datatekniska och lärande system
Opponent: Doctor Konstantinos Karanasos, Meta Research, Menlo Park, CA, USA
Supervisor: Paris Carbone, Datatekniska och lärande system
QC 20260522
Abstract
In traditional data management systems, queries have well-defined semantics and produce exact results. Integrating Machine Learning inference into data processing pipelines disrupts both properties by introducing operators whose outputs are approximate rather than exact. This thesis establishes two foundations for trustworthy AI-native data systems: empirical characterization of ML operator execution cost, and formal, declarative correctness guarantees that the system enforces on behalf of the user. We develop these foundations across three levels of abstraction, from single-operator cost, to single-operator correctness, to their joint optimization at the pipeline level, and establish Conformal Prediction as a practical statistical foundation for this approach. We introduce Crayfish, a benchmarking framework for ML inference within dataflow engines that reveals how interactions between serving tools, stream processors, and pipeline configurations shape inference costs in ways that are difficult to anticipate from component-level behavior alone. We propose ConANN, the first framework to provide distribution-free recall guarantees for Inverted File-based Approximate Nearest Neighbor search, using conformal methods to replace heuristic index tuning with formal statistical guarantees. At the pipeline level, we study joint cost and correctness optimization in the context of Neural Graph Databases, where multi-hop queries over Knowledge Graphs interleave retrieval and neural execution. We formalize a hybrid query optimization architecture for this setting, then introduce ConRAD, which enforces end-to-end recall guarantees for multi-hop queries while dynamically bypassing expensive neural inference when recall targets can be met with local graph evidence alone. Taken together, these contributions show that the rigor users expect from traditional data systems need not be abandoned as those systems become increasingly driven by Machine Learning.