Foundations of Trustworthy AI-Native Data Systems

Time: Mon 2026-06-15 14.00

Location: F3 Flodis, Lindstedtvägen 26

Video link: https://kth-se.zoom.us/j/65502477126

Language: English

Doctoral student: Sonia-Florina Horchidan , Datatekniska och lärande system

Opponent: Doctor Konstantinos Karanasos, Meta Research, Menlo Park, CA, USA

Supervisor: Paris Carbone, Datatekniska och lärande system

Export to calendar

QC 20260522

Abstract

In traditional data management systems, queries have well-defined semantics and produce exact results. Integrating Machine Learning inference into data processing pipelines disrupts both properties by introducing operators whose outputs are approximate rather than exact. This thesis establishes two foundations for trustworthy AI-native data systems: empirical characterization of ML operator execution cost, and formal, declarative correctness guarantees that the system enforces on behalf of the user. We develop these foundations across three levels of abstraction, from single-operator cost, to single-operator correctness, to their joint optimization at the pipeline level, and establish Conformal Prediction as a practical statistical foundation for this approach. We introduce Crayfish, a benchmarking framework for ML inference within dataflow engines that reveals how interactions between serving tools, stream processors, and pipeline configurations shape inference costs in ways that are difficult to anticipate from component-level behavior alone. We propose ConANN, the first framework to provide distribution-free recall guarantees for Inverted File-based Approximate Nearest Neighbor search, using conformal methods to replace heuristic index tuning with formal statistical guarantees. At the pipeline level, we study joint cost and correctness optimization in the context of Neural Graph Databases, where multi-hop queries over Knowledge Graphs interleave retrieval and neural execution. We formalize a hybrid query optimization architecture for this setting, then introduce ConRAD, which enforces end-to-end recall guarantees for multi-hop queries while dynamically bypassing expensive neural inference when recall targets can be met with local graph evidence alone. Taken together, these contributions show that the rigor users expect from traditional data systems need not be abandoned as those systems become increasingly driven by Machine Learning.

Link to DiVA

To the calendar

Studies

Research

Collaboration

About KTH

Library

Foundations of Trustworthy AI-Native Data Systems

Abstract

Contact