Ricardo Caldas Santana
Doctoral student
Details
Researcher
About me
Academic Background
Building technology that adapts to people, not the other way around
I’m a Ph.D. candidate in Computer Science at KTH (Department of Intelligent Systems) and the Media & Language Research Arena (WASP), supervised by Dr. Pereira. My research focuses on integrating representation learning and large language models (LLMs) to enable seamless, socially-aware collaboration between humans and machines.
I earned my M.Sc. in Aerospace Engineering from TU Delft, where I applied deep reinforcement learning to autonomous mobility, under the supervision of Dr. Sharpanskykh and Dr. Wei.
Professional Experience
Before academia, I spent 2 years as a Digital Consultant at McKinsey & Company, driving AI projects and launching new logistics, insurance, and energy startups. During my master's, I did a half-year research stint at Sensei, developing real-time 3D human pose estimation.
Current Research Opportunities
Interested in collaboration or MSc thesis supervision opportunities? Check my ongoing research projects and to discuss potential research synergies.
TutorLM - Multimodal AI Companion for Offline STEM Tutoring (Watch demo video)

Recent advances in Large Language Models (LLMs) position them as transformative tools for scaling personalized tutoring. However, current systems remain predominantly text-centric and require stable connectivity, limiting their pedagogical effectiveness and accessibility. TutorLM addresses these constraints through multimodal interaction design and efficient model architectures that enable rich educational experiences.
This research explores fundamental questions at the intersection of AI and human-computer interaction for educational contexts:
-
Real-time visual grounding: How can models efficiently process free-form visual content to determine where on the canvas to ground explanations spatially? This involves developing architectures that maintain a coherent understanding across sketches, diagrams, and text.
-
Adaptive interaction: How can models synthesize multimodal behavioral context (e.g., gaze, speech, interaction traces) to model the user state? This involves collecting or working with an existing multimodal dataset of tutor-student interactions and modeling collaboration dynamics.
-
Comprehensive evaluation: How does canvas-based tutoring compare to traditional text-centric interfaces in terms of learning outcomes and interaction quality? This involves developing comprehensive evaluation frameworks and conducting a user study to measure objective learning gains (e.g., knowledge retention) and subjective student experience (e.g., engagement, enjoyment).