Skip to main content
To KTH's start page

Improving Sample-efficiency of Reinforcement Learning from Human Feedback

Time: Tue 2025-04-01 14.00

Location: F3 (Flodis), Lindstedtsvägen 26

Video link: https://kth-se.zoom.us/j/62755931085

Language: English

Subject area: Computer Science

Doctoral student: Simon Holk , Robotik, perception och lärande, RPL

Opponent: Research Associate Professor Brad Knox, The University of Texas at Austin, Austin, TX, USA

Supervisor: Associate professor Iolanda Leite, Robotik, perception och lärande, RPL

Export to calendar

QC 20250307

Abstract

With the rapid advancement of AI, the technology has moved out of the industrial and lab setting and into the hands of everyday people. Once AI and robot agents are placed in everyday households they need to be able to take into account human needs. With methods like Reinforcement Learning from Human Feedback (RLHF), the agent can learn desirable behavior by learning a reward function or optimizing a policy directly based on their feedback. Unlike vision models and large language models (LLM) which benefit from internet-scale data, RLHF is limited by the amount of feedback provided since it requires additional human effort. In this thesis, we look into how we can decrease the amount of feedback humans provide to reduce their burden when estimating a reward function without degrading the estimate. We investigate the fundamental trade-off between the informativeness and efficiency of feedback from a preference-based learning perspective. In this regard, we introduce multiple methods that can be categorized into two groups, implicit methods that increase the quality of the feedback without additional human effort, and explicit methods that aim to drastically increase the information content by using additional feedback types. To implicitly improve the efficiency of preference feedback, we look into how we can utilize Active Learning (AL) to improve the diversity of samples by strategically picking from different clusters in a learned representation through a Variational Autoencoder (VAE). Furthermore, we make use of the unique relationship between preference pairs to perform data synthesis by interpolation on the latent space of the VAE. While the implicit methods have the benefit of requiring no extra effort, they still suffer from the limited amount of information that preferences alone can provide. One limitation of preferences on trajectories is that there is no discounting which means that if a trajectory is preferred, the assumption is that the whole trajectory is preferred leading to casual confusion. Therefore, we introduce a new form of feedback called highlights that lets the user show on the trajectory, which part was good and which part was bad. Furthermore, leveraging LLMs we create a method for letting humans explain their preferences via natural language to deduce which parts were preferred. Overall, this thesis takes a step away from the assumption of internet-scale data and shows how we can achieve alignment from less human feedback.

urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-360983