Skip to main content
To KTH's start page

Multilingual Language Models

Studies of Pre-Training Approaches and Hallucination Detection

Time: Mon 2024-12-16 14.00

Location: Kollegiesalen, Brinellvägen 8, Stockholm

Video link: https://kth-se.zoom.us/s/3719008936

Language: English

Doctoral student: Evangelia Gogoulou , Programvaruteknik och datorsystem, SCS, RISE Research Institutes of Sweden

Opponent: Professor Barbara Plank, Ludwig-Maximilians-Universität München, München, Germany

Supervisor: Professor Magnus Boman, Programvaruteknik och datorsystem, SCS; Professor Joakim Nivre, RISE Research Institutes of Sweden, Uppsala University; Professor Hedvig Kjellström, Robotik, perception och lärande, RPL

Export to calendar

QC 20241119

Abstract

The performance of large language models has been improving steadily but varies considerably across languages. One strategy for improving this situation is to train multilingual models that enable cross-lingual transfer, such that knowledge from high-resource languages can be leveraged to improve performance on low-resource languages, but there are limits to the number of languages models can effectively support. Understanding the factors influencing cross-lingual transfer is crucial for building models that perform consistently across languages. This thesis investigates how the interaction between languages during pre-training affects model performance in different scenarios of training schemes, model architecture, and evaluation criteria. We first investigate the scalability of multilingual joint pre-training in the generative setting. We pre-train the first large-scale autoregressive language model for English and Swedish and find that its performance improves with increasing data volumes and number of parameters. Then, we study the forward cross-lingual transfer effects in the case of incremental language pre-training. Our experimental results of transferring monolingual encoder language models from a set of four languages to English demonstrate that forward transfer effects, measured in terms of downstream performance, are consistently positive. Building on this, we next analyze both forward and backward effects of incrementally pre-training autoregressive language models on a sequence of languages, with varying order. While forward transfer effects are again always positive, it is observed that backward transfer effects depend on the order and characteristics of languages. Our analysis of possible explanatory factors for backward transfer reveals the potentially important role of language contamination and syntactic similarity. Lastly, we conduct a comparative study of the performance of autoregressive models with varying language coverage on the task of detecting intrinsic hallucinations in paraphrase generation and machine translation scenarios, in different languages. Our experimental results show that models have consistent performance across languages, and also suggest that model-specific factors, such as model size and instruction tuning, have a large impact on the performance. These findings advance the understanding of cross-lingual transfer, providing the foundations for multilingual models with enhanced learning capacity and consistent performance across previously learned languages. Additionally, our work contributes to the evaluation of autoregressive multilingual language models, by providing resources and methods for studying the hallucination phenomenon in machine-generated text. 

urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-356567