KTH Logo

Our upcoming ICLR paper: “KVComm: Enabling Efficient LLM Communication through Selective KV Sharing”

We are very happy to announce that our paper will appear in the Proceedings of the Fourteenth International Conference on Learning Representation (ICLR). The paper title is “KVComm: Enabling Efficient LLM Communication through Selective KV Sharing“, and this is joint work with Xiangyu Shi, Marco Chiesa, Gerald Q. Maguire Jr., and Dejan Kostic (all from KTH). We have already released the source code.

The full abstract is below:

Large Language Models (LLMs) are increasingly deployed in multi-agent systems, where effective inter-model communication is crucial. Existing communication protocols either rely on natural language, incurring high inference costs and information loss, or on hidden states, which suffer from information concentration bias and inefficiency. To address these limitations, we propose KVComm, a novel communication framework that enables efficient communication between LLMs through selective sharing of KV pairs. KVComm leverages the rich information encoded in the KV pairs while avoiding the pitfalls of hidden states. We introduce a KV layer-wise selection strategy based on attention importance scores with a Gaussian prior to identify the most informative KV pairs for communication. Extensive experiments across diverse tasks and model pairs demonstrate that KVComm achieves comparable performance to the upper-bound method, which directly merges inputs to one model without any communication, while transmitting as few as 30% of layers’ KV pairs. Our study highlights the potential of KV pairs as an effective medium for inter-LLM communication, paving the way for scalable and efficient multi-agent systems.

Our first paper: “Deriving Coding-Specific Sub-Models from LLMs using Resource-Efficient Pruning”

We are happy to announce that our first paper will appear in the Proceedings of the Second International Workshop on Large Language Models for Code (LLM4Code). The paper title is “Deriving Coding-Specific Sub-Models from LLMs using Resource-Efficient Pruning“, and this is joint work with Laura Puccioni (now at Spotify), Alireza Farshin (now at NVIDIA), Mariano Scazzariello (now at RISE), Changjie Wang, Marco Chiesa, and Dejan Kostic (KTH).

 

The full abstract is below, while Laura’s video is also available.

 

Large Language Models (LLMs) have demonstrated their exceptional performance in various complex code generation tasks. However, their broader adoption is limited by significant computational demands and high resource requirements, particularly memory and processing power. To mitigate such requirements, model pruning techniques are used to create more compact models with significantly fewer parameters. However, current approaches do not focus on the efficient extraction of programming-language-specific sub-models. In this work, we explore the idea of efficiently deriving coding-specific sub-models through unstructured pruning (i.e., Wanda). We investigate the impact of different domain-specific calibration datasets on pruning outcomes across three distinct domains and extend our analysis to extracting four language-specific sub-models: Python, Java, C++, and JavaScript. We are the first to efficiently extract programming-language-specific sub-models using appropriate calibration datasets while maintaining acceptable accuracy w.r.t. full models. We are also the first to provide analytical evidence that domain-specific tasks activate distinct regions within LLMs, supporting the creation of specialized sub-models through unstructured pruning. We believe that this work has significant potential to enhance LLM accessibility for coding by reducing computational requirements to enable local execution on consumer-grade hardware, and supporting faster inference times critical for real-time development feedback.