ICLR 2026 Names 'Transformers Are Inherently Succinct' and 'LLMs Get Lost in Multi-Turn Conversation' as Outstanding Papers (April 2026)
ICLR's Outstanding Paper Committee has named two winners and one Honorable Mention from the Rio de Janeiro conference: a theoretical proof that Transformers are exponentially more compact than RNNs, a Microsoft/Salesforce study showing LLMs lose 39% of their performance over multi-turn chats, and an optimizer breakthrough that improves the trendy Muon algorithm.
The Fourteenth International Conference on Learning Representations (ICLR 2026) on announced its three top papers in Rio de Janeiro: Transformers are Inherently Succinct, LLMs Get Lost In Multi-Turn Conversation, and Honorable Mention The Polar Express. The picks span pure theory, empirical evaluation, and practical optimizer engineering — an unusually broad sweep that highlights how the field is splitting between proving what models can do, measuring what they actually do, and squeezing more out of the GPUs underneath them.
What Happened
The Outstanding Paper Committee, chaired by Gautam Kamath of the University of Waterloo and including reviewers from Cornell, MIT, KAUST, UCLA, the Vector Institute and Stanford, selected the awards from a longlist of 36 papers flagged by Area Chairs or via top reviewer scores. Selection was conducted over five weeks in March and April 2026 and announced on the official ICLR Blog. The conference itself runs – at the Riocentro Convention Center.
The two Outstanding Paper winners are:
- Transformers are Inherently Succinct by Pascal Bergsträßer (RPTU Kaiserslautern), Ryan Cotterell (ETH Zürich) and Anthony Widjaja Lin (RPTU Kaiserslautern / Max Planck Institute). The paper proves that Transformers can encode certain languages exponentially more compactly than RNNs and modern State-Space Models, and doubly exponentially more compactly than finite automata — a formal explanation for why Transformers keep beating sequence models even with similar parameter counts. As a corollary, the authors prove that verifying properties of Transformers is EXPSPACE-complete, meaning provably intractable.
- LLMs Get Lost In Multi-Turn Conversation by Philippe Laban and Hiroaki Hayashi (Microsoft Research), Yingbo Zhou (Salesforce) and Jennifer Neville (Microsoft Research / Purdue). Using a new "sharded simulation" benchmark that splits fully-specified instructions into fragments delivered turn-by-turn, the authors show that 15 frontier LLMs lose an average 39% in performance when the same task is delivered across multiple under-specified turns rather than a single prompt. The reliability gap is even larger than the absolute drop: the same model on the same task swings wildly between conversations.
The Honorable Mention is The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm by Noah Amsel (NYU), David Persson (EPFL), Christopher Musco (NYU) and Robert M. Gower (Flatiron Institute). The paper derives an optimal polynomial approximation for the matrix polar decomposition that, when dropped into the popular Muon optimizer, delivers consistent validation-loss improvements training a GPT-2 architecture on 1–10 billion FineWeb tokens — and remains stable in bfloat16 on H100s.
Key Details
- 15 LLMs tested in the Microsoft/Salesforce study — including GPT-4o, Claude 3.5 Sonnet, Llama 3.1, Gemini 1.5 — all degraded significantly when conversations spanned multiple turns. The authors emphasize that even the latest frontier models still drop in re-runs.
- 39% average performance drop across six generation tasks (code, summarization, multi-document QA, etc.) when prompts were delivered as multi-turn shards instead of single-turn instructions.
- EXPSPACE-completeness — Bergsträßer et al. prove that Transformer verification is in the same complexity class as the worst regular-language problems, making formal model-checking effectively impossible for production-scale Transformers.
- Polar Express in Muon — improvements were measured on 1B–10B FineWeb tokens with a GPT-2 model in bfloat16 precision; the authors release reference code under MIT license at noahamsel/polar-express.
- Selection methodology — five weeks of review, three phases, Area Chair flags + top reviewer scores, with explicit conflict-of-interest controls. The committee modelled the process on TMLR.
What Researchers and Developers Are Saying
The picks have generated unusually wide engagement on Hacker News and X. Co-author Philippe Laban shared on X that they re-ran the multi-turn experiments against newer 2026 models and found "performance still drops, but with modest gains: mostly from improvements on the Python coding task," reinforcing the paper's relevance for current frontier systems. The Hacker News thread on the awards announcement noted that the multi-turn finding is the empirical foundation for why agentic systems like OpenHands and Claude Code now lean on aggressive context resets rather than long conversations.
On the theory side, the Transformers are Inherently Succinct award has reignited debate over whether State-Space Models like Mamba can ever fully replace Transformers — the paper's core result is essentially a no-go theorem for that thesis at fixed model size. ML researcher Sasha Rush summarized the result on X as "finally, a complexity-theoretic reason your RNN keeps losing." The Polar Express Honorable Mention is being widely shared in the optimizer community on r/MachineLearning, where Muon has become the most-discussed alternative to AdamW since .
What This Means for Developers
The most actionable result for application developers is the multi-turn paper: if your product wraps an LLM in a chat interface, expect a roughly 40% capability drop versus the same model handed the full task in one prompt. The practical mitigation, already adopted by tools like Cursor 3, Cowork, and Devin, is to summarize-and-restart long conversations into a single fresh prompt before delegating any non-trivial multi-step task. The Polar Express result is more of a near-future signal: if you train your own models, expect Muon-with-Polar-Express to land in major training stacks (HuggingFace Accelerate, Mosaic Composer, NVIDIA NeMo) within the next quarter.
What's Next
All three papers are presented in full at the conference's Oral sessions through ; recordings will be uploaded to the ICLR 2026 virtual site. ICLR 2027 will be held in Singapore, with the Call for Papers expected to open in September 2026.
Sources
- ICLR Blog: Announcing the ICLR 2026 Outstanding Papers — the official primary source from the Outstanding Paper Committee.
- arXiv: Transformers are Inherently Succinct — full Bergsträßer/Cotterell/Lin paper.
- arXiv: LLMs Get Lost In Multi-Turn Conversation — Laban/Hayashi/Zhou/Neville paper and sharded-simulation benchmark.
- arXiv: The Polar Express — Optimal Matrix Sign Methods and Their Application to Muon — Amsel et al. on improving the Muon optimizer.
- ICLR 2026 Virtual Site — full list of accepted papers, posters and oral recordings.
- Microsoft Research — official publication page with author bios and dataset release.
Stay up to date with Doolpa
Subscribe to Newsletter →