<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Llms on Left 4 More</title><link>https://left4more.com/tags/llms/</link><description>Recent content in Llms on Left 4 More</description><generator>Hugo</generator><language>en-au</language><lastBuildDate>Thu, 14 May 2026 01:36:50 +1000</lastBuildDate><atom:link href="https://left4more.com/tags/llms/index.xml" rel="self" type="application/rss+xml"/><item><title>AI Benchmarks Are Lying to You (But Not in the Way You Think)</title><link>https://left4more.com/posts/ai-benchmarks-are-lying-to-you-but-not-in-the-way/</link><pubDate>Thu, 14 May 2026 01:36:50 +1000</pubDate><guid>https://left4more.com/posts/ai-benchmarks-are-lying-to-you-but-not-in-the-way/</guid><description>&lt;p>There&amp;rsquo;s a post doing the rounds this week about GPT-5.5 cracking something called ProgramBench for the first time. It&amp;rsquo;s a software engineering benchmark that&amp;rsquo;s been resistant to frontier models until now, and the result is genuinely interesting. But the discussion underneath it is, predictably, a mess.&lt;/p>
&lt;p>Some of it is the usual stuff: people declaring their preferred model the winner, others pointing out the charts are misleading, a few genuinely useful technical observations buried under the noise. Normal internet discourse. What caught my attention wasn&amp;rsquo;t the headline result though. It was a quieter observation someone made about the benchmark itself.&lt;/p></description></item></channel></rss>