(no title)
mattcollins | 4 months ago
To explain the 60% a bit more...
With small amounts of input data, the accuracy is near 100%. As you increase the size of the input data, the accuracy gradually decreases.
For this test, I intentionally chose an input data set large enough that the LLM would score in the region of 50% accuracy (with variation between formats) in order to maximise the discriminative power of the test.
padolsey|4 months ago
As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing a basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.
Based on these limited tests, here's the leaderboards on formats FWIW:
So, the biggest takeaway really is: Use the best model you can reasonably afford, then format will matter less. The cheapest 100% coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1And if you have no control over model, then use CSV or Markdown Table.
ysleepy|4 months ago
rovr138|4 months ago
Interesting.
On your section "Limitations and Areas for Further Study",
What I'd be curious on future work would be,
I'm curious to know if what it fails is the same, if it changes depending on the location, if it's a bias.Is it always a specific question? Is it always a specific value? Is it always question #x (or around question #x?). Does it tend towards x or y on types of questions?
Good idea
CuriouslyC|4 months ago
Redster|4 months ago
It looks to me that the concisest way of representing each of these tables was a CSV and then a standard markdown table. The amount of tokens appears to be 1/2 or 1/3 of the other options. For experiments not in mice (GPT-4.1-nano), but in larger models or larger context aside from the data table itself, my guess is that preserving context is might be higher value than having the higher-LLM-legibility of the Markdown-KV.