Evaluating LLMs for my personal use case
Summary
Itβs great that AI can win maths Olympiads, but thatβs not what Iβm doing. I mostly ask basic Rust, Python, Linux and life questions. So I did my own e...
Similar Articles (10 found)
π 69.8% similar
Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult
24th November 2025
Anthropic released Claude Opus 4.5 this morning, which they ...
π 69.7% similar
Things we learned about LLMs in 2024
31st December 2024
A lot has happened in the world of Large Language Models over the course of 2024. Hereβs a rev...
π 68.5% similar
GPT-5: Key characteristics, pricing and model card
7th August 2025
Iβve had preview access to the new GPT-5 model family for the past two weeks (see r...
π 68.2% similar
2 Years of ML vs. 1 Month of Prompting
November 7, 2025
Recalls at major automakers cost hundreds of millions of dollars a year. Itβs a huge issue. To...
π 66.7% similar
GPT-5.2
11th December 2025
OpenAI reportedly declared a βcode redβ on the 1st of December in response to increasingly credible competition from the li...
π 65.5% similar
> the generation of 281,128 augmented examples, from which 1,000 were
held out as a benchmark test set.
This model is trained on a custom dataset of 2...
π 64.9% similar
When we launched Skald, we wanted it to not only be self-hostable, but also for one to be able to run it without sending any data to third-parties.
Wi...
π 64.9% similar
Olmo 3 is a fully open LLM
22nd November 2025
Olmo is the LLM series from Ai2βthe Allen institute for AI. Unlike most open weight models these are not...
π 64.5% similar
Same AI, Different Answer: How Tiny Prompts Can Change Everything
Why Does ChatGPT Sometimes Feel Different?
If youβve used AI chatbots like ChatGPT f...
π 64.3% similar
Vibe Coding as a Coding Veteran
From 8-bit Assembly to English-as-Code
By now, weβve all heard about this βvibe codingβ thing: you let an AI assistant...