Similar Articles

How we run GPT OSS 120B at 500+ tokens per second on NVIDIA GPUs | Baseten Blog

https://www.baseten.co/blog/sota-performance-for-gpt-oss-120b-on-nvidia-gpus/

Domain: www.baseten.co Added: 2025-08-06 Status: ✓ Success

www.baseten.co,model performance optimization,bug fixing,nvidia gpus,experimentation,benchmarking

Day zero model performance optimization work is a mix of experimentation, bug fixing, and benchmarking guided by intuition and experience. This writeup outlines the process we followed to achieve SOTA...

Similar Articles (10 found)

https://news.ycombinator.com/item?id=37484135

news.ycombinator.com 2025-07-13

open-source,gpt-4,tech,hackernews,llms,news,news.ycombinator.com,machine learning

There has been a lot of interest on HN in fine-tuning open-source LLMs recently (eg. Anyscale's post at https://news.ycombinator.com/item?id=37090632)...

🔍 View Similar Articles

🔍 67.0% similar

GPT-5: Strategic Implications

https://nextword.substack.com/p/gpt-5-strategic-implications

nextword.substack.com 2025-08-28

nextword.substack.com

Each month, this newsletter is read by over 45K+ operators, investors, and tech / product leaders and executives. If you found value in this newslette...

🔍 View Similar Articles

https://news.ycombinator.com/item?id=44840728

news.ycombinator.com 2025-08-13

news.ycombinator.com

Sam said yesterday that chatgpt handles ~700M weekly users. Meanwhile, I can't even run a single GPT-4-class model locally without insane VRAM or pain...

🔍 View Similar Articles

🔍 66.5% similar

OpenAI's Open Source Strategy

https://nextword.substack.com/p/openai-open-source-strategy-gpt-oss

nextword.substack.com 2025-08-28

nextword.substack.com

OpenAI just released two open-weight models—gpt-oss-120b and gpt-oss-20b—after months of anticipation (you can try them here). That means anyone with ...

🔍 View Similar Articles

🔍 64.7% similar

LLM Engineer's Almanac - Workloads

https://modal.com/llm-almanac/workloads

modal.com 2026-02-03

modal.com

The three types of LLM workloads and how to serve them We hold this truth to be self-evident: not all workloads are created equal. But for large langu...

🔍 View Similar Articles 🟠 HN

🔍 64.1% similar

GPT-5.2

https://simonwillison.net/2025/Dec/11/gpt-52/#atom-entries

simonwillison.net 2025-12-18

simonwillison.net

GPT-5.2 11th December 2025 OpenAI reportedly declared a “code red” on the 1st of December in response to increasingly credible competition from the li...

🔍 View Similar Articles

https://simonwillison.net/2025/Nov/24/claude-opus/#atom-entries

simonwillison.net 2025-12-18

simonwillison.net

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult 24th November 2025 Anthropic released Claude Opus 4.5 this morning, which they ...

🔍 View Similar Articles

https://neutree.ai/blog/nano-vllm-part-1

neutree.ai 2026-02-03

neutree.ai

Understanding LLM Inference Engines: Inside Nano-vLLM (Part 1) Architecture, Scheduling, and the Path from Prompt to Token When deploying large langua...

🔍 View Similar Articles 🟠 HN

🔍 62.2% similar

So you wanna build a local RAG?

https://blog.yakkomajuri.com/blog/local-rag

blog.yakkomajuri.com 2025-11-28

blog.yakkomajuri.com

When we launched Skald, we wanted it to not only be self-hostable, but also for one to be able to run it without sending any data to third-parties. Wi...

🔍 View Similar Articles 🟠 HN

https://www.seangoedecke.com/inference-batching-and-deepseek/

www.seangoedecke.com 2025-07-13

deepseek,ai models,throughput,latency,batch size,www.seangoedecke.com

Why DeepSeek is cheap at scale but expensive to run locally Why is DeepSeek-V3 supposedly fast and cheap to serve at scale, but too slow and expensive...

🔍 View Similar Articles 🟠 HN