botirk's comments

botirk | 1 month ago

We analyzed per-task results on SWE-Bench Verified and noticed a pattern that aggregate leaderboard scores hide: many tasks failed by the top-performing model are consistently solved by other models.

For example, Claude Opus 4.5 solves the most tasks overall, but a significant number of tasks it fails are solved by other models like Sonnet or Gemini. The reverse is also true. This suggests strong task-level specialization that a single-model baseline cannot exploit.

We built a simple routing system to test this idea. Instead of training a new foundation model, we embed each problem description, assign it to a semantic cluster learned from a separate general coding dataset, and route the task to the model with the highest historical success rate in that cluster.

Using this approach, the system exceeds single-model baselines on SWE-Bench Verified (75.6% versus ~74% for the best individual model).

A few clarifications up front: we did not train on SWE-Bench problems or patches. Clusters are derived from general coding data, not from SWE-Bench. SWE-Bench is used only to estimate per-cluster model success rates. At inference time, routing uses only the problem description and historical cluster statistics, with no repo execution or test-time search.

The main takeaway is not the absolute number, but the mechanism. Leaderboard aggregates hide complementary strengths between models, and even simple routing can capture a higher performance ceiling than any single model.

botirk | 1 month ago | on: Nordlys Hypernova: 75.6% on SWE-Bench

We propose a new architecture called Mixture of Models (MoM) to solve LLM routing for coding workflows. We use a embedding + clustering approach on SWE data and then evaluate LLMs on each cluster to find out who is best.

botirk | 5 months ago

We built our infra on Azure during a hackathon. It made sense at the time, so we stuck with it.

For a while, Container Apps worked fine. Then we launched our AI model router demo, and everything changed.

In just two days, we spent over $250 on GPU compute. Two uni students, a side project, and suddenly we were paying production-level bills.

Autoscaling was slow. Cold starts were bad. Costs were unpredictable.

Then I watched a talk from one of Modal’s founders about GPU infra. We gave Modal a try.

Now we’re running the same workloads for under $100, with fast autoscaling and no lag.

Azure was stable, but Modal gave us speed, control, and real cost efficiency.

Anyone else switch from Azure (or AWS/GCP) to Modal for AI workloads? What was your experience?

botirk | 5 months ago

I hit this while building Adaptive, a proxy layer for LLM APIs.

I needed to extend the OpenAI SDK / Anthropic SDK types with some extra fields.

In most languages, this is trivial.

In Go, it meant:

→ Embedding the original struct and hoping it wouldn’t break with the next SDK release.

→ Or recreating the types entirely, just to add fields.

That feels painful for something so basic.

But here’s the twist.

I also love how Go won’t let me build endless inheritance hierarchies or clever “extension” tricks that make a codebase unreadable.

The rigidity forces simplicity.

The problem is sometimes it becomes too simple.

When I want type-specific extensions, Go makes me fight the language instead of working with it.

That’s why I both hate and love Go’s type system.

It keeps my code clean — but makes it harder to grow.

botirk | 5 months ago | on: Lessons from building an intelligent LLM router

We have been experimenting with routing inference across LLMs, and the path has been full of wrong turns.

Our first attempt was to just use a large LLM itself to decide routing. It was too costly and the decisions were unreliable.

Next we tried training a small fine-tuned LLM as a router. It was cheaper, but the outputs were poor and not trustworthy.

Then we wrote heuristics to map prompt types to model IDs. That worked for a while, but it was brittle. Every API change or workload shift broke it.

Eventually we shifted to thinking in terms of model criteria instead of hardcoded model IDs. We benchmarked models across task types, domains, and complexity levels, and made routing decisions based on those profiles.

To estimate task type and complexity, we used NVIDIA’s Prompt Task and Complexity Classifier. It classifies prompts into categories like QA, summarization, code generation, and more. It also scores prompts along six dimensions such as creativity, reasoning, domain knowledge, contextual knowledge, constraints, and few-shots. From this it produces a weighted overall complexity score.

This gave us a structured way to decide when a prompt justified a premium model like Claude Opus 4.1 and when a smaller model like GPT-5-mini would perform just as well.

Now we are working on integrating this with Google’s UniRoute (https://arxiv.org/abs/2502.08773

botirk | 5 months ago

I built FastKey, a Redis-compatible key-value store written from scratch in C.

It started as a project to understand Redis internals.

It turned into a complete implementation with real-world features.

Key features: → Full RESP protocol compatibility (works with redis-cli and Redis clients) → Master-slave replication with PSYNC → Streams support (XADD, XRANGE, XREAD) → Transactions (MULTI / EXEC / DISCARD) → Thread-safe concurrent handling with read-write locks → RDB persistence format → 256 tests with 100% pass rate

The focus was on memory safety, proper cleanup, and thread safety. The code is clean C with a modular architecture so you can actually follow how things work.

This could be useful as: → A learning resource for anyone curious about Redis internals → A lightweight alternative when you need Redis compatibility without the full Redis overhead

I would love feedback on the architecture, threading model, and implementation details.

botirk | 5 months ago

Claude Code has exploded in popularity as a developer tool.

The problem is cost, running everything directly through Anthropic gets expensive fast.

We built Adaptive, a model routing platform that integrates with Claude Code as a drop-in replacement for the Claude API.

You keep the exact same Claude Code workflow, but Adaptive routes requests intelligently across models to cut costs by 60–80% while maintaining performance.

Setup takes one script install. Docs: https://docs.llmadaptive.uk/developer-tools/claude-code

botirk | 5 months ago | on: Show HN: I wrote an OS in 1000 lines of Zig

I wanted to understand what the bare minimum of an operating system looks like.

So I built one in Zig, keeping the whole thing under 1000 lines of code.

It can: → Boot from GRUB → Manage memory → Schedule simple tasks → Output text to VGA

The point was not to make it feature-rich, but to show how much is possible with just a few hundred lines if you strip everything down to the essentials.

page 1