zhangchen's comments

zhangchen | 3 hours ago | on: StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)

this tracks with what i've seen too. gemini tends to 'overthink' tool calls - it'll reason about whether to use a tool instead of just using it. in my experience the models that are best at agentic tasks are the ones that commit to a tool call quickly and recover from failures, not the ones that deliberate forever and sometimes bail. would be interesting to see if the benchmark captures retry behavior since thats where cost-effectiveness really diverges

zhangchen | 17 days ago | on: Looking for Partner to Build Agent Memory (Zig/Erlang)

the Squelch primitive for mathematical forgetting is really interesting. most memory systems I've worked with treat forgetting as an afterthought, just TTL-based eviction or manual deletion. having it built into the algebra itself is a much cleaner approach for agents that need to update beliefs over time without accumulating stale context.

zhangchen | 18 days ago | on: You Don't Need a Vector Database

this is way too broad, RAG works fine in the 10k-1M doc range if your chunking and retrieval pipeline are tuned properly. the failure mode is usually bad embeddings or naive chunking, not RAG itself.

zhangchen | 19 days ago | on: Gemini Embedding 2: natively multimodal embedding model

the steerability point is interesting. have you tried using task-specific prompts for cross-modal retrieval though? like searching images with text queries. curious whether qwen's prompt-based steering actually helps there or if it mainly improves same-modality tasks. the 3072-dim space seems tight for encoding all those modalities well.

zhangchen | 20 days ago | on: Are LLM merge rates not getting better?

fwiw the merge rate metric itself might be misleading. most real codebases have implicit conventions and architectural patterns that aren't captured in the issue description, so even if the model writes correct code it might not match what the maintainer actually wanted. imo the bigger signal is how much back-and-forth it takes before merging, not whether the first attempt lands cleanly.

zhangchen | 21 days ago | on: Many SWE-bench-Passing PRs would not be merged

Yeah this matches what we've seen too. The biggest gains we got weren't from switching models, it was from investing in better context, giving the agent a well structured spec, relevant code samples from the repo, and explicit constraints upfront. Without that, even the best models will happily produce working but unmaintainable code. Feels like the whole SWE-bench framing misses this, passing tests is the easy part, fitting into an existing codebase's patterns and conventions is where it actually gets hard.

zhangchen | 22 days ago | on: Agents that run while I sleep

certainty scoring sounds useful but fwiw the harder problem is temporal - a fact that was true yesterday might be wrong today, and your agent has no way to know which version to trust without some kind of causal ordering on the writes.

zhangchen | 22 days ago | on: Redox OS has adopted a Certificate of Origin policy and a strict no-LLM policy

that's already happening tbh. the real issue isn't hypocrisy though, it's that maintainers reviewing their own LLM output have full context on what they asked for and can verify it against their mental model of the codebase. a random contributor's LLM output is basically unverifiable, you don't know what prompt produced it or whether the person even understood the code they're submitting.

zhangchen | 22 days ago | on: Show HN: Run 500B+ Parameter LLMs Locally on a Mac Mini

The mmap layer streaming approach is smart for working around memory limits. In practice though, 1.58-bit ternary quantization tends to degrade quality noticeably on reasoning-heavy tasks compared to 4-bit — curious if you've measured perplexity deltas at the 140B scale.
page 1