zhangchen
|
3 hours ago
|
on: StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)
this tracks with what i've seen too. gemini tends to 'overthink' tool calls - it'll reason about whether to use a tool instead of just using it. in my experience the models that are best at agentic tasks are the ones that commit to a tool call quickly and recover from failures, not the ones that deliberate forever and sometimes bail. would be interesting to see if the benchmark captures retry behavior since thats where cost-effectiveness really diverges
zhangchen
|
14 days ago
|
on: Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training
this lines up with what pruning papers have been finding, the middle layers carry most of the reasoning weight and you can often drop the outer ones without much loss. cool to see the inverse also works, just stacking them for extra passes.
zhangchen
|
15 days ago
|
on: Why AI systems don't learn – On autonomous learning from cognitive science
Has anyone tried implementing something like System M's meta-control switching in practice? Curious how you'd handle the reward signal for deciding when to switch between observation and active exploration without it collapsing into one mode.
zhangchen
|
17 days ago
|
on: Looking for Partner to Build Agent Memory (Zig/Erlang)
the Squelch primitive for mathematical forgetting is really interesting. most memory systems I've worked with treat forgetting as an afterthought, just TTL-based eviction or manual deletion. having it built into the algebra itself is a much cleaner approach for agents that need to update beliefs over time without accumulating stale context.
zhangchen
|
18 days ago
|
on: You Don't Need a Vector Database
this is way too broad, RAG works fine in the 10k-1M doc range if your chunking and retrieval pipeline are tuned properly. the failure mode is usually bad embeddings or naive chunking, not RAG itself.
zhangchen
|
19 days ago
|
on: Gemini Embedding 2: natively multimodal embedding model
the steerability point is interesting. have you tried using task-specific prompts for cross-modal retrieval though? like searching images with text queries. curious whether qwen's prompt-based steering actually helps there or if it mainly improves same-modality tasks. the 3072-dim space seems tight for encoding all those modalities well.
zhangchen
|
20 days ago
|
on: Are LLM merge rates not getting better?
fwiw the merge rate metric itself might be misleading. most real codebases have implicit conventions and architectural patterns that aren't captured in the issue description, so even if the model writes correct code it might not match what the maintainer actually wanted. imo the bigger signal is how much back-and-forth it takes before merging, not whether the first attempt lands cleanly.
zhangchen
|
21 days ago
|
on: Many SWE-bench-Passing PRs would not be merged
Yeah this matches what we've seen too. The biggest gains we got weren't from switching models, it was from investing in better context, giving the agent a well structured spec, relevant code samples from the repo, and explicit constraints upfront. Without that, even the best models will happily produce working but unmaintainable code. Feels like the whole SWE-bench framing misses this, passing tests is the easy part, fitting into an existing codebase's patterns and conventions is where it actually gets hard.
zhangchen
|
22 days ago
|
on: Agents that run while I sleep
certainty scoring sounds useful but fwiw the harder problem is temporal - a fact that was true yesterday might be wrong today, and your agent has no way to know which version to trust without some kind of causal ordering on the writes.
zhangchen
|
22 days ago
|
on: Redox OS has adopted a Certificate of Origin policy and a strict no-LLM policy
that's already happening tbh. the real issue isn't hypocrisy though, it's that maintainers reviewing their own LLM output have full context on what they asked for and can verify it against their mental model of the codebase. a random contributor's LLM output is basically unverifiable, you don't know what prompt produced it or whether the person even understood the code they're submitting.
zhangchen
|
22 days ago
|
on: Ask HN: How are you monitoring AI agents in production?
Langfuse + custom OTEL spans has been the most practical combo for us. The key insight was treating each agent step as a trace span with token counts and latency, then setting alerts on cost-per-task rather than raw token volume.
zhangchen
|
22 days ago
|
on: Show HN: Run 500B+ Parameter LLMs Locally on a Mac Mini
The mmap layer streaming approach is smart for working around memory limits. In practice though, 1.58-bit ternary quantization tends to degrade quality noticeably on reasoning-heavy tasks compared to 4-bit — curious if you've measured perplexity deltas at the 140B scale.