sysmax | 4 months ago | on: Signs of introspection in large language models
sysmax's comments
sysmax | 7 months ago | on: OpenAI's new open-source model is basically Phi-5
I am playing around with interactive workflow where the model suggests what can be wrong with a particular chunk of code, then the user selects one of the options, and the model immediately implements the fix.
Biggest problem? Total Wild West in terms of what the models try to suggest. Some models suggest short sentences, others spew out huge chunks at a time. GPT-OSS really likes using tables everywhere. Llama occasionally gets stuck in the loop of "memcpy() could be not what it seems and work differently than expected" followed by a handful of similar suggestions for other well-known library functions.
I mostly got it to work with some creative prompt engineering and cross-validation, but having a model fine-tuned for giving reasonable suggestions that are easy to understand from a quick glance, would be way better.
sysmax | 7 months ago | on: Cerebras Code
I tried copy-pasting all the relevant parts into ChatGPT and gave it instructions like "add support for X to Y, similar to Z", and it got it pretty well each time. The bottleneck was really pasting things into the context window, and merging the changes back. So, I made a GUI that automated it - showed links on top of functions/classes to quickly attach them into the context window, either as just declarations, or as editable chunks.
That worked faster, but navigating to definitions and manually clicking on top of them still looked like an unnecessary step. But if you asked the model "hey, don't follow these instructions yet, just tell me which symbols you need to complete them", it would give reasonable machine-readable results. And then it's easy to look them up on the symbol level, and do the actual edit with them.
It doesn't do magic, but takes most of the effort out of getting the first draft of the edit, than you can then verify, tweak, and step through in a debugger.
sysmax | 7 months ago | on: Cerebras Code
What works for me (adding features to huge interconnected projects), is think what classes, algorithms and interfaces I want to add, and then give very brief prompts like "split class into abstract base + child like this" and "add another child supporting x,y and z".
So, I still make all the key decisions myself, but I get to skip typing the most annoying and repetitive parts. Also, the code don't look much different from what I could have written by hand, just gets done about 5x faster.
sysmax | 7 months ago | on: Cerebras Code
It was adopted because trying to generate diffs with AI opens a whole new can of worms, but there's a very efficient approach in between: slice the files on the symbol level.
So if the AI only needs the declaration of foo() and the definition of bar(), the entire file can be collapsed like this:
class MyClass {
void foo();
void bar() {
//code
}
}
Any AI-suggested changes are then easy to merge back (renamings are the only notable exception), so it works really fast.I am currently working on an editor that combines this approach with the ability to step back-and-forth between the edits, and it works really well. I absolutely love the Cerebras platform (they have a free tier directly and pay-as-you-go offering via OpenRouter). It can get very annoying refactorings done in one or two seconds based on single-sentence prompts, and it usually costs about half a cent per refactoring in tokens. Also great for things like applying known algorithms to spread out data structures, where including all files would kill the context window, but pulling individual types works just fine with a fraction of tokens.
If you don't mind the shameless plug, there's a more explanation how it works here: https://sysprogs.com/CodeVROOM/documentation/concepts/symbol...
sysmax | 7 months ago | on: Vibe code is legacy code
E.g. if you point at a vector class and just "Distance()" for prompt, it will make assumptions like "you want to add a function calculating distance from (0,0)", "function calculating distance between 2 vectors", etc. It runs pretty fast with models like LLaMA, so you can get small routine edits done much faster than by hand.
The part I am currently experimenting with are one-click commands like "expand" or "I don't like the selected part. Give me other options for it". I think, I'll get it to a mostly usable state around Monday. Feel free to shoot an email to the address in the contact page and I'll send you a link to the experimental build.
[0] https://sysprogs.com/CodeVROOM/documentation/concepts/planni...
sysmax | 7 months ago | on: Vibe code is legacy code
You can get way better results with incremental refinement. Refine brief prompt into detailed description. Refine description into requirements. Refine requirements into specific steps. Refine steps into modified code.
I am currently experimenting with several GUI options for this workflow. Feel free to reach out to me if you want to try it out.
sysmax | 7 months ago | on: Vibe code is legacy code
sysmax | 7 months ago | on: Vibe code is legacy code
You still think out all the classes, algorithms, complexities in your head, but then instead of writing code by hand, use short prompts like "encapsulate X and Y in a nested class + create a dictionary where key is A+B".
This saves a ton of repetitive manual work, while the results are pretty indistinguishable from doing all the legwork yourself.
I am building a list of examples with exact prompts and token counts here [0]. The list is far from being complete, but gives the overall idea.
[0] https://sysprogs.com/CodeVROOM/documentation/examples/scatte...
sysmax | 8 months ago | on: AI coding tools can reduce productivity
Things like "apply this known algorithm to that project-specific data structure" work really well and save plenty of time. Things that require a gut feeling for how things are organized in memory don't work unless you are willing to babysit the model.
sysmax | 8 months ago | on: Data on AI-related Show HN posts
To me, it feels like studying a new physical phenomenon. Like when Nicola Tesla was playing around with coils and wires, eventually loading to the creation of an entire industry.
Except, with LLMs, you don't need multi-million dollar equipment to play around with models. You can get pretty cool stuff done with a regular GPU, and even cooler if you use cloud.
I would say, if you are not spending some spare time fiddling around with LLMs trying to get them do some of the work you would otherwise do by hand, you are missing out.
sysmax | 8 months ago
I tried using a refactoring tool for reordering function arguments. The problem is, clicking through various GUI to get your point across is again too distracting. And there are still too many details. You can't say something like "new argument should be zero for callers that ignore the return value". It's not deterministic, and each case is slightly different from others. But LLMs handle this surprisingly well, and the mistakes they make are easy to spot.
What I'm really hoping to do some day is a "formal mode" where the LLM would write a mini-program to mutate the abstract syntax tree based on a textual refactoring request, thus guaranteeing determinism. But that's a whole new dimension of work, and there are numerous easier use cases to tackle before that.
sysmax | 8 months ago
With LLMs I can literally type "unsavedOnly => enum Scope{Unsaved, Saved, RecentlySaved (ignore for now)}" and that's it. It will replace the "bool unsavedOnly" argument with "scope Scope", update the check inside the method, and update the callers. If had to do it by hand each time, I would have lazied out and added another bool argument, or some other kind of a sloppy fix, snowballing the technical debt. But if LLMs can do all the legwork, you don't need sloppy fixes anymore. Keeping the code nice and clean doesn't mean a huge distraction and doesn't kick you out of the zone.
sysmax | 8 months ago
That said, when you review human work, the granularity is usually different. I've actually been heavily using AI to do minor refactoring like "replace these 2 variables with a struct and update all call sites" and the reviewing flow is just different. AI makes fairly predictable mistakes, and once you get the hang of it, you can spot them before you even fully read the code. Like groups of 3 edits for all call sites, and one call site with 4. Or things like removed comments or renamed variables you didn't ask to rename. Properly collapsing irrelevant parts makes much bigger difference than with human-made edits.
sysmax | 8 months ago
I am actually working on a GUI for just that [0]. The first problem is solved by having explicit links above functions and classes whether to include them in the context window (with an option to remove bodies of functions, just keeping the declarations). The second one is solved by a special review mode where it auto-collapses functions/classes that were unchanged, and having an outline window that shows how many blocks were changed in each function/class/etc.
The tool is still very early in development with tons of more functionality coming (like proper deep understanding of C/C++ code structure), but the code slicing and outline-based reviewing already works just fine. Also, works with DeepSeek, or any other model that can, well, complete conversations.
sysmax | 8 months ago | on: Writing a basic Linux device driver when you know nothing about Linux drivers
So, the safest thing to do is not give details at all, or "leak" them like another reply in this thread mentions.
sysmax | 8 months ago | on: Libxml2's "no security embargoes" policy
I used to work on a kernel debugging tool and had a particularly annoying security researcher bug me about a signed/unsigned integer check that could result in a target kernel panic with a malformed debug packet. Like you couldn't do the same by just writing random stuff at random addresses, since you are literally debugging the kernel with full memory access. Sad.
sysmax | 8 months ago | on: Define policy forbidding use of AI code generators
Because for projects like QEMU, current AI models can actually do mind-boggling stuff. You can give it a PDF describing an instruction set, and it will generate you wrapper classes for emulating particular instructions. Then you can give it one class like this and a few paragraphs from the datasheet, and it will spit out unit tests checking that your class works as the CPU vendor describes.
Like, you can get from 0% to 100% test coverage several orders of magnitude faster than doing it by hand. Or refactoring, where you want to add support for a particular memory virtualization trick, and you need to update 100 instruction classes based on straight-forward, but not 100% formal rule. A human developer would be pulling their hairs out, while an LLM will do it faster than you can get a coffee.
sysmax | 8 months ago | on: Gemini CLI
But in many other niches (say embedded), the workflow is different. You add a feature, you get weird readings. You start modelling in your head, how the timing would work, doing some combination of tracing and breakpoints to narrow down your hypotheses, then try them out, and figure out what works the best. I can't see the CLI agents do that kind of work. Depends too much on the hunch.
Sort of like autonomous driving: most highway driving is extremely repetitive and easy to automate, so it got automated. But going on a mountain road in heavy rain, while using your judgment to back off when other drivers start doing dangerous stuff, is still purely up to humans.
sysmax | 8 months ago | on: GitHub CEO: manual coding remains key despite AI boom
Except, not all work is like that. Fast-forward to product version 2.34 where a particular customer needs a change that could break 5000 other customers because of non-trivial dependencies between different parts of the design, and you will be rewriting the entire thing by humans or having it collapse under its own weight.
But out of 100 products launched on the market, only 1 or 2 will ever reach that stage, and having 100 LLM prototypes followed by 2 thoughtful redesigns is way better than seeing 98 human-made products die.
If you just ask the model in plain text, the actual "decision" whether it detected anything or not is made by by the time it outputs the second word ("don't" vs. "notice"). The rest of the output builds up from that one token and is not that interesting.
A way cooler way to run such experiments is to measure the actual token probabilities at such decision points. OpenAI has the logprob API for that, don't know about Anthropic. If not, you can sort of proxy it by asking the model to rate on a scale from 0-9 (must be a single token!) how much it think it's being under influence. The score must be the first token in its output though!
Another interesting way to measure would be to ask it for a JSON like this:
Again, the rigid structure of the JSON will eliminate the interference from the language structure, and will give more consistent and measurable outputs.It's also notable how over-amplifying the injected concept quickly overpowers the pathways trained to reproduce the natural language structure, so the model becomes totally incoherent.
I would love to fiddle with something like this in Ollama, but am not very familiar with its internals. Can anyone here give a brief pointer where I should be looking if I wanted to access the activation vector from a particular layer before it starts producing the tokens?