sysmax's comments

sysmax | 4 months ago | on: Signs of introspection in large language models

Bah. It's a really cool idea, but a rather crude way to measure the outputs.

If you just ask the model in plain text, the actual "decision" whether it detected anything or not is made by by the time it outputs the second word ("don't" vs. "notice"). The rest of the output builds up from that one token and is not that interesting.

A way cooler way to run such experiments is to measure the actual token probabilities at such decision points. OpenAI has the logprob API for that, don't know about Anthropic. If not, you can sort of proxy it by asking the model to rate on a scale from 0-9 (must be a single token!) how much it think it's being under influence. The score must be the first token in its output though!

Another interesting way to measure would be to ask it for a JSON like this:

  "possible injected concept in 1 word" : <strength 0-9>, ...

Again, the rigid structure of the JSON will eliminate the interference from the language structure, and will give more consistent and measurable outputs.

It's also notable how over-amplifying the injected concept quickly overpowers the pathways trained to reproduce the natural language structure, so the model becomes totally incoherent.

I would love to fiddle with something like this in Ollama, but am not very familiar with its internals. Can anyone here give a brief pointer where I should be looking if I wanted to access the activation vector from a particular layer before it starts producing the tokens?

sysmax | 7 months ago | on: OpenAI's new open-source model is basically Phi-5

Want a good use case?

I am playing around with interactive workflow where the model suggests what can be wrong with a particular chunk of code, then the user selects one of the options, and the model immediately implements the fix.

Biggest problem? Total Wild West in terms of what the models try to suggest. Some models suggest short sentences, others spew out huge chunks at a time. GPT-OSS really likes using tables everywhere. Llama occasionally gets stuck in the loop of "memcpy() could be not what it seems and work differently than expected" followed by a handful of similar suggestions for other well-known library functions.

I mostly got it to work with some creative prompt engineering and cross-validation, but having a model fine-tuned for giving reasonable suggestions that are easy to understand from a quick glance, would be way better.

sysmax | 7 months ago | on: Cerebras Code

Ironically, that's how I got the whole idea of symbol-level edits. I was working on project like that, and realized that a lot of work is actually fairly small edits. But to do one right, you need to you need to look through a bunch of classes, abstraction layers, and similar implementations, and then keep in your head how to get an instance of X from a pointer to Y, etc. Very annoying repetitive work.

I tried copy-pasting all the relevant parts into ChatGPT and gave it instructions like "add support for X to Y, similar to Z", and it got it pretty well each time. The bottleneck was really pasting things into the context window, and merging the changes back. So, I made a GUI that automated it - showed links on top of functions/classes to quickly attach them into the context window, either as just declarations, or as editable chunks.

That worked faster, but navigating to definitions and manually clicking on top of them still looked like an unnecessary step. But if you asked the model "hey, don't follow these instructions yet, just tell me which symbols you need to complete them", it would give reasonable machine-readable results. And then it's easy to look them up on the symbol level, and do the actual edit with them.

It doesn't do magic, but takes most of the effort out of getting the first draft of the edit, than you can then verify, tweak, and step through in a debugger.

sysmax | 7 months ago | on: Cerebras Code

It greatly depends on the type of work you are trying to delegate to the AI. If you ask it to add one entire feature at a time, file level could work better. But the time and costs go up very fast, and it's harder to review.

What works for me (adding features to huge interconnected projects), is think what classes, algorithms and interfaces I want to add, and then give very brief prompts like "split class into abstract base + child like this" and "add another child supporting x,y and z".

So, I still make all the key decisions myself, but I get to skip typing the most annoying and repetitive parts. Also, the code don't look much different from what I could have written by hand, just gets done about 5x faster.

sysmax | 7 months ago | on: Cerebras Code

Adding entire files into the context window and letting the AI sift through it is a very wasteful approach.

It was adopted because trying to generate diffs with AI opens a whole new can of worms, but there's a very efficient approach in between: slice the files on the symbol level.

So if the AI only needs the declaration of foo() and the definition of bar(), the entire file can be collapsed like this:

  class MyClass {
    void foo();
    
    void bar() {
        //code
    }
  }

Any AI-suggested changes are then easy to merge back (renamings are the only notable exception), so it works really fast.

I am currently working on an editor that combines this approach with the ability to step back-and-forth between the edits, and it works really well. I absolutely love the Cerebras platform (they have a free tier directly and pay-as-you-go offering via OpenRouter). It can get very annoying refactorings done in one or two seconds based on single-sentence prompts, and it usually costs about half a cent per refactoring in tokens. Also great for things like applying known algorithms to spread out data structures, where including all files would kill the context window, but pulling individual types works just fine with a fraction of tokens.

If you don't mind the shameless plug, there's a more explanation how it works here: https://sysprogs.com/CodeVROOM/documentation/concepts/symbol...

sysmax | 7 months ago | on: Vibe code is legacy code

Cool. The part that is released already is described here [0]. You can point at some code and give brief instructions, and it will ask the model to expand them, giving several options.

E.g. if you point at a vector class and just "Distance()" for prompt, it will make assumptions like "you want to add a function calculating distance from (0,0)", "function calculating distance between 2 vectors", etc. It runs pretty fast with models like LLaMA, so you can get small routine edits done much faster than by hand.

The part I am currently experimenting with are one-click commands like "expand" or "I don't like the selected part. Give me other options for it". I think, I'll get it to a mostly usable state around Monday. Feel free to shoot an email to the address in the contact page and I'll send you a link to the experimental build.

[0] https://sysprogs.com/CodeVROOM/documentation/concepts/planni...

sysmax | 7 months ago | on: Vibe code is legacy code

Most of the AI hiccups come from the sequential nature of generating responses. It gets to a spot where adhering to code structure means token X, and fulfilling some common sense requirement means token Y, so it picks X and the rest of the reply is screwed.

You can get way better results with incremental refinement. Refine brief prompt into detailed description. Refine description into requirements. Refine requirements into specific steps. Refine steps into modified code.

I am currently experimenting with several GUI options for this workflow. Feel free to reach out to me if you want to try it out.

sysmax | 7 months ago | on: Vibe code is legacy code

I use hierarchical trees of guidelines (just markdown sections) that are attached to every prompt. It's somewhat wasteful in terms of token, but works well. If AI is not creating a new wrapper, it will just ignore the "instructions for creating new wrappers" section.

sysmax | 7 months ago | on: Vibe code is legacy code

There's a pretty good sweet spot in between vibe coding and manual coding.

You still think out all the classes, algorithms, complexities in your head, but then instead of writing code by hand, use short prompts like "encapsulate X and Y in a nested class + create a dictionary where key is A+B".

This saves a ton of repetitive manual work, while the results are pretty indistinguishable from doing all the legwork yourself.

I am building a list of examples with exact prompts and token counts here [0]. The list is far from being complete, but gives the overall idea.

[0] https://sysprogs.com/CodeVROOM/documentation/examples/scatte...

sysmax | 8 months ago | on: AI coding tools can reduce productivity

It works with low-level C/C++ just fine as long as you rigorously include all relevant definitions in the context window, provide non-obvious context (like the lifecycle of some various objects) and keep your prompts focused.

Things like "apply this known algorithm to that project-specific data structure" work really well and save plenty of time. Things that require a gut feeling for how things are organized in memory don't work unless you are willing to babysit the model.

sysmax | 8 months ago | on: Data on AI-related Show HN posts

I don't see it as anything bad. LLMs are a interesting, alright. They can very quickly do a lot of mind-numbing work that used to be done by hand, and then they can stumble and produce total nonsense that no human being would even consider writing. And then you tweak the prompt in a weird way, and you're suddenly back in business.

To me, it feels like studying a new physical phenomenon. Like when Nicola Tesla was playing around with coils and wires, eventually loading to the creation of an entire industry.

Except, with LLMs, you don't need multi-million dollar equipment to play around with models. You can get pretty cool stuff done with a regular GPU, and even cooler if you use cloud.

I would say, if you are not spending some spare time fiddling around with LLMs trying to get them do some of the work you would otherwise do by hand, you are missing out.

sysmax | 8 months ago

Looked into it a lot. There are deterministic refactoring tools for things like convert for into foreach, or create constructor based on list of fields, but they still don't cover a lot of use cases.

I tried using a refactoring tool for reordering function arguments. The problem is, clicking through various GUI to get your point across is again too distracting. And there are still too many details. You can't say something like "new argument should be zero for callers that ignore the return value". It's not deterministic, and each case is slightly different from others. But LLMs handle this surprisingly well, and the mistakes they make are easy to spot.

What I'm really hoping to do some day is a "formal mode" where the LLM would write a mini-program to mutate the abstract syntax tree based on a textual refactoring request, thus guaranteeing determinism. But that's a whole new dimension of work, and there are numerous easier use cases to tackle before that.

sysmax | 8 months ago

It's just faster and less distracting. What is a total game-changer for me, is small refactoring. Let's say, you have a method that takes a boolean argument. At some point you realize you need a third value. You could replace it with an enum, but updating a handful of call sites is boring and terribly distracting.

With LLMs I can literally type "unsavedOnly => enum Scope{Unsaved, Saved, RecentlySaved (ignore for now)}" and that's it. It will replace the "bool unsavedOnly" argument with "scope Scope", update the check inside the method, and update the callers. If had to do it by hand each time, I would have lazied out and added another bool argument, or some other kind of a sloppy fix, snowballing the technical debt. But if LLMs can do all the legwork, you don't need sloppy fixes anymore. Keeping the code nice and clean doesn't mean a huge distraction and doesn't kick you out of the zone.

sysmax | 8 months ago

It's not really that specific. There's a actually a hidden command there for comparing the current source file against an older version (otherwise, good luck testing the diff GUI without pre-recorded test cases). If anyone's interested, it can be very easily converted into a proper feature.

That said, when you review human work, the granularity is usually different. I've actually been heavily using AI to do minor refactoring like "replace these 2 variables with a struct and update all call sites" and the reviewing flow is just different. AI makes fairly predictable mistakes, and once you get the hang of it, you can spot them before you even fully read the code. Like groups of 3 edits for all call sites, and one call site with 4. Or things like removed comments or renamed variables you didn't ask to rename. Properly collapsing irrelevant parts makes much bigger difference than with human-made edits.

sysmax | 8 months ago

You can't trust LLMs to copy-paste code, but you can explicitly pick what should be editable, and also review the edits in a more streamlined way.

I am actually working on a GUI for just that [0]. The first problem is solved by having explicit links above functions and classes whether to include them in the context window (with an option to remove bodies of functions, just keeping the declarations). The second one is solved by a special review mode where it auto-collapses functions/classes that were unchanged, and having an outline window that shows how many blocks were changed in each function/class/etc.

The tool is still very early in development with tons of more functionality coming (like proper deep understanding of C/C++ code structure), but the code slicing and outline-based reviewing already works just fine. Also, works with DeepSeek, or any other model that can, well, complete conversations.

[0] https://codevroom.com/

sysmax | 8 months ago | on: Writing a basic Linux device driver when you know nothing about Linux drivers

It's not the IP, it's sadly how people react. Some folks will be appreciative of help, credit to them. Others will immediately get back how they tried it, it didn't work and now they need you to rewrite everything, or do their project for them, or redesign your product to match what they want it to be. And if you politely refuse, it quickly escalates to threats of trashing your business through every channel, and other things.

So, the safest thing to do is not give details at all, or "leak" them like another reply in this thread mentions.

sysmax | 8 months ago | on: Libxml2's "no security embargoes" policy

Sadly, that stuff backfires. The researcher will publish your response along with some snarky remarks how you are refusing to fix a "critical issue", and next time you are looking for a job and the HR googles up your name, it pops up, and -poof-, we'll call your later.

I used to work on a kernel debugging tool and had a particularly annoying security researcher bug me about a signed/unsigned integer check that could result in a target kernel panic with a malformed debug packet. Like you couldn't do the same by just writing random stuff at random addresses, since you are literally debugging the kernel with full memory access. Sad.

sysmax | 8 months ago | on: Define policy forbidding use of AI code generators

I wish people would make distinction regarding the size/scope of the AI-generated parts. Like with video copyright laws, where a 5-second clip from a copyrighted movie is usually considered fair use and not frowned upon.

Because for projects like QEMU, current AI models can actually do mind-boggling stuff. You can give it a PDF describing an instruction set, and it will generate you wrapper classes for emulating particular instructions. Then you can give it one class like this and a few paragraphs from the datasheet, and it will spit out unit tests checking that your class works as the CPU vendor describes.

Like, you can get from 0% to 100% test coverage several orders of magnitude faster than doing it by hand. Or refactoring, where you want to add support for a particular memory virtualization trick, and you need to update 100 instruction classes based on straight-forward, but not 100% formal rule. A human developer would be pulling their hairs out, while an LLM will do it faster than you can get a coffee.

sysmax | 8 months ago | on: Gemini CLI

I think, there are different niches. AI works extremely well for Web prototyping because a lot of that work is superficial. Back in the 90s we had Delphi where you could make GUI applications with a few clicks as opposed to writing tons of things by hand. The only reason we don't have that for Web is the decentralized nature of it: every framework vendor has their own vision and their own plan for future updates, so a lot of the work is figuring out how to marry the latest version of component X with the specific version of component Y because it is required by component Z. LLMs can do that in a breeze.

But in many other niches (say embedded), the workflow is different. You add a feature, you get weird readings. You start modelling in your head, how the timing would work, doing some combination of tracing and breakpoints to narrow down your hypotheses, then try them out, and figure out what works the best. I can't see the CLI agents do that kind of work. Depends too much on the hunch.

Sort of like autonomous driving: most highway driving is extremely repetitive and easy to automate, so it got automated. But going on a mountain road in heavy rain, while using your judgment to back off when other drivers start doing dangerous stuff, is still purely up to humans.

sysmax | 8 months ago | on: GitHub CEO: manual coding remains key despite AI boom

If you are launching one product per day, you are using LLMs to convert unrefined ideas into proof-of-concept prototypes. That works really well, that's the kind of work that nobody should be doing by hand anymore.

Except, not all work is like that. Fast-forward to product version 2.34 where a particular customer needs a change that could break 5000 other customers because of non-trivial dependencies between different parts of the design, and you will be rewriting the entire thing by humans or having it collapse under its own weight.

But out of 100 products launched on the market, only 1 or 2 will ever reach that stage, and having 100 LLM prototypes followed by 2 thoughtful redesigns is way better than seeing 98 human-made products die.

page 1