top | item 46924426

Software factories and the agentic moment

304 points| mellosouls | 1 month ago |factory.strongdm.ai

See also https://simonwillison.net/2026/Feb/7/software-factory/

459 comments

order
[+] noosphr|1 month ago|reply
I was looking for some code, or a product they made, or anything really on their site.

The only github I could find is: https://github.com/strongdm/attractor

    Building Attractor

    Supply the following prompt to a modern coding agent
    (Claude Code, Codex, OpenCode, Amp, Cursor, etc):
  
    codeagent> Implement Attractor as described by
    https://factory.strongdm.ai/
Canadian girlfriend coding is now a business model.

Edit:

I did find some code. Commit history has been squashed unfortunately: https://github.com/strongdm/cxdb

There's a bunch more under the same org but it's years old.

[+] ares623|1 month ago|reply
I was about to say the same thing! Yet another blog post with heaps of navel gazing and zero to actually show for it.

The worst part is they got simonw to (perhaps unwittingly or social engineering) vouch and stealth market for them.

And $1000/day/engineer in token costs at current market rates? It's a bold strategy, Cotton.

But we all know what they're going for here. They want to make themselves look amazing to convince the boards of the Great Houses to acquire them. Because why else would investors invest in them and not in the Great Houses directly.

[+] jessmartin|1 month ago|reply
They have a Products page where they list a database and an identity system in addition to attractors: https://factory.strongdm.ai/products

For those of us working on building factories, this is pretty obvious because once you immediately need shared context across agents / sessions and an improved ID + permissions system to keep track of who is doing what.

[+] yomismoaqui|1 month ago|reply
I don't know if that is crazy or a glimpse of the future (could be both).

PS: TIL about "Canadian girlfriend", thanks!

[+] ebhn|1 month ago|reply
That's hilarious
[+] touristtam|1 month ago|reply
So paste that into a 'chat' and hope the link doesn't blow up in your face?
[+] itissid|1 month ago|reply
So I am on a web cast where people working about this. They are from https://docs.boundaryml.com/guide/introduction/what-is-baml and humanlayer.dev Mostly are talking about spec driven development. Smart people. Here is what I understood from them about spec driven development, which is not far from this AFAIU.

Lets start with the `/research -> /plan -> /implement(RPI)`. When you are building a complex system for teams you _need_ humans in the loop and you want to focus on design decisions. And having structured workflows around agents provides a better UX to those humans make those design decisions. This is necessary for controlling drift, pollution of context and general mayhem in the code base. _This_ is the starting thesis around spec drive development.

How many times have you working as a newbie copied a slash command pressed /research then /plan then /implement only to find it after several iterations is inconsistent and go back and fix it? Many people still go back and forth with chatgpt copying back and forth copying their jira docs and answering people's question on PRD documents. This is _not_ a defence it is the user experience when working with AI for many.

One very understandable path to solve this is to _surface_ to humans structured information extracted from your plan docs for example:

https://gist.github.com/itissid/cb0a68b3df72f2d46746f3ba2ee7...

In this very toy spec driven development the idea is that each step in the RPI loop is broken down and made very deterministic with humans in the loop. This is a system designed by humans(Chief AI Officer, no kidding) for teams that follow a fairly _customized_ processes on how to work fast with AI, without it turning into a giant pile of slop. And the whole point of reading code or QA is this: You stop the clock on development and take a beat to see the high signal information: Testers want to read tests and QAers want to test behavior, because well written they can tell a lot about weather a software works. If you have ever written an integration test on a brownfield code with poor test coverage, and made it dependable after several days in the dark, you know what it feels like... Taking that step out is what all VCs say is the last game in town.. the final game in town.

This StrongDM stuff is a step beyond what I can understand: "no humans should write code", "no humans should read code", really..? But here is the thing that puzzles me even more is that spec driven development as I understand it, to use borrowed words, is like parents raising a kid — once you are a parent you want to raise your own kid not someone else's. Because it's just such a human in the loop process. Every company, tech or not, wants to make their own process that their engineers like to work with. So I am not sure they even have a product here...

[+] codingdave|1 month ago|reply
> If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement

At that point, outside of FAANG and their salaries, you are spending more on AI than you are on your humans. And they consider that level of spend to be a metric in and of itself. I'm kinda shocked the rest of the article just glossed over that one. It seems to be a breakdown of the entire vision of AI-driven coding. I mean, sure, the vendors would love it if everyone's salary budget just got shifted over to their revenue, but such a world is absolutely not my goal.

[+] dixie_land|1 month ago|reply
This is an interesting point but if I may offer a different perspective:

Assuming 20 working days a month: that's 20k x 12 == 240k a year. So about a fresh grad's TC at FANG.

Now I've worked with many junior to mid-junior level SDEs and sadly 80% does not do a better job than Claude. (I've also worked with staff level SDEs who writes worse code than AI, but they offset that usually with domain knowledge and TL responsibilities)

I do see AI transform software engineering into even more of a pyramid with very few human on top.

[+] TheFellow|1 month ago|reply
It feels like folks are too focused on the number, and less about the implication. Pick any [$ amount] per [unit time] and you'd have the same discussion. What I think this really means is that if you're not burning tokens at [rate] then you should ask yourself what else you could be doing to maximize the efficacy of the tokens you already burned. Was the prior output any good? Good question. You can burn tokens on a code review, or, you could burn tokens building a QA system that itself, burns tokens. What is the output of the QA system? Feedback to improve the state/quality of the original output. Then, moar tokens burn taking in that feedback and improving (hopefully) the original output; And, now, there is a QA system ready to review again, further the goalpost, and of course - burn more tokens. The point being: You have tokens to burn. Use those tokens to build systems that will use tokens to further your goal. Make the leap from "I burned N tokens getting feedback on my code" to "I burned N + M tokens to build a system that improves itself" and get yourself out of the loop entirely.
[+] dewey|1 month ago|reply
It would depend on the speed of execution, if you can do the same amount of work in 5 days with spending 5k, vs spending a month and 5k on a human the math makes more sense.
[+] kaffekaka|1 month ago|reply
If the output is (dis)proportionally larger, the cost trade off might be the right thing to do.

And it might be the tokens will become cheaper.

[+] philipp-gayret|1 month ago|reply
$1,000 is maybe 5$ per workday. I measure my own usage and am on the way to $6,000 for a full year. I'm still at the stage where I like to look at the code I produce, but I do believe we'll head to a state of software development where one day we won't need to.
[+] CuriouslyC|1 month ago|reply
Until we solve the validation problem, none of this stuff is going to be more than flexes. We can automate code review, set up analytic guardrails, etc, so that looking at the code isn't important, and people have been doing that for >6 months now. You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.

There are higher and lower leverage ways to do that, for instance reviewing tests and QA'ing software via use vs reading original code, but you can't get away from doing it entirely.

[+] kaicianflone|1 month ago|reply
I agree with this almost completely. The hard part isn’t generation anymore, it’s validation of intent vs outcome. Especially once decisions are high-stakes or irreversible, think pkg updates or large scale tx

What I’m working on (open source) is less about replacing human validation and more about scaling it: using multiple independent agents with explicit incentives and disagreement surfaced, instead of trusting a single model or a single reviewer.

Humans are still the final authority, but consensus, adversarial review, and traceable decision paths let you reserve human attention for the edge cases that actually matter, rather than reading code or outputs linearly.

Until we treat validation as a first-class system problem (not a vibe check on one model’s answer), most of this will stay in “cool demo” territory.

[+] bluesnowmonkey|1 month ago|reply
But, is that different from how we already work with humans? Typically we don't let people commit whatever code they want just because they're human. It's more than just code reviews. We have design reviews, sometimes people pair program, there are unit tests and end-to-end tests and all kinds of tests, then code review, continuous integration, Q&A. We have systems to watch prod for errors or user complaints or cost/performance problems. We have this whole toolkit of process and techniques to try to get reliable programs out of what you must admit are unreliable programmers.

The question isn't whether agentic coders are perfect. Actually it isn't even whether they're better than humans. It's whether they're a net positive contribution. If you turn them loose in that kind of system, surrounded by checks and balances, does the system tend to accumulate bugs or remove them? Does it converge on high or low quality?

I think the answer as of Opus 4.5 or so is that they're a slight net positive and it converges on quality. You can set up the system and kind of supervise from a distance and they keep things under control. They tend to do the right thing. I think that's what they're saying in this article.

[+] varispeed|1 month ago|reply
AI also quickly goes off the rails, even the Opus 2.6 I am testing today. The proposed code is very much rubbish, but it passes the tests. It wouldn't pass skilled human review. Worst thing is that if you let it, it will just grow tech debt on top of tech debt.
[+] cronin101|1 month ago|reply
This obviously depends on what you are trying to achieve but it’s worth mentioning that there are languages designed for formal proofs and static analysis against a spec, and I have suspicions we are currently underutilizing them (because historically they weren’t very fun to write, but if everything is just tokens then who cares).

And “define the spec concretely“ (and how to exploit emerging behaviors) becomes the new definition of what programming is.

[+] dimitri-vs|1 month ago|reply
It's simple: you just offload the validation and security testing to the end user.
[+] stitched2gethr|1 month ago|reply
This is what we're working on at Speedscale. Our methods use traffic capture and replay to validate what worked before still works today.
[+] simianwords|1 month ago|reply
did you read the article?

>StrongDM’s answer was inspired by Scenario testing (Cem Kaner, 2003).

[+] losvedir|1 month ago|reply
I think the "software factory" terminology is very interesting, and I would imagine quite intentional.

It calls to mind the early days of the industrial revolution, when I believe the idea was that mass produced items were not better quality, just dramatically cheaper. So you still had the artisans that the rich people paid for but now poorer people had access to something they couldn't before.

Then, as technology progressed, factories started producing things that humans are incapable of. And part of this is because those factories were built on output of earlier factories.

It makes me wonder if this is where we're headed. Right now the code quality of agents isn't better than hand written code, and so arguably the products aren't either. But will there come a time when it surpasses what we can do? You can't handcraft a microchip, for example. But I guess the takeaway is maybe there's a time for both agentic lower quality but cheaper output and human software engineer higher quality output, at least for a time.

[+] idiotsecant|1 month ago|reply
Which is exactly why agentic coding tools are effectively running at a loss right now. We are training that next generation of factories. We're paying in human cognition.
[+] japhyr|1 month ago|reply
> That idea of treating scenarios as holdout sets—used to evaluate the software but not stored where the coding agents can see them—is fascinating. It imitates aggressive testing by an external QA team—an expensive but highly effective way of ensuring quality in traditional software.

This is one of the clearest takes I've seen that starts to get me to the point of possibly being able to trust code that I haven't reviewed.

The whole idea of letting an AI write tests was problematic because they're so focused on "success" that `assert True` becomes appealing. But orchestrating teams of agents that are incentivized to build, and teams of agents that are incentivized to find bugs and problematic tests, is fascinating.

I'm quite curious to see where this goes, and more motivated (and curious) than ever to start setting up my own agents.

Question for people who are already doing this: How much are you spending on tokens?

That line about spending $1,000 on tokens is pretty off-putting. For commercial teams it's an easy calculation. It's also depressing to think about what this means for open source. I sure can't afford to spend $1,000 supporting teams of agents to continue my open source work.

[+] simonw|1 month ago|reply
This is the stealth team I hinted at in a comment on here last week about the "Dark Factory" pattern of AI-assisted software engineering: https://news.ycombinator.com/item?id=46739117#46801848

I wrote a bunch more about that this morning: https://simonwillison.net/2026/Feb/7/software-factory/

This one is worth paying attention to to. They're the most ambitious team I've see exploring the limits of what you can do with this stuff. It's eye-opening.

[+] enderforth|1 month ago|reply
This right here is where I feel most concerned

> If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement

Seems to me like if this is true I'm screwed no matter if I want to "embrace" the "AI revolution" or not. No way my manager's going to approve me to blow $1000 a day on tokens, they budgeted $40,000 for our team to explore AI for the entire year.

Let alone from a personal perspective I'm screwed because I don't have $1000 a month in the budget to blow on tokens because of pesky things that also demand financial resources like a mortgage and food.

At this point it seems like damned if I do, damned if I don't. Feels bad man.

[+] amarant|1 month ago|reply
"If you haven't spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement"

Apart from being a absolutely ridiculous metric, this is a bad approach, at least with current generation models. In my experience, the less you inspect what the model does, the more spaghetti-like the code will be. And the flying spaghetti monster eats tokens faster than you can blink! Or put more clearly: implementing a feature will cost you a lot more tokens in a messy code base than it does in a clean one. It's not (yet) enough to just tell the agent to refactor and make it clean, you have to give it hints on how to organise the code.

I'd go do far as to say that if you're burning a thousand dollars a day per engineer, you're getting very little bang for your tokens.

And your engineers probably look like this: https://share.google/H5BFJ6guF4UhvXMQ7

[+] Garlef|1 month ago|reply
Maybe Management will finally get behind refactoring
[+] kakugawa|1 month ago|reply
It's short-term vs long-term optimization. Short-term optimization is making the system effective right now. Long-term optimization is exploring ways to improve the system as a whole.
[+] rileymichael|1 month ago|reply
> In rule form: - Code must not be written by humans - Code must not be reviewed by humans

as a previous strongDM customer, i will never recommend their offering again. for a core security product, this is not the flex they think it is

also mimicking other products behavior and staying in sync is a fools task. you certainly won't be able to do it just off the API documentation. you may get close, but never perfect and you're going to experience constant breakage

[+] simonw|1 month ago|reply
Important to note that this is the approach taken by their AI research lab over the past six months, it's not (yet) reflective of how they build the core product.
[+] andersmurphy|1 month ago|reply
Right but how many unsuspecting customers like you do they need to have before they can exit?
[+] galoisscobi|1 month ago|reply
What has strongdm actually built? Are their users finding value from their supposed productivity gains?

If their focus is to only show their productivity/ai system but not having built anything meaningful with it, it feels like one of those scammy life coaches/productivity gurus that talk about how they got rich by selling their courses.

[+] politelemon|1 month ago|reply
> we transitioned from boolean definitions of success ("the test suite is green") to a probabilistic and empirical one. We use the term satisfaction to quantify this validation: of all the observed trajectories through all the scenarios, what fraction of them likely satisfy the user?

Oh, to have the luxury of redefining success and handwaving away hard learned lessons in the software industry.

[+] geraneum|1 month ago|reply
> with the second revision of Claude 3.5 (October 2024), long-horizon agentic coding workflows began to compound correctness rather than error.

What does it mean to compound correctness? Like negative acceleration in rate of errors? How does that compound? Unseriously!

[+] navanchauhan|1 month ago|reply
The model could start building on top of things it had successfully built before instead of just straight up exponential error propagation
[+] danshapiro|1 month ago|reply
If you'd like to try this yourself, you can build an "attractor" by just pointing claude code at their llms.txt. Or if you'd like to save some tokens, you can clone my go version. https://github.com/danshapiro/kilroy This version has a Claude Code skill to help. Tell it to use it's skill to create a dotfile from your requirements. Then tell it to run that dotfile with kilroy.
[+] softwaredoug|1 month ago|reply
A lot of examples of creating clones of existing products don't resonate with new products we build

For example, most development work involves discovering correctness, not writing to a fullproof spec (like cloning slack)

Usually work goes like:

* Team decides some vague requirement

* Developer must implement requirement into executable decisions

Now I use Claude Code to do step 2 now, and its great. But I'm looking over whether the implementation's little decisions actually do what the business would want. Or more accurately, I'm making decisions to the level of specificity that matters to the problem at hand.

I have to try, backtrack, and rebuild all the time when my assumptions get broken.

In some cases decisions have low specificity: I could one-shot a complex feature (or entire app if trying to test PMF or something). In other cases, the tradeoffs in 10 lines of code become crucially important.

[+] bluesnowmonkey|1 month ago|reply
> The Digital Twin Universe is our answer: behavioral clones of the third-party services our software depends on. We built twins of Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets, replicating their APIs, edge cases, and observable behaviors.

Came to the same conclusion. I have an integration heavy codebase and it could hardly test anything if tests weren't allowed to call external services. So there are fake implementations of every API it touches: Anthropic, Gemini, Sprites, Brave, Slack, AgentMail, Notion, on and on and on. 22 fakes and climbing. Why not? They're essentially free to generate, it's just tokens.

I didn't go as far as recreating the UI of these services, though, as the article seems to be implying based on those screenshots. Just the APIs.

[+] Dumblydorr|1 month ago|reply
What would happen if these agents are given a token lifespan, and are told to continually spend tokens to create more agentic children, and give their genetic and data makeup such as it is to children that it creates with other agents sexually potentially, but then tokens are limited and they can not get enough without certain traits.

Wouldn’t they start to evolve to be able to reproduce more and eat more tokens? And then they’d be mature agents to take further human prompts to gain more tokens?

Would you see certain evolutionary strategies reemerge like carnivores eating weaker agents for tokens, eating of detritus of old code, or would it be more like evolution of roles in a company?

I assume the hurdles would be agents reproducing? How is that implemented?

[+] stego-tech|1 month ago|reply
IT perspective here. Simon hits the nail on the head as to what I'm genuinely looking forward to:

> How do you clone the important parts of Okta, Jira, Slack and more? With coding agents!

This is what's going to gut-punch most SaaS companies repeatedly over the next decade, even if this whole build-out ultimately collapses in on itself (which I expect it to). The era of bespoke consultants for SaaS product suites to handle configuration and integrations, while not gone, are certainly under threat by LLMs that can ingest user requirements and produce functional code to do a similar thing at a fraction of the price.

What a lot of folks miss is that in enterprise-land, we only need the integration once. Once we have an integration, it basically exists with minimal if any changes until one side of the integration dies. Code fails a security audit? We can either spool up the agents again briefly to fix it, or just isolate it in a security domain like the glut of WinXP and Win7 boxen rotting out there on assembly lines and factory floors.

This is why SaaS stocks have been hammered this week. It's not that investors genuinely expect huge players to go bankrupt due to AI so much as they know the era of infinite growth is over. It's also why big AI companies are rushing IPOs even as data center builds stall: we're officially in a world where a locally-run model - not even an Agent, just a model in LM Studio on the Corporate Laptop - can produce sufficient code for a growing number of product integrations without any engineer having to look through yet another set of API documentation. As agentic orchestration trickles down to homelabs and private servers on smaller, leaner, and more efficient hardware, that capability is only going to increase, threatening profits of subscription models and large AI companies. Again, why bother ponying up for a recurring subscription after the work is completed?

For full-fledged software, there's genuine benefit to be had with human intervention and creativity; for the multitude of integrations and pipelines that were previously farmed out to pricey consultants, LLMs will more than suffice for all but the biggest or most complex situations.

[+] groundtruthdev|1 month ago|reply
In this hypothetical world where AI reliably generates software, large and small software providers alike are out of luck. Companies will go straight to LLMs or open-source models, fine-tune them for their needs, and run them on in-house hardware as costs fall, spreading expenses across departments. Even LLM providers won’t be safe. Brand, lock-in, and incumbent status won’t save anyone. The advantage goes to whoever can integrate, customize, and scale internally. Hypothetically is the keyword.
[+] hnthrow0287345|1 month ago|reply
Yep, you definitely want to be in the business of selling shovels for the gold rush.
[+] mccoyb|1 month ago|reply
Effectively everyone is building the same tools with zero quantitative benchmarks or evidence behind the why / ideas … this entire space is a nightmare to navigate because of this. Who cares without proper science, seriously? I look through this website and it looks like a preview for a course I’m supposed to buy … when someone builds something with these sorts of claims attached, I assume that there is going to be some “real graphs” (“these are the number of times this model deviated from the spec before we added error correction …”)

What we have instead are many people creating hierarchies of concepts, a vast “naming” of their own experiences, without rigorous quantitative evaluation.

I may be alone in this, but it drives me nuts.

Okay, so with that in mind, it amounts to heresay “these guys are doing something cool” — why not shut up or put up with either (a) an evaluation of the ideas in a rigorous, quantitative way or (b) apply the ideas to produce an “hard” artifact (analogous, e.g., to the Anthropic C compiler, the Cursor browser) with a reproducible pathway to generation.

The answer seems to be that (b) is impossible (as long as we’re on the teet of the frontier labs, which disallow the kind of access that would make (b) possible) and the answer for (a) is “we can’t wait we have to get our names out there first”

I’m disappointed to see these types of posts on HN. Where is the science?

[+] tezza|1 month ago|reply
Not sure “Digital Twin Universe” is required here. They seem rather to have rediscovered Simulators in Integration Tests from first principles? The DTU comes off as XML Databases or Information Superhighway.

Still… a really good application of agent hands-off replication.

Seems like creating a quality negative mould and then that single negative mould makes multiple positive objects en-masse.