(no title)
Zakodiac | 22 days ago
Their answer of keeping scenarios external to the codebase like a holdout set is smart. And building full behavioral clones of services like Okta, Jira, Slack so you can run thousands of end to end scenarios without hitting rate limits or production - that's where the actual hard engineering work is. Not the code generation, the validation infrastructure.
Most teams trying this will skip that part because it's expensive and unglamorous. They'll let agents write code and tests together and wonder why things break in production. The "factory" part isn't the agents writing code. It's having robust enough external proof that the code does what it's supposed to.
jaytaylor|22 days ago
I did have an initial key insight which led to a repeatable strategy to ensure a high level of fidelity between DTU vs. the official canonical SaaS services:
Use the top popular publicly available reference SDK client libraries as compatibility targets, with the goal always being 100% compatibility.
You've also zeroed in on how challenging this was: I started this back in August 2025 (as one of many projects, at any time we're each juggling 3-8 projects) with only Sonnet 3.5. Much of the work was still very unglamorous, but feasible. Especially Slack, in some ways Slack was more challenging to get right than all of G-Suite (!).
Now I'm part way through reimplementing the entire DTU in Rust (v1 was in Go) and with gpt-5.2 for planning and gpt-5.3-codex for execution it's significantly less human effort.
IMO the most novel part to this story is Navan's Attractor and corresponding NLSpec. Feed in a good Definition-of-Done and it'll bounce around between nodes until it gets it right. There are already several working implementations in less than 24 hours since it was released, one of which is even open source [0].
[0] https://github.com/danshapiro/kilroy
ukuina|22 days ago
Why the switch from Go to Rust?
knuckleheads|22 days ago
Zakodiac|22 days ago
[deleted]
ares623|22 days ago
But after thinking about it more, I think it must be the lowest of low hanging fruits for LLMs. You're building something with well defined specs, most of which is readily available by the original creators, with a UI that only does the bare minimum, and it doesn't need any long-term features like reliability since it's all for internal short-lived use. On top of that, it looks super impressive when used in a demo, because all those applications being mocked are very complicated pieces of software. So to recreate a thin facade of them can look very impressive. And calling it a "Digital Twin Universe" is just icing on the cake.
intended|21 days ago
But at some point you get back to tests, because they are simpler to write.
This is a child of the “no handwritten code” rule. Since they can’t steer tests, they have to do something else to ensure quality.
This is only worth it if the added cost and overhead is cheaper than writing the code.
This seems like it will pull towards building a simulation of your firm, for the simulation to work? Or simulations of your process?
VenturingVole|21 days ago
I recognised this was grounded in an entirely different world of software engineering and organisation size though. I followed a path of thinking about what went wrong historically and how might they be solved: Better structure, discipline, resources - all of the things which agentic AI facilitates.
You are right about most skipping this part: But I view it as being like a sewerage and sanitation system - largely invisible and not thought about but critical for long-term health.
Also this ties in very nicely with Netflix's approach to Chaos Engineering and enabling it at broader scale.
throwup238|21 days ago
And like sewage and sanitation the infrastructure is a lot more complicated than people think.
I’m curious what happens when they need to make a DRU of Stripe or another payment processor.
az226|22 days ago
This approach balances out and maximizes accuracy.
otabdeveloper4|22 days ago
Can't help but chuckle at that.
dboreham|22 days ago
smithclay|21 days ago
For customers, it makes migrations much easier and less-risky between vendors.
For the vendors themselves, it means you can cheaply and reliably port features your competitors have that you don’t have.
szundi|22 days ago
[deleted]