Launch HN: GPT Driver (YC S21) – End-to-end app testing in natural language
You can watch a brief product walkthrough here: https://www.youtube.com/watch?v=5-Ge2fqdlxc
In terms of trying the product out: since the service is resource-intensive (we provide hosted virtual/real phone instances), we don't currently have a playground available. However, you can see some examples here https://mobileboost.io/showcases and book a demo of GPT Driver testing your app through our website.
Why we built this: working at previous startups and scaleups, we saw how as app teams grew, QA teams would struggle to ensure everything was still working. This caused tension between teams and resulted in bugs making it into production.
You’d expect automated tests to help, but these were a huge effort because only engineers could create the tests, and the apps themselves kept changing—breaking the tests regularly and leading to high maintenance overhead. Functional tests often failed not because of actual app errors, but due to changes like copy updates or modifications to element IDs. This was already a challenge, even before considering the added complexities of multiple platforms, different environments, multilingual UIs, marketing popups, A/B tests, or minor UI changes from third-party authentication or payment providers.
We realized that combining computer vision with LLM reasoning could solve the common flakiness issues in E2E testing. So, we launched GPT Driver—a no-code editor paired with a hosted emulator/simulator service that allows teams to set up test automation efficiently. Our visual + LLM reasoning test execution reduces false alarms, enabling teams to integrate their E2E tests into their CI/CD pipelines without getting blocked. Some interesting technical challenges we faced along the way: (1) UI Object Detection from Vision Input: We had to train object detection models (YOLO and Faster R-CNN based) on a subset of the RICO dataset as well as our own dataset to be able to interact accurately with the UI. (2) Reasoning with Current LLMs: We have to shorten instructions, action history, and screen content during runtime for better results, as handling large amounts of input tokens remains a challenge. We also work with reasoning templates to achieve robust decision-making. (3) Performance Optimization: We optimized our agentic loop to make decisions in less than 4 seconds. To reduce this further, we implemented caching mechanisms and offer a command-first approach, where our AI agent only takes over when the command fails.
Since launching GPT Driver, we’ve seen adoption by technical teams, both with and without dedicated QA roles. Compared to code-based tests, the core benefit is the reduction of both the manual work and the time required to maintain effective E2E tests. This approach is particularly powerful for apps which have a lot of dynamic screens and content such as Duolingo which we have been working with since a couple of months. Additionally, the tests can now also be managed by non-engineers.
We’d love to hear about your experiences with E2E test automation—what approaches have worked or didn’t work for you? What features would you find valuable?
[+] [-] msoad|1 year ago|reply
It used to be that the frontend was very fragile. XVFB, Selenium, ChromeDriver, etc., used to be the cause of pain, but recently the frontend frameworks and browser automation have been solid. Headless Chrome hardly lets us down.
The biggest pain in e2e testing is that tests fail for reasons that are hard to understand and debug. This is a very, very difficult thing to automate and requires AGI-level intelligence to really build a system that can go read the logs of some random service deep in our service mesh to understand why an e2e test fails. When an e2e test flakes, in a lot of cases we ignore it. I have been in other orgs where this is the case too. I wish there was a system that would follow up and generate a report that says, “This e2e test failed because service XYZ had a null pointer exception in this line,” but that doesn’t exist today. In most of the companies I’ve been at, we had complex enough infra that the error message never makes it to the frontend so we can see it in the logs. OpenTelemetry and other tools are promising, but again, I’ve never seen good enough infra that puts that all together.
Writing tests is not a pain point worth buying a solution for, in my case.
My 2c. Hopefully it’s helpful and not too cynical.
[+] [-] hn_throwaway_99|1 year ago|reply
That is, I don't think a framework focused on front end testing should really be where the solution for your problem is implemented. You say "This is a very, very difficult thing to automate and requires AGI-level intelligence to really build a system that can go read the logs of some random service deep in our service mesh to understand why an e2e test fails." - I would argue what you really need is better log aggregation and system tracing. And I'm not saying this to be snarky (at scale with a bunch of different teams managing different components I've seen that it can be difficult to get everyone on the same aggregation/tracing framework and practices), but that's where I'd focus, as you'll get the dividends not only in testing but in runtime observability as well.
[+] [-] ec109685|1 year ago|reply
Those types of transient issues aren’t something that you would want to fail a test for given it still would let the human get the job done if it happened in the field.
This seems like the most useful part of adding AI to e2e tests. The world is not deterministic, which AI handles well.
Uber takes this approach here: https://www.uber.com/blog/generative-ai-for-high-quality-mob...
[+] [-] cschiller|1 year ago|reply
Regarding writing robust e2e tests, I think it really depends on the team's experience and the organization’s setup. We’ve found that in some organizations—particularly those with large, fast-moving engineering teams—test creation and maintenance can still be a bottleneck due to the flakiness of their e2e tests.
For example, we’ve seen an e-commerce team with 150+ mobile engineers struggle to keep their functional tests up-to-date while the company was running copy and marketing experiments. Another team in the food delivery space faced issues where unrelated changes in webviews caused their e2e tests to fail, making it impossible to run tests in a production-like system.
Our goal is to help free up that time so that teams can focus on solving bigger challenges, like the debugging problems you’ve mentioned.
[+] [-] fullstackchris|1 year ago|reply
Maybe someday the tooling for mobile will be as good as headless chrome is for web :)
Agreed though that the followup debugging of a failed test could be hard to automate in some cases.
[+] [-] rafaelmn|1 year ago|reply
[+] [-] codedokode|1 year ago|reply
[+] [-] TechDebtDevin|1 year ago|reply
[+] [-] tomatohs|1 year ago|reply
Debugging failed test is a "first world problem"
[+] [-] batikha|1 year ago|reply
Over nearly 10 years in startups (big and small), I've been consistently surprised by how much I hear that "testing has been solved", yet I see very little automation in place and PMs/QAs/devs and sometimes CEOs and VPs doing lots of manual QA. And not only on new features (which is a good thing), also on happy path / core features (arguably a waste of time to test things over and over again).
More than once I worked for a company that was against having a manual QA team, out of principle and more or less valid reasons (we use a typed language so less bug, engineers are empowered, etc etc), but ended up hiring external consultants to handle QA after a big quality incident.
The amount of mismatch between theory and practice in this field is impressive.
[+] [-] epolanski|1 year ago|reply
Because software is a clownish mimicking of engineering that lacks any real solid and widespread engineering practices.
It's cultural.
Crowds boast their engineering degrees, but have little to show but leetcode and system design black belts, even though their day to day job rarely requires them to architect systems or reimplement a new Levehnstein distance but would benefit a lot from thoroughly investigating functional and non functional requirements and encoding and maintaining those through automation.
There's very little engineering in software, people really care about the borderline fun parts and discard the rest.
[+] [-] cschiller|1 year ago|reply
[+] [-] ec109685|1 year ago|reply
Have you considered an approach like what Anthropic is doing for their computer control where an agent runs on your own computer and controls a device simulator?
[+] [-] ec109685|1 year ago|reply
[+] [-] codepathfinder|1 year ago|reply
[+] [-] cschiller|1 year ago|reply
[+] [-] codepathfinder|1 year ago|reply
[+] [-] cschiller|1 year ago|reply
[+] [-] tomatohs|1 year ago|reply
Another example, imagine an error box shows up. Was that correct or incorrect?
So you need to build a "meta" layer, which includes UI, to start marking up the video and end up in the same state.
Our approach has been to let the AI explore the app and come up with ideas. Less interaction from the user.
[+] [-] rvz|1 year ago|reply
That has around 95% of what GPT Driver does and has the potential to do Web E2E testing.
[0] https://maestro.mobile.dev
[+] [-] cschiller|1 year ago|reply
To be more concrete their words were: - “What you define, you can tweak, touch the detail, and customize, saving you time.” - “You don’t entirely rely on AI. You stay involved, avoiding misinterpretations by AI.” - “Flexibility to refine, by using templates and triggering partial tests, features that come from real-world experience. This speeds up the process significantly.”
Our understanding is that because we launched the first version of GPT Driver in April 2023, we’ve built it in an “AI-native” way, while other tools are simply adding AI-based features on top. We worked closely with leading mobile teams, including Duolingo, to ensure we stay as aligned as possible with real-world challenges.
While our focus is on mobile, GPT Driver also works effectively on web platforms.
[+] [-] mmaunder|1 year ago|reply
PS:If you had this for desktop we'd immediately become a customer.
[+] [-] cschiller|1 year ago|reply
To address these issues, we enhance the models with our own custom logic and specialized models, which helps us achieve more reliable results.
Looking forward, we expect our QA Studio to become even more powerful as we integrate tools like test management, reporting, and infrastructure, especially as models improve. We're excited about the possibilities ahead!
[+] [-] tomatohs|1 year ago|reply
[+] [-] drothlis|1 year ago|reply
Does your model consistently get the positions right? (above, below, etc). Every time I play with ChatGPT, even GPT-4o, it can't do basic spatial reasoning. For example, here's a typical output (emphasis mine):
> If YouTube is to the upper *left* of ESPN, press "Up" once, then *"Right"* to move the focus.
(I test TV apps where the input is a remote control, rather than tapping directly on the UI elements.)
[+] [-] xyst|1 year ago|reply
This was many years ago though (2018-2019?) before the genAI craze. Wonder if it has improved or not; or if this product is any better than its competitors.
[+] [-] pj_mukh|1 year ago|reply
[+] [-] chrtng|1 year ago|reply
[+] [-] tomatohs|1 year ago|reply
[+] [-] bluelightning2k|1 year ago|reply
[+] [-] cschiller|1 year ago|reply
From our tests, even the latest model snapshots aren't yet reliable enough in positional accuracy. That's why we still rely on augmenting them with specialized object detection models. As foundational models continue to improve, we believe our QA suite - covering test case management, reporting, agent orchestration, and infrastructure - will become more relevant for the end user. Exciting times ahead!
[+] [-] doublerebel|1 year ago|reply
[+] [-] tauntz|1 year ago|reply
> Individuals with the last name "Bach" or "Bolton" are prohibited from using, referencing, or commenting on this website or any of its content.
..and now I'm curious to know the backstory for this :)
[+] [-] archerx|1 year ago|reply
[+] [-] chrtng|1 year ago|reply
https://www.theverge.com/2024/2/16/24075304/trademark-pto-op...
[+] [-] alexwordxxx|1 year ago|reply
[+] [-] 101008|1 year ago|reply
[+] [-] chairhairair|1 year ago|reply
I do not want additional uncertainty deep in the development cycle.
I can tolerate the uncertainty while I'm writing. That's where there is a good fit for these fuzzy LLMs. Anything past the cutting room floor and you are injecting uncertainty where it isn't tolerable.
I definitely do not want additional uncertainty in production. That's where the "large action model" and "computer use" and "autonomous agent" cases totally fall apart.
It's a mindless extension something like: "this product good for writing... let's let it write to prod!"
[+] [-] aksophist|1 year ago|reply
And then there are truly dynamic apps like games or simulators. There may be no accessibility info to deterministically code to.
[+] [-] cschiller|1 year ago|reply
Take, for example, scenarios involving social logins or payments where external webviews are opened. These often trigger cookie consent forms or other unexpected elements, which the app developer has limited control over. The complexity increases when these elements have unstable identifiers or frequently changing attributes. In such cases, even though the core functionality (e.g., logging in) works as expected, traditional test automation often fails, requiring constant maintenance.
The key, as to other comments, is ensuring the solution is good at distinguishing between meaningful test issues and non issues.
[+] [-] worldsayshi|1 year ago|reply
[+] [-] devjab|1 year ago|reply
In many cases you’re correct though. We have a few libraries where we won’t use Typescript because even though it might transpire 99% correctly, the fact that we have to check, is too much work for it to be worth our time in those cases. I think LLMs are similar, once in a while you’re not going to want them because checking their work takes too much resources, but for a lot of stuff you can use them. Especially if your e2e testing is really just pseudo jobbing because some middle manager wanted it, which it unfortunately is far too often. If you work in such a place you’re going to recommend the path of least resistance and if that’s LLM powered then it’s LLM powered.
On the less bleak and pessimistic side, if the LLM e2e output is good enough to be less resource consuming, even if you have to go over it, then it’s still a good business case.
[+] [-] batikha|1 year ago|reply
So being non-deterministic is actually an advantage, in practice.
[+] [-] joshuanapoli|1 year ago|reply
[+] [-] dartos|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] aksophist|1 year ago|reply
[+] [-] chrtng|1 year ago|reply
Our key metrics include the time and cost per agentic loop, as well as the false positive rate for a full end-to-end test. If you have any specific benchmarks or evaluation metrics you'd suggest, we'd be happy to hear them!
[+] [-] iknownthing|1 year ago|reply
[+] [-] lihua919|1 year ago|reply