The same day, a post on reddit was about: "We built 3B and 8B models that rival GPT-5 at HTML extraction while costing 40-80x less - fully open source" [1].
Not fully equivalent to what is doing Skyvern, but still an interesting approach.
This is exactly the direction I am seeing agent go. They should be able to write their own tools and we are soon launching something about that.
That being said...
LLMS are amazing for some coding tasks and fail miserably at others. My hypothesis is that there is some sort of practical limit to how many concepts an LLM can hold into account no matter the context window given the current model architectures.
For a long time I wanted to find some sort of litmus test to measure this and I think I found one that is an easy to understand programming problem, can be done in a single file, yet complex enough. I have not found a single LLM to be able to build a solution without careful guidance.
> For a long time I wanted to find some sort of litmus test to measure this and I think I found one that is an easy to understand programming problem, can be done in a single file, yet complex enough. I have not found a single LLM to be able to build a solution without careful guidance.
Plan for solving this problem:
- Build a comprehensive design system with AI models
- Catalogue the components it fails on (like yours)
- These components are the perfect test cases for hiring challenges (immune to “cheating” with AI)
- The answers to these hiring challenges can be used as training data for models
- Newer models can now solve these problems
- You can vary this by framework (web component / React / Vue / Svelte / etc.) or by version (React v18 vs React v19, etc.)
What you’re doing with this is finding the exact contours of the edge of AI capability, then building a focused training dataset to push past those boundaries. Also a Rosetta Stone for translating between different frameworks.
I put a brain dump about the bigger picture this fits into here:
also training data quality. they are horrifyingly bad at concurrent code in general in my experience, and looking at most concurrent code in existence.... yeah I can see why.
With the upcoming release of Gemini 3.0 Pro, we might see a breakthrough for that particular issue. (Those are the rumors, at least.) I'm sure not fully solved, but possibly greatly improved.
I feel like this is how normal work is. When I have to figure out how to use a new app/api etc, I go through an initial period where I am just clicking around, shouting in the ether etc until I get the hang of it.
And then the third or fourth time its automatic. Its weird but sometimes I feel like the best way to make agents work is to metathink about how I myself work.
Off topic, but because the article mentioned improper usage of DOM, I put down the UK government's design system/accessibility. It's well documented, and I hope all governments have the same standard. I guess they paid a huge amount of money to consultants and vendors.
We had a similar realization here at Thoughtful and pivoted towards code generation approaches as well.
I know the authors of Skyvern are around here sometimes --
How do you think about code generation with vision based approaches to agentic browser use like OpenAI's Operator, Claude Computer Use and Magnitude?
From my POV, I think the vision based approaches are superior, but they are less amenable to codegen IMO.
I wonder why the focus on replaying UI interactions, rather than just skipping one step ahead to the underlying network/API calls? I've been playing around with similar ideas a lot recently, and I indeed started out in a similar approach as what is described in the article - but then I realized that you can get much more robust (and faster-executing) automation scripts by having the agents figure out the exact network calls to replay, rather than clicking around in a headless browser.
In AI First workshops. By now I tell them for the last exercise "no scrappers". the learning is to separate reasoning (AI) from data (that you have to bring.) and ai coded scrappers seem a logical, but always fail. scrapping is a scaling issue, not reasoning challenge. also the most interesting websites are not keen for new scrappers.
A point orthogonal to this; consider whether you need browser automation at all.
If a website isn't using Cloudflare or a JS-only design, it's generally better to skip playwright. All the major AIs understand beautifulsoup pretty well, and they're likely to write you a faster, less brittle scraper.
I tried skyvern like 6 mo ago and it didn’t work for scraping a site that sounds like welp. Ended up doing it myself. Was trying to scrape data across Bay Area.
That said I’d try it again but I don’t want to spend money again.
You gain experience getting interactions with other agencies optimised by dealing with them yourself. If the AI you rely on fails, you are dead in the water. And I'm speaking as a fairly resilient 50 year old with plenty of hands-on experience, but concerned for the next generation. I know generational concern has existed since the invention of writing, and the world hasn't fallen apart, so what do I know? :)
Over the past few days I've spent a lot of time dealing with terribly designed UIs. Some legitimate and desired use cases are impossible because poor logic excludes them.
Is AI capable of saying, "This website sucks, and doesn't work - file a complaint with the webmaster?"
I once had similar problems with the CIA's World Factbook. I shudder to think what an I would do there.
While I cans see _some_ good uses for it, there are clearly abusive uses for it, including in their examples.
I mean jesus fuck, who wants cheap/free automation out there to "Skyvern can be instructed to navigate to job application websites like Lever.co and automatically generate answers, fill out and submit the job application."?
I already have to deal with enough totally unsuitable scattergun job applications every time we advertise an open position.
Your example use case is automatically filling out an IRS form, operated by the sort of IRC department that makes a webform that's only up during business hours? Do you realize how legally risky that is to create, and how legally risky that will be to operate?
nithril|4 months ago
Not fully equivalent to what is doing Skyvern, but still an interesting approach.
[1] https://www.reddit.com/r/LocalLLaMA/comments/1o8m0ti/we_buil...
suchintan|4 months ago
Thanks for sharing!
_pdp_|4 months ago
That being said...
LLMS are amazing for some coding tasks and fail miserably at others. My hypothesis is that there is some sort of practical limit to how many concepts an LLM can hold into account no matter the context window given the current model architectures.
For a long time I wanted to find some sort of litmus test to measure this and I think I found one that is an easy to understand programming problem, can be done in a single file, yet complex enough. I have not found a single LLM to be able to build a solution without careful guidance.
I wrote more about this here if you are interested: https://chatbotkit.com/reflections/where-ai-coding-agents-go...
JimDabell|4 months ago
Plan for solving this problem:
- Build a comprehensive design system with AI models
- Catalogue the components it fails on (like yours)
- These components are the perfect test cases for hiring challenges (immune to “cheating” with AI)
- The answers to these hiring challenges can be used as training data for models
- Newer models can now solve these problems
- You can vary this by framework (web component / React / Vue / Svelte / etc.) or by version (React v18 vs React v19, etc.)
What you’re doing with this is finding the exact contours of the edge of AI capability, then building a focused training dataset to push past those boundaries. Also a Rosetta Stone for translating between different frameworks.
I put a brain dump about the bigger picture this fits into here:
https://jim.dabell.name/articles/2025/08/08/autonomous-softw...
Groxx|4 months ago
meowface|4 months ago
whinvik|4 months ago
And then the third or fourth time its automatic. Its weird but sometimes I feel like the best way to make agents work is to metathink about how I myself work.
suchintan|4 months ago
pennaMan|4 months ago
What used to be a constant almost daily chore with them breaking all the time at random intervals is now a self-healing system that rarely ever fails.
silver_sun|4 months ago
ACCount37|4 months ago
TheTaytay|4 months ago
suchintan|4 months ago
hamasho|4 months ago
[1] https://design-system.service.gov.uk/components/radios/
philipbjorge|4 months ago
I know the authors of Skyvern are around here sometimes -- How do you think about code generation with vision based approaches to agentic browser use like OpenAI's Operator, Claude Computer Use and Magnitude?
From my POV, I think the vision based approaches are superior, but they are less amenable to codegen IMO.
suchintan|4 months ago
suchintan|4 months ago
We can ask the vision based models to output why they are doing what they are doing, and fallback to code-based approaches for subsequent runs
Ldorigo|4 months ago
franze|4 months ago
showerst|4 months ago
If a website isn't using Cloudflare or a JS-only design, it's generally better to skip playwright. All the major AIs understand beautifulsoup pretty well, and they're likely to write you a faster, less brittle scraper.
Etheryte|4 months ago
suchintan|4 months ago
They aren't enough for anything that's login-protected, or requires interacting with wizards (eg JS, downloading files, etc)
pavel_lishin|4 months ago
moomoo11|4 months ago
That said I’d try it again but I don’t want to spend money again.
randunel|4 months ago
fsckboy|4 months ago
AI, build me a scraper
what do you want to scrape
[lists sites to scrape]
oh, I've already scraped those relentlessly, here ya go
pu_pu|4 months ago
pcblues|4 months ago
pyuser583|4 months ago
Is AI capable of saying, "This website sucks, and doesn't work - file a complaint with the webmaster?"
I once had similar problems with the CIA's World Factbook. I shudder to think what an I would do there.
suchintan|4 months ago
Skyvern kept suggesting improvements unrelated to the issue they were testing for
unknown|4 months ago
[deleted]
guluarte|4 months ago
fragmede|4 months ago
suchintan|4 months ago
herpdyderp|4 months ago
suchintan|4 months ago
ahstilde|4 months ago
bigiain|4 months ago
While I cans see _some_ good uses for it, there are clearly abusive uses for it, including in their examples.
I mean jesus fuck, who wants cheap/free automation out there to "Skyvern can be instructed to navigate to job application websites like Lever.co and automatically generate answers, fill out and submit the job application."?
I already have to deal with enough totally unsuitable scattergun job applications every time we advertise an open position.
This is just asking to be used for abuse.
jimrandomh|4 months ago
suchintan|4 months ago
claysmithr|4 months ago