Launch HN: Extend (YC W23) – Turn your messiest documents into data
We built Extend to handle the hardest documents that break most pipelines. You can see some examples here in our demo (no signup required): https://dashboard.extend.ai/demo
I know you're probably thinking “not another document API startup”. Unfortunately, the problem just isn’t solved yet!
I’ve personally spent months struggling to build reliable document pipelines at a previous job. The long tail of edge cases is endless — massive tables split across pages, 100pg+ files, messy handwriting, scribbled signatures, checkboxes represented in 10 different formats, multiple file types… the list just keeps going. After seeing countless other teams during our time in YC run into these same issues, we started building Extend.
We initially launched with a set of APIs for engineers to parse, classify, split, and extract documents. That started to take off, and soon we were deployed in production at companies building everything from medical agents, to real-time bank account onboarding, to mortgage automation. Over time, we’ve worked closely with these teams and seen first-hand how large the gap is between raw OCR/model outputs —> a production-ready pipeline (LLMs and VLMs aren’t magic).
Unlike other solutions in the space, we're specifically focused on three core areas: (1) the computer vision layer, (2) LLM context engineering, and (3) the surrounding product tooling. The combination of all three is what we think it takes to hit 99% accuracy and maintain it at scale.
For instance, to parse messy handwriting, we built an agentic OCR correction layer which uses a VLM to review and make edits to low confidence OCR errors. To tackle multi-page tabular data, we built a semantic chunking engine which can detect the optimal boundaries within a document so models can excel with smaller context inputs.
We also shipped a prompt optimization agent to automate the endless prompt engineering whack-a-mole teams spend time on. It’s built as a background agent to replicate the best prompter on your team, and runs in a loop with access to a set of tools (view files, run evals, analyze results, and update schemas).
The most surprising part of this whole experience has been seeing how many crazy PDF formats are out there! We've run into everything from supermarket inventory magazines, pesticide labels, construction blueprints, and satellite manufacturing plans.
Everything described above is live today. You can see it in action here (no signup): https://dashboard.extend.ai/demo. To upload your own files, you can log in and do so (we’re adding free usage credits to all accounts that sign up today).
We’re excited to be sharing with HN! We’d love to hear about your experiences building document pipelines. Please try it out, and share any and all feedback with us (e.g. hard documents that didn’t work, feature requests).
airstrike|4 months ago
> Unlike other solutions in the space, we're specifically focused on three core areas: (1) the computer vision layer, (2) LLM context engineering, and (3) the surrounding product tooling.
I assume the goal is to continue to serve this via an API? That would be immensely helpful to teams building other products around these capabilities.
kbyatnal|4 months ago
We've seen customers integrate these in a few interesting ways so far:
1. Agents (exposing these APIs as tools in certain cases, or into a vector DB for RAG)
2. Real-time experiences in their product (e.g. we power all of Brex's user-facing document upload flows)
3. Embedded in internal tooling for back-office automation
Our customers are already requesting new APIs and capabilities for all the other problems they run into with documents (e.g. fintech customers want fraud detection, healthcare users need form filling). Some of these we'll be rolling out soon!
pratikshelar871|4 months ago
nextworddev|4 months ago
constantinum|4 months ago
1. Trellis (YC W24) 2. Roe AI (YC W24) 3. Omni AI (YC W24) 4. Reductor (YC W24)
Other players(extended):
1. Unstract: Open-source ETL for documents (https://github.com/Zipstack/unstract) 2. Datalab: Makers of Surya/Marker 3. Unstructured.io
scottydelta|4 months ago
arvind_k|4 months ago
One persistent challenge was generalizing across “wild” PDFs, especially multi-page tables.
Your mention of agentic OCR correction and semantic chunking really caught my attention. I’m curious — how did you architect those to stay consistent across diverse layouts without relying on massive rule sets?
FitchApps|4 months ago
kbyatnal|4 months ago
A lot of customers choose us for our handwriting, checkbox, and table performance. To handle complex handwriting, we've built an agentic OCR correction layer which uses a VLM to review and make edits to low confidence OCR errors.
Tables are a tricky beast, and the long tail of edge cases here is immense. A few things we've found to be really impactful are (1) semantic chunking that detects table boundaries (so a table that spans multiple pages doesn't get chopped in half) and (2) table-to-HTML conversion (in addition to markdown). Markdown is great at representing most simple tables, but can't represent cases where you have e.g. nested cells.
You can see examples of both in our demo! https://dashboard.extend.ai/demo
Accuracy and data verification is challenging. We have a set of internal benchmarks we use, which gets us pretty far, but that's not always representative of specific customer situations. That's why one of the earliest things we built was a evaluation product, so that customers can easily measure performance on their exact docs and use cases. We recently added support for LLM-as-a-judge and semantic similarity checks, which have been really impactful for measuring accuracy before going live.
FabioFleitas|4 months ago
kbyatnal|4 months ago
aaa29292|4 months ago
https://docs.extend.ai/2025-04-21/product/general/how-credit...
Are those just different SLAs or different APIs or what?
serjester|4 months ago
kbyatnal|4 months ago
Our goal is to provide customers with as much flexibility as possible. For certain use cases, you might be willing to take a slight hit to accuracy in exchange for better costs and latency. To support this, we offer a "light" processing mode (with significantly lower prices) that uses smaller models, fewer VLMs, and more heuristics under the hood.
For other use cases, you simply want the highest accuracy possible. Our "performance" processing mode is a great fit for that, which enables layout models, signature detection, handwriting VLMs, and the most performant foundation models.
We back this up with a native evals experience in the product, so you can directly measure the % accuracy difference between the two modes for your exact use case.
aaa29292|4 months ago
asdev|4 months ago
nibab|4 months ago
ng3n is more of a grid-like workflow solution on top of documents. it's a user-facing application geared towards non-technical users that have processing needs.
if there are all these new problems that became solvable, what exactly are they?
id be interested in replacing datalab with extend, but im not sure what avenues that opens for ng3n. would be very curious to learn!
kbyatnal|4 months ago
unknown|4 months ago
[deleted]
wunderlust|4 months ago
airstrike|4 months ago
nextworddev|4 months ago
kbyatnal|4 months ago
The world today is quite different though. In the last 24 months, the "TAM" for document processing has expanded by multiple orders of magnitude. In the next 10 years, trillions of pages of documents will be ingested across all verticals.
Previous generations of tools were always limited to the same set of structured/semi-structured documents (e.g. tax forms). Today, engineering teams are ingesting truly the wild west of documents, from 500pg mortgage packages to extremely messy healthcare forms. All of those legacy providers fall apart when tackling these types of actual unstructured docs.
We work with hundreds of customers now, and I'd estimate 90% of the use cases we tackle weren't technically solvable until ~12 months ago. So it's nearly all greenfield work, and very rarely replacing an existing vendor or solution already in place.
All that to say, the market is absolutely huge. I do suspect we'll see a plateau in new entrants though (and probably some consolidation of current ones). With how fast the AI space moves, it's nearly impossible to compete if you enter a market just a few months too late.