top | item 39668962

Launch HN: Onedoc (YC W24) – A better way to create PDFs

293 points| AugusteLef | 2 years ago |github.com

Hey HN, we’re the co-founders of Onedoc (https://www.onedoclabs.com/ ), and the original contributors to the open-source library react-print-pdf (https://github.com/OnedocLabs/react-print-pdf ) which lets developers design and generate PDF documents automatically. Here’s a demo video: https://www.youtube.com/watch?v=MgfCyOyckQU&t=3s

Billions of PDFs are generated daily: invoices, contracts, receipts, reports, you name it. Developer time gets wasted producing these basic documents because there are no good-enough tools to design and generate PDFs.

We previously worked at giant firms, where documents (especially PDFs) were central to most workflows. We got asked to generate automated trade confirmations for our customer’s counterparties. We could not find any tool other than outdated libraries offering poor control over layout and the generation process. In the end, we just created our own—basically bringing web technologies to PDFs. That was the genesis of Onedoc.

PDF creation has two phases: design (specifying content and layout) and generation (producing the actual PDF file). Onedoc lets you do both simply and automatically.

Design: we have an open-source library called "react-print-pdf" (https://github.com/OnedocLabs/react-print-pdf ) that allows you to design a document the same way you would design a website. It supports Tailwind CSS components, Chakra UI components, and recently also built LaTeX and Markdown components. The latter let you write text in Markdown style, and include formulas using LaTeX syntax, directly within a React component.

Generation: we have an API (https://docs.onedoclabs.com/api-reference/introduction ) and Node.js SDK (https://docs.onedoclabs.com/quickstart/nodejs ) that render your designs into PDFs.

The choice of renderer significantly affects the accuracy of the resulting PDF. For example, exporting a webpage into PDF will often result in a layout that differs from the original webpage. We ensure that what you designed is what you get, and therefore you have 100% control over the entire layout of your document including margin, style, etc. We can do that because we built the react-print-pdf library to match the HTML/CSS to PDF rendering tool we have.

Once you have generated your document, you can either store it on your local system or, if you want, use our platform (https://app.onedoclabs.com/ ) to host your document online. If you use us, you’ll also get analytics over your documents.

Our main product is an API, but you can try it on our website directly (https://www.onedoclabs.com/) using our playground without any installation or sign-up. Our pricing is usage-based: per document generated. The pricing is degressive: the more documents you generate, the less you pay per document. If you don’t want to pay for PDF generation, you can still generate as many documents as you want, but with a watermark on the margin.

It’s been fun to see what our users are building with our open-source library (components, templates, etc.) and our API. We have a website (https://react-print.onedoclabs.com/) dedicated to the open-source library where we post the templates submitted by the community. Some early power users built simple web apps (CV/Resume generator, NDA and Invoice generator). We are excited to show our product to the HN community and look forward to your feedback!

185 comments

order
[+] egnehots|2 years ago|reply
The main issue is conflating templating and pdf generation.

Using html to pdf solutions allow to do the templating in html, where it is pretty much a solved issue.

And as many said, headless chrome is a robust html to pdf solution, even though it feel like a hack.

But, yeah, there seems to be a lack of awareness about these options within corporations. So, kudos to you for addressing a genuine problem!

[+] pedro120|2 years ago|reply
Indeed, we aim at bundling this in a way that makes it easy and obvious for enterprises to build their PDFs that way.
[+] yencabulator|2 years ago|reply
Typst is a typesetting language that makes programmatic layout and processing JSON input pretty darn simple. I make invoices by having a Typst template read line items from a JSON file.

https://github.com/typst/typst

[+] gzapp|2 years ago|reply
There are also a few good options in a lot of languages for streamlining chromium use.

In C# I'd look to use the Playwright library or perhaps even embed chromium via CerSharp if I were trying to avoid extra processes.

[+] midenginedcoupe|2 years ago|reply
I've also spent much longer than I'd like on this same problem. Having a lightweight-enough service to convert html->pdf on the fly, with good fidelity, and that can create an accessible pdf seems to be impossible.

If you can nail accessible PDFs then you'd open up a very big government market.

[+] AugusteLef|2 years ago|reply
We felt the same, and that's precisely why we built this tool! The key, as you mentioned, is fidelity, especially for designing complex layouts. We hope to bring something new and valuable to the table. And yes, documents are central to many industries including government, legal, banking etc.
[+] matteason|2 years ago|reply
Really interesting product. I do agree that the pricing seems steep ($0.25/document on Pro on the most generous tier) but I don't know enough about pricing B2B products to know if that would be a blocker.

I agree that HTML -> PDF can be a really powerful tool. I worked on the UK government's tool to generate energy efficiency labels for consumer goods [0] and we ended up doing PDF generation with SVG templates, using Open HTML to PDF for the conversion. That ended up working very well, though as you allude to there can be some gotchas (eg unsupported CSS features) that you need to work around.

A few questions:

- Do the rendered documents support PDF's various accessibility features?

- How suitable is this for print PDF generation? For example, what version of the PDF spec do you target? What's your colour profile support like? Do you support the different PDF page boxes (MediaBox, CropBox, BleedBox, TrimBox, ArtBox)?

[0] https://github.com/UKGovernmentBEIS/energy-label-service

[1] https://github.com/danfickle/openhtmltopdf

[+] ak217|2 years ago|reply
FYI: the open source state of the art in this area is Playwright (the successor to Puppeteer) with Paged.js (https://pagedjs.org/). I highly recommend that everyone check out and donate to paged.js, it's a fantastic project with lots to like. It certainly blows commercial alternatives like Prince XML out of the water.

That forms a solid foundation that I find it hard to imagine paying for. The things where you might still command a premium are basically safety mechanisms/CI checks/library components that ensure the PDF renders correctly in the presence of variable-length content, etc. as well as maybe PDF-specific features like metadata and fillable forms. Naive ways to format headers, footers, tables/grids/flexboxes etc. often fail in PDFs because of unexpected layout complications. So having a methodology, process, and validation system for ensuring that a mission critical piece of information appears on a PDF in the presence of these constraints could be attractive.

[+] caesil|2 years ago|reply
I think https://github.com/diegomura/react-pdf is closer to what this company is doing.

In fact their open source library, https://github.com/OnedocLabs/react-print-pdf, seems like a higher-level library that sits above react-pdf. Reminds me a lot of the set of react-pdf based components I built for a corporate job where letting users create PDFs was a huge part of the value proposition.

They're solving a really cool problem, actually, because building out into certain difficult use cases like SVG support was a huge pain.

[+] Titou325|2 years ago|reply
We are currently experimenting with this approach. A good thing about paged.js is that we would be able to provide hot-reload and live preview of files without actually converting to PDF.

Your second point is very interesting, seems like some kind of .assert('text').isVisible() API. We may want to dig into that further!

[+] timvdalen|2 years ago|reply
(How) does it handle CMYK and print PDFs? I see images of printed books created by Paged.js, were these post-processed, or printed using a printer that does a best-effort RGB conversion?
[+] Mick-Jogger|2 years ago|reply
Isn't Playwright a testing framework, I am not sure how this solves the use-case that Onedoc is aiming for. I would be highly interested in some more background as we are evaluating alternative solutions to princeXML right now.
[+] somberi|2 years ago|reply
Useful service and a large problem space. Congrats and all the best. As someone who is a target customer, my 2 cents:

a. If this is a strategic value for my pipeline (and it is), we are going to code it ourselves, only because we can host it inside our fences. Critical customer data and hence.

b. The pricing is way off and is not reflective of the cost or value (for us). Even if it was 1/10th of the prices you charge, it will still be a no-go. At the volumes we have, it makes sense to build this ourselves.

c. SOC2 / ISO27001 - You might want to obtain them asap if you are looking to sell to outsourcing companies or FSG.

[+] AugusteLef|2 years ago|reply
certifications (SOC2 / ISO27001) and offer an on-premise solution! I see there's already a discussion about pricing, so I'll leave that be. However, would an unlimited volume at a fixed cost (and self-host) be an attractive solution? It could be interesting for very high volumes.
[+] HatchedLake721|2 years ago|reply
Curious, with ~$0.005 per document, what volumes do you do that pricing becomes a no-go for you?
[+] Brajeshwar|2 years ago|reply
May be this is just me but this looks extremely costly to me! It will cost $2,500 to generate 50,000 PDFs. Are edits/corrections additional cost?
[+] jot|2 years ago|reply
It sounds like this is as advanced as DocRaptor[1]. They have what I consider to be the best PDF generation API, giving complete control over the documents you need to create. The pricing is similar.

If you'd rather do it for free weasyprint[2] is the best open source alternative.

Another more affordable option you might want to consider is Urlbox[3]. (Disclosure: I work on this)

Urlbox's rendering engine is based on Chrome. It's been refined over the last 11 years to render pages as images or PDFs[4] that look great. I was a customer for 5 years before I joined the team. Everything we'd tried before Urlbox was a disappointment.

Urlbox probably can't match the power of either Onedoc or DocRaptor, but pricing starts at less than $0.01 per document and drops significantly with scale. If your PDF looks great when saving as PDF in Chrome it should look identically brilliant with Urlbox.

[1]: https://docraptor.com [2]: https://weasyprint.org [3]: https://urlbox.com [4]: https://urlbox.com/html-to-pdf

[+] Titou325|2 years ago|reply
This is a good point, and we are still trying to figure out how to price things fairly. Depending on the type of PDF, whether it is a simple receipt or a large multi-pages report, associated costs are very different on our side. At this time, we rely on other proprietary software that we are aiming to replace but that incur high costs on our side as well.

Edits and corrections on generated PDFs is not provided as the PDFs are signed as-is, however you can attach the metadata to the PDF and rerender with the modifications.

[+] snadal|2 years ago|reply
I second this. Maybe I'm missing something in the value proposition, but we already generate PDFs from .docx/.html templates using open source libraries and Docker microservices.

Do not misunderstand. A Stripe for generating PDFs can be great, but for a small team, $0.50/PDF is way more than I can afford (after all, you can create a small number of PDFs without too much fuss). Maybe you are oriented towards large companies?

[+] adnans|2 years ago|reply
We use https://www.api2pdf.com/pricing/ and it's priced per bandwidth and usage - ($.001 per mb bandwidth and $0.00019551 per second of computation)

You can choose which API to use: Headless Chrome, Wkhtmltopdf, Libreoffice, etc.

[+] Leoko|2 years ago|reply
I had to deal a lot with PDF generation over the past few years and I was very unhappy with the eco-system that was available:

1. HTML-to-PDF: The web has a great layout system that works well for dynamic content. So using that seems like a good idea. BUT it is not very efficient as a lot of these libraries simply spin up a headless browser or deal with virtual doms.

2. PDF Libraries (like jsPDF): They mostly just have methods like ".text(x, y, string) which is an absolute pain to work with when building dynamic content or creating complex layouts.

This was such a pain point in various projects I worked on that I built my own library that has a component system to build dynamic layouts (like tables over multiple pages) and then computes that down to simple jsPDF commands. Giving you the best of both worlds.

Hope this makes somebody's life a bit easier: https://github.com/DevLeoko/painless-pdf

[+] chrisfinazzo|2 years ago|reply
Is there a reason you didn't consider something like Weasyprint?

https://weasyprint.org

Going all the way down to raw HTML is a bit verbose, but with almost anything I've thrown at it - CV's, business cards, you name it - it hasn't let me down yet.

[+] Crowberry|2 years ago|reply
I'm with you..

We ended up writing a similar wrapper around https://github.com/jung-kurt/gofpdf library. We haven't open sourced it yet. But it's made it a lot easier to deal with rendering a PDF, especially over pagebreaks ect.

[+] Gualdrapo|2 years ago|reply
It seems TeX/LaTeX is a major inspiration in this, though there can be seen some room for improvement for details like hyphenation, expansion/protusion and microtypography. Not sure if/how a web engine can reach to those points but still it seems this has a potential niche and market outcome, so congrats.

Though personally I wish stuff like ConTeXt was more popular and approachable - to my humble knowledge their Lua backend seems to have huge potential, I am doing my invoices with ConTeXt/Lua.

[+] Titou325|2 years ago|reply
It definitely is! Typesetting quality was the main reason we chose not to go down the Puppeteer/headless browser route but rather use a completely separate engine where typography is a first-class citizen.

We like LaTeX, but even for advanced users laying things out can be a difficult thing. Given that documents are a frontend, we wanted to bring the same tools frontend developers already use.

[+] kornhucker|2 years ago|reply
Super interesting and potentially a fit for a project I'm working on right now. What are the benefits of going this route vs styling your page for print (ex. tailwind print modifier) and relying on the browser's print dialogue?
[+] petern81|2 years ago|reply
This is a good problem to tackle. The hours i've sunk...
[+] Sytten|2 years ago|reply
Definitely a problem I experienced. Big fan of browserless.io. Though I didnt see any comment on the biggest problem in this space: SSRF.

Most HTML-to-PDF are deeply insecure and I am more than happy to pay someone else to deal with isolation and security. Report generators are often used to leak cloud secrets via the metadata API.

[+] AugusteLef|2 years ago|reply
True. Security is a significant concern, and in our discussions with businesses, we realised that most of them do not want any kind of data leaving their own systems. This is especially true in the biotech/healthcare industry, but also in legal and banking. That's why we're considering an on-premises solution for the future (as we're focusing on B2B). However, I assume most people were talking about personal use cases or non-sensitive documents, hence the fact that no one mentioned SSRF (yet ;)).
[+] throw03172019|2 years ago|reply
Also big fan of browserless. Do you run it yourself? We run the browserless docker containers on prem.
[+] marceldegraaf|2 years ago|reply
We're using Gotenberg[1] to convert a rendered web page (with Elixir/Phoenix, in our case) to PDF. Works like a charm and we can use our existing frontend code/styling (including SVG graph generators) which is a huge bonus.

1: https://gotenberg.dev/

[+] Titou325|2 years ago|reply
We actually experimented with Gotenberg! Ultimately it is a layer on top of Chromium for conversion and we were dissatisfied with the results. I am curious so as to how are you handling assets and other static media / attachments: do you embed everything in a single HTML file or do you use some kind of bucketing system to resolve URLs?
[+] ffpip|2 years ago|reply
Love the demo on the homepage with the render button. Really helps explain the product!
[+] bbryanj23|2 years ago|reply
Congrats on the launch. I was a user of htmldocs back in the day, good to see more products in the space.

One of the features I wish I had with htmldocs was the ability to automatically store generated documents in my own S3. I'd rather not introduce another cloud to my data stack just to host PDFs.

[+] AugusteLef|2 years ago|reply
Thanks! We are looking to extend our set of feature and integration, offering self-storing on S3 could definitely be one of them! Good call
[+] staffors|2 years ago|reply
I see that you support page breaks and headers and footers and stuff which is very cool. Is there some form of widow/orphan control when text wraps from one page to the next? How do you handle things like a large table that is longer than the length of a page?
[+] staffors|2 years ago|reply
Also, do you different paper sizes (A4 and Letter)?
[+] fasteddie31003|2 years ago|reply
Is this just a wrapper around Puppeteer that renders a pdf? I do this currently with an AWS lambda that has a chrome-aws-lambda layer.
[+] Titou325|2 years ago|reply
We use a dedicated HTML to PDF engine (such as PrinceXML) rather than building on top of a browser. Main issue with browser-backed implementations is that PDFs are often of subpar quality. However, the main good thing is you can rely on the latest CSS features.

In the end, what was the main decisive factor is the support for the PrintCSS and PagedMedia specifications, which have been completely discarded by major vendors and only implemented by specific engines.

[+] winter-day|2 years ago|reply
Congrats! My career has also revolved around PDF generation (once for federal compliance at large companies, second for scrubbing data from PDFs for HIPAA compliance and then generating a new pdf based on the scrubbed data). I think I've seen your tool around, I ended up creating a workflow that generated LateX scripts then converted them to pdfs, and the second a python library. The most difficult aspect for our tools was formatting - the pdfs were generally 60-100 pages and tables could show up anywhere and break the page/formatting. Quite curious to see how your company will grow, good luck!
[+] DutchHugo|2 years ago|reply
Curious, which python library did you use to convert to PDFs? currently looking into a couple options myself
[+] cxr|2 years ago|reply
Why does your dev-local repo[1] README have a link that's described as being the Adobe PDF viewer extension for VS Code but actually link to an extension that uses pdf.js by a company called Mathematic[2]?

1. <https://github.com/OnedocLabs/dev-local>

2. <https://marketplace.visualstudio.com/items?itemName=mathemat...>

[+] johnsonjo|2 years ago|reply
Well I'm not entirely sure why they did that, Adobe is the original creators of the PDF format [1] as given by this Wikipedia article on PDFs which might mean they meant something more like a viewer for *Adobe's PDF format* rather than *Adobe's viewer* for PDFs.

[1]: https://en.wikipedia.org/wiki/PDF

[+] AugusteLef|2 years ago|reply
Sorry for the misunderstanding, we indeed "meant something more like a viewer for Adobe's PDF format rather than Adobe's viewer for PDFs.". We will make sure to change the wording. Thank you!
[+] Crowberry|2 years ago|reply
This looks really interesting! One of the main reasons we've opted to writing a more complex rending code is for speed. We're getting around 500ms for a single document, which is (last I tested) quicker than any headless chrome setup.

How long does it take to render using your API? :)

[+] pedro120|2 years ago|reply
Rendering time scales with the length / complexity of the document. At the moment, our self-serve API renders slower than a headless chrome setup. We are working on speeding this up as it is currently in the order of seconds.
[+] baggy_trough|2 years ago|reply
This is definitely a somewhat painful process. I have done it with puppeteer / chromium on Debian, and it works very well after the headache of figuring it out. Having to pay 50 cents per PDF and deal with a 3rd party vendor would not provide value for our needs.
[+] AugusteLef|2 years ago|reply
We've updated our pricing, and it can go as low as $0.005 per document. True, you'll still need to work with a third-party vendor, but isn't it worth considering if the features are competitive and the interface is user-friendly? It would be interesting to know what might convince you to switch from Puppeteer to another solution—or if you're completely satisfied and wouldn't switch regardless of the offerings, which is perfectly fine.