Pdf2htmlEX – Convert PDF to HTML without losing text or format

[+] chill1|13 years ago|reply

I've actually been using this to convert large PDF files to HTML to be displayed in-browser. It's for my work, so I don't feel comfortable posting a link to the demo instance here.

It is definitely the best solution I've found so far. The outputted HTML / CSS / images look almost identical to the source PDF. That being said, there are a few issues still:

* One Gigantic (600kb) CSS file from a single PDF

* Hundreds of individual fonts

* HTML semantics are non-existent

These are all relatively easy to fix, I believe. I have found my own solutions to most of the issues in post-processing.

Kudos to you, coolwanglu. Also, I'd like to get in touch with you about lending a hand to fix some of the issues I've encountered.

Thanks for a cool piece of software!

[+] roel_v|13 years ago|reply

" * HTML semantics are non-existent

These are all relatively easy to fix, I believe. "

How? For example, how would you identify <span>'s (or whatever this converter uses) to identify headers, and page headers/footers, or a ToC, or a preface? IMO this is an AI-hard problem, for which even the 'simple' approximation (statistics) is very hard due to the wide variety in inputs (a corpus trained for multi-column journal articles will most likely not work at all for books, although I haven't tried and would love to be proven wrong).

Use case: a working (i.e., preserving semantics) pdf-to-epub converter. This would, imho, be a killer product / service.

[+] coolwanglu|13 years ago|reply

Hey thanks for the info!

2nd & 3rd are in the future plan, as I'm still working on accuracy and speed. And #115(https://github.com/coolwanglu/pdf2htmlEX/issues/115) is about the 2nd issue.

About the first one, I've not got an elegant solution yet, maybe a CSS file per page?

Please file new issues at GitHub if you think it's necessary :)

[+] ComputerGuru|13 years ago|reply

Can anyone recommend an equally good opposite (HTML to PDF)?

wkhtmltopdf [0] is probably the most popular, but it's also ridiculously buggy.

0: https://code.google.com/p/wkhtmltopdf/

[+] SigmundA|13 years ago|reply

http://phantomjs.org/ is the best so far in my experience since it handles all the client side javascript properly.

The PDF's it outputs are full vector not just rasters, it the same engine used in Chrome to view PDF's and print web pages from my understanding.

[+] syaramak|13 years ago|reply

Not open source, but you might want to check out PrinceXML. It is really good.

[+] unknown|13 years ago|reply

[deleted]

[+] ars|13 years ago|reply

Print the HTML document to a postscript printer, but have it print to file.

Then use ps2pdf from ghostscript.

You can automate this with a small amount of work.

[+] columbo|13 years ago|reply

Flying Saucer worked great for me: http://code.google.com/p/flying-saucer/

[+] pchivers|13 years ago|reply

I've had good results with htmldoc (http://www.msweet.org/projects.php?Z1).

[+] Samuel_Michon|13 years ago|reply

In OS X, you can print to PDF from every application.

[+] unknown|13 years ago|reply

[deleted]

[+] AndreasFrom|13 years ago|reply

This works and displays correctly, but is unbearably slow on iPad 2 whereas the PDF loads instantly. What is the point then or does it work a lot better in desktop browsers?

[+] SigmundA|13 years ago|reply

The OSX Quartz graphic layer (also used in iOS) uses PDF internally as graphic object model.

It is no surprise iOS handles rendering PDF's so quickly and so well and without the need for an third party app, it always has from the release of the first iPhone. This is also why print to PDF is built in on OSX.

[+] coolwanglu|13 years ago|reply

I heard that with careful optimization on the server side and a clever JS may solve this. So far the default UI just demostrates the ability of reading-while-downloading.

The idea is that now the document becomes more controllable and accessible, say you can put Google Analytics in your resume written in LaTeX; or maybe an social reading service, where you can comment, annotate and share.

Unlike PDF viewers, web browers are never optimized for this kind of messy inputs. The next version of pdf2htmlEX will be focused on optimizations, e.g. smaller size of background images, hopefully that would help.

[+] tuananh|13 years ago|reply

Scrolling is laggy (rMBP 15 default spec) but usable.

[+] pjmlp|13 years ago|reply

Slow as well on a dual core P8700 with 8 GB Ram.

[+] crazygringo|13 years ago|reply

Interesting. So it converts all vector graphics to a background image per page, but keeps all text as browser-rendered on top of it.

I guess I don't really see much practical purpose for it -- most browsers these days seem perfectly fine opening PDF files natively, after all. But it's a very cool technological demonstration.

Maybe this could be some kind of bridge tool for generating sites with fancy typographical layout? You could use Adobe Illustrator etc. to do fancy column work, drop caps, hyphenation, all that jazz -- and then "render" into HTML. It would certainly be as anti-"responsive" as you can get, but it would certainly have the ability to generate more advanced typography much faster than you can produce with HTML/CSS by hand.

[+] train_robber|13 years ago|reply

It definitely has a few practical purposes. I have used this for a website for a small magazine. Their issue was that they didn't have resources to design for the web. This was a good solution, wherein they just needed to upload a PDF once an issue was out. And this provides a bit more flexibility from other PDF viewers - organize by articles, add social sharing, commenting per page/article etc. etc.

[+] altrego99|13 years ago|reply

As a practical purpose, how about being able to edit a PDF document? I understand that it can be done through some other tools, but this is one more - and would be free and easy.

Convert to HTML -> Edit -> Print back to PDF (if needed)

[+] coolwanglu|13 years ago|reply

It's for embedding, when you want to control the document or access the content.

Say you have a resume written in LaTeX and you want to insert Google Analytics inside?

[+] dannyrough|13 years ago|reply

I do this almost daily. I use a PDF converter driver found on the internet . Install it and it becomes a selectable converter option.Then you can convert PDFs to many forms in any program at all, including Adobe Acrobat . Just open a PDF, select convert, and choice a form you want, the task will be finished in several seconds. if you haven't found a good choice , you can have a try. best wishes. http://www.rasteredge.com/how-to/csharp-imaging/pdf-convert-...

[+] _DiskError|13 years ago|reply

Question, does your public folder periodically delete files? I accidentally uploaded something confidential and it seems to be gone. I was wondering if this was a manual deletion or just expired since I still see files that were uploaded around the same time still there.

[+] alcuadrado|13 years ago|reply

Can't Mozilla's pdf.js be used to get the same result? Great results anyway!

[+] coolwanglu|13 years ago|reply

You don't want to rely on the computing power at the client side, do you? :)

[+] chucknelson|13 years ago|reply

Promising start. Hopefully performance improves with each release.

[+] coolwanglu|13 years ago|reply

Right, that is in the schedule, just heard enough complains, in a good way.

[+] Dnguyen|13 years ago|reply

I didn't see any mention of tables in the doc. Does this means it's outside of the "good enough" range? Table extraction would be a great feature.

[+] coolwanglu|13 years ago|reply

It's still a startup, so currently it's focused on accurate rendering, and fast speed(which is not achieved yet so far).

Features about recognition would be planned in the future, usually PDF viewers do not recognize too many things, do they? :)

[+] chedar|13 years ago|reply

Have you tried using tabula? https://github.com/jazzido/tabula

[+] rcfox|13 years ago|reply

How did you manage to get Mediafire to host your demo?

[+] coolwanglu|13 years ago|reply

MF uses pdf2htmlEX :) And it also provides public folder and public dropbox <- I really like that.

48 comments