I've actually been using this to convert large PDF files to HTML to be displayed in-browser. It's for my work, so I don't feel comfortable posting a link to the demo instance here.
It is definitely the best solution I've found so far. The outputted HTML / CSS / images look almost identical to the source PDF. That being said, there are a few issues still:
* One Gigantic (600kb) CSS file from a single PDF
* Hundreds of individual fonts
* HTML semantics are non-existent
These are all relatively easy to fix, I believe. I have found my own solutions to most of the issues in post-processing.
Kudos to you, coolwanglu. Also, I'd like to get in touch with you about lending a hand to fix some of the issues I've encountered.
These are all relatively easy to fix, I believe.
"
How? For example, how would you identify <span>'s (or whatever this converter uses) to identify headers, and page headers/footers, or a ToC, or a preface? IMO this is an AI-hard problem, for which even the 'simple' approximation (statistics) is very hard due to the wide variety in inputs (a corpus trained for multi-column journal articles will most likely not work at all for books, although I haven't tried and would love to be proven wrong).
Use case: a working (i.e., preserving semantics) pdf-to-epub converter. This would, imho, be a killer product / service.
This works and displays correctly, but is unbearably slow on iPad 2 whereas the PDF loads instantly. What is the point then or does it work a lot better in desktop browsers?
The OSX Quartz graphic layer (also used in iOS) uses PDF internally as graphic object model.
It is no surprise iOS handles rendering PDF's so quickly and so well and without the need for an third party app, it always has from the release of the first iPhone. This is also why print to PDF is built in on OSX.
I heard that with careful optimization on the server side and a clever JS may solve this. So far the default UI just demostrates the ability of reading-while-downloading.
The idea is that now the document becomes more controllable and accessible, say you can put Google Analytics in your resume written in LaTeX; or maybe an social reading service, where you can comment, annotate and share.
Unlike PDF viewers, web browers are never optimized for this kind of messy inputs. The next version of pdf2htmlEX will be focused on optimizations, e.g. smaller size of background images, hopefully that would help.
Interesting. So it converts all vector graphics to a background image per page, but keeps all text as browser-rendered on top of it.
I guess I don't really see much practical purpose for it -- most browsers these days seem perfectly fine opening PDF files natively, after all. But it's a very cool technological demonstration.
Maybe this could be some kind of bridge tool for generating sites with fancy typographical layout? You could use Adobe Illustrator etc. to do fancy column work, drop caps, hyphenation, all that jazz -- and then "render" into HTML. It would certainly be as anti-"responsive" as you can get, but it would certainly have the ability to generate more advanced typography much faster than you can produce with HTML/CSS by hand.
It definitely has a few practical purposes. I have used this for a website for a small magazine. Their issue was that they didn't have resources to design for the web. This was a good solution, wherein they just needed to upload a PDF once an issue was out. And this provides a bit more flexibility from other PDF viewers - organize by articles, add social sharing, commenting per page/article etc. etc.
As a practical purpose, how about being able to edit a PDF document? I understand that it can be done through some other tools, but this is one more - and would be free and easy.
Convert to HTML -> Edit -> Print back to PDF (if needed)
I do this almost daily. I use a PDF converter driver found on the internet . Install it and it becomes a selectable converter option.Then you can convert PDFs to many forms in any program at all, including Adobe Acrobat . Just open a PDF, select convert, and choice a form you want, the task will be finished in several seconds. if you haven't found a good choice , you can have a try. best wishes. http://www.rasteredge.com/how-to/csharp-imaging/pdf-convert-...
Question, does your public folder periodically delete files?
I accidentally uploaded something confidential and it seems to be gone. I was wondering if this was a manual deletion or just expired since I still see files that were uploaded around the same time still there.
[+] [-] chill1|13 years ago|reply
It is definitely the best solution I've found so far. The outputted HTML / CSS / images look almost identical to the source PDF. That being said, there are a few issues still:
* One Gigantic (600kb) CSS file from a single PDF
* Hundreds of individual fonts
* HTML semantics are non-existent
These are all relatively easy to fix, I believe. I have found my own solutions to most of the issues in post-processing.
Kudos to you, coolwanglu. Also, I'd like to get in touch with you about lending a hand to fix some of the issues I've encountered.
Thanks for a cool piece of software!
[+] [-] roel_v|13 years ago|reply
These are all relatively easy to fix, I believe. "
How? For example, how would you identify <span>'s (or whatever this converter uses) to identify headers, and page headers/footers, or a ToC, or a preface? IMO this is an AI-hard problem, for which even the 'simple' approximation (statistics) is very hard due to the wide variety in inputs (a corpus trained for multi-column journal articles will most likely not work at all for books, although I haven't tried and would love to be proven wrong).
Use case: a working (i.e., preserving semantics) pdf-to-epub converter. This would, imho, be a killer product / service.
[+] [-] coolwanglu|13 years ago|reply
2nd & 3rd are in the future plan, as I'm still working on accuracy and speed. And #115(https://github.com/coolwanglu/pdf2htmlEX/issues/115) is about the 2nd issue.
About the first one, I've not got an elegant solution yet, maybe a CSS file per page?
Please file new issues at GitHub if you think it's necessary :)
[+] [-] ComputerGuru|13 years ago|reply
wkhtmltopdf [0] is probably the most popular, but it's also ridiculously buggy.
0: https://code.google.com/p/wkhtmltopdf/
[+] [-] SigmundA|13 years ago|reply
The PDF's it outputs are full vector not just rasters, it the same engine used in Chrome to view PDF's and print web pages from my understanding.
[+] [-] syaramak|13 years ago|reply
[+] [-] unknown|13 years ago|reply
[deleted]
[+] [-] ars|13 years ago|reply
Then use ps2pdf from ghostscript.
You can automate this with a small amount of work.
[+] [-] columbo|13 years ago|reply
[+] [-] pchivers|13 years ago|reply
[+] [-] Samuel_Michon|13 years ago|reply
[+] [-] unknown|13 years ago|reply
[deleted]
[+] [-] AndreasFrom|13 years ago|reply
[+] [-] SigmundA|13 years ago|reply
It is no surprise iOS handles rendering PDF's so quickly and so well and without the need for an third party app, it always has from the release of the first iPhone. This is also why print to PDF is built in on OSX.
[+] [-] coolwanglu|13 years ago|reply
The idea is that now the document becomes more controllable and accessible, say you can put Google Analytics in your resume written in LaTeX; or maybe an social reading service, where you can comment, annotate and share.
Unlike PDF viewers, web browers are never optimized for this kind of messy inputs. The next version of pdf2htmlEX will be focused on optimizations, e.g. smaller size of background images, hopefully that would help.
[+] [-] tuananh|13 years ago|reply
[+] [-] pjmlp|13 years ago|reply
[+] [-] crazygringo|13 years ago|reply
I guess I don't really see much practical purpose for it -- most browsers these days seem perfectly fine opening PDF files natively, after all. But it's a very cool technological demonstration.
Maybe this could be some kind of bridge tool for generating sites with fancy typographical layout? You could use Adobe Illustrator etc. to do fancy column work, drop caps, hyphenation, all that jazz -- and then "render" into HTML. It would certainly be as anti-"responsive" as you can get, but it would certainly have the ability to generate more advanced typography much faster than you can produce with HTML/CSS by hand.
[+] [-] train_robber|13 years ago|reply
[+] [-] altrego99|13 years ago|reply
Convert to HTML -> Edit -> Print back to PDF (if needed)
[+] [-] coolwanglu|13 years ago|reply
Say you have a resume written in LaTeX and you want to insert Google Analytics inside?
[+] [-] dannyrough|13 years ago|reply
[+] [-] _DiskError|13 years ago|reply
[+] [-] alcuadrado|13 years ago|reply
[+] [-] coolwanglu|13 years ago|reply
[+] [-] chucknelson|13 years ago|reply
[+] [-] coolwanglu|13 years ago|reply
[+] [-] Dnguyen|13 years ago|reply
[+] [-] coolwanglu|13 years ago|reply
Features about recognition would be planned in the future, usually PDF viewers do not recognize too many things, do they? :)
[+] [-] chedar|13 years ago|reply
[+] [-] rcfox|13 years ago|reply
[+] [-] coolwanglu|13 years ago|reply
This means that you can create one of your own.
[+] [-] est|13 years ago|reply
[+] [-] v-yadli|13 years ago|reply