Have you thought about the reverse, i.e., a tool that could convert pdfs to html faithfully?
I would be willing to pay money for a reliable tool that didn't need much manual editing after processing.
Unfortunately, the pdftohtml project (http://pdftohtml.sourceforge.net/) has been inactive, and the current version has trouble with even moderately complex layouts.
That's a non-trivial task. There are no such objects like tables, styles, lists or paragraphs in PDF so you would need to reconstruct this kind of information. Also, text and vector graphics is positioned absolutely. Tagged PDFs contains some meta-information about the document structure which could help but still it is a lot of work.
The fundamental problem is that PDF stores the document presentation while html defines the document and the presentation is created by the browser. And obviously, to restore a document definition from its presentation is hard as lot of information is missing.
I can easily see a use for this. I'm doing a pro bono project for a small non profit, and part of the project requires generating simple PDF reports. They don't have any money so we need to keep it low cost.
One of the ways of doing this is to host it on a simple shared server (it's not a heavily used app).
Downside of this is that it's unlikely we'll be able to use any of the PDF tools I've used in the past (since they need to be installed). This should work fine for our purposes.
Thanks, I was wondering how I'd get around this.
To all those who were dissing this because they couldn't immediately see a use for it, try to have a more open mind.
I'm also developing an HTML->PDF feature and jumped when I saw this! I tried smashingmag.com - which is funny because i meant smashingmagazine.com but actually smashingmag.com is some japanese site. either way i got back a totally blank pdf - maybe japanese character set is to blame?
One other caveat is that having the ability to view flash would be awesome as well. main function of pdf as i understand it is to create a document that PRINTS completely identically on every setup, so frequently people are going to want to print flash, which is already a huge pain in the ass. Unfortunately it looks like it blanks out completely if there is flash on the page (2advanced.net)
if you could solve that i would start paying tomorrow.
The quality of http://www.princexml.com is amazing. It's not open source and there is a cost to use it commercially (<1K, if I recall). I used it to convert my HTML documentation for Sleep into a camera-ready PDF.
Nice execution - as per the comment below, something like this would've saved me lots of manual fiddling back when I was doing lots of PDF stuff.
Given the focus on APIs I guess you're aiming it at those wanting to programmatically generate PDFs using a familiar markup, rather than conversion of existing (static) content into PDF? If so, maybe investigate the ability to overlay rendering onto an existing PDF template at some point - in my experience it's been a common requirement (think form letters, account statements, etc).
Interesting that it appears to execute Javascript; guess it's a sign of the times that you need to in order to render many sites correctly nowadays. I haven't poked it too hard, but suspect there might be one or two security challenges there...
Well, your default HTML generates one screwey PDF. When viewed in Mac OSX Preview, I get the text "T pe our HTML here..." Then, when I select the text, certain letters get partially removed or overwritten and I end up with gibberish.
I've just spent weeks working on HTML -> PDF conversion code, so I know it's not just my viewer. I've put all kinds of crazy stuff through there.
There is no doubt that many developers will use wkhtmltopdf.
I think that the Pdfcrowd's selling points could be 1) wide availability - only HTTP is needed so it can be used theoretically on any platform 2) no need to install any 3rd party software which makes the applications more portable 3) API bindings
We used an html->pdf conversion service (I believe it was http://www.htm2pdf.co.uk/ but I'm not positive) for awhile to do billing and our biggest problem was that it went down all the time. We ended up purchasing a (pretty cheap) license to a Java library that does pdf generation for us and is pretty easy to use. This is definitely a service that people will pay money to use - best of luck to you!
I don't know the exact status of how WebKit handles these properties. I know that at least "page-break-after: always" works since that is what I use when the user clicks the 'Insert Page Break' button in the editor (http://pdfcrowd.com/editor/).
NICE!
You have beat me (and I am sure a dozen others hackers) to the realization of this idea...
Here is an idea for an extra feature: make a print bookmarklet -- clicking on it you get a nice PDF version of the page you are viewing right now. I can't stand firefox's print renditions of some pages... terrible...
(also you might want to set the page size to letter or A4 depending on the geolocation of your visitor's ip address)
I notice there are some questions about how to make money. One may be to position yourself as a way to get PDF reports generated from phone apps, in which case you may want to do per app licensing and provide facilities for email delivery of PDFs.
I could see this being useful porting apps from iPhone (can easily generate PDFs) to Android (which does not appear to support PDF output).
I used this for a major company's site-edit auditing system. (No, they didn't want HTML snapshots of each revision. It had to be a screenshot of the browser...)
It works really well. The only quirk is that it needs a fake X server (for font loading), but Xvfb works just fine for that.
The pdf conversion is awesome! I just tried printing http://times.com/ to a pdf in firefox and it ended up putting the main content of the site on page 2, whereas yours seemed to render it perfectly.
Looks good. I'm keen to use (and pay for) a service like this - if its reliable and quick. With a ruby gem its particularly attractive as all other rails to pdf solutions are incomplete, require a pdf specific dsl or are very expensive.
This is awesome. I'm at once excited about using this in the future, and dismayed thinking of the time I've spent manually generating PDFs because none of the HTML -> PDF options worked.
I fed it my homepage, and it nailed it. I'm impressed.
Haven't tested this, but great idea. I've used a couple of the PDF creation tools and it seems so tedious to build out even a simple table view on a PDF. Good luck with this!
That's a known problem on my todo list. The colors are dulled only in Acrobat but other PDF readers render the colors correctly. Please, could you post the link to that page if possible? Thanks.
[+] [-] dpapathanasiou|16 years ago|reply
I would be willing to pay money for a reliable tool that didn't need much manual editing after processing.
Unfortunately, the pdftohtml project (http://pdftohtml.sourceforge.net/) has been inactive, and the current version has trouble with even moderately complex layouts.
[+] [-] jgresula|16 years ago|reply
The fundamental problem is that PDF stores the document presentation while html defines the document and the presentation is created by the browser. And obviously, to restore a document definition from its presentation is hard as lot of information is missing.
[+] [-] petesalty|16 years ago|reply
One of the ways of doing this is to host it on a simple shared server (it's not a heavily used app).
Downside of this is that it's unlikely we'll be able to use any of the PDF tools I've used in the past (since they need to be installed). This should work fine for our purposes.
Thanks, I was wondering how I'd get around this.
To all those who were dissing this because they couldn't immediately see a use for it, try to have a more open mind.
[+] [-] wdewind|16 years ago|reply
One other caveat is that having the ability to view flash would be awesome as well. main function of pdf as i understand it is to create a document that PRINTS completely identically on every setup, so frequently people are going to want to print flash, which is already a huge pain in the ass. Unfortunately it looks like it blanks out completely if there is flash on the page (2advanced.net)
if you could solve that i would start paying tomorrow.
[+] [-] sjs382|16 years ago|reply
[+] [-] raffi|16 years ago|reply
http://sleep.dashnine.org/manual/ - original docs http://sleep.dashnine.org/download/sleep21manual.pdf - result
[+] [-] jonallanharper|16 years ago|reply
If PDFCrowd can effectively handle images, I'll brand their logo into my bicep.
[+] [-] thepsi|16 years ago|reply
Given the focus on APIs I guess you're aiming it at those wanting to programmatically generate PDFs using a familiar markup, rather than conversion of existing (static) content into PDF? If so, maybe investigate the ability to overlay rendering onto an existing PDF template at some point - in my experience it's been a common requirement (think form letters, account statements, etc).
Interesting that it appears to execute Javascript; guess it's a sign of the times that you need to in order to render many sites correctly nowadays. I haven't poked it too hard, but suspect there might be one or two security challenges there...
[+] [-] DanHulton|16 years ago|reply
I've just spent weeks working on HTML -> PDF conversion code, so I know it's not just my viewer. I've put all kinds of crazy stuff through there.
[+] [-] thepsi|16 years ago|reply
[+] [-] jgresula|16 years ago|reply
[+] [-] karanbhangui|16 years ago|reply
[+] [-] jgresula|16 years ago|reply
I think that the Pdfcrowd's selling points could be 1) wide availability - only HTTP is needed so it can be used theoretically on any platform 2) no need to install any 3rd party software which makes the applications more portable 3) API bindings
[+] [-] latortuga|16 years ago|reply
[+] [-] sjs382|16 years ago|reply
[+] [-] jgresula|16 years ago|reply
[+] [-] ivansavz|16 years ago|reply
Here is an idea for an extra feature: make a print bookmarklet -- clicking on it you get a nice PDF version of the page you are viewing right now. I can't stand firefox's print renditions of some pages... terrible...
(also you might want to set the page size to letter or A4 depending on the geolocation of your visitor's ip address)
[+] [-] watmough|16 years ago|reply
I notice there are some questions about how to make money. One may be to position yourself as a way to get PDF reports generated from phone apps, in which case you may want to do per app licensing and provide facilities for email delivery of PDFs.
I could see this being useful porting apps from iPhone (can easily generate PDFs) to Android (which does not appear to support PDF output).
[+] [-] ricmo|16 years ago|reply
[+] [-] jrockway|16 years ago|reply
It works really well. The only quirk is that it needs a fake X server (for font loading), but Xvfb works just fine for that.
[+] [-] jgresula|16 years ago|reply
[+] [-] deutronium|16 years ago|reply
[+] [-] juliancox|16 years ago|reply
[+] [-] qeorge|16 years ago|reply
I fed it my homepage, and it nailed it. I'm impressed.
[+] [-] pstinnett|16 years ago|reply
[+] [-] carbocation|16 years ago|reply
[+] [-] jgresula|16 years ago|reply
[+] [-] oskee80|16 years ago|reply
[+] [-] washingtondc|16 years ago|reply
[+] [-] jgresula|16 years ago|reply
[+] [-] va_coder|16 years ago|reply
[+] [-] mleonhard|16 years ago|reply
[+] [-] jgresula|16 years ago|reply