The examples lack hyphenation, which partly explains the too-variable interword spacing. Is this because Chrome still fails to support hyphenation, unlike, for example, Firefox?
There are other subtle defects, which make these PDFs pretty good, but not high quality.
Here is a brief discussion of some of the shortcomings of web typography, and why we still need to use TeX if we want the most beautiful and easiest to read results:
I would say similarly to hyphenation is TeX's ability to place page breaks optimally. I don't believe any web technology can solve this problem at the moment.
Just printing the <p> tag, with its constraints of text layout on all layers (word, line, paragraph, page) already has a lot of details you need to get right to get naturally readable text flow before adding on all the other complexities of html. For instance, if you have a single line creep onto the next page, but you could also just move the entire paragraph to the next page and subtly adjust spacing on the first page, then that is preferable so that each paragraph resides entirely on one page. This is obviously not always possible or desirable, so it turns into a search problem with many variables that can be dynamically altered in the middle of text flowing.
My understanding of modern CSS engines is both that a) CSS itself lacks the natural primitives to even express constraints you'd find in TeX, and also b) the concerns necessary to solve page layout to this degree fall into the type of search problem that browsers tend to try to avoid when rendering.
Of course, there's an argument to be made that if people don't realize it's missing, maybe it wasn't terribly valuable to begin with. I'd imagine for most home uses it's not very useful, but the fact that you can typeset decades old documents at a de-facto professional level, for free, OR with heavily modified engines allowing more modern practices, is really quite amazing. I hope the effort that went into formalizing "readable text" doesn't get lost as people move on from TeX--it'd be great to get some of this capacity in a browser with competing implementations; TeX is a lot to learn for most people, and it's also turing complete, which is IMHO mostly a bad sign for accessibility.
There are also projects which attempt to render HTML to TeX, but they were frankly mostly terrible the last time I looked. I honestly wonder if it's easier these days for javascript to attempt to render the DOM to TeX and just leverage the browser as much as possible, but I'm not familiar enough with the DOM to speculate on how much this is likely to work on unaltered pages. My guess is you only get so much for free before you have to specifically consider that output scenario, just like other types of responsive layout.
I was wondering the same as it’s a common use-case for the project I run (browserless.io). Seems to be a big demand for sane PDF rendering and generation.
Been pretty interesting seeing webtech handle these kinds of problems
How would that compare to, say, an HTML template + wkhtmltopdf?
Also I feel like the biggest gripe with generating (long) PDFs from HTML are things such as page numbering, orphans and widows, semantically correct word-wrapping, page margins, etc...
Chrome does a decent job but is nowhere close to what LaTeX can do.
It is open to contributions, so any thoughts welcome. In a nutshell, all your points are valid. Chrome is one of the best browsers, but still behind LaTeX in some aspects. But which will evolve faster in the future ?
> Also I feel like the biggest gripe with generating (long) PDFs from HTML are things such as page numbering, orphans and widows, semantically correct word-wrapping, page margins, etc...
Oh how much i would love to have good way how to generate print quality PDFs. The real problem is not hyphenation but how lines are composed. If you want even lines in type set to block then there is probably only Adobe Indesign and LaTeX anything else uses "single line composer" i dont know the algorythm but Latex and Indesign are only ones which take multiple lines into considiration. Latex is sort of Okay but the algorythm in Indesign is still highly superior. I suspect that is some Adobe secret sauce. Pity because you cant run indesign on server, you have it open and use "extendscript" their version of old ECMAScript 3 :(
Adobe's secret sauce is largely implemented in the microtype package in LaTeX world (character protrusion for optical margin alignment and font expansion for more even interword spacing and less hyphenation). Also the technology didn't originate at Adobe; Adobe purchased the technology from URW who developed the hz-program that was the real pioneer for those micro-typographic adjustments.
Have you looked at Prince [1]? It's commercial, but highly regarded.
The coolest project I've seen with it is OMA (Rem Koolhaas' architecture firm), which uses it to print internal, very professional-looking booklets automatically generated from data, text and photos stored in Sanity [2]. (The Sanity team also built the system to make the booklets.)
Seems kind of neat. But for my purposes I will still use Markdown to PDF using pandoc etc.
What really upsets me... the typography still looks shit compared to LaTeX... MS Word / LibreOffice can do better. Would rather stick with plaintext again.
FOP is the only TeX alternative that can get close to it on basic typography in a FOSS implementation. I had a toolchain that ran ReST -> Docbook -> XSL-FOP -> PDF but the hard drive it was on bit the dust and I haven't gotten around to recreating it. Still much more pleasant than wrestling with LaTeX's rigid predetermined layouts. The result was nice and didn't have the crusty PDF LaTeX appearance.
In the image, ConTeXt generates PDFs. The EA box represents HTML documentation exported from Enterprise Architect, but could be any structured document that pandoc can parse. The source repository contains various themes for the final PDF.
Using ConTeXt offers several compelling features, such as: citations, cross-references, and ability to produce EBPUBs.
Pandoc ultimately either has to move the html through another markup format such as laTeX or uses a plugin that attempts to convert html4 to pdf code.
This uses a full browser rendering engine that supports modern html5/css3/js by ultimately running a headless browser.
I suspect pandoc is still a great approach for a lot of cases. Running a headless browser isn't cheap, especially at scale. If your output is a simple book or an invoice, pandoc is probably the way to go. If you want to pdf websites or dump an html file with charts into a pdf, use this.
This is neat, but perhaps switching the final typesetting engine from chromium's PDF printer to LaTex (via Pandoc maybe) would make it more useful. You'd get more control over things like page numbering and TOCs, plus good justification/microtypography, which is important to most publishers.
That will probably be difficult because Chrome just "prints" a PDF. Therefore headers, footers, footnotes, and page numbering is a difficult issue to solve.
I am thinking about it and there may be a way to do it using Pug mixins (like LaTeX macros).
Also, ReLaXed supports Markdown-it, which in turn has plug-ins for footnotes and citations, for instance. Not sure what you mean by auto-reference, but that should be possible, like in any other HTML page, wouldn't it ?
All I want is a system that gets the basic right and is version-controllable in git (plain text source code). Latex is just ridiculously complex and inconsistent. Even after years of using it, I have to google how to do most things every time. I would prefer a simple PDF generator that uses pug/HTML (which I know by heart) any day.
This is an implementation of the line breaking algorithm used in TeX in Javascript. It would be nice to add to obtain better typographic results with justified text.
Looks like the perfect solution to my resume. The latest iteration is in HTML/CSS, because it allowed me to easily get the exact layout I wanted (so painful in LaTeX...), but getting a consistent PDF was a challenge.
I produce all my PDFs with pandoc's markdown and in-line html: letters, slides and papers with citations. Depending on whether I need mathjax I use wkhtmltopdf or chromium (JS-based hypens with Hyphenopoly) or just http://weasyprint.org/ if no JS is involved.
This pug language seems to be a good alternative to intermixed markdown+html.
I find Markdown most natural for writing because I do not have to worry about formatting or syntax.
Currently I deliver ~2 PDF reports per week using Ulysses or MacDown for content creation (distraction-free writing), and then typesetting everything into InDesign.
Thank you for creating this tool, I will try it next week.
The ability to render Markdown to Pug as an "Import Markdown" feature would be key for many people to adopt this.
Inline markdown and external markdown files are both supported. Have a look at the "Book" example. Every chapter is in its own Markdown file. Most of the other examples have parts where I simply switch to markdown.
I am also a big markdown user and I have found that for writing reports all day long markdown clearly wins over Pug, in particular with tools like
Prince is a 3800$ software. Prince seems to encourage XML/HTML/CSS for writing documents, and I didn't like this. With ReLaXed I am trying to show that Pug/SCSS makes document writing much more natural.
Where Prince wins is in its support for CSS @page extensions (having pages with different margins etc.), it looks much more adapted to professional publishing. There are certainly many more advantages related to typography but I don't know them.
For one, everything here appears to be free and open source. Prince is pretty costly to run as a small endeavor and last I knew their pricing model wasn't very kind to horizontal scaling.
Not sure if this is related to the format of the PDF somehow, but my computer completely froze when trying to open the Alice pdf in the GitHub viewer. This is on Safari, Chrome was fine.
Upon further inspection, the GitHub renderer works fine on PDF's much larger [1], and the native Safari PDF viewer opens these PDF's fine. I suspect there is something the GitHub renderer, your pdf generator, and Safari's js engine disagree on.
[+] [-] leephillips|8 years ago|reply
There are other subtle defects, which make these PDFs pretty good, but not high quality.
Here is a brief discussion of some of the shortcomings of web typography, and why we still need to use TeX if we want the most beautiful and easiest to read results:
https://lwn.net/Articles/662053/
All that aside, this is impressive and should be useful to many people.
[+] [-] drb91|8 years ago|reply
Just printing the <p> tag, with its constraints of text layout on all layers (word, line, paragraph, page) already has a lot of details you need to get right to get naturally readable text flow before adding on all the other complexities of html. For instance, if you have a single line creep onto the next page, but you could also just move the entire paragraph to the next page and subtly adjust spacing on the first page, then that is preferable so that each paragraph resides entirely on one page. This is obviously not always possible or desirable, so it turns into a search problem with many variables that can be dynamically altered in the middle of text flowing.
My understanding of modern CSS engines is both that a) CSS itself lacks the natural primitives to even express constraints you'd find in TeX, and also b) the concerns necessary to solve page layout to this degree fall into the type of search problem that browsers tend to try to avoid when rendering.
Of course, there's an argument to be made that if people don't realize it's missing, maybe it wasn't terribly valuable to begin with. I'd imagine for most home uses it's not very useful, but the fact that you can typeset decades old documents at a de-facto professional level, for free, OR with heavily modified engines allowing more modern practices, is really quite amazing. I hope the effort that went into formalizing "readable text" doesn't get lost as people move on from TeX--it'd be great to get some of this capacity in a browser with competing implementations; TeX is a lot to learn for most people, and it's also turing complete, which is IMHO mostly a bad sign for accessibility.
There are also projects which attempt to render HTML to TeX, but they were frankly mostly terrible the last time I looked. I honestly wonder if it's easier these days for javascript to attempt to render the DOM to TeX and just leverage the browser as much as possible, but I'm not familiar enough with the DOM to speculate on how much this is likely to work on unaltered pages. My guess is you only get so much for free before you have to specifically consider that output scenario, just like other types of responsive layout.
[+] [-] zulko|8 years ago|reply
https://www.w3schools.com/cssref/css3_pr_word-break.asp
From what I remember LaTeX has better algorithms, both in how to distribute words between lines, and in knowing where in a word it is ok to cut.
[+] [-] kaycebasques|8 years ago|reply
https://github.com/GoogleChrome/puppeteer
(I work for Chrome DevTools team, creators of Puppeteer)
[+] [-] mrskitch|8 years ago|reply
Been pretty interesting seeing webtech handle these kinds of problems
[+] [-] nightmunnas|8 years ago|reply
[+] [-] Ecco|8 years ago|reply
Also I feel like the biggest gripe with generating (long) PDFs from HTML are things such as page numbering, orphans and widows, semantically correct word-wrapping, page margins, etc...
Chrome does a decent job but is nowhere close to what LaTeX can do.
[+] [-] zulko|8 years ago|reply
https://github.com/RelaxedJS/ReLaXed/wiki/ReLaXed-vs-other-s...
It is open to contributions, so any thoughts welcome. In a nutshell, all your points are valid. Chrome is one of the best browsers, but still behind LaTeX in some aspects. But which will evolve faster in the future ?
[+] [-] jahewson|8 years ago|reply
https://developer.mozilla.org/en-US/docs/Web/CSS/Paged_Media
[+] [-] tingletech|8 years ago|reply
a blog about this issue: http://www.pagedmedia.org
[+] [-] omnimus|8 years ago|reply
[+] [-] kccqzy|8 years ago|reply
[+] [-] lobster_johnson|8 years ago|reply
The coolest project I've seen with it is OMA (Rem Koolhaas' architecture firm), which uses it to print internal, very professional-looking booklets automatically generated from data, text and photos stored in Sanity [2]. (The Sanity team also built the system to make the booklets.)
[1] https://www.princexml.com
[2] https://www.sanity.io/docs/introduction/what-the-headless
[+] [-] rayiner|8 years ago|reply
[+] [-] che371291|8 years ago|reply
What really upsets me... the typography still looks shit compared to LaTeX... MS Word / LibreOffice can do better. Would rather stick with plaintext again.
[+] [-] kevin_thibedeau|8 years ago|reply
[+] [-] thangalin|8 years ago|reply
https://i.imgur.com/tMkMjNV.png
In the image, ConTeXt generates PDFs. The EA box represents HTML documentation exported from Enterprise Architect, but could be any structured document that pandoc can parse. The source repository contains various themes for the final PDF.
Using ConTeXt offers several compelling features, such as: citations, cross-references, and ability to produce EBPUBs.
[+] [-] vorpalhex|8 years ago|reply
This uses a full browser rendering engine that supports modern html5/css3/js by ultimately running a headless browser.
I suspect pandoc is still a great approach for a lot of cases. Running a headless browser isn't cheap, especially at scale. If your output is a simple book or an invoice, pandoc is probably the way to go. If you want to pdf websites or dump an html file with charts into a pdf, use this.
[+] [-] baby|8 years ago|reply
[+] [-] lahcim8|8 years ago|reply
[+] [-] deleterofworlds|8 years ago|reply
[+] [-] nateroling|8 years ago|reply
[+] [-] ghrifter|8 years ago|reply
[deleted]
[+] [-] killercup|8 years ago|reply
[+] [-] sebazzz|8 years ago|reply
[+] [-] zulko|8 years ago|reply
Also, ReLaXed supports Markdown-it, which in turn has plug-ins for footnotes and citations, for instance. Not sure what you mean by auto-reference, but that should be possible, like in any other HTML page, wouldn't it ?
[+] [-] nmca|8 years ago|reply
So the beginnings of an alternative looks great!
[+] [-] foobaw|8 years ago|reply
[+] [-] fmntf|8 years ago|reply
[+] [-] felixfbecker|8 years ago|reply
[+] [-] agussell|8 years ago|reply
This is an implementation of the line breaking algorithm used in TeX in Javascript. It would be nice to add to obtain better typographic results with justified text.
[+] [-] Wehrdo|8 years ago|reply
[+] [-] Klasiaster|8 years ago|reply
This pug language seems to be a good alternative to intermixed markdown+html.
[+] [-] buildbuildbuild|8 years ago|reply
Currently I deliver ~2 PDF reports per week using Ulysses or MacDown for content creation (distraction-free writing), and then typesetting everything into InDesign.
Thank you for creating this tool, I will try it next week.
The ability to render Markdown to Pug as an "Import Markdown" feature would be key for many people to adopt this.
[+] [-] zulko|8 years ago|reply
I am also a big markdown user and I have found that for writing reports all day long markdown clearly wins over Pug, in particular with tools like
https://atom.io/packages/markdown-preview-enhanced
But the day where you need to produce a super-nice report with a bit of custom layout, Pug/SCSS is awesome.
[+] [-] deedubaya|8 years ago|reply
I'm in the process of launching BreezyPDF.com which can generate equally as wonderful PDFs from the HTML/JS/CSS you're already using.
Here's a demo of turning a complex dashboard into a PDF: https://ruby.demo.breezypdf.com
[+] [-] williamscales|8 years ago|reply
[+] [-] zulko|8 years ago|reply
Where Prince wins is in its support for CSS @page extensions (having pages with different margins etc.), it looks much more adapted to professional publishing. There are certainly many more advantages related to typography but I don't know them.
Link to Prince:
https://www.princexml.com
[+] [-] SigmundA|8 years ago|reply
Real issue is Prince is the only browser that supports full print CSS, none of the major browsers seem to care about better print output anymore.
[+] [-] evilduck|8 years ago|reply
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] jakear|8 years ago|reply
[+] [-] jakear|8 years ago|reply
[1]: https://github.com/mynane/PDF/blob/master/Docker%20——%20从入门到...