The story of the PDF (2018)

[+] alister|5 years ago|reply

I'm thankful PDF won, because otherwise I think it would have been Microsoft Word. There was a time when papers, books, resumes, contracts, etc. almost always came as Word. Does anyone else remember getting a book as preface.doc, chap1.doc, chap1a.doc, chap2.doc, subchap2a2.doc, and so on, and a mess of jpegs and gifs and trying to figure out how it had to be assembled, and discovering something was missing, or that one chapter was newer than the others. That's one reason I really like PDF -- it's one file, self-contained, and linear.

On the other hand, I really wish it was more diff'able. If for example a credit card company changes one word in their terms & conditions PDF, it seems like 90% of document changes at the binary level. I know that PDF diff tools exist, but there must be tremendous internal complexity in the PDF format for tiny changes to alter the whole structure.

[+] marcan_42|5 years ago|reply

PDF objects within the file are usually compressed. That means if anything changes, the whole compressed binary blob changes.

Other than compression and such encodings, PDF files are actually text files, with a drawing model largely based on PostScript but without the programming. If you want to diff them, use `mutool clean -d -a` to first turn them into pure ASCII text.

That said, since it's a "baked" layout format, if one word pushes the rest of the text forward, everything after that will show up with changed coordinates. It's closer to a vector image format like SVG than a markup format like HTML or ODF.

There are also things like font subsetting, where removing a word that was the only use of a character, or adding a word that uses a new character, might change the font data to add/remove those characters.

[+] rahimnathwani|5 years ago|reply

"but there must be tremendous internal complexity in the PDF format for tiny changes to alter the whole structure."

Imagine a simple file format that doesn't support text wrapping, but allows you to specify elements as (x, y, s) where (x, y) specify a position, and s is a string that will be written left-to-right, truncated at the edge of the screen.

That's a simple file format, right?

But inserting a word somewhere early in that document would change the string within every element in the rest of the paragraph. And maybe move the y position of every element later in the document.

That would be a PITA to diff. Even more so if the document has more than one column.

[+] pjmlp|5 years ago|reply

The irony is when I get told that people want an application to output PDF instead of Word, because it is read only.

I always get amused by proving those people how to edit PDFs.

It is the same logic that documents sent by Fax are legally binding but the same document sent by email not.

[+] ternaryoperator|5 years ago|reply

>I'm thankful PDF won, because otherwise I think it would have been Microsoft Word.

Well, probably Microsoft XPS, which was actually a fairly well designed format. But Microsoft didn't have the fight in them to really push it as a competitor to PDF. In part, I suspect b/c it's hard to justify investing a lot of money in your competing document standard as there is not much revenue you can derive from it. As of 2018, Microsoft no longer bundles XPS support in Windows 10.

[+] tonyedgecombe|5 years ago|reply

The main thing for me is I don't need the originating application or fonts. I have twenty year old PDF files created by some long gone software that I can still read.

[+] redman25|5 years ago|reply

PDFs act more like images than text. I made a tool for diffing PDFs at the visual level a little while ago (http://parepdf.com) because I needed a way to see the explicit differences between PDFs.

Diffing PDFs at the textual level is a much harder problem though since lines of text need to be reordered and concatenated with each other. Unfortunately there is nothing built into the format that allows you to know what line belongs with what other line beyond guesswork.

[+] yyyk|5 years ago|reply

We'd probably be using PostScript and maybe later XPS. Word never had a print-oriented format with exact layout.

[+] unknown|5 years ago|reply

[deleted]

[+] rstuart4133|5 years ago|reply

I don't know of a better alternative to PDF that was around at the time, but I can't say I'm a fan. It undeniably works well as a way of placing pixels precisely on a page but then so does PNG, and PNG is far simpler and compresses better for computer generated content.

Sadly some information I only get as PDF's, so I have to scrape them. Easy right? It can be, if the PDF is structured sanely. But PDF isn't some well defined data structure for laying out the page, it's a Turing complete stack based computer program that can do whatever it damned well pleases. The font tables don't necessarily have ' '=32, 'A'=65, 'a'=97. Why not optimise it and get rid of all those gaps, so now ' '=0, 'A'=30? And it doesn't have to be drawn in any sane order. It can be just a mess that makes even copy & paste near impossible, and some are.

Did we really need to invent a DSL that has to be executed every time we wanted to view page? I remember it being pushed as a cool solution at the time. It doesn't look so cool now. SVG would be an improvement.

[+] axiolite|5 years ago|reply

> If for example a credit card company changes one word in their terms & conditions PDF, it seems like 90% of document changes at the binary level.

Convert to text with: pdftotext -layout

[+] einpoklum|5 years ago|reply

> because otherwise I think it would have been Microsoft Word.

No, those are formats with completely different scopes. They don't compete and are essentially non-interchangeable.

> There was a time when papers, books, resumes, contracts, etc. almost always came as Word.

There was never such a time. I mean, sure, you could (and can) send people Word/LibreOffice documents, but things that needed some reproducibility and finality [1] were distributed or published is MS-Word format - almost ever. Postscript used to be pretty popular though.

[1] - Yes, PDFs can be edited too, I know.

[+] hyiltiz|5 years ago|reply

It is a pity that DjVu[0] wasn't even mentioned; an open format that was superior to PDF in many ways[1], including better optimization, efficient storage.

[0] http://djvu.org/ [1] https://en.wikipedia.org/wiki/DjVu

[+] msla|5 years ago|reply

DjVu is a great format for scanned images, which is its primary use-case, but I'm not seeing where you can have actual, selectable text in a DjVu document, like you can with PDF and PostScript. It seems like it's all images.

[+] BelleOfTheBall|5 years ago|reply

Man, I haven't seen a DjVu file in years. It used to be somewhat common in scans of magazines and other media that relied on images. Pity that it didn't catch on, although I suppose it still could, if some of its benefits were refined. I find that larger PDFs tend to tank optimization, is that a problem for DjVu files at all?

[+] saagarjha|5 years ago|reply

I don't see how DjVu solves vector graphics, which is a pretty important usecase for PDF.

[+] adamnemecek|5 years ago|reply

It's crazy that Yann LeCun was involved in the creation.

[+] asperous|5 years ago|reply

One issue I have with the "archival" aspect of pdfs discussed in the article, is you are archiving a picture of something, not the blueprint.

So much pain and time will be spend on machine learning models extracting semantic meaning from pdfs that could have been saved if archivers were to also save source formats or machine readable data. But for some reason, publishers have an allergy to submitting those so its a lost cause.

[+] watersb|5 years ago|reply

In the early summer of 1995, the Mac community was fairly small. But it dominated the publishing industry.

At the conference for Macintosh network administrators, we were all super excited about this World Wide Web thing. The potential for a while new paradigm for information publishing, from creation to distribution, for in-house corporate operations or mass media companies, it was a new medium that would make paper obsolete.

The Adobe reps were visibly exasperated by all this. They had solved this problem, years ago. You could click on any element of a PDF, and go to a different place in the current document, or open any other file on your computer. Powerful tools for graphical interactive PDF creation and editing. Even the ability to trigger AppleScript actions in response to mouse or keyboard events...

The Web, by comparison, was primitive and naive. Why was it getting all the attention?

[+] qubex|5 years ago|reply

• Because one was proprietary, the other was not.

• Because one was top heavy, the other was not.

• Because one was a document format shared between an application that creates ne one that displays, the other was a whole server/protocol/client stack.

• Because one would insist on rigidly paginating it’s content as output by the generating application, while the other defined content that would be streamed to your client and allow it to adapt the content to your display and reflow it’s (admittedly primitive-looking) text & cetera.

• Because one was designed for use within corporations to distribute documents, while the other was intended to allow collective authorship beyond corporate confines and that this consumer/researcher technology would later seep back into the corporate domain and possibly screw up their plans.

Another way to look at it: if the Adobe folks were angsty, irritable, annoyed, or otherwise flustered, it’s probably because they knew (some?) of the above (and perhaps more) and realised that they were going to have a fight on their hands.

[+] netfl0|5 years ago|reply

There was a period of time when I thought PDF’s days were numbered. That was over a decade ago.

There is now first class support in many applications. I don’t think it’s going anywhere.

[+] Finnucane|5 years ago|reply

It's pretty much essential to the publishing industry. Until we actually stop printing books, we'll be using pdfs.

[+] gogopuppygogo|5 years ago|reply

https://en.wikipedia.org/wiki/PDF

It became open source in 2008 so it’s definitely here to stay.

[+] qubex|5 years ago|reply

> there was a period of time when I thought PDF’s days were numbered

And indeed you are correct! A time shall arise, sooner or later in the future, a moment when the last PDF file is created, as well as a moment when a PDF is consulted for the last time. Depending on your definition of format obsolescence, this might be well beyond its expiry date, or it might actually mark the moment of death.

(Let’s forget that the Apple lineage of OSes derived from Display PostScript-using NeXTstep such as OS X [latterly macOS], iOS, iPadOS, watchOS and tvOS all use PDF as a mechanism for drawing primitive sources onto the screen.)

Anyway... after that long preamble, statements like these remind me very much of Goldfinger’s famous quote, and in honour of Sean Connery’s passing yesterday I will allow myself to elucidate:

Bond: “Do you expect me to talk?” Goldfinger: “I expect you to die!”

The latter being a very reliable expectation, but one that can sometimes take a lot longer to come true than the utterer might have in mind when they make the assertion.

[+] Sniffnoy|5 years ago|reply

Original article, without so many obnoxious ads: https://tedium.co/2018/02/27/pdf-file-format-history/

[+] ffpip|5 years ago|reply

https://ublockorigin.com

[+] gumby|5 years ago|reply

PDF has been bad news, as it embodies assumptions from an earlier age: how paper works.

I want to read flowable text that adapts to my screen and my size needs. I want to be able to reliably select and extract text. I don’t need something that apes an archaic IO system (printer+paper) with all its flaws and, when on scree, none of its advantages.

[+] gnicholas|5 years ago|reply

Agreed. Adobe's recently-announced [1] Liquid Mode for mobile devices is a step in the right direction.

1: https://techcrunch.com/2020/09/23/adobes-liquid-mode-uses-ai...

[+] Gibbon1|5 years ago|reply

I use pdf's for data sheets. Last thing I want is flowable text. Also 25 years on in selecting and extracting text from html is hot garbage.

[+] pjmlp|5 years ago|reply

I still use lots of paper and PDF is the ideal format for it.

There are other formats for flowable text in screens.

[+] kuharich|5 years ago|reply

Past comments: https://news.ycombinator.com/item?id=19819789

[+] unnouinceput|5 years ago|reply

This has something of a misleading argument in it in the form that PDF is the "basis" for document world. PDF is not the basis.

Lemme explain: for each format there is a basis and there is the most used format. For sound that's .WAV / .MP3; for pictures that's .BMP / .JPEG (or .PNG if you're a purist).

And for documents that's .RTF / .PDF. You see a PDF is not the absolute basis, it's just the most convenient trade between usability and fidelity. Nobody except snobs wants pure .WAV files for their preferred songs and everybody uses .MP3 instead. If you want the absolute purest form of a document, you use .RTF

My 2 cents.

[+] maxerickson|5 years ago|reply

The analogy is not particularly apt.

For one thing, PDF can encode information that rtf cannot.

There's also lots of approaches to document layout (the underlying descriptions of what should appear, not just different styles).

Pedantically, the analogy works somewhat better for postscript than rtf, but not really, except maybe the bmp->png part.

[+] 867-5309|5 years ago|reply

if only .PDFs could easily be converted back to a useful raw format. parsing them is a bloody minefield, irregularly stuffed with proprietary metadata galore

[+] Biganon|5 years ago|reply

Even a snob wouldn't want a WAV file; FLAC is lossless.

[+] asperous|5 years ago|reply

The only problems with PDFs are that they are misused.

They are amazing at exactly reproducing a printed document, and far superior to a jpg at doing that because it is vector, searchable, can contain links, etc

If you've ever tried to read a math textbook in ebook format on a ipad then switched to pdf, you can see how pdf shines.

[+] varispeed|5 years ago|reply

When I first got a computer magazine in a PDF format in the late 90s, I knew this is going to be the future. It looked so slick on my CRT monitor and I've been looking at pages, zooming, zooming out, just for the sake of it. Whenever I open a PDF file my mind goes back in time and relives these moments of joy.

[+] bastawhiz|5 years ago|reply

Man, this reminded me of XPS (https://en.m.wikipedia.org/wiki/Open_XML_Paper_Specification), which I haven't thought about in ten years. Glad it never won.

[+] gautamcgoel|5 years ago|reply

Any have a recommendation for a good FOSS PDF reader for Linux.

[+] SSLy|5 years ago|reply

zathura with the mupdf engine, evince (if you can stomach poppler's speed), sumatrapdf inside wine.

[+] unknown|5 years ago|reply

[deleted]

[+] consolelog2000|5 years ago|reply

[deleted]

91 comments