History of the PDF

ralferoo|2 years ago

So, reading the article is a bit weird. It's clear there's an anti-PDF bias from the start, with the implicit assumption that everybody hates reading PDF files. Actually, I don't because I get to read a well formatted document. They even say that it should only be used as a format for things to be printed, never as a document for people to read on a computer... and yet this is clearly meant to be read once on a screen and not printed out. It also contains a hypertext link to their company that obviously wouldn't work if printed, and they embed it in an iframe, because they expect people to be reading it online.

But towards the end, you start to see the real objection to PDFs - that it's not always easy to extract text automatically from a document. It mentions a few of the issues - extra spaces, not enough spaces, hidden text that gets extracted because it's off-page, fonts that are designed to obfuscate the internal text, e.g. re-arranging characters or splitting glyphs up in strange ways, etc. It's worth noting that with the exception of the spaces, these techniques are used deliberately to stop people extracting text or the copyrighted fonts from the document.

It's not at all obvious from the document itself, but if you click on the link to the company, all becomes clear. The reason this company is saying that all these things are problems with PDF is because their company is in the business of extracting raw text from PDF. Ignoring all the designers efforts to place things in specific places to make things pleasing for a human to read, etc... They don't want any of that. They just want to extract the raw text so they can data mine it and sell that as a service.

ogurechny|2 years ago

You are not reading a PDF document, you are reading a visual representation constructed by a program which is made by people who tear their hair out.

PDF “specification” is not a specification, it only documents the happy path. It never states that behavior of Acrobat remains the holy truth, but in practice undocumented bug-for-bug compatibility is assumed. (We're talking about most basic, universally supported features here.) If ISO was worth their salt, they would at least try to codify the de facto behavior instead of stamping their name on some Adobe-provided document, then it would be horrible but fixed format. A collection of tests would be nice to have, too.

Of course, this “history” is just a promotional leaflet, which describes the “layman approach” they tried to construct. It's a fault not to mention that PDF was, and still is, a foundation of digital print industry, where big vendors solve compatibility problems for mere mortals, and therefore create unwritten rules of what should and shouldn't work.

It is also ironic that they praise the Web, but have to use Web Archive to link to the article from the ancient year of… 2020.

Tangurena2|2 years ago

Mostly, the page description language (inside PDFs) says "move to point (x, y) and write text foo" or "move to point (x, y) and draw glyph bar".

The reason that many PDFs produce garbage when extracting text is because the underlying document doesn't include fonts, every letter is a drawn glyph. This is most common in older (1990s) pdfs generated on UNIX systems.

Since the page description language is saying "write text foo", that text is broken up by the generating software, so there is not necessarily a whole line of text as a human would see it.

And some PDFs are impossible to extract text from because they've flattened the page into an image. Law firms are notorious for doing this - to provide the documents exactly as required/specified during discovery, but make them impossible for the text to be extracted. Basically it is a fax - every page is a TIFF image (because it is harder to OCR than a JPEG, although JBIG2 has its own flaws [0]).

I've been working with PDF projects off and on since the late 90s. The standard tries to be everything for everybody and that makes it a Charlie Foxtrot from top to bottom (if you've ever written an object viewer/editor to dig inside PDFs, you know exactly what I'm referring to). It is a great spec for making a document that appears the same no matter where it is viewed or printed. But I always treat it as a sausage: you can turn the cow into a sausage, but you can't turn that sausage back into a cow.

0 - https://www.theverge.com/2013/8/6/4594482/xerox-copiers-rand...

Someone|2 years ago

> fonts that are designed to obfuscate the internal text, e.g. re-arranging characters or splitting glyphs up in strange ways, etc. It's worth noting that with the exception of the spaces, these techniques are used deliberately to stop people extracting text or the copyrighted fonts from the document.

Maybe (and, for the fonts, likely), but I don’t think it’s the only reason. Subsetting embedded fonts makes PDFs smaller, often a lot smaller (why embed an entire font because the document uses a single glyph of it as a bullet point? Why would one include Chinese, Japanese, etc glyphs if the document doesn’t use them?)

Even if it’s possible to do that without changing the code point to glyph mapping (is it? I don’t know enough of fonts to answer that), implementing it may be simpler or result in smaller files if one makes the embedded font dense in code points (I tried finding an answer, but soon remembered how complex fonts are, and gave up)

And of course, modern tools _should_ output accessible PDF documents, which means text extraction _should_ work. I wouldn’t know how well that works in reality, but have my doubts.

merb|2 years ago

Actually most pdfs are formatted in a good way and it’s easy to extract text. The stupid stuff is just copy encryption, which is just a stupid feature (because pdf viewers can ignore it) I have no idea why somebody hates pdfs for extracting data, when stuff like doc,xls (the old format) is clearly way worse. Pdf sometimes has its quirks but the 2.0 version clearly cleans up a lot of the messes

eviks|2 years ago

Except it's a poorly formatted document because it's not formatted to fit screens of different width, which is huge (phones are a thing)

Also you haven't solved another huge fail of the most basic digital workflow - copy&paste - by pointing at the motivation of the author since "except spaces" ruin it for everyone, not just professional data extractors

herodotus|2 years ago

After working on PDF document reconstruction for more than a decade, I often fantasized about inventing a cleaner and simpler alternative. After all, there are only three kinds of objects in PDF: shapes, images and glyphs. But it is all those little details that will get you in the end. A line - all you need is a coordinate and a length, right? No: is it solid, what is its width? Is its end point anchored on the left most part of the visible line, or does the thickness spread out from the anchor point? And is the end square of curved? If curved, what are the parameters of the curve? Are both ends the same? On it on it goes. And don't event get me started on glyphs...

PDF is a remarkable creation. It has some notable weaknesses, such as the fact that its color channel for images does not include alpha, and thus needs masks, but the fact that it covers so much visual complexity in a relatively compact form is just amazing. (BTW: Its graphics model is strictly from Adobe Postscript, but PDF content streams are not programs.)

One thing that bugged me while reading this article was the use of the definite article ("the PDF"). Since PDF is an acronym for "Portable Document Format" there may be a grammatical case to be made for the "the", but no one says "the HTML" or "the NASA" and so on.

tonyedgecombe|2 years ago

>In 2020, Nielsen made the case again, writing, “After 20 years of watching users perform similar tasks on a variety of sites that use either PDFs or regular web pages, one thing remains certain: PDFs degrade the user experience.”

Good luck saving a HTML version of any modern web page and being able to read it in twenty or thirty years time. HTML just wasn't designed for that.

kps|2 years ago

HTML was; Javascript wasn't.

wslh|2 years ago

I used to click "Print" (macOS) and save to file in (ironically for this article) PDF... does it work for you?

eviks|2 years ago

Good luck doing the same in PDF with embedded javascript, which is the stuff that's not designed, not HTML

contrarian1234|2 years ago

Are there any real alternatives?

I tried to make a conference poster with SVG - using Inkscape - and it was minor disaster that rendered differently in different programs/browsers, with some features entirely broken

but I don't know of a third option..

epc|2 years ago

IBM tried to push a competitor in the 1990s…BookManager was an initially mainframe (VM/CMS, MVS, etc) combination of viewer program and proprietary format. It came about in response to both IBM customers and product documentation groups demanding some sort of online “hypertext” version of the thousands of publications available.

IIRC it came out around the same time as the initial Acrobat format but not necessarily in response to it. Eventually there were viewers for Windows, OS/2. It wasn't particularly bad, but it was very literal in display and Acrobat/PDF rapidly left it in the dust.

When the web boomed in 1995–1996 the product group behind BookManager tried to ban distribution of PDFs by other IBM groups but failed. One of the problems with BookManager formatted files is you had to recreate the appropriate record format if you transferred it back to a mainframe, and I vaguely recall EBCDIC vs ASCII issues (where PDF is, I think, UTF native?).

https://en.wikipedia.org/wiki/SCRIPT_(markup)#BookManager

signaru|2 years ago

Microsoft tried with XPS which is a zipped XML format, pretty much like MS Office 2007+ files. To Adobe's credit, they made PDF an open standard around the time XPS came out. Maybe it's a combination of being there first, many files already in PDF, and finally making the format open which made PDF win.

martin_a|2 years ago

Well, this is exactly why PDF was invented and is doing its job so well. To preserve a desired layout and very specific information on how something has to be outputted.

That comes with downsides, yes, but at its core it's just working fine.

edit: Third option would be to render your content as an image, but that comes with its own downsides.

FuriouslyAdrift|2 years ago

There's DVI (device independent file format) from TeX

https://en.wikipedia.org/wiki/Device_independent_file_format

anthk|2 years ago

DJVU, but it's raster I think.

jimjimjim|2 years ago

PDF is the worst document format, apart from all the other formats. When developing software to read or process PDFs the PDF spec can always deliver a jump scare like no other spec. But to give it credit it broke Microsoft's stranglehold on documents, not completely, but back in the mid 2000s organizations no longer required you to submit things as word documents anymore.

jansan|2 years ago

Adobe file formats never had the reputation of being easy to work with. I spent some time with the CFF font format, and I can say it was not a pleasure.

balder1991|2 years ago

But is the word XML format really bad nowadays?

HenryBemis|2 years ago

That's the thing. Every PC, Mac, Android, iPhone can display PDF files. You can capture all elements of a website on one.

It just works.

dave8088|2 years ago

Here’s a link that works: https://www.sensible.so/blog/history-of-the-pdf

dang|2 years ago

Thanks! I've changed to that from https://www.sensible.so/history-of-the-pdf-pdf above.

dblitt|2 years ago

For some reason, this page embeds the PDF as an iframe. The actual PDF is at https://19971168.fs1.hubspotusercontent-na1.net/hubfs/199711...

ralferoo|2 years ago

It's so they can add their help / advertising at the bottom right of the screen.

ggm|2 years ago

"read this PDF online" goes to a 404. Perhaps un-ironically? If you have the PDF you can read it!

martin_a|2 years ago

For everybody complaining about the non-transformative character of PDF: There are several PDF standards out in the wild.

In the graphic industry we mainly use PDF/X files. These are very solid and precise in defining the layout and how objects are rendered.

For archiving purposes there's another standard, it's called PDF/A. Part of PDF/A is that you must be able to transform its text content back to Unicode.

So, if you're looking into being able to convert PDFs back and forth, you should probably use PDF/A. PDF/X files will drop that support to maintain the desired appearance as close as possible.

https://en.wikipedia.org/wiki/PDF/A

FinnKuhn|2 years ago

I would also add that .pdfs are often not meant to be transformed. They are the digital equivalent of a book, which no one complains about not being able to edit. If you wanted to have a document you could edit you don't use .pdf, but something else before you convert export it as an .pdf. The same is true about images. No one complains about .jpg not being editable, as any sane person would use a photoshop or similar file and only export the final product.

ogurechny|2 years ago

PDF/A is a joke of a “standard” that does almost nothing that is promised on the cover. It is just a subset of PDF with limits on variable options like color representation, frozen at some arbitrary point in time, probably because people working with digital archives realized that they couldn't reach the moving goal, and implement the ever growing list of features. We may only expect programs producing PDF/A files to be less “creative”, and produce straightforward markup, but it's not guaranteed at all, because PDF/A doesn't address any of the real core format issues.

divbzero|2 years ago

From page 12 of the PDF:

> Comments on places like HackerNews refer to it as “one of the worst file formats ever produced” [1], “soul-crushing” [2], and something that “should really be destroyed with fire” [3].

I found the source for [1] and [3] https://news.ycombinator.com/item?id=22474460 but couldn’t identify the source of [2].

DeathArrow|2 years ago

PDF as a format is bad when you want to edit or extract text. But do we have an alternative to it that is as portable?

FuriouslyAdrift|2 years ago

That's because it's intent is for output rendering and not editing. There source document(s) are what is used for editing.

korp|2 years ago

This doesn't make the best coffee table book.

mobilio|2 years ago

Fun fact first version for PC was DOS version:

https://winworldpc.com/product/acrobat-reader/1

HenryBemis|2 years ago

I remember seeing this software on the university's computers, an acrobat.. and I was thinking.. WHAT is that software that I see in EVERY PC? I didn't know what PDF was at the time. I grew up with an Amstrad PC1512 (oh, and a family too) but never had to use Acrobat. Only GW-Basic, Zaxxon, Bubble Bobble, Defender of the Crown, and other super useful software ;)

The most 'complicated' software I used was Volkswriter!

phkx|2 years ago

Statisticians beware: It‘s about the Portable Document Format.

anticensor|2 years ago

As opposed to probability distribution functions, I guess?

jtvjan|2 years ago

I'm so impressed by the design of this PDF file. It's amazing that they put in so much effort to design what comes down to just an informational article.

dwynings|2 years ago

Yeah, we tried to have some fun with it.

Alifatisk|2 years ago

I like PDF, It does one thing and it does it well. I just wish there was free alternatives to work with PDFs. I can't believe something so well adopted by everyone is still closely controlled by Adobe.

You can't a free tool that offers features close to Adobe Acrobat, there is none. You have to download multiple tools that each offers their own feature close to Acrobat.

davidthewatson|2 years ago

I thought being emancipated from word docs was freedom until I realized the suffering brought on by a left field version of wkhtmltopdf no one uses but someone built and distributed via pamac AUR. In software, Hybridization is Postmodernism.

davidthewatson|2 years ago

PDF shares a common property with computational complexity and robotics: the magical world of software is no longer free from physics, as if it ever was. Software has reciprocity in creating these illusions and destroying them.

vlark|2 years ago

You can tell the writers and designers hate the PDF format because they made the damn thing so difficult to read, layout-wise. They went 1990s PageMaker/QuarkXpress crazy here.

57 comments