(no title)
aidos | 1 month ago
mutool clean -d in.pdf out.pdf
If you look below you can see a Pages list (1 0 obj) that references (2 0 R) a Page (2 0 obj). 1 0 obj
<<
/Type /Pages
/Count 1
/Kids [ 2 0 R ]
>>
endobj
2 0 obj
<<
/Type /Page
/Contents 5 0 R
...
>>
endobj
Rather than editing the PDFs in place, it's possible to update these objects to overwrite them by appending a new "generation" of an object. Notice the 0 has been incremented to a 1 here. This allows leaving the original PDF intact while making edits. 1 1 obj
<<
/Type /Pages
/Count 2
/Kids [ 2 0 R 200 0 R ]
>>
endobj
You can have anything inside a PDF that you want really and it could be orphaned so a PDF reader never picks up on it. There's nothing to say an object needs to be referenced (oh, there's a "trailer" at the end of the PDF that says where the Root node is, so they know where to start).
pfisherman|1 month ago
So it works kind of like a soft delete — dereference instead of scrubbing the bits.
Is this behavior generally explicitly defined in PDF editors (i.e. an intended feature)? Is it defined in some standard or set of best practices? Or is it a hack (or half baked feature) someone implemented years ago that has just kind of stuck around and propagated?
clord|1 month ago
SeriousM|1 month ago
aidos|1 month ago
But yeah. It's all just objects pointing at each other. It's mostly tree structured, but not entirely. You have a Catalog of Pages that have Resources, like Fonts (that are likely to be shared by multiple pages hence, not a tree). Each Page has Contents that are a stream of drawing instructions.
This gives you a sense of what it all looks like. The contents of a page is a stack based vector drawing system. Squint a little (or stick it through an LLM) and you'll see Tf switches to Font F4 from the resources at size 14.66, Tj is placing a char at a position etc.
I'm going to hand wave away the 100+ different types of objects. But at it's core it's a simple model.