> It produces a somewhat-readable PDF (first page at least) with this text output
Any chance you could share a screenshot / re-export it as a (normalized) PDF? I’m curious about what’s in there, but all of my readers refuse to open it.
Letting Claude work a little longer produced this behemoth of a script (which is supposed to be somewhat universal in correcting similar OCR'd PDFs - not yet tested on any others though):
https://pastebin.com/PsaFhSP1
It decodes to binary pdf and there are only so many valid encodings. So this is how I would solve it.
1. Get an open source pdf decoder
2. Decode bytes up to first ambiguous char
3. See if next bits are valid with an 1, if not it’s an l
4. Might need to backtrack if both 1 and l were valid
By being able to quickly try each char in the middle of the decoding process you cut out the start time. This makes it feasible to test all permutations automatically and linearly
This is one of those things that seems like a nerd snipe but would be more easily accomplished through brute forcing it. Just get 76 people to manually type out one page each, you'd be done before the blog post was written.
Or one person types 76 pages. This is a thing people used to do, not all that infrequently. Or maybe you have one friend who will help–cool, you just cut the time in half.
The first week of my PHD was accurately copying DNA sequences from an old paper into a computer file. 10 pages in total. I used OCR to make an initial version then text to speech to check it
As TFA says, the hard part is that "1" and "l" look the same in the selected typeface. Whether your OCR is done by computers or humans, you still have to deal with that problem somehow. You still need to do the part sketched out e.g. by pyrolistical in [1] and implemented by dperfect in [2].
I consider myself fairly normal in this regard, but I don't have 76 friends to ask to do this, so I don't know how I'd go about doing this. Post an ad on craigslist? Fiverr? Seems like a lot to manage.
Given how much of a hot mess PDFs are in general, it seems like it would behoove the government to just develop a new, actually safe format to standardize around for government releases and make it open source.
Unlike every other PDF format that has been attempted, the federal government doesn't have to worry about adoption.
XPS [0] seems to meet these criteria. It supports most of the features of PDF, is an "official" standard, has decent software support (including lots of open source programs), and uses a standard file format (XML). But the tooling is quite a bit worse than it is for PDF, and the file format is still complex enough that redaction would probably be just as hard.
DjVu [1] would be another option. It has really good open source tooling available, but it supports substantially less features than PDF, making it not really suitable as a drop-in replacement. The format is relatively simple though, so redaction should be fairly doable.
TIFF [2] is already occasionally used for government documents, but it's arguably more complex than PDF, so probably not a good choice for this.
> Then my mom wrote the following: “be careful not to get sucked up in the slime-machine going on here! Since you don’t care that much about money, they can’t buy you at least.”
I'm lucky to have parents with strong values. My whole life they've given me advice, on the small stuff and the big decisions. I didn't always want to hear it when I was younger, but now in my late thirties, I'm really glad they kept sharing it. In hidhsight I can see the life-experience / wisdom in it, and how it's helped and shaped me.
Why not just try every permutation of (1,l)? Let’s see, 76 pages, approx 69 lines per page, say there’s one instance of [1l] per line, that’s only… uh… 2^5244 possibilities…
It should be much easier than that. You should should be able to serially test if each edit decodes to a sane PDF structure, reducing the cost similar to how you can crack passwords when the server doesn't use a constant-time memcmp. Are PDFs typically compressed by default? If so that makes it even easier given built-in checksums. But it's just not something you can do by throwing data at existing tools. You'll need to build a testing harness with instrumentation deep in the bowels of the decoders. This kind of work is the polar opposite of what AI code generators or naive scripting can accomplish.
Easy, just start a crypto currency (Epsteincoin?) based on solving these base64 scans and you'll have all the compute you could ever want just lining up
pdftoppm and Ghostscript (invoked via Imagemagick) re-rasterize full pages to generate their output. That's why it was slow. Even worse with a Q16 build of Imagemagick. Better to extract the scanned page images directly with pdfimages or mutool.
This. Not only is it faster, the images are likely to be of better quality. If you rasterize the pages then the images will be scaled, unless you get very lucky.
there are a few messaging conversations between FB agents early on that are kind of interesting. It would be very interesting to see them about the releases. I sometimes wonder if some was malicious compliance... ie, do a shitty job so the info get's out before it get re-redacted... we can hope...
I am in no way a republican apologist, but how many people were clamoring for the immediate releasing these documents, saying it "should be easy" and all that? Laws were passed ordering their sudden speedy disclosure. How would you have handled this?
From the unredacted attachments you could figure out what the redacted content most likely contains. Just like the other sloppy redactions that sometimes hide one party of the conversation, sometimes the other, so you can easily figure out the both sides.
I doubt the PDF would be very interesting. There are enough clues in the human-readable parts: it's an invite to a benefit event in New York (filename calls it DBC12) that's scheduled on December 10, 2012, 8pm... Good old-fashioned searching could probably uncover what DBC12 was, although maybe not, it probably wasn't a public event.
Gods, I had a flashback just from you mentioning that.
I had a reasonably simple problem to solve, slightly weird font and some 10 words in English (I actually only missed one or two blocks for missing letters to cover all I needed).
After a couple of days having almost everything (?) I just surrendered. This seems to be intentionally hostile. All the docs scattered across several repositories, no comprehensive examples, etc.
Absolutely awful piece of software from this end (training the last gen).
On one hand, the DOJ gets shit because it was taking too long to produce the documents, and then on another, they get shit because there are mistakes in the redacting because there are 3 million pages of documents.
What they are redacting is pretty questionable though. Entire pages being suspiciously redacted with no explanation (which they are supposed to provide). This is just my opinion, but I think it's pretty hard to defend them as making an honest and best effort here. Remember they all lied about and changed their story on the Epstein "files" several times now (by all I mean Bondi, Patel, Bongino, and Trump).
It's really really hard to give them the benefit of the doubt at this point.
The zeitgeist around the files started with MAGA and their QAnon conspiracy. All the right wing podcasters were pushing a narrative that Trump was secretly working to expose and takedown a global child sex trafficking ring. Well, it turns out, unsurprisingly, that Trump was implicated too and that's when they started to do a 180. You can't have your cake and eat it too.
> …but good luck getting that to work once you get to the flate-compressed sections of the PDF.
A dynamic programming type approach might still be helpful. One version or other of the character might produce invalid flate data while the other is valid, or might give an implausible result.
Geezus, with the short CV in your profile, you couldn't tell an LLM to decode "filename=utf-8"CV%5F%5F%5FHanna%5FTr%C3%A4ff%5F.pdf"? That's not "Bouveng".
Anyway searching for the email sender's name, there's a screenshot of an email of hers in English offering him a girl as an assistant who is "in top physical shape" (probably not this Hanna girl). That's fucking creepy: https://www.expressen.se/nyheter/varlden/epsteins-lofte-till...
Honestly, this is something that should've been kept private, until each and every single one of the files is out in the open. Sure, mistakes are being made, but if you blast them onto the internet, they WILL eventually get fixed.
Won't that entire DOJ archive already be downloaded for backup by several people?
If I'd be a journalist working on those files, this is the very first thing I would do as soon as those files were published. Just to make sure you have the originals before DOJ can start adding more redactions.
Are there archives of this? I have no doubt after this post goes viral some of these files might go “missing”
Having a large number of conspiracies validated has lead me to firmly plant my aluminum hat
DUBIN BREAST CENTER
SECOND ANNUAL BENEFIT
MONDAY, DECEMBER 10, 2012
HONORING ELISA PORT, MD, FACS
AND
THE RUTTENBERG FAMILY
HOST
CYNTHIA MCFADDEN
SPECIAL MUSICAL PERFORMANCES
CAROLINE JONES, K'NAAN,
HALEY REINHART, THALIA, EMILY WARREN
MANDARIN ORIENTAL
7:00PM COCKTAILS
LOBBY LOUNGE
8:00PM DINNER AND ENTERTAINMENT
MANDARIN BALLROOM
FESTIVE ATTIRE
My non political take about this gift that keeps on giving is that: PDF might seem great for the end user that is just expected to read or print the file they are given, but the technology actually sucks.
PDF is basically a prettify layer on top of the older PS that brings an all lot of baggage. The moment you start trying to do what should be simple stuff like editing lines, merging pages, change resolution of the images, it starts giving you a lot of headaches.
I used to have a few scripts around to fight some of its quirks from when I was writing my thesis and had to work daily with it. But well, it was still an improvement over Word.
It's meant as a printer replacement format, hence "print to PDF". It's a computer file format about equivalent to a printed document. Like a printed document, you can't just change its structure and recompile it.
dperfect|24 days ago
Claude Opus came up with this script:
https://pastebin.com/ntE50PkZ
It produces a somewhat-readable PDF (first page at least) with this text output:
https://pastebin.com/SADsJZHd
(I used the cleaned output at https://pastebin.com/UXRAJdKJ mentioned in a comment by Joe on the blog page)
pests|24 days ago
https://www.mountsinai.org/about/newsroom/2012/dubin-breast-...
https://www.businessinsider.com/dubin-breast-center-benefit-...
Even names match up, but oddly the date is different.
notpushkin|24 days ago
Any chance you could share a screenshot / re-export it as a (normalized) PDF? I’m curious about what’s in there, but all of my readers refuse to open it.
dperfect|24 days ago
which uses this Rust zlib stream fixer: https://pastebin.com/iy69HWXC
and gives the best output I've seen it produce: https://imgur.com/itYWblh
This is using the same OCR'd text posted by commenter Joe.
the_real_cher|24 days ago
bawolff|24 days ago
https://pretius.com/blog/ocr-tesseract-training-data
pyrolistical|24 days ago
1. Get an open source pdf decoder
2. Decode bytes up to first ambiguous char
3. See if next bits are valid with an 1, if not it’s an l
4. Might need to backtrack if both 1 and l were valid
By being able to quickly try each char in the middle of the decoding process you cut out the start time. This makes it feasible to test all permutations automatically and linearly
pletnes|24 days ago
bawolff|24 days ago
percentcer|24 days ago
jjwiseman|24 days ago
sjducb|23 days ago
76 pages is a couple of months of work
quuxplusone|23 days ago
[1] - https://news.ycombinator.com/item?id=46906897
[2] - https://news.ycombinator.com/item?id=46916065
fragmede|24 days ago
I consider myself fairly normal in this regard, but I don't have 76 friends to ask to do this, so I don't know how I'd go about doing this. Post an ad on craigslist? Fiverr? Seems like a lot to manage.
WolfeReader|24 days ago
legitster|24 days ago
Unlike every other PDF format that has been attempted, the federal government doesn't have to worry about adoption.
gucci-on-fleek|24 days ago
DjVu [1] would be another option. It has really good open source tooling available, but it supports substantially less features than PDF, making it not really suitable as a drop-in replacement. The format is relatively simple though, so redaction should be fairly doable.
TIFF [2] is already occasionally used for government documents, but it's arguably more complex than PDF, so probably not a good choice for this.
[0]: https://en.wikipedia.org/wiki/Open_XML_Paper_Specification
[1]: https://en.wikipedia.org/wiki/DjVu
[2]: https://en.wikipedia.org/wiki/TIFF
Spooky23|24 days ago
It’s not a tools problem, it’s a problem of malicious compliance and contempt for the law.
Ekaros|24 days ago
derwiki|24 days ago
unknown|24 days ago
[deleted]
ChocMontePy|24 days ago
The copy linked in the post:
https://www.justice.gov/epstein/files/DataSet%209/EFTA004004...
Three more copies:
https://www.justice.gov/epstein/files/DataSet%2010/EFTA02153...
https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...
https://www.justice.gov/epstein/files/DataSet%2010/EFTA02154...
Perhaps having several different versions might make it easier.
ChocMontePy|24 days ago
https://www.justice.gov/epstein/files/DataSet%209/EFTA007755...
This doesn't solve the "1 & l" problem for the pdf you are looking at, but it could be useful anyway.
tcgv|24 days ago
I'm lucky to have parents with strong values. My whole life they've given me advice, on the small stuff and the big decisions. I didn't always want to hear it when I was younger, but now in my late thirties, I'm really glad they kept sharing it. In hidhsight I can see the life-experience / wisdom in it, and how it's helped and shaped me.
pavel_lishin|23 days ago
pimlottc|24 days ago
Hmm. Anyone got some spare CPU time?
wahern|24 days ago
unknown|24 days ago
[deleted]
kalleboo|24 days ago
kevin_thibedeau|24 days ago
Followup: pdfimages is 13x faster than pdftoppm
masfuerte|24 days ago
chrisjj|24 days ago
Or worse. She did.
winddude|24 days ago
krupan|24 days ago
eek2121|24 days ago
bushbaba|24 days ago
Snoozus|24 days ago
phanimahesh|24 days ago
darig|24 days ago
[deleted]
velaia|24 days ago
unknown|24 days ago
[deleted]
nubg|24 days ago
ryanSrich|24 days ago
poyu|24 days ago
sznio|24 days ago
alhamdulillah23|23 days ago
Page 1: https://imgur.com/a/jwgu9uH
Page 2: https://imgur.com/a/4Zi3bkk
Use this: https://github.com/KoKuToru/extract_attachment_EFTA00400459
iwontberude|24 days ago
netsharc|24 days ago
The recipient is also named in there...
linuxguy2|24 days ago
Evidlo|24 days ago
subscribed|24 days ago
I had a reasonably simple problem to solve, slightly weird font and some 10 words in English (I actually only missed one or two blocks for missing letters to cover all I needed).
After a couple of days having almost everything (?) I just surrendered. This seems to be intentionally hostile. All the docs scattered across several repositories, no comprehensive examples, etc.
Absolutely awful piece of software from this end (training the last gen).
queenkjuul|24 days ago
FarmerPotato|24 days ago
zahlman|24 days ago
ks2048|24 days ago
I tried to find the message in this blog post, but couldn't. (don't see how to search by date).
blindriver|24 days ago
tclancy|24 days ago
rexpop|24 days ago
Incompetence is incompetence.
rapind|24 days ago
It's really really hard to give them the benefit of the doubt at this point.
thereisnospork|24 days ago
subscribed|24 days ago
They wasted months erasing Trump from that instead. So it's on them.
krupan|24 days ago
hypeatei|24 days ago
zahlman|24 days ago
A dynamic programming type approach might still be helpful. One version or other of the character might produce invalid flate data while the other is valid, or might give an implausible result.
yunnpp|24 days ago
unknown|24 days ago
[deleted]
winddude|24 days ago
https://www.justice.gov/epstein/files/DataSet%2010/EFTA01804...
https://www.justice.gov/epstein/files/DataSet%209/EFTA007755...
https://www.justice.gov/epstein/files/DataSet%209/EFTA004349...
and than this one judging by the name of the file (hanna something) and content of the email:
"Here is my girl, sweet sparkling Hanna=E2=80=A6! I am sure she is on Skype "
maybe more sinister (so be careful, i have no ideas what the laws are if you uncover you know what trump and Epstein were into)...
https://www.justice.gov/epstein/files/DataSet%2011/EFTA02715...
[Above is probably a legit modeling CV for HANNA BOUVENG, based on, https://www.justice.gov/epstein/files/DataSet%209/EFTA011204..., but still creepy, and doesn't seem like there's evidence of her being a victim]
Enhaj12|23 days ago
I tried and got alot of errors, cant seem to fix it, due to corruption.
https://www.docfly.com/editor/fa3bcb1fa9e8d2629b32/v9r21qsju...
Tried to get AI to guess the remaining text: https://pastebin.com/Z9X2d510
netsharc|24 days ago
Anyway searching for the email sender's name, there's a screenshot of an email of hers in English offering him a girl as an assistant who is "in top physical shape" (probably not this Hanna girl). That's fucking creepy: https://www.expressen.se/nyheter/varlden/epsteins-lofte-till...
Snoozus|24 days ago
eek2121|24 days ago
Cool article, however.
misja111|24 days ago
unknown|24 days ago
[deleted]
SomaticPirate|24 days ago
direwolf20|24 days ago
IshKebab|24 days ago
sorbus-25|24 days ago
sorbus-25|24 days ago
wtcactus|24 days ago
PDF is basically a prettify layer on top of the older PS that brings an all lot of baggage. The moment you start trying to do what should be simple stuff like editing lines, merging pages, change resolution of the images, it starts giving you a lot of headaches.
I used to have a few scripts around to fight some of its quirks from when I was writing my thesis and had to work daily with it. But well, it was still an improvement over Word.
direwolf20|24 days ago
prettywoman|24 days ago
[deleted]
heraldgeezer|24 days ago
[deleted]
nullorempty|24 days ago
unknown|24 days ago
[deleted]