Did some similar work with similar visualizations ~2009, on ~5.7M research articles (PDFs, private corpus) from scientific publishers Elsevier, Springer:
Newton, G., A. Callahan & M. Dumontier. 2009. Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library. Second Workshop on Very Large Digital Libraries at the European Conference on Digital Libraries (ECDL) 2009. https://lekythos.library.ucy.ac.cy/bitstream/handle/10797/14...
I can imagine mining all of these articles was a ton of work. I’d be curious to know how quickly the computation could be done today vs. the 13 hour 2009 benchmark :)
Nowadays people would be slamming those data through UMAP!
One of the now-underdiscussed features of embeddings is that you can indeed use any existing statistical modeling techniques on them out of the box, and as a bonus avoid the common NLP preprocessing nuances and pitfalls (e.g. stemming) entirely.
This post is a good example on why going straight to LLM embeddings for NLP is a pragmatic first step, especially for long documents.
Hi snats, great article. You mention the accuracy of the various techniques you used, could you explain more about how you calculated the accuracy? Were the pdfs already categorized?
Interesting read with lots of good detail, thank you. A comment: if you are balancing the classes when you do one vs all binary training, and then use the max probability for inference, your probabilities might not be calibrated well, which could be a problem. Do you correct the probabilities before taking the argmax?
That was before hoarding and building questionable businesses around them became a thing. I remember it being really easy to find textbooks, solution manuals, and related pdf and other stuff as late as 2008 far easier than 6-8 years later.
The main difference were sites like chegg and many other sites started slurping them up to resell in some way.
I personally have about 350GB worth of old service manuals, data sheets, catalogs, and periodicals. Mostly related to electronics and engineering. All from torrent sources from ~2-years ago (when I wanted to mess with GraphQL and some OSR resources).
Care to make it publicly available? Or is that not permitted on your dataset? Certainly, there’s a lot more PDFs out there than 8TB. I bet there’s a lot of redundancy in yours, but doesn’t dedup well because of all the images.
Interesting and fun article! I've been experimenting with various LLMs/GenAI solutions to extract tabular data from PDFs with underwhelming results. It seems like they are good at extracting strings of text and summarizing (e.g what was the total price? when was this printed?) but extracting reliably into a CSV has a decent margin of error.
Very cool! At Airtrain we’ve also found embeddings can be very valuable for building classification models. If you’re looking to play around with a large amount of text and embeddings we actually recently deduped and embedded all of fineweb-edu (also mentioned in the article) and put the resulting dataset on Hugging Face: https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fort...
This is a really cool idea, thanks for sharing. I don't have that much free time these days, but I was thinking of trying a similar-but-different project not too long ago.
I wanted to make a bit of an open source tool to pull down useful time series data for the social sciences (e.g. time series of social media comments about grocery prices). Seems like LLMs have unlocked all kinds of new research angles that people aren't using yet.
I may steal some of your good ideas if I ever get to work on that side project :)
Nice work! You've taken multiple approaches similar to what I sometimes do at the national library, I've used all kind of embeddings -> classifiers / LDA.
Classification is just a start. Wondering if it's worth doing something more -- like turning all of the text into Markdown or HTML? Would anyone find that interesting?
There are a lot of webcrawlers where the chief feature is turning the website into markdown, I don't quite understand what they are doing for me thats useful since I can just do something like `markdownify(my_html)` or whatever, all this to say is that I wouldn't find this useful, but also clearly people think this is a useful feature as part of an LLM pipeline.
My first thought on seeing the PCA embeddings scatterplot was "I wonder what pdfs are at the centre of those two clusters?" The most typical pdfs on the internet.
Ive been playing with https://www.aryn.ai/ for Partitioning. Curious if anyone has tried these tools for better data extraction from PDFs. Any other suggestions?
(I'm a bit disappointed that most of the discussion is about estimating the size of PDFs on the internet, I'd love to hear more about different approaches to extracting better data from the PDFs.)
This seems like cool work but with a ton of "marketing hype speak" that immediately gets watered down by the first paragraph.
Ordering of statements.
1. (Title) Classifying all of the pdfs on the internet
2. (First Paragraph) Well not all, but all the PDFs in Common Crawl
3. (First Image) Well not all of them, but 500k of them.
I am not knocking the project, but while categorizing 500k PDFs is something we couldnt necessarily do well a few years ago, this is far from "The internet's PDFs".
Interesting read, I did not know about Common Crawl. I feel like RTBF is kind of a lost battle these days with more and more crawlers for AI and whatnot. Once on the internet there is no way back, for better or for worse. This tangent aside, 8TB is really not a lot of data, it's just 8 consumer-grade 1TB hard drives. I find it hard to believe this is "the largest corpus of PDFs online", maybe the largest public one. Not sure how representative it is of "the whole internet".
RTBF was a ludicrous concept before AI and these new crawlers.
Only EU bureaucracts would have the hubris to believe you could actually, comprehensively remove information from the Internet. Once something is spread, it is there, forever.
Doesn't sound like a lot, but where I am now we routinely work on very large infrastructure projects and the plans, documents and stuff mostly come as PDF. We are talking of thousands of documents, often with thousands of pages, per project and even very big projects almost never break 20 GB.
If you like, you could say, PDF are information dense, but data sparse. After all it is mostly white space ;)
The common crawl only pulls documents less than a small limit (1MiB last I checked). Without special handling in this project, bigger documents than that would be missing.
So indeed, not representative of the whole Internet.
Tangentially related, I was once handed a single PDF between 2 and 5 GBs in size and asked to run inference on it. This was the result of a miscommunication with the data provider, but I think it's funny and almost impressive that this file even exists.
Yeah 8TB is really tiny. Google scholar was estimated to index 160.000.000 pdfs in 2015.[0] If we assume that a third of those are not behind paywalls, and average pdf size is 1mb, its ends up as something above 50TB of documents. Almost ten years later the number of available pdfs of just scholarly communication should be substantially higher.
I don’t have 8TB laying around, but we can be a bit more clever.... In particular I cared about a specific column called url. I really care about the urls because they essentially tell us a lot more from a website than what meats the eye.
I'm I correct that it is only only using the URL of the PDF to do classification? Maybe still useful, but that's quite a different story than "classifying all the pdfs".
It’s just classifying the URLs if that’s the case.
The legwork to classify PDFs is already done, and the authorship of the article can go to anyone who can get a grant for a $400 NewEgg order for an 8TB drive.
gnewton77|1 year ago
Newton, G., A. Callahan & M. Dumontier. 2009. Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library. Second Workshop on Very Large Digital Libraries at the European Conference on Digital Libraries (ECDL) 2009. https://lekythos.library.ucy.ac.cy/bitstream/handle/10797/14...
I am the first author.
j_bum|1 year ago
I can imagine mining all of these articles was a ton of work. I’d be curious to know how quickly the computation could be done today vs. the 13 hour 2009 benchmark :)
Nowadays people would be slamming those data through UMAP!
Loughla|1 year ago
minimaxir|1 year ago
This post is a good example on why going straight to LLM embeddings for NLP is a pragmatic first step, especially for long documents.
throw10920|1 year ago
snats|1 year ago
dangoodmanUT|1 year ago
I dug through the code and it seemed like a ton of things I'm not familiar with, probably a lot of techniques I don't know rather than python ofc.
bprew|1 year ago
Thanks!
whistle650|1 year ago
llm_trw|1 year ago
namrog84|1 year ago
The main difference were sites like chegg and many other sites started slurping them up to resell in some way.
nativeit|1 year ago
qingcharles|1 year ago
buildbot|1 year ago
sporedro|1 year ago
I have a decent collection of ebooks/pdfs/manga from reading. But I can’t imagine how large a 20TB library is.
mehulashah|1 year ago
unknown|1 year ago
[deleted]
guiomie|1 year ago
abhi_p|1 year ago
Give the Aryn partitioning service a shot: https://www.aryn.ai/post/announcing-the-aryn-partitioning-se...
We recently released it and we've a few examples here: https://sycamore.readthedocs.io/en/stable/aryn_cloud/get_sta... that show you how to turn the tabular data from the pdf into a pandas dataframe(which you can then turn into csv).
josh-sematic|1 year ago
ned_at_codomain|1 year ago
I wanted to make a bit of an open source tool to pull down useful time series data for the social sciences (e.g. time series of social media comments about grocery prices). Seems like LLMs have unlocked all kinds of new research angles that people aren't using yet.
I may steal some of your good ideas if I ever get to work on that side project :)
sireat|1 year ago
Curious on your prompt: https://github.com/snat-s/m/blob/main/classify_metadata/prom...
Wouldn't this be basically prompting to classify by the type of URL?
mehulashah|1 year ago
Treesrule14|1 year ago
pxdm|1 year ago
muratsu|1 year ago
Thaxll|1 year ago
excalibur|1 year ago
Definitely as 'hot dog' or 'not a hot dog'.
Mindey|1 year ago
autokad|1 year ago
layer8|1 year ago
TuringNYC|1 year ago
(I'm a bit disappointed that most of the discussion is about estimating the size of PDFs on the internet, I'd love to hear more about different approaches to extracting better data from the PDFs.)
dwynings|1 year ago
Full disclosure: I'm an employee
niels_bom|1 year ago
byteknight|1 year ago
Ordering of statements.
1. (Title) Classifying all of the pdfs on the internet
2. (First Paragraph) Well not all, but all the PDFs in Common Crawl
3. (First Image) Well not all of them, but 500k of them.
I am not knocking the project, but while categorizing 500k PDFs is something we couldnt necessarily do well a few years ago, this is far from "The internet's PDFs".
schneehertz|1 year ago
1-6|1 year ago
unknown|1 year ago
[deleted]
afh1|1 year ago
ziddoap|1 year ago
For those of us who aren't familiar with this random acronym, I think RTBF = right to be forgotten.
ronsor|1 year ago
Only EU bureaucracts would have the hubris to believe you could actually, comprehensively remove information from the Internet. Once something is spread, it is there, forever.
tivert|1 year ago
Right to be forgotten, not the Belgian public service broadcaster (https://en.wikipedia.org/wiki/RTBF)?
Propelloni|1 year ago
If you like, you could say, PDF are information dense, but data sparse. After all it is mostly white space ;)
SnowflakeOnIce|1 year ago
So indeed, not representative of the whole Internet.
seanw265|1 year ago
deweller|1 year ago
tokai|1 year ago
[0] https://link.springer.com/article/10.1007/s11192-015-1614-6
moralestapia|1 year ago
(Although you could argue libgen is not really "public" in the legal sense of the word, lol).
Disregarding that, the article is great!
(edit: why would someone downvote this, HN is becoming quite hostile lately)
unknown|1 year ago
[deleted]
unknown|1 year ago
[deleted]
ks2048|1 year ago
xattt|1 year ago
The legwork to classify PDFs is already done, and the authorship of the article can go to anyone who can get a grant for a $400 NewEgg order for an 8TB drive.
unknown|1 year ago
[deleted]