Classifying all of the pdfs on the internet

gnewton77|1 year ago

Did some similar work with similar visualizations ~2009, on ~5.7M research articles (PDFs, private corpus) from scientific publishers Elsevier, Springer:

Newton, G., A. Callahan & M. Dumontier. 2009. Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library. Second Workshop on Very Large Digital Libraries at the European Conference on Digital Libraries (ECDL) 2009. https://lekythos.library.ucy.ac.cy/bitstream/handle/10797/14...

I am the first author.

j_bum|1 year ago

Nice article, thanks for sharing.

I can imagine mining all of these articles was a ton of work. I’d be curious to know how quickly the computation could be done today vs. the 13 hour 2009 benchmark :)

Nowadays people would be slamming those data through UMAP!

Loughla|1 year ago

How do you decide who is listed first? And does the ampersand symbolize something that the word and doesn't, or is that just citation style?

minimaxir|1 year ago

One of the now-underdiscussed features of embeddings is that you can indeed use any existing statistical modeling techniques on them out of the box, and as a bonus avoid the common NLP preprocessing nuances and pitfalls (e.g. stemming) entirely.

This post is a good example on why going straight to LLM embeddings for NLP is a pragmatic first step, especially for long documents.

throw10920|1 year ago

You can apply statistical techniques to the embeddings themselves? How does that work?

snats|1 year ago

Hi! Author here, I wasn't expecting this to be at the top of HN, AMA

dangoodmanUT|1 year ago

Great post, I'm wondering if there are resources you'd suggest to learn this kind of analysis?

I dug through the code and it seemed like a ton of things I'm not familiar with, probably a lot of techniques I don't know rather than python ofc.

bprew|1 year ago

Hi snats, great article. You mention the accuracy of the various techniques you used, could you explain more about how you calculated the accuracy? Were the pdfs already categorized?

Thanks!

whistle650|1 year ago

Interesting read with lots of good detail, thank you. A comment: if you are balancing the classes when you do one vs all binary training, and then use the max probability for inference, your probabilities might not be calibrated well, which could be a problem. Do you correct the probabilities before taking the argmax?

llm_trw|1 year ago

Back in 2006 there were multiple 1tb collections of textbooks as torrents. I imagine the size and number has only grown since then.

namrog84|1 year ago

That was before hoarding and building questionable businesses around them became a thing. I remember it being really easy to find textbooks, solution manuals, and related pdf and other stuff as late as 2008 far easier than 6-8 years later.

The main difference were sites like chegg and many other sites started slurping them up to resell in some way.

nativeit|1 year ago

I personally have about 350GB worth of old service manuals, data sheets, catalogs, and periodicals. Mostly related to electronics and engineering. All from torrent sources from ~2-years ago (when I wanted to mess with GraphQL and some OSR resources).

qingcharles|1 year ago

Anna's Archive has a lot of multi-multi-terabyte torrents if you want them.

buildbot|1 year ago

I have 20-40TB (pre-dedup) of PDFs - 8TB is a lot but not even close to the total number of PDFs available.

sporedro|1 year ago

Just wondering what do you collect? Is it mainly mirroring things like libgen?

I have a decent collection of ebooks/pdfs/manga from reading. But I can’t imagine how large a 20TB library is.

mehulashah|1 year ago

Care to make it publicly available? Or is that not permitted on your dataset? Certainly, there’s a lot more PDFs out there than 8TB. I bet there’s a lot of redundancy in yours, but doesn’t dedup well because of all the images.

unknown|1 year ago

[deleted]

guiomie|1 year ago

Interesting and fun article! I've been experimenting with various LLMs/GenAI solutions to extract tabular data from PDFs with underwhelming results. It seems like they are good at extracting strings of text and summarizing (e.g what was the total price? when was this printed?) but extracting reliably into a CSV has a decent margin of error.

abhi_p|1 year ago

Disclosure: I'm an employee.

Give the Aryn partitioning service a shot: https://www.aryn.ai/post/announcing-the-aryn-partitioning-se...

We recently released it and we've a few examples here: https://sycamore.readthedocs.io/en/stable/aryn_cloud/get_sta... that show you how to turn the tabular data from the pdf into a pandas dataframe(which you can then turn into csv).

josh-sematic|1 year ago

Very cool! At Airtrain we’ve also found embeddings can be very valuable for building classification models. If you’re looking to play around with a large amount of text and embeddings we actually recently deduped and embedded all of fineweb-edu (also mentioned in the article) and put the resulting dataset on Hugging Face: https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fort...

ned_at_codomain|1 year ago

This is a really cool idea, thanks for sharing. I don't have that much free time these days, but I was thinking of trying a similar-but-different project not too long ago.

I wanted to make a bit of an open source tool to pull down useful time series data for the social sciences (e.g. time series of social media comments about grocery prices). Seems like LLMs have unlocked all kinds of new research angles that people aren't using yet.

I may steal some of your good ideas if I ever get to work on that side project :)

sireat|1 year ago

Nice work! You've taken multiple approaches similar to what I sometimes do at the national library, I've used all kind of embeddings -> classifiers / LDA.

Curious on your prompt: https://github.com/snat-s/m/blob/main/classify_metadata/prom...

Wouldn't this be basically prompting to classify by the type of URL?

mehulashah|1 year ago

Classification is just a start. Wondering if it's worth doing something more -- like turning all of the text into Markdown or HTML? Would anyone find that interesting?

Treesrule14|1 year ago

There are a lot of webcrawlers where the chief feature is turning the website into markdown, I don't quite understand what they are doing for me thats useful since I can just do something like `markdownify(my_html)` or whatever, all this to say is that I wouldn't find this useful, but also clearly people think this is a useful feature as part of an LLM pipeline.

pxdm|1 year ago

My first thought on seeing the PCA embeddings scatterplot was "I wonder what pdfs are at the centre of those two clusters?" The most typical pdfs on the internet.

muratsu|1 year ago

I would have expected the finetuned model to perform much better. Would be curious to see the performance with other models

Thaxll|1 year ago

First you need a good PDF library :/

excalibur|1 year ago

> How would you classify all the pdfs in the internet?

Definitely as 'hot dog' or 'not a hot dog'.

Mindey|1 year ago

Whoever said there "internet," they fail to grasp how big the internet really is.

autokad|1 year ago

would be interesting to see if they tried LDA (latent direchelet allocation) topics

layer8|1 year ago

I would have expected categories for product brochures and product manuals.

TuringNYC|1 year ago

Ive been playing with https://www.aryn.ai/ for Partitioning. Curious if anyone has tried these tools for better data extraction from PDFs. Any other suggestions?

(I'm a bit disappointed that most of the discussion is about estimating the size of PDFs on the internet, I'd love to hear more about different approaches to extracting better data from the PDFs.)

dwynings|1 year ago

https://www.sensible.so/

Full disclosure: I'm an employee

niels_bom|1 year ago

Typo: “meats the eye”

byteknight|1 year ago

This seems like cool work but with a ton of "marketing hype speak" that immediately gets watered down by the first paragraph.

Ordering of statements.

1. (Title) Classifying all of the pdfs on the internet

2. (First Paragraph) Well not all, but all the PDFs in Common Crawl

3. (First Image) Well not all of them, but 500k of them.

I am not knocking the project, but while categorizing 500k PDFs is something we couldnt necessarily do well a few years ago, this is far from "The internet's PDFs".

schneehertz|1 year ago

Moreover, the classification was not done on 500,000 PDF files themselves, but rather on the metadata of those 500,000 PDFs.

1-6|1 year ago

Overpromise with headline, underdeliver on details.

unknown|1 year ago

[deleted]

afh1|1 year ago

Interesting read, I did not know about Common Crawl. I feel like RTBF is kind of a lost battle these days with more and more crawlers for AI and whatnot. Once on the internet there is no way back, for better or for worse. This tangent aside, 8TB is really not a lot of data, it's just 8 consumer-grade 1TB hard drives. I find it hard to believe this is "the largest corpus of PDFs online", maybe the largest public one. Not sure how representative it is of "the whole internet".

ziddoap|1 year ago

>I feel like RTBF is kind of a lost battle these days

For those of us who aren't familiar with this random acronym, I think RTBF = right to be forgotten.

ronsor|1 year ago

RTBF was a ludicrous concept before AI and these new crawlers.

Only EU bureaucracts would have the hubris to believe you could actually, comprehensively remove information from the Internet. Once something is spread, it is there, forever.

tivert|1 year ago

> RTBF

Right to be forgotten, not the Belgian public service broadcaster (https://en.wikipedia.org/wiki/RTBF)?

Propelloni|1 year ago

Doesn't sound like a lot, but where I am now we routinely work on very large infrastructure projects and the plans, documents and stuff mostly come as PDF. We are talking of thousands of documents, often with thousands of pages, per project and even very big projects almost never break 20 GB.

If you like, you could say, PDF are information dense, but data sparse. After all it is mostly white space ;)

SnowflakeOnIce|1 year ago

The common crawl only pulls documents less than a small limit (1MiB last I checked). Without special handling in this project, bigger documents than that would be missing.

So indeed, not representative of the whole Internet.

seanw265|1 year ago

Tangentially related, I was once handed a single PDF between 2 and 5 GBs in size and asked to run inference on it. This was the result of a miscommunication with the data provider, but I think it's funny and almost impressive that this file even exists.

deweller|1 year ago

Is it possible that the 8 TB is just the extracted text?

tokai|1 year ago

Yeah 8TB is really tiny. Google scholar was estimated to index 160.000.000 pdfs in 2015.[0] If we assume that a third of those are not behind paywalls, and average pdf size is 1mb, its ends up as something above 50TB of documents. Almost ten years later the number of available pdfs of just scholarly communication should be substantially higher.

[0] https://link.springer.com/article/10.1007/s11192-015-1614-6

moralestapia|1 year ago

Libgen size is ~33TB so, no, it's not "the largest corpus of PDFs online".

(Although you could argue libgen is not really "public" in the legal sense of the word, lol).

Disregarding that, the article is great!

(edit: why would someone downvote this, HN is becoming quite hostile lately)

unknown|1 year ago

[deleted]

unknown|1 year ago

[deleted]

ks2048|1 year ago

    I don’t have 8TB laying around, but we can be a bit more clever.... In particular I cared about a specific column called url. I really care about the urls because they essentially tell us a lot more from a website than what meats the eye.

I'm I correct that it is only only using the URL of the PDF to do classification? Maybe still useful, but that's quite a different story than "classifying all the pdfs".

xattt|1 year ago

It’s just classifying the URLs if that’s the case.

The legwork to classify PDFs is already done, and the authorship of the article can go to anyone who can get a grant for a $400 NewEgg order for an 8TB drive.

unknown|1 year ago

[deleted]

110 comments