top | item 37764088

1.3B Worldcat scrape and data science mini-competition

248 points| crtasm | 2 years ago |annas-blog.org | reply

88 comments

order
[+] hedora|2 years ago|reply
I looked into using Dewey Decimal for a hobby project. OCLC has a de facto monopoly on it due to the Worldcat database. They're a non-profit, but they're supported by having libraries pay a subscription fee for Worldcat.

Back when OCLC was founded, the idea that people would want to have a copy of a card catalog for personal use was laughable, so I'm sympathetic to the people that set up their funding model. It's far cheaper for a library to subscribe to Worldcat than to hire a team to maintain such a database, so it created a win-win situation.

However, keeping the world's books' metadata a secret (and leaving control in the hands of a monopoly) is an anachronism.

It's well past the time when someone (such as an international coalition of Libraries of Congress) should figure out how to sustainably fund OCLC while also releasing their work into the public domain.

[+] mgr86|2 years ago|reply
I have been told that my organization developed a system, HABS, that pre-dated OCLC [0]. That OCLC used this system as an inspiration. However, I cannot confirm this. Closest I can do is to find a footnote that Thanks Fred Kilgore, the founder of OCLC [1]. I should reach out to Koh, a friend of a friend, while she is still alive to confirm the story. Nevertheless, we have a collection of punch cards in a dusty room in an attic that was once the HABS system. I think it is a pretty fascinating legacy, and I wish it was better preserved.

[0] https://journals.sagepub.com/doi/pdf/10.1177/106939716900400... [1]https://journals.sagepub.com/doi/abs/10.1177/106939717300800...

[+] actuallyalys|2 years ago|reply
I suppose one way to do it would be to allow patrons of subscriber libraries to access the database dumps and API.

The downside is that would still make it harder than necessary to access and leave some people out. The upside is that it's not that much of change from their existing model. I'm sure there would also be concerns about database dumps being shared publicly, although Anna's Archive has already released their entire database, and I suspect most people who would pay for formal access wouldn't use an authorized copy. Ultimately, I suspect OCLC would still be resistant to this change, as it would feel like a huge shift, even if I'm not sure it would change much from their perspective.

[+] partytax|2 years ago|reply
The Melville Decimal System popularized by https://www.librarything.com may be of interest.

Here's an explanation from their footer: "Although Dewey invented his system in 1876, recent editions of his system are in copyright. LibraryThing's Melvil Decimal System is based on the classification work of libraries around the world, whose assignments are not copyrightable. The "schedules" (the words that describe the numbers) come from a pre-copyright edition of his system, John Mark Ockerbloom's Free Decimal System, and member contributions."

[+] mannyv|2 years ago|reply
One strange thing about the Dewey Decimal System is that it's copyrighted and libraries pay a fee to use it.

It came out when the Library Hotel in NYC used some of their notation as their room numbers and were sued. Everyone was WTF?

[+] empthought|2 years ago|reply
OCLC is a nonprofit membership cooperative and would argue that it itself is that international coalition of national libraries and archives.
[+] freewizard|2 years ago|reply
ISBN is the default ID when it comes to book related projects, yes it is convenient but not without its caveats. The often overlooked fact is ISBN was introduced in late 1960s, so books published prior to that obviously does not have that number; and not all countries adopted ISBN from day one, some like China was on its own catalog systems until 1980s; and bc ISBN are usually centralized managed by govt or commercial agencies, censorship with political or commercial reasons are not uncommon, some books were not able to get published, or may only see the world without an ISBN.

For obvious reasons, older / non-English / suppressed books may be those need more care when it comes to preserving.

[+] wiml|2 years ago|reply
A second issue is that ISBNs identify a specific SKU (different formats will have different ISBNs, different printings may even get different ISBNs, etc), but book-related projects typically want some way to identify "the same book" across all these different formats, printings, sometimes even editions and translations and collections. OCLC IDs are identifying a different space than ISBNs are.
[+] dredmorbius|2 years ago|reply
Along with other limiations, ISBN are, well, book numbers. They're specific to books, and exclude many other forms of published materials.

OCLC spans books, articles, audio recordings, videos, and other catalogued artefacts and documents.

[+] neilv|2 years ago|reply
How does Anna's Archive keep their all their lawyers from quitting?

> Even though OCLC is a non-profit, their business model requires protecting their database. Well, we’re sorry to say, friends at OCLC, we’re giving it all away. :-) [...]

> This included a substantial overhaul of their backend systems, introducing many security flaws. We immediately seized the opportunity, and were able scrape hundreds of millions (!) of records in mere days. [...]

> PS: We do want to give a genuine shout-out to the Worldcat team. Even though it was a small tragedy that your data was locked up, you did an amazing job at getting 30,000 libraries on board to share their metadata with you. As with many of our releases, we could not have done it without the decades of hard work you put into building the collections that we now liberate. Truly: thank you.

[+] RhodesianHunter|2 years ago|reply
For real. Openly bragging about exploiting security flaws to scrape out data en-masse, which undoubtedly put massive strain on back-end systems, is a far cry from what is considered legal (politely scraping public information).
[+] Sebguer|2 years ago|reply
You realize they're not a business, right?
[+] raybb|2 years ago|reply
From the end:

> We do want to give a genuine shout-out to the Worldcat team. Even though it was a small tragedy that your data was locked up, you did an amazing job at getting 30,000 libraries on board to share their metadata with you.

I wonder what the story is behind Worldcat getting so many libraries across the world on board? I don't know much about the software but it must be pretty compelling.

[+] justaguitarist|2 years ago|reply
Disclaimer, I was a Linux admin at OCLC for a few years. The WorldCat database has been around since the early 70s, so I think that helps the numbers a bit. I don't have any insight into their marketing/sales/end user experience though.
[+] gmcharlt|2 years ago|reply
It's not the software per se, which is generally fit for purpose but not amazing, but the traditions and economics underpinning how libraries maintain their bibliographic metadata.

Libraries sharing metadata for their catalogs has a long history, dating back to at least 1902 when the Library of Congress started selling catalog cards for use by other libraries. In the 1960s, the Library of Congress embarked on various projects to computerize their catalog, leading to the creation of the MARC format as a common metadata format for exchanging bibliographic records. (And there is a straight line between how card catalogs were put together and much of how library metadata is conceptualized, although that's been (slowly) changing.)

One problem is that bibliographic metadata from the Library of Congress is mostly generated in-house, and LoC does not catalog everything; not even close. In the late 1960s, OCLC, the organization behind Worldcat, was started to operate a union catalog. The idea is that libraries could download bibliographic records needed for their own catalogs ("copy cataloging") and contribute new records for the unique stuff they cataloged ("original cataloging"). Under the aegis of OCLC as a non-profit organization, it was a pretty good deal for libraries, and over time led to additional services such as brokering interlibrary loan requests. After all, since Worldcat had a good idea of the holdings of libraries in North America (and over time, a good chunk of Europe and other areas), it was straightforward to set up an exchange for ILL requests.

Tie this with a general trend over the past couple decades of libraries decreasing the funding and staffing for maintaining their local catalogs, and need for sharing in the creation and maintenance of library metadata has gotten only more important.

However, OCLC has had a long history of trying to control access and use of the metadata in WorldCat, to the point of earning a general perception in many library quarters of trying to monopolize it. To give a taste, Aaron Swartz tangled with them back in the day. [1] One irony, among many, is that the majority of metadata in Worldcat has its origins in the efforts by publicly-funded libraries and as such shouldn't have been enclosed in the first place. OCLC also has a focus on growing itself, to the point where it does far more than run Worldcat. Its various ventures have earned itself a reputation for charging high prices to libraries, to the point where it can be too expensive for smaller libraries to participate in Worldcat. (Fortunately for them, there are various alternative ways of getting MARC records for free or very cheap, but nobody has a database more comprehensive than Worldcat.)

That said, OCLC does do quite a bit itself to improve the overall quality of Worldcat and to try to push libraries past the 1960s-era MARC format. But one of the ironies of the scraping is that it's not going to be immediately helpful to the libraries who are unable to afford to participate in Worldcat. This is because the scrape didn't (and quite possibly never could have) capture the data in MARC format, which is what most library catalog software uses. While MARC records could be cross-walked from the JSON, they will undoubtedly omit some data elements found in the original MARC.

[1] http://www.aaronsw.com/weblog/oclcreply

[+] kinos|2 years ago|reply
they probably have a good interface for personal library tracking
[+] DoctorOetker|2 years ago|reply
> We scraped ISBNdb, and downloaded the Open Library dataset, but the results were unsatisfactory. The main problem was that there was not a ton of overlap of ISBNs.

What prevents ISBN number collisions between authors? Is there a central authority assigning them, or is there say a national prefix, with each government assigning ISBN's for local publications (perhaps delegating this to another body in that nation).

Surely such bodies would have the most complete view on all this data.

It's also bizarre that this simple metadata is not available from whatever authority assigns ISBN numbers..

[+] sevenseventen|2 years ago|reply
There are regional ISBN agencies. The US agency, Bowker, assigns ISBN prefixes by publisher, and publishers assign within their prefix as they please. They're supposed to use one ISBN per edition and format, but many publishers use ISBN as a kind of SKU so you can't 100% count on that.

If that sounds sloppy...I went to publishing conferences fairly regularly from the late 90's into the teens, and I never saw a program that didn't have at least one session or panel titled something like "Publishers must improve their metadata practices."

[+] gmcharlt|2 years ago|reply
ISBNs are messy.

The International ISBN Agency coordinates assigning ISBN ranges to national agencies, who in turn will assign subranges to publishers. The publishers in turn assign specific numbers to their own works. However, the international agency does not itself maintain a universal database of assigned ISBNs - the most it operates is a global database of publishers and their assigned ranges. And since it's the publishers who are assigning numbers from their allocations, various errors can crop up, including reusing ISBNs for different works and failing to issue distinct ISBNs for different formats. (For example, if you publish hardcover, paperbook, and ebook versions of a book, you should assign three ISBNs. That rule is not always observed.)

Also, libraries hold many books that long predate ISBNs; it wasn't until 1965 that the immediate predecessor of the ISBN, the SBN, was a twinkle in a bookseller's eye.

[+] crtasm|2 years ago|reply
There's lots of agencies, they hand out blocks of numbers from their allocation. Seems there's no central database for the metadata:

https://www.isbn-international.org/content/isbn-users-manual... ISBN_FAQs_to_7ed_Manual_Absolutely_final.docx

> Will people in other countries be able to search for my books in search engines in those countries?

> This does not happen automatically ... In order for your book to be listed in other countries you should contact the respective ISBN Agency and ask them for details of how to be entered into their national catalogue for books in circulation (books in print). Sometimes you will have to obtain a distributor from that country or have an address in that country before this is possible. In some circumstances in order to be listed, the book must be in the language of that country. As well as catalogues of books in circulation, you may also want to ensure that you are listed by internet retailers... . Again, you will need to contact each of these organisations directly (including each separate international branch) with details of your book.

[+] skuxxlife|2 years ago|reply
For US/UK/NZ/Aus/SA, ISBNs are granted through Bowker who does maintain their "Books In Print" data set that, in theory, contains metadata for all of the ISBNs they've granted. In practice though it's a mess. It's expensive to access and relies on publishers to enter in accurate and consistent metadata, which is...variable in quality to say the least. Often publishers buy blocks of ISBNs to use later so no metadata is entered up front and has to be pushed back to Bowker at a later date. To be somewhat fair to Bowker, the history of ISBNs far predates modern data standards and I can imagine wrangling publishers to get accurate data is a difficult task. But on the other hand, you'd think they'd have a lot to gain for doing it right. As someone who runs a book website, it is endlessly frustrating.
[+] yorwba|2 years ago|reply
Companies can get a range of ISBNs before deciding what to publish under each ISBN, or whether to publish something at all. So the authorities assigning ISBNs don't necessarily know what they're being used for.
[+] qingcharles|2 years ago|reply
I can't get the .torrent file to work in my client. Can anyone give me a magnet link for it?

I need the magazine ISSNs for my magazine encyclopedia.

edit: got the .torrent working in qTorrent

[+] mannyv|2 years ago|reply
It's unclear what exactly the competition is about. Just to poke around the dataset?
[+] mannyv|2 years ago|reply
As a note, I wish I had enough space to mirror their library. Looking at this brings out the collector in me...a tendency that I've successfully suppressed. You can only keep so many terabytes of archive around.
[+] wolverine876|2 years ago|reply
An earlier study that addressed the scope of all published works:

J-B Michel, et al. "Quantitative Analysis of Culture Using Millions of Digitized Books". Science (the journal) (16 Dec 2010). https://www.science.org/doi/10.1126/science.1199644

Their focus was on words but along the way they analyzed the number of published texts, a study that "includes serials and sets but excludes kits, mixed media, and periodicals such as newspapers".

They concluded that the world had published 129 million "editions" (one book may have multiple editions).

[+] itissid|2 years ago|reply
Noob Question: Isn't this going to be an great source for training Language models? Is it safe to assume that OpenAI/Google/Meta etc already have these?

In any case great work!

[+] crtasm|2 years ago|reply
Worldcat is a database of books, not the books themselves. The summary and description text might be useful though?
[+] mannyv|2 years ago|reply
If you could somehow download the entire archive you could feed it into your LLM for training. This is a huge corpus and is sort of ill-gotten. That said, it would be pretty awesome.

Google has this sort of thing already, since they have that whole "let's digitize the world's books" project. Interesting as to why google never developed a ChatGPT, given that they literally have a large amount of the world's books digitized.

[+] m00dy|2 years ago|reply
yes, of course. AA has a special program for llm developers.
[+] dredmorbius|2 years ago|reply
Question for anyone from Anna's Archive or elsewhere: are catalogue metadata available from national library collections such as the British Library or US Library of Congress?

(I've ... worked a bit with LoC classification and subject headings data, of which publicly-available data are only available in PDF or wordprocessing (MS Word or Wordperfect, if memory serves) formats. Which is ... somewhat unfortunate.)

[+] gmcharlt|2 years ago|reply
Depending on what you're looking for, a _lot_ more is being published by the Library of Congress as Linked Data nowadays, including LoC classification and subject headings. Check out https://id.loc.gov/.
[+] greenie_beans|2 years ago|reply
this is an obvious prediction, but with the writers class action lawsuit against openai for using their books, the internet will become more closed. it's gonna be so hard to scrape websites in the future. we were trending in this direction before gpt, but gpt exacerbated this and put the issue into the mainstream.
[+] sterlind|2 years ago|reply
It's infuriating seeing non-profits gatekeep datasets that were compiled with grant money. At least Elsevier doesn't present itself as a charity.

I was recently trying to get my hands on the Switchboard and Fisher conversational speech datasets. Both were funded by DARPA grants, and maintained by the non-profit LDC, which charges you thousands of dollars for access (and no discounts for individual researchers) - that is, if they'll even pay attention to you without a .edu email address. And both are standard corpora in the field of audio NLP, which makes replicating studies impossible.

Sadly, I couldn't find any way to pirate the datasets - they're too niche. So I applaud the authors for sticking it to Worldcat and scraping their data.

[+] 3abiton|2 years ago|reply
I hope one day there will a piratebay for datasets (the pile) and ai models (llama)
[+] harveywi|2 years ago|reply
>Over the past year, we’ve meticulously scraped all Worldcat records. At first, we hit a lucky break. Worldcat was just rolling out their complete website redesign (in Aug 2022). This included a substantial overhaul of their backend systems, introducing many security flaws. We immediately seized the opportunity, and were able scrape hundreds of millions (!) of records in mere days.

>After that, security flaws were slowly fixed one by one, until the final one we found was patched about a month ago. By that time we had pretty much all records, and were only going for slightly higher quality records.

OCLC carelessly fiddlefarted around with their moat and lost it. Poof!

[+] RhodesianHunter|2 years ago|reply
I don't think anyone is (legally) going to prop up a business or non-profit using data that was admittedly taken from them using their security holes.