top | item 42897120

Visualizing all books of the world in ISBN-Space

486 points| phiresky | 1 year ago |phiresky.github.io

91 comments

order

PaulDavisThe1st|1 year ago

Wow.

When we started Amazon, this was precisely what I wanted to do, but using Library of Congress triple classifications instead of ISBN.

It turned out to be impossible because the data provider (a mixture of Baker & Tayler (book distributors) and Books In Print) munged the triple classification into a single string, so you could not find the boundaries reliably.

Had to abandon the idea before I even really got started on it, and it would certainly have been challenging to do this sort of "flythrough" in the 1994-1995 version of "the web".

Kudos!

dredmorbius|1 year ago

What are you referring to as the LoC triple classification?

I've spent quite some time looking at both the LoC Classification and the LoC Subject Headings. Sadly the LoC don't make either freely available in a useful machine-readable form, though it's possible to play games with the PDF versions. I'd been impressed by a few aspects of this, one point that particularly sticks in my mind is that the state-law section of the Classification shows a very nonuniform density of classifications amongst states. If memory serves, NY and CA are by far the most complex, with PA a somewhat distant third, and many of the "flyover" states having almost absurdly simple classifications, often quite similar. I suspect that this reflects the underlying statutory, regulatory, and judical / caselaw complexity.

Another interesting historical factoid is that the classification and its alphabetic top-level segmentation apparently spring directly from Thomas Jefferson's personal library, which formed the origin of the LoC itself.

For those interested, there's a lot of history of the development and enlargement of the Classification in the annual reports of the Librarian of Congress to Congress, which are available at Hathi Trust.

Classification: <https://www.loc.gov/catdir/cpso/lcco/>

Subject headings: <https://id.loc.gov/authorities/subjects.html>

Annual reports:

- Recent: <https://www.loc.gov/about/reports-and-budgets/annual-reports...>

- Historical archive to ~1866: <https://catalog.hathitrust.org/Record/000072049>

ilamont|1 year ago

> a mixture of Baker & Tayler (book distributors)

Having dealt with Baker & Taylor in the past, this doesn't surprise me in the least. It was one of the most technologically backwards companies I've ever dealt with. Purchase orders and reconciliations were still managed with paper, PDFs, and emails as of early 2020 (when I closed my account). I think at one point they even had me faxing documents in.

layer8|1 year ago

It’s not uncommon for an ISBN to have been assigned multiple times to different books [0]. Thus “all books in ISBN space” may be an overstatement.

There’s also the problem of books with invalid ISBNs, i.e. where the check digit doesn’t match the rest of the ISBN, but where correcting the check digit would match a different book. These books would be outside of the ISBN space assumed by the blog post.

[0] https://scis.edublogs.org/2017/09/28/the-dreaded-case-of-dup...

mormegil|1 year ago

And possibly not even assigned at all. I looked at the lowest known ISBNs for Czech publishers and a different color stood out: no, https://books.google.cz/books?vid=ISBN9788000000015&redir_es... is not a correct ISBN, I'd say :-) (But I don't know if the book includes such obviously-fake ISBN, or the error is just in Google Books data.)

rsecora|1 year ago

Impressive presentation.

Note: The presentation reflects the contents of Anna's archive exclusively, rather than the entire ISBN catalog. There is a discernible bias towards a limited range of languages, due to Anna's collection bias to those languages. The sections marked in black represent the missing entries in the archive.

phiresky|1 year ago

That's not entirely accurate since AA has separate databases for books they have as files, and one for books they only know the metadata of. The metadata database comes from various sources and as far as I know is pretty complete.

Black should mostly be sections that have no assigned books

keepamovin|1 year ago

Wow, that is really cool. What an amazing passion project and what an incredible resource!

Zooming in you can see the titles, the barcode and hovering get a book cover and details. Incredible, everything you could want!

Some improvement ideas: checkbox to hide the floating white panel at top left, and the thing at top right. Because I really like to "immerse" in these visualizations, those floaters lift you out of that experience to some extent, limiting fun and functionality for me a bit.

255|1 year ago

When you zoom in it's book shelves! That's so cool

MeteorMarc|1 year ago

Possible improvement: paperback and bounded editions are shown next to each other, but look the same. Do not know about the e-books.

grues-dinner|1 year ago

Awesome. A real life Library of Babel: https://libraryofbabel.info/

Out of all the VR vapourware, a real life infinite library or infinite museum is the one thing that could conceivably get me dropping cash.

WillAdams|1 year ago

Unfortunately, the writers won't see any of that for this particular implementation.

It would be far more interesting as a project which tried to make all legitimately available downloadable texts accessible, say as an interface to:

https://onlinebooks.library.upenn.edu/

araes|1 year ago

Found the presentation a little overwhelming in the current format. Took a bit to realize the preset part in the upper left actually led to further dataviz vectors like AA (yes/no), rarity, and Google Books inclusion. However, offers a lot in terms of the visualization and data depth available. Also liked https://archive.anarchy.cool/blog/all-isbns.html#visualizing for the region clustering look.

The preset year part was neat though in and of itself just for looking at how active certain regions and areas have been in publishing. Poland's been really active lately. Norway looks very quiet by comparison. China looks like they ramped in ~2005 and huge amounts in the last decade.

United States has got some weird stuff too. Never heard of them, yet Blackstone Audio, Blurb Inc., and Draft2Digital put out huge numbers of ISBNs.

phiresky|1 year ago

It is admittedly pretty noisy, which is somewhat intentional because the focus was on high data density. Here's an a bit more minimalistic view (less color, only one text level simultaneously):

https://phiresky.github.io/isbn-visualization/?dataset=all&g...

It could probably be tweaked further to not show some of the texts (the N publishers part), less stuff on hover, etc.

pfedak|1 year ago

I think you can reasonably think about the flight path by modeling the movement on the hyperbolic upper half plane (x would be the position along the linear path between endpoints, y the side length of the viewport).

I considered two metrics that ended up being equivalent. First, minimizing loaded tiles assuming a hierarchical tiled map. The cost of moving x horizontally is just x/y tiles, using y as the side length of the viewport. Zooming from y_0 to y_1 loads abs(log_2(y_1/y_0)) tiles, which is consistent with ds = dy/y. Together this is just ds^2 = (dx^2 + dy^2)/y^2, exactly the upper-half-plane metric.

Alternatively, you could think of minimizing the "optical flow" of the viewport in some sense. This actually works out to the same metric up to scaling - panning by x without zooming, everything is just displaced by x/y (i.e. the shift as a fraction of the viewport). Zooming by a factor k moves a pixel at (u,v) to (k*u,k*v), a displacement of (u,v)*(k-1). If we go from a side length of y to y+dy, this is (u,v)*dy/y, so depending how exactly we average the displacements this is some constant times dy/y.

Then the geodesics you want are just the horocycles, circles with centers at y=0, although you need to do a little work to compute the motion along the curve. Once you have the arc, from θ_0 to θ_1, the total time should come from integrating dtheta/y = dθ/sin(θ), so to be exact you'd have to invert t = ln(csc(θ)-cot(θ)), so it's probably better to approximate. edit: mathematica is telling me this works out to θ = atan2(1-2*e^(2t), 2*e^t) which is not so bad at all.

Comparing with the "blub space" logic, I think the effective metric there is ds^2 = dz^2 + (z+1)^2 dx^2, polar coordinates where z=1/y is the zoom level, which (using dz=dy/y^2) works out to ds^2 = dy^2/y^4 + dx^2*(1/y^2 + ...). I guess this means the existing implementation spends much more time panning at high zoom levels compared to the hyperbolic model, since zooming from 4x to 2x costs twice as much as 2x to 1x despite being visually the same.

pfedak|1 year ago

Actually playing around with it the behavior was very different from what I expected - there was much more zooming. Turns out I missed some parts of the zoom code:

Their zoom actually is my "y" rather than a scale factor, so the metric is ds^2 = dy^2 + (C-y)^2 dx^2 where C is a bit more than the maximal zoom level. There is some special handling for cases where their curve would want to zoom out further.

Normalizing to the same cost to pan all the way zoomed out (zoom=1), their cost for panning is basically flat once you are very zoomed in, and more than the hyperbolic model when relatively zoomed out. I think this contributes to short distances feeling like the viewport is moving very fast (very little advantage to zooming out) vs basically zooming out all the way over larger distances (intermediate zoom levels are penalized, so you might as well go almost all the way).

zellyn|1 year ago

This really drives home how scattershot organizing books by publisher is. Try searching for "Harry Potter and the Goblet of Fire" and clicking on each of the results in turn: they're nowhere near each other.

Or try "That Hideous Strength" by "C.S. Lewis" vs "Clive Stables Lewis", and suddenly you're arcing across a huge linear separation.

Still, given that that's what we use, this visualization is lovely. Imagine if you could open every book and read it…

Finnucane|1 year ago

Why would you expect otherwise? Titles are assigned ISBNs by publishers as they are being published. Books published simultaneously as a set might have sequential numbers, but otherwise not. Books separated by a year or more are not going to have related numbers. It's an inventory tracking mechanism, it has no other meaning.

bambax|1 year ago

I did find my micro-publishing house relatively easily... Very cool! ;-)

https://i.imgur.com/mhw6Mub.png

tomw1808|1 year ago

I know that isn't an AMA, but may I ask, how is running a publishing house working out for you? From the outside, having a small publishing house, sounds like an uphill battle on all fronts. What is the main driver to become a publisher - hobby turned into profession?

phiresky|1 year ago

Huh, that text (and barcodes) are very offset from where they should be. Would you mind sharing what OS and browser you are using and if this text weirdness was temporary or all the time?

casey2|1 year ago

Regarding ISBN The first section consists of a 3 digit number are issued by GS1, they have only issued 978 and 979, all other sections are issued by the International ISBN Agency.

The second section identifies a country, geographical region or language area. It consists of a 1-5 digit number. The third section, up to 7 digits, is given on request of a publisher to the ISBN agency, larger publishers (publishers with a large expected output) are given smaller numbers (as they get more digits to play with in the 4th section). The forth, up to 6 digits, are given to "identify a specific edition, of a publication by a specific publisher in a particular format", the last section is a single check digit, equal to 10|+´digits×⥊6/[1‿3] where digits are the first 12 digits.

From this visualization it's most apparent that the publishers "Create Space"(aka Great UNpublished, booksurge) and "Forgotten Books" should have been given a small number for the third section. (though in my opinion self-published editions and low value spam shouldn't get an isbn number, or rather it should be with the other independently published work @9798)

They also gave Google tons of space but it appears none of it has been used as of yet.

Jun8|1 year ago

Great description of ISBN format description and visualization. TIL that 978- prefix was “Bookland” ie a fictional country prefix that may be thought of as “Earth”. It has expanded to 979- which was originally “Musicland”.

This probably means that in the (hopefully near) future where we have extraterrestrial publishing (most likely in the Moon or Mars) we’ll need another prefix.

quink|1 year ago

Not really. The 978 prefix, or previously ISBN-10 namespace, in addition to a recalculation of the checksum, makes most books go into the EAN-13 namespace. EAN is meant for unique identifiers (“Numbers”) of “Articles” in “Europe”. Later that got changed to “International”, but most still prefer the acronym EAN.

So 978 really is Bookland, as it used to be, and Earth, but the EAN-13 namespace as a whole really does refer to Earth as well. That said, the extraterrestrials can get a prefix just the same?

celltalk|1 year ago

This is library of Alexendria but in digital format. Amazing work!

youssefabdelm|1 year ago

Does anyone know if there's an API where I could plug in ISBN and get all the libraries in the world that have that book?

I know Worldcat has something like this when you search for a book, but the API, I assume is only for library institutions and I'm not a library nor an institution.

ofou|1 year ago

This is a wonderful submission to Anna's archive [1]. I really love people pushing the boundaries of shadow source initiatives that benefit all of us, especially providing great code and design. Can't emphasize enough the net plus of open source, BitTorrent, and shadow libraries that have had in the world. You can also make the case that LLMs wouldn't have been possible without shadow libraries; it's just no way of getting enough data to learn.

Just thank you.

https://software.annas-archive.li/AnnaArchivist/annas-archiv...

randomcatuser|1 year ago

Nice work!

Things I love:

- How every book has a title & link to the google books

- Information density - You can see various publishers, sparse areas of the grid, and more

- Visualization of empty space -- great work at making it look like a big bookshelf!

Improvements?

- Instead of 2 floating panels, collapse to 1

- After clicking a book, the tooltip should disappear once you zoom out/move locations!

- Sort options (by year, by publisher name)

- Midlevel visualization - I feel like at the second zoom level (groupings of publishers), there's little information that it provides besides the names and relative sparsity (so we can remove the ISBN-related stuff on every shelf) Also since there are a fixed width of shelves, I can know there are 20 publishers, so no need! If we declutter, it'll make for a really nice physical experience!

artninja1988|1 year ago

Did they get the bounty?

vallode|1 year ago

I believe the bounty is closed but they haven't announced the winner(s).

IOUnix|1 year ago

This would be absolutely incredible to incorporate into VR. You could create such an intuitive organizational method adding a 3rd dimension for displaying.g

IOUnix|1 year ago

This would be absolutely incredible to incorporate into VR. You could create such an intuitive organizational method adding a 3rd dimension for displaying.

pbronez|1 year ago

Super cool. Love that you can zoom all the way in and suddenly it looks like a bookshelf.

When I got down to the individual book level, I found several that didn’t have any metadata- not even a title. There are hyperlinks to look up the ISBN on Google books or World Cat, and in the cases I tried WorldCat had the data.

So… why not bring the worldcat data into the dataset?

Hnrobert42|1 year ago

I didn't appreciate the difficulty of the Fly to Book path calculation until I read the description!

fnord77|1 year ago

there's a massive block under "German Language" that's almost entirely english

https://i.imgur.com/LKDuTJP.png

godber|1 year ago

Good find, I think those would be books written in English published by German publishers. The blog post discusses how ISBNs are allocated ... specifically the ones you picture are published by Springer, which is a German company that publishes in the English language.

Considering a specific example: "Forecasting Catastrophic Events in Technology, Nature and Medicine". The website's use of "Group 978-3: German language" is a bit of a misnomer, if they had said "Group 978-3: German issued" or "German publisher" it would be clearer to users.

jaakl|1 year ago

One more zoom LoD please: to the actual pages of the books!

Ekaros|1 year ago

No wonder search did not work for one book I tried. First it did not have the prefix and then the information on either the book or the database had different numbers...

destitude|1 year ago

Searched for some books and didn't find them.. I assume this in't complete list of all USA based books with ISBN published?

phiresky|1 year ago

Books visible should be fairly complete as far as I know. But the search is pretty limited (and dependent on your location) because that uses the Google Books API. If you put in an ISBN13 directly, that should work more reliably.

sinuhe69|1 year ago

I wonder if one day we will have an AI that reads, summarizes and catalogues all the published books? A super librarian :) Imagine being able to ask questions like: "What have they written about AI in the 21st century?". Even better: "What did people not think of when they pursued AGI in the 21st century, which later led to their extinction?" ;)

pillefitz|1 year ago

Since most foundational models have been trained on illegally acquired books, this info should be already baked in.

godber|1 year ago

This is really exceptional work, and still works on my ten year old iPad. Great job!

maCDzP|1 year ago

I am guessing this is going to win the bounty by Anna’s Archive?

soheil|1 year ago

Needs a better minimap that moves as the map is zoomed in.

tekkk|1 year ago

Wow, what a cool little project, congratulations on shipping!

karunamurti|1 year ago

Nice, I found my book in the rack somewhere.

est|1 year ago

I am amazed to see how many STEM books China published.

compootr|1 year ago

hey, is this that annas archive bounty thing? Best of luck to you!

vivzkestrel|1 year ago

Imagine making a foundational LLM model out of every book every written on every subject ever conceived

ofou|1 year ago

It’s called DeepSeek. The founder just confirmed a few days ago that he got the data from Anna's to train on, I think for their latest vision model.

kowlo|1 year ago

books are missing