When we started Amazon, this was precisely what I wanted to do, but using Library of Congress triple classifications instead of ISBN.
It turned out to be impossible because the data provider (a mixture of Baker & Tayler (book distributors) and Books In Print) munged the triple classification into a single string, so you could not find the boundaries reliably.
Had to abandon the idea before I even really got started on it, and it would certainly have been challenging to do this sort of "flythrough" in the 1994-1995 version of "the web".
What are you referring to as the LoC triple classification?
I've spent quite some time looking at both the LoC Classification and the LoC Subject Headings. Sadly the LoC don't make either freely available in a useful machine-readable form, though it's possible to play games with the PDF versions. I'd been impressed by a few aspects of this, one point that particularly sticks in my mind is that the state-law section of the Classification shows a very nonuniform density of classifications amongst states. If memory serves, NY and CA are by far the most complex, with PA a somewhat distant third, and many of the "flyover" states having almost absurdly simple classifications, often quite similar. I suspect that this reflects the underlying statutory, regulatory, and judical / caselaw complexity.
Another interesting historical factoid is that the classification and its alphabetic top-level segmentation apparently spring directly from Thomas Jefferson's personal library, which formed the origin of the LoC itself.
For those interested, there's a lot of history of the development and enlargement of the Classification in the annual reports of the Librarian of Congress to Congress, which are available at Hathi Trust.
Having dealt with Baker & Taylor in the past, this doesn't surprise me in the least. It was one of the most technologically backwards companies I've ever dealt with. Purchase orders and reconciliations were still managed with paper, PDFs, and emails as of early 2020 (when I closed my account). I think at one point they even had me faxing documents in.
It’s not uncommon for an ISBN to have been assigned multiple times to different books [0]. Thus “all books in ISBN space” may be an overstatement.
There’s also the problem of books with invalid ISBNs, i.e. where the check digit doesn’t match the rest of the ISBN, but where correcting the check digit would match a different book. These books would be outside of the ISBN space assumed by the blog post.
And possibly not even assigned at all. I looked at the lowest known ISBNs for Czech publishers and a different color stood out: no, https://books.google.cz/books?vid=ISBN9788000000015&redir_es... is not a correct ISBN, I'd say :-) (But I don't know if the book includes such obviously-fake ISBN, or the error is just in Google Books data.)
Note: The presentation reflects the contents of Anna's archive exclusively, rather than the entire ISBN catalog. There is a discernible bias towards a limited range of languages, due to Anna's collection bias to those languages. The sections marked in black represent the missing entries in the archive.
That's not entirely accurate since AA has separate databases for books they have as files, and one for books they only know the metadata of. The metadata database comes from various sources and as far as I know is pretty complete.
Black should mostly be sections that have no assigned books
Wow, that is really cool. What an amazing passion project and what an incredible resource!
Zooming in you can see the titles, the barcode and hovering get a book cover and details. Incredible, everything you could want!
Some improvement ideas: checkbox to hide the floating white panel at top left, and the thing at top right. Because I really like to "immerse" in these visualizations, those floaters lift you out of that experience to some extent, limiting fun and functionality for me a bit.
Ah, this is a perfect application for Microsift SilverLight PivotViewer, a terrific web interface we used for neuroimaging until Microsoft pulled the plug.
There is an awe inspiring TED talk by Gary W. Flake demonstrating its use.
Found the presentation a little overwhelming in the current format. Took a bit to realize the preset part in the upper left actually led to further dataviz vectors like AA (yes/no), rarity, and Google Books inclusion. However, offers a lot in terms of the visualization and data depth available. Also liked https://archive.anarchy.cool/blog/all-isbns.html#visualizing for the region clustering look.
The preset year part was neat though in and of itself just for looking at how active certain regions and areas have been in publishing. Poland's been really active lately. Norway looks very quiet by comparison. China looks like they ramped in ~2005 and huge amounts in the last decade.
United States has got some weird stuff too. Never heard of them, yet Blackstone Audio, Blurb Inc., and Draft2Digital put out huge numbers of ISBNs.
It is admittedly pretty noisy, which is somewhat intentional because the focus was on high data density. Here's an a bit more minimalistic view (less color, only one text level simultaneously):
I think you can reasonably think about the flight path by modeling the movement on the hyperbolic upper half plane (x would be the position along the linear path between endpoints, y the side length of the viewport).
I considered two metrics that ended up being equivalent. First, minimizing loaded tiles assuming a hierarchical tiled map. The cost of moving x horizontally is just x/y tiles, using y as the side length of the viewport. Zooming from y_0 to y_1 loads abs(log_2(y_1/y_0)) tiles, which is consistent with ds = dy/y. Together this is just ds^2 = (dx^2 + dy^2)/y^2, exactly the upper-half-plane metric.
Alternatively, you could think of minimizing the "optical flow" of the viewport in some sense. This actually works out to the same metric up to scaling - panning by x without zooming, everything is just displaced by x/y (i.e. the shift as a fraction of the viewport). Zooming by a factor k moves a pixel at (u,v) to (k*u,k*v), a displacement of (u,v)*(k-1). If we go from a side length of y to y+dy, this is (u,v)*dy/y, so depending how exactly we average the displacements this is some constant times dy/y.
Then the geodesics you want are just the horocycles, circles with centers at y=0, although you need to do a little work to compute the motion along the curve. Once you have the arc, from θ_0 to θ_1, the total time should come from integrating dtheta/y = dθ/sin(θ), so to be exact you'd have to invert t = ln(csc(θ)-cot(θ)), so it's probably better to approximate. edit: mathematica is telling me this works out to θ = atan2(1-2*e^(2t), 2*e^t) which is not so bad at all.
Comparing with the "blub space" logic, I think the effective metric there is ds^2 = dz^2 + (z+1)^2 dx^2, polar coordinates where z=1/y is the zoom level, which (using dz=dy/y^2) works out to ds^2 = dy^2/y^4 + dx^2*(1/y^2 + ...). I guess this means the existing implementation spends much more time panning at high zoom levels compared to the hyperbolic model, since zooming from 4x to 2x costs twice as much as 2x to 1x despite being visually the same.
Actually playing around with it the behavior was very different from what I expected - there was much more zooming. Turns out I missed some parts of the zoom code:
Their zoom actually is my "y" rather than a scale factor, so the metric is ds^2 = dy^2 + (C-y)^2 dx^2 where C is a bit more than the maximal zoom level. There is some special handling for cases where their curve would want to zoom out further.
Normalizing to the same cost to pan all the way zoomed out (zoom=1), their cost for panning is basically flat once you are very zoomed in, and more than the hyperbolic model when relatively zoomed out. I think this contributes to short distances feeling like the viewport is moving very fast (very little advantage to zooming out) vs basically zooming out all the way over larger distances (intermediate zoom levels are penalized, so you might as well go almost all the way).
This really drives home how scattershot organizing books by publisher is. Try searching for "Harry Potter and the Goblet of Fire" and clicking on each of the results in turn: they're nowhere near each other.
Or try "That Hideous Strength" by "C.S. Lewis" vs "Clive Stables Lewis", and suddenly you're arcing across a huge linear separation.
Still, given that that's what we use, this visualization is lovely. Imagine if you could open every book and read it…
Why would you expect otherwise? Titles are assigned ISBNs by publishers as they are being published. Books published simultaneously as a set might have sequential numbers, but otherwise not. Books separated by a year or more are not going to have related numbers. It's an inventory tracking mechanism, it has no other meaning.
I know that isn't an AMA, but may I ask, how is running a publishing house working out for you? From the outside, having a small publishing house, sounds like an uphill battle on all fronts. What is the main driver to become a publisher - hobby turned into profession?
Huh, that text (and barcodes) are very offset from where they should be. Would you mind sharing what OS and browser you are using and if this text weirdness was temporary or all the time?
Regarding ISBN The first section consists of a 3 digit number are issued by GS1, they have only issued 978 and 979, all other sections are issued by the International ISBN Agency.
The second section identifies a country, geographical region or language area. It consists of a 1-5 digit number. The third section, up to 7 digits, is given on request of a publisher to the ISBN agency, larger publishers (publishers with a large expected output) are given smaller numbers (as they get more digits to play with in the 4th section). The forth, up to 6 digits, are given to "identify a specific edition, of a publication by a specific publisher in a particular format", the last section is a single check digit, equal to 10|+´digits×⥊6/[1‿3] where digits are the first 12 digits.
From this visualization it's most apparent that the publishers "Create Space"(aka Great UNpublished, booksurge) and "Forgotten Books" should have been given a small number for the third section. (though in my opinion self-published editions and low value spam shouldn't get an isbn number, or rather it should be with the other independently published work @9798)
They also gave Google tons of space but it appears none of it has been used as of yet.
Great description of ISBN format description and visualization. TIL that 978- prefix was “Bookland” ie a fictional country prefix that may be thought of as “Earth”. It has expanded to 979- which was originally “Musicland”.
This probably means that in the (hopefully near) future where we have extraterrestrial publishing (most likely in the Moon or Mars) we’ll need another prefix.
Not really. The 978 prefix, or previously ISBN-10 namespace, in addition to a recalculation of the checksum, makes most books go into the EAN-13 namespace. EAN is meant for unique identifiers (“Numbers”) of “Articles” in “Europe”. Later that got changed to “International”, but most still prefer the acronym EAN.
So 978 really is Bookland, as it used to be, and Earth, but the EAN-13 namespace as a whole really does refer to Earth as well. That said, the extraterrestrials can get a prefix just the same?
Does anyone know if there's an API where I could plug in ISBN and get all the libraries in the world that have that book?
I know Worldcat has something like this when you search for a book, but the API, I assume is only for library institutions and I'm not a library nor an institution.
This is a wonderful submission to Anna's archive [1]. I really love people pushing the boundaries of shadow source initiatives that benefit all of us, especially providing great code and design. Can't emphasize enough the net plus of open source, BitTorrent, and shadow libraries that have had in the world. You can also make the case that LLMs wouldn't have been possible without shadow libraries; it's just no way of getting enough data to learn.
- How every book has a title & link to the google books
- Information density - You can see various publishers, sparse areas of the grid, and more
- Visualization of empty space -- great work at making it look like a big bookshelf!
Improvements?
- Instead of 2 floating panels, collapse to 1
- After clicking a book, the tooltip should disappear once you zoom out/move locations!
- Sort options (by year, by publisher name)
- Midlevel visualization - I feel like at the second zoom level (groupings of publishers), there's little information that it provides besides the names and relative sparsity (so we can remove the ISBN-related stuff on every shelf) Also since there are a fixed width of shelves, I can know there are 20 publishers, so no need! If we declutter, it'll make for a really nice physical experience!
Does anyone see were the raw data is downloaded? I see this [1], but looks like it might just be the list of ISBNs and not the titles. I suppose following the build instructions for this page [2] would do it, but rather not install these js tools.
This would be absolutely incredible to incorporate into VR. You could create such an intuitive organizational method adding a 3rd dimension for displaying.g
This would be absolutely incredible to incorporate into VR. You could create such an intuitive organizational method adding a 3rd dimension for displaying.
Super cool. Love that you can zoom all the way in and suddenly it looks like a bookshelf.
When I got down to the individual book level, I found several that didn’t have any metadata- not even a title. There are hyperlinks to look up the ISBN on Google books or World Cat, and in the cases I tried WorldCat had the data.
So… why not bring the worldcat data into the dataset?
Good find, I think those would be books written in English published by German publishers. The blog post discusses how ISBNs are allocated ... specifically the ones you picture are published by Springer, which is a German company that publishes in the English language.
Considering a specific example: "Forecasting Catastrophic Events in Technology, Nature and Medicine". The website's use of "Group 978-3: German language" is a bit of a misnomer, if they had said "Group 978-3: German issued" or "German publisher" it would be clearer to users.
No wonder search did not work for one book I tried. First it did not have the prefix and then the information on either the book or the database had different numbers...
Books visible should be fairly complete as far as I know. But the search is pretty limited (and dependent on your location) because that uses the Google Books API. If you put in an ISBN13 directly, that should work more reliably.
I wonder if one day we will have an AI that reads, summarizes and catalogues all the published books? A super librarian :) Imagine being able to ask questions like: "What have they written about AI in the 21st century?". Even better: "What did people not think of when they pursued AGI in the 21st century, which later led to their extinction?" ;)
PaulDavisThe1st|1 year ago
When we started Amazon, this was precisely what I wanted to do, but using Library of Congress triple classifications instead of ISBN.
It turned out to be impossible because the data provider (a mixture of Baker & Tayler (book distributors) and Books In Print) munged the triple classification into a single string, so you could not find the boundaries reliably.
Had to abandon the idea before I even really got started on it, and it would certainly have been challenging to do this sort of "flythrough" in the 1994-1995 version of "the web".
Kudos!
dredmorbius|1 year ago
I've spent quite some time looking at both the LoC Classification and the LoC Subject Headings. Sadly the LoC don't make either freely available in a useful machine-readable form, though it's possible to play games with the PDF versions. I'd been impressed by a few aspects of this, one point that particularly sticks in my mind is that the state-law section of the Classification shows a very nonuniform density of classifications amongst states. If memory serves, NY and CA are by far the most complex, with PA a somewhat distant third, and many of the "flyover" states having almost absurdly simple classifications, often quite similar. I suspect that this reflects the underlying statutory, regulatory, and judical / caselaw complexity.
Another interesting historical factoid is that the classification and its alphabetic top-level segmentation apparently spring directly from Thomas Jefferson's personal library, which formed the origin of the LoC itself.
For those interested, there's a lot of history of the development and enlargement of the Classification in the annual reports of the Librarian of Congress to Congress, which are available at Hathi Trust.
Classification: <https://www.loc.gov/catdir/cpso/lcco/>
Subject headings: <https://id.loc.gov/authorities/subjects.html>
Annual reports:
- Recent: <https://www.loc.gov/about/reports-and-budgets/annual-reports...>
- Historical archive to ~1866: <https://catalog.hathitrust.org/Record/000072049>
ilamont|1 year ago
Having dealt with Baker & Taylor in the past, this doesn't surprise me in the least. It was one of the most technologically backwards companies I've ever dealt with. Purchase orders and reconciliations were still managed with paper, PDFs, and emails as of early 2020 (when I closed my account). I think at one point they even had me faxing documents in.
layer8|1 year ago
There’s also the problem of books with invalid ISBNs, i.e. where the check digit doesn’t match the rest of the ISBN, but where correcting the check digit would match a different book. These books would be outside of the ISBN space assumed by the blog post.
[0] https://scis.edublogs.org/2017/09/28/the-dreaded-case-of-dup...
mormegil|1 year ago
rsecora|1 year ago
Note: The presentation reflects the contents of Anna's archive exclusively, rather than the entire ISBN catalog. There is a discernible bias towards a limited range of languages, due to Anna's collection bias to those languages. The sections marked in black represent the missing entries in the archive.
phiresky|1 year ago
Black should mostly be sections that have no assigned books
keepamovin|1 year ago
Zooming in you can see the titles, the barcode and hovering get a book cover and details. Incredible, everything you could want!
Some improvement ideas: checkbox to hide the floating white panel at top left, and the thing at top right. Because I really like to "immerse" in these visualizations, those floaters lift you out of that experience to some extent, limiting fun and functionality for me a bit.
robwwilliams|1 year ago
There is an awe inspiring TED talk by Gary W. Flake demonstrating its use.
https://m.youtube.com/watch?v=LT_x9s67yWA
And here is our IEEE paper from 2011.
Really sorry this is not a web standard.
https://www.dropbox.com/scl/fi/bl8zkjs3y47q3377hh3ya/Yan_Wil...
c-fe|1 year ago
There are more cool submissions here https://software.annas-archive.li/AnnaArchivist/annas-archiv...
Mine is at https://isbnviz.pages.dev
255|1 year ago
MeteorMarc|1 year ago
grues-dinner|1 year ago
Out of all the VR vapourware, a real life infinite library or infinite museum is the one thing that could conceivably get me dropping cash.
WillAdams|1 year ago
It would be far more interesting as a project which tried to make all legitimately available downloadable texts accessible, say as an interface to:
https://onlinebooks.library.upenn.edu/
araes|1 year ago
The preset year part was neat though in and of itself just for looking at how active certain regions and areas have been in publishing. Poland's been really active lately. Norway looks very quiet by comparison. China looks like they ramped in ~2005 and huge amounts in the last decade.
United States has got some weird stuff too. Never heard of them, yet Blackstone Audio, Blurb Inc., and Draft2Digital put out huge numbers of ISBNs.
phiresky|1 year ago
https://phiresky.github.io/isbn-visualization/?dataset=all&g...
It could probably be tweaked further to not show some of the texts (the N publishers part), less stuff on hover, etc.
pfedak|1 year ago
I considered two metrics that ended up being equivalent. First, minimizing loaded tiles assuming a hierarchical tiled map. The cost of moving x horizontally is just x/y tiles, using y as the side length of the viewport. Zooming from y_0 to y_1 loads abs(log_2(y_1/y_0)) tiles, which is consistent with ds = dy/y. Together this is just ds^2 = (dx^2 + dy^2)/y^2, exactly the upper-half-plane metric.
Alternatively, you could think of minimizing the "optical flow" of the viewport in some sense. This actually works out to the same metric up to scaling - panning by x without zooming, everything is just displaced by x/y (i.e. the shift as a fraction of the viewport). Zooming by a factor k moves a pixel at (u,v) to (k*u,k*v), a displacement of (u,v)*(k-1). If we go from a side length of y to y+dy, this is (u,v)*dy/y, so depending how exactly we average the displacements this is some constant times dy/y.
Then the geodesics you want are just the horocycles, circles with centers at y=0, although you need to do a little work to compute the motion along the curve. Once you have the arc, from θ_0 to θ_1, the total time should come from integrating dtheta/y = dθ/sin(θ), so to be exact you'd have to invert t = ln(csc(θ)-cot(θ)), so it's probably better to approximate. edit: mathematica is telling me this works out to θ = atan2(1-2*e^(2t), 2*e^t) which is not so bad at all.
Comparing with the "blub space" logic, I think the effective metric there is ds^2 = dz^2 + (z+1)^2 dx^2, polar coordinates where z=1/y is the zoom level, which (using dz=dy/y^2) works out to ds^2 = dy^2/y^4 + dx^2*(1/y^2 + ...). I guess this means the existing implementation spends much more time panning at high zoom levels compared to the hyperbolic model, since zooming from 4x to 2x costs twice as much as 2x to 1x despite being visually the same.
pfedak|1 year ago
Their zoom actually is my "y" rather than a scale factor, so the metric is ds^2 = dy^2 + (C-y)^2 dx^2 where C is a bit more than the maximal zoom level. There is some special handling for cases where their curve would want to zoom out further.
Normalizing to the same cost to pan all the way zoomed out (zoom=1), their cost for panning is basically flat once you are very zoomed in, and more than the hyperbolic model when relatively zoomed out. I think this contributes to short distances feeling like the viewport is moving very fast (very little advantage to zooming out) vs basically zooming out all the way over larger distances (intermediate zoom levels are penalized, so you might as well go almost all the way).
zellyn|1 year ago
Or try "That Hideous Strength" by "C.S. Lewis" vs "Clive Stables Lewis", and suddenly you're arcing across a huge linear separation.
Still, given that that's what we use, this visualization is lovely. Imagine if you could open every book and read it…
Finnucane|1 year ago
bambax|1 year ago
https://i.imgur.com/mhw6Mub.png
tomw1808|1 year ago
phiresky|1 year ago
casey2|1 year ago
The second section identifies a country, geographical region or language area. It consists of a 1-5 digit number. The third section, up to 7 digits, is given on request of a publisher to the ISBN agency, larger publishers (publishers with a large expected output) are given smaller numbers (as they get more digits to play with in the 4th section). The forth, up to 6 digits, are given to "identify a specific edition, of a publication by a specific publisher in a particular format", the last section is a single check digit, equal to 10|+´digits×⥊6/[1‿3] where digits are the first 12 digits.
From this visualization it's most apparent that the publishers "Create Space"(aka Great UNpublished, booksurge) and "Forgotten Books" should have been given a small number for the third section. (though in my opinion self-published editions and low value spam shouldn't get an isbn number, or rather it should be with the other independently published work @9798)
They also gave Google tons of space but it appears none of it has been used as of yet.
JustinGoldberg9|1 year ago
Jun8|1 year ago
This probably means that in the (hopefully near) future where we have extraterrestrial publishing (most likely in the Moon or Mars) we’ll need another prefix.
quink|1 year ago
So 978 really is Bookland, as it used to be, and Earth, but the EAN-13 namespace as a whole really does refer to Earth as well. That said, the extraterrestrials can get a prefix just the same?
celltalk|1 year ago
dark-star|1 year ago
Although I don't know if this was the winning entry or not
c-fe|1 year ago
omoikane|1 year ago
https://news.ycombinator.com/item?id=42652577 - Visualizing All ISBNs (2025-01-10, 139 comments)
youssefabdelm|1 year ago
I know Worldcat has something like this when you search for a book, but the API, I assume is only for library institutions and I'm not a library nor an institution.
ofou|1 year ago
Just thank you.
https://software.annas-archive.li/AnnaArchivist/annas-archiv...
randomcatuser|1 year ago
Things I love:
- How every book has a title & link to the google books
- Information density - You can see various publishers, sparse areas of the grid, and more
- Visualization of empty space -- great work at making it look like a big bookshelf!
Improvements?
- Instead of 2 floating panels, collapse to 1
- After clicking a book, the tooltip should disappear once you zoom out/move locations!
- Sort options (by year, by publisher name)
- Midlevel visualization - I feel like at the second zoom level (groupings of publishers), there's little information that it provides besides the names and relative sparsity (so we can remove the ISBN-related stuff on every shelf) Also since there are a fixed width of shelves, I can know there are 20 publishers, so no need! If we declutter, it'll make for a really nice physical experience!
ks2048|1 year ago
[1] (Gitlab page) https://software.annas-archive.li/AnnaArchivist/annas-archiv...
[2] https://github.com/phiresky/isbn-visualization
artninja1988|1 year ago
vallode|1 year ago
IOUnix|1 year ago
IOUnix|1 year ago
unknown|1 year ago
[deleted]
pbronez|1 year ago
When I got down to the individual book level, I found several that didn’t have any metadata- not even a title. There are hyperlinks to look up the ISBN on Google books or World Cat, and in the cases I tried WorldCat had the data.
So… why not bring the worldcat data into the dataset?
Hnrobert42|1 year ago
siddharthgoel88|1 year ago
fnord77|1 year ago
https://i.imgur.com/LKDuTJP.png
godber|1 year ago
Considering a specific example: "Forecasting Catastrophic Events in Technology, Nature and Medicine". The website's use of "Group 978-3: German language" is a bit of a misnomer, if they had said "Group 978-3: German issued" or "German publisher" it would be clearer to users.
jaakl|1 year ago
Ekaros|1 year ago
ElijahLynn|1 year ago
zeristor|1 year ago
omoikane|1 year ago
https://en.wikipedia.org/wiki/ISBN#ISBN-10_to_ISBN-13_conver...
destitude|1 year ago
phiresky|1 year ago
sinuhe69|1 year ago
pillefitz|1 year ago
godber|1 year ago
maCDzP|1 year ago
soheil|1 year ago
tekkk|1 year ago
karunamurti|1 year ago
est|1 year ago
est|1 year ago
phiresky|1 year ago
compootr|1 year ago
chikere232|1 year ago
vivzkestrel|1 year ago
ofou|1 year ago
kowlo|1 year ago
lovegrenoble|1 year ago
moneymack|1 year ago
[deleted]
gygyjk|1 year ago
[deleted]