Show HN: I mapped HN's favorite books with GPT-4o
285 points| pmaze | 1 year ago |hnbooks.pieterma.es
- OpenAI's embeddings were processed using UMAP and HDBSCAN. A direct 2D projection from the text embeddings didn't yield visually interesting results. Instead, HDBSCAN is first applied on a high-dimensional projection. Those clusters tend to correspond to different genres. The genre memberships are then embedded using a second round of UMAP (using Hellinger distance) which results in pleasingly dense structures.
- The books' descriptions are based on extractions from the comments and GPT's general knowledge. Quality levels vary, and it leads to some oddly specific points, but I haven't found any yet that are straight up wrong.
- There are multiple books with the same title. Currently, only the most popular one of those makes it onto the map.
- It's surprisingly hard to get high quality book cover images. I tried Google Books and a bunch of open APIs, but they all had their issues. In the end, I got the covers from GoodReads through a hacked together process that combines their autocomplete search with GPT for data linkage. Does anyone know of a reliable source?
peteforde|1 year ago
In other words, what are the clustering shapes telling us? Can we dig in based on geography, publishing date, key terms or themes?
Either way, I can't keep the site open for more than 30-40 seconds before it crashes. I suspect that's not the goal!
Is Cryptonomicon the best fiction book, or is the data wrong?
refulgentis|1 year ago
IMHO it's a category error that results from tutorials using the king + female = queen example (which, funnily enough, wasn't even true for the original word2vec, if commentary I've read previously here is correct).
Working with them a lot has me picture them more as "a multivariate function that outputs 768 numbers, and was learned by brute force" than "something that sees in 768 dimensions" --- of course, they're both true, but the second interpretation shades more than it illuminates once you're past the very first interrogatory of "so what is this calculating, exactly?"
pmaze|1 year ago
You've got the cluster semantics spot on, to be honest. Broad genres are grouped together, with a tendency for sub-genres to be grouped locally within those.
There is no interpretation of the overall shapes or the global structure, those are more a result of a particular UMAP run than inherent in the data.
Would love to provide different views on it and go more in depth next, thanks for the suggestion.
jdthedisciple|1 year ago
Yup, probably was about to happen to me too, had I not closed it.
CPU fan almost launched off the troposphere about 30 seconds in.
Probably a cluttered bunch of heavily unoptimized ReactJS modules in there (no offense to OP, I know it probably sped up development by 10x at least)
kristianp|1 year ago
[1] https://blog.reyem.dev/post/extracting_hn_book_recommendatio...
iwishiknewlisp|1 year ago
padolsey|1 year ago
I share the frustraion with getting book covers for my project ablf.io. Amazon used to make this much easier, but they've locked it down recently, so you have to jump through affiliate hoops. I ended up implementing my own thing and storing thousands of images myself on S3. If you have the goodreads IDs, feel free to use:
N.B. The actual goodreads website itself make it hard as well since they have an additional UUID in their img URIs, so it's not deterministic; that's why I created this.DantesKite|1 year ago
It even recommended me a somewhat eclectic book I’ve recently been meaning to read.
Is there a reason you limit to only 6 favorite books? Is it due to computational restraints?
renjimen|1 year ago
alabhyajindal|1 year ago
Adding direct links to the comments that mention the books could be a good feature to add. Hacker News Books [1] does this and it's useful have all the comments for a book in a single page.
1. https://hackernewsbooks.com
paulwarren|1 year ago
sleazebreeze|1 year ago
Nice project though, I love it.
mooreed|1 year ago
I also would love to hear more about the cluster shapes and cardinality of the coordinate system. I consider myself am pretty versed in data analysis, however with less expertise on NLP topics (eg t-SNE).
So a quick blurb like: the units on the axes in the graph are “a reduced embedding space” designed to keep structure and to reduce the dimensionality such that the clusters could be plotted on screen…
(I’m not even sure that’s correct, but I would have loved for you to have informed me on the one sentence visualization choice and then point me to t-SNE.)
Overall nice project - and it reminds me of a painful professional analysis lesson I have had to re-learn more than once.
> After working for NN hours on an analysis, and finally breaking through and completing it, overlooking the title and labels is the biggest footgun I have ever dealt with.
r_singh|1 year ago
Tastefully made. I'm gonna go over it in my leisure time.
About your question for a reliable source to get book covers. I run this api that could possibly do this if you collect the Amazon asin numbers (or urls) for the books (that can also be done with the search api I host): https://docs.unwrangle.com/amazon-product-data-api/
If it seems useful, you can reach out to me and mention this chat. I'll be happy to offer free credits for your project.
Brajeshwar|1 year ago
Strongbad536|1 year ago
https://github.com/BrianVia/hacker-news-favorite-books
theturtletalks|1 year ago
pstorm|1 year ago
vismit2000|1 year ago
The links here directly refer to images on Amazon (e.g. https://m.media-amazon.com/images/I/81YkqyaFVEL._SL1500_.jpg)
dangus|1 year ago
jppope|1 year ago
ilikehurdles|1 year ago
I really like the project otherwise. We have a book club that’s deciding on what to read next and this could be very helpful.
changexd|1 year ago
namanyayg|1 year ago
i'm curious about the decision to use hellinger distance for the second round of UMAP - was that purely empirical or did you have some intuition about why it'd work well for this specific dataset?
also, out of curiosity, what's the most popular book on the map that doesn't have a clear genre cluster?
pmaze|1 year ago
The cluster memberships that come out of the first round are distributions over the different clusters, e.g. a given book is weighted 0.8 for cluster A and 0.2 for cluster B. The Hellinger distance is well-suited to quantify the difference between two distributions like that. Cosine similarity and Euclidean distance worked as well, but Hellinger gave subjectively nicer results.
Very interesting question, I'm not sure! While developing, I noticed that the systems thinking books were spread over different genres, which I found quite pleasing. However, I'm not sure if other books were even more diffuse. I'll have to dig back in and find out :)
wtf242|1 year ago
noitpmeder|1 year ago
answerheck|1 year ago
Probably a comment on my subconscious desire for familiarity/patterns, but the left side of the map instantly made me think of NW Europe: long skinny Norway dangling between the UK and Denmark (not correctly spaced, but sizes are reasonably correct!). A few other candidates at a stretch - maybe some Baltic states off to the east, for example - but after that it breaks down unfortunately.
Cool project sir
unknown|1 year ago
[deleted]
Nathanael_M|1 year ago
Failed to load module script: Expected a JavaScript module script but the server responded with a MIME type of "text/html". Strict MIME type checking is enforced for module scripts per HTML spec.
This crashes my browser in less than a minute.
pmaze|1 year ago
dcchambers|1 year ago
ijidak|1 year ago
For example, I just finished The Phoenix Project.
I'm already seeing some related books I should take a look at.
Very useful!
23B1|1 year ago
Idea: Amazon has killed 'random' browsing of books. Would love to see this applied to topic area searches etc. so I can have the same serendipity that I used to get in all the bookstores Amazon unalived.
WillAdams|1 year ago
https://www.literature-map.com/
vismit2000|1 year ago
motohagiography|1 year ago
Eduard|1 year ago
* Google Chrome form flathub. Version 128.0.6613.119 (Official Build) (64-bit) * Debian 12 bookworm under KDE Wayland
reducesuffering|1 year ago
unknown|1 year ago
[deleted]
lucius_verus|1 year ago
SoftTalker|1 year ago
allenu|1 year ago
ok123456|1 year ago
LudwigNagasena|1 year ago
bestinterest|1 year ago
fruktmix|1 year ago
maCDzP|1 year ago
pmaze|1 year ago
goshx|1 year ago
the__alchemist|1 year ago
kthartic|1 year ago
vegabook|1 year ago
dnlserrano|1 year ago
unknown|1 year ago
[deleted]