Show HN: Exploring five million Hacker News posts
3 points| lmcinnes | 1 year ago |lmcinnes.github.io
The dataset was filtered from https://huggingface.co/datasets/OpenPipe/hacker-news Stories were embedded in a vector space via nomic-embed: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 A 2D representation was generated using UMAP: https://github.com/lmcinnes/umap Clusters were generated and topics named via HDBSCAN and Toponymy using Cohere Command-R: - https://github.com/TutteInstitute/fast_hdbscan - https://github.com/TutteInstitute/toponymy - https://cohere.com/command The interactive map was generated using DataMapPlot: https://github.com/TutteInstitute/datamapplot
The map provides a great way to get an overview of Hacker News stories over the years, and to explore them, and find interesting niche topics. There are limitations to both the text embedding and the 2D representation. For example posts about John Gruber's "Daring Fireball" end up in "Sun-related phenomena" in the Astronomy region of the map; and some topics get squashed into odd places because of the limits of a 2D representation. Nonetheless, most topics, regions and stories are well placed. There is a wealth of knowledge and information packed in here, and a lot to explore.
unknown|1 year ago
[deleted]