The hairball was much worse before. I used a lot of techniques from this paper [1] to make it look decent and a bunch of other heuristics based on other papers to make it look informative.
This is very cool but also not accurate, at least for jakeseliger.com. Henryn.ca lists 0 links from jakeseliger.com to nytimes.com, reason.com, and numerous others that simple search demonstrates are linked to, for example: https://jakeseliger.com/?s=nytimes.com&submit=Search
I put up many links posts, so I probably link to an abnormally large number of sites.
> I scraped my favorite blogs and made a graph from the domains that each blog links to.
Nice analysis! However, I'm guessing these arent your fav blogs as there are tens of thousands of entries! How did you decide which blogs to index, did you use some central registry of blogs?
Very neat! So you wrote the graph visualization UI? I see in prior project you used cytoscape - any motivation for doing it yourself this time (vs one of the available libraries)?
Reminds me of the intern project I worked on at Google back in 2008.
My mentor at the time had a traceroute dataset of the Internet and wanted to render it on top of Google Maps. I implemented a MapReduce algorithm that geolocated the data points and then produced Google Maps tiles at various zoom levels to show how the Internet was connected. It was pretty cool to visualize how the data flowed throughout the world and to be able to "dig deeper" by zooming into the mess of connections. Very similar to what this project does!
The project didn't go anywhere but it was a cool fun experiment and a great learning opportunity for me (S2 geometry is... well, weird, but touching MapReduce and Bigtable were invaluable exercises for my later tenure at the company). Those were very different times. I don't think you would be able to pursue such a "useless" project as an intern at Google these days.
IIRC we used the LGL algorithm (https://lgl.sourceforge.net/) while pinning any nodes we could get geolocations for, giving a nice hybrid geo/topo layout
I don't remember exactly how we got the geolocations, but often network routers have 3-letter airport codes in their DNS names, so maybe that? We may also have had a lookup table in el googz somewhere
Definitely a project whose time should again come! ;)
This is a neat idea - however, I think the graphical view of the blog graph trades "coolness" for "utility".
Have you thought of a front end that is basically just text/plain HTML (in normal size) + navigation links to explore the blogs in one frame, and the currently chosen blog in another frame? That way, you could look at the blogs while travelling your crawl graph, a kind of "blog explorer".
Reminds me of my friend's visualisation of tracks on the popular London station NTS https://www.barneyhill.com/pages/nts-tracklists/. Turns out a lot of cool artists like the same tracks... ;)
Yep this is only for stuff that we've crawled, so we can't detect all of your links. Because we have limited crawling resources, we rate-limit the crawling by domain so we don't get stuck in spider traps.
The current visualization only shows the current state of the crawl, so it won't know about all of the posts.
Awesome, I've always wanted to build something like that on top of YaCy just so that I could properly select new potentially interesting sites to index. (I can't rely on the auto-index unfortunately because it has no option to pre-confirm before indexing.)
This is only tangentially related, but has anyone done similar for HN comments? I'd be curious to know who responds to whom on particular topics, etc....
To get their topics? I used a basic louvain community detection algorithm, then put all the URLs into GPT with some few-shot prompting tricks to get it to output a particular topic. There's some heuristics to break up giant communities / combine small communities in there too.
PaulHoule|1 year ago
https://cambridge-intelligence.com/how-to-fix-hairballs/
ng-henry|1 year ago
[1] https://jgaa.info/accepted/2015/NocajOrtmannBrandes2015.19.2...
tauchunfall|1 year ago
https://aviz.fr/~bbach/confluentgraphs/
3abiton|1 year ago
ng-henry|1 year ago
You can see clusters forming of websites that talk about similar topics, like crypto, rationality, Canada, India, and even postgres!
The visualization was made entirely in webgl with some neat optimizations to render that many lines and circles.
jseliger|1 year ago
I put up many links posts, so I probably link to an abnormally large number of sites.
TuringNYC|1 year ago
Nice analysis! However, I'm guessing these arent your fav blogs as there are tens of thousands of entries! How did you decide which blogs to index, did you use some central registry of blogs?
nickjj|1 year ago
dameyawn|1 year ago
varenc|1 year ago
gala8y|1 year ago
erikig|1 year ago
One nice feature that would be helpful is the ability to preview the blog.
imdsm|1 year ago
jmmv|1 year ago
My mentor at the time had a traceroute dataset of the Internet and wanted to render it on top of Google Maps. I implemented a MapReduce algorithm that geolocated the data points and then produced Google Maps tiles at various zoom levels to show how the Internet was connected. It was pretty cool to visualize how the data flowed throughout the world and to be able to "dig deeper" by zooming into the mess of connections. Very similar to what this project does!
The project didn't go anywhere but it was a cool fun experiment and a great learning opportunity for me (S2 geometry is... well, weird, but touching MapReduce and Bigtable were invaluable exercises for my later tenure at the company). Those were very different times. I don't think you would be able to pursue such a "useless" project as an intern at Google these days.
pmayrgundter|1 year ago
Dataset was something from CAIDA, like this: https://www.caida.org/catalog/datasets/ipv4_prefix_probing_d...
IIRC we used the LGL algorithm (https://lgl.sourceforge.net/) while pinning any nodes we could get geolocations for, giving a nice hybrid geo/topo layout
I don't remember exactly how we got the geolocations, but often network routers have 3-letter airport codes in their DNS names, so maybe that? We may also have had a lookup table in el googz somewhere
Definitely a project whose time should again come! ;)
bhartzer|1 year ago
amadeuspagel|1 year ago
jszymborski|1 year ago
ng-henry|1 year ago
etimberg|1 year ago
jll29|1 year ago
Have you thought of a front end that is basically just text/plain HTML (in normal size) + navigation links to explore the blogs in one frame, and the currently chosen blog in another frame? That way, you could look at the blogs while travelling your crawl graph, a kind of "blog explorer".
montyanderson|1 year ago
rcarmo|1 year ago
And I have my own internal links visualization, which might be a bit over the top (GPU recommended): https://taoofmac.com/static/graph
ng-henry|1 year ago
Yep this is only for stuff that we've crawled, so we can't detect all of your links. Because we have limited crawling resources, we rate-limit the crawling by domain so we don't get stuck in spider traps. The current visualization only shows the current state of the crawl, so it won't know about all of the posts.
Avamander|1 year ago
CalRobert|1 year ago
anfractuosity|1 year ago
aendruk|1 year ago
abalaji|1 year ago
adithyabalaji.com
ng-henry|1 year ago
ibaikov|1 year ago
mixedmath|1 year ago
system2|1 year ago
nexuist|1 year ago
ng-henry|1 year ago
denvaar|1 year ago
hanniabu|1 year ago
JohnKemeny|1 year ago
nhggfu|1 year ago
gdelfino01|1 year ago
nerdl0ve_kr|1 year ago
gverrilla|1 year ago
m0rty01|1 year ago
[deleted]