Building a Dark Web Crawler in Go

First of all, it’s hidden sevices, not dark web.

Second, to anyone crawling hidden services or crawling over tor, please run a relay or decrease your hop. Don’t sacrifice other’s desperate need for anonymity for your $whatever_purpose_thats_probably_not_important. It could be some fun thing to do for you, but some people are relying on tor to use the free, secure and anonymous Internet.

jstanley|6 years ago

Actually, the opposite is true.

People who actually need anonymity need to hide among traffic that is boring. If you reduce the number of hops your crawler is using, you're reducing the amount of boring traffic and making it easier to find the interesting people.

Running a relay in addition to using Tor in the normal way is a good idea, however, as it increases the bandwidth of the network.

clashmeifyoucan|6 years ago

Actually, the official term is Onion Services (https://2019.www.torproject.org/docs/onion-services.html.en check out the first paragraph)

buildbuildbuild|6 years ago

A polite suggestion, but this is not currently possible.

The Tor Project recently added a consensus flag which can globally disable single hop client connections as a DDoS mitigation approach. It is currently enabled. (DoSRefuseSingleHopClientRendezvous)

malux85|6 years ago

> First of all, it’s hidden sevices, not dark web

For the uninitiated, can you please explain the differences in what they are and how they're accessed?

Myrmornis|6 years ago

Must one rely on appeals such as this (i.e. a cultural solution) or does tor have a technological solution to the problem you're describing?

unknown|6 years ago

[deleted]

alufers|6 years ago

I am no TOR expert, but how does decreasing the amount of hops or running your own relay decrease the privacy of other people?

vectorEQ|6 years ago

i just like how 'dark web' turned into 'Tor' at some point :'). there's tons of others... :s guess ppl forgot

MuffinFlavored|6 years ago

> other’s desperate need for anonymity

Can somebody list some positive, legitimate, not illegal uses to desperately be anonymous?

Hitton|6 years ago

Disclaimer: I have rather small experience with Golang and just skimmed the crawler code.

From what I could see, author made effort to make the crawler distributed with k8s (which I don't is needed considering there are only approximately 75 000 onion addresses) using modern buzzword technology, but from what I could see the crawler itself is rather simplistic. It doesn't even seem to index/crawl relative urls, just absolute ones.

etrain|6 years ago

Assume 100 pages on each onion address (it’s probably power-law but let’s just assume that’s the mean). Latency with Tor is super high. Assume average of 5s to load a single page. This is generous because tail latency will probably dominate mean latency in this setting.

These things can happen in parallel but let’s also assume no more than 32 simultaneous TCP connections per host through a Tor proxy.

So we’re looking at ~75k1005/32 seconds = 14 days to run through all of them. You may not need to distribute this but there are situations (e.g. I want a fresh index daily) where it is warranted.

creekorful|6 years ago

Author here. I'm fairly new to Golang too and it's my first project.

Regarding the number of onion addresses available you are wrong. Addresses are encoded in Base32 which means there are 32 characters available. So there are 32^16=1.208925819614629174706176×10^24 addresses available.

Not taken but available.

I agree with the fact that the crawler is really simplistic. But the project is new (2 months I think) and has to evolve. You can make a PR If you want to help me to improve it!

jmnicolas|6 years ago

I'd be concerned that the DB is going to contain some pretty nasty stuff that might be hard to explain in front of a judge.

robjan|6 years ago

A crawler of the surface web will have this problem too.

iends|6 years ago

If you avoid storing images, is there any other items you could be liable for?

creekorful|6 years ago

You are right. That's why it's an educational project and not a public search engine

mschuster91|6 years ago

To anyone experimenting with such stuff, take care and don't make your services publically available. Especially the dark web is full with highly illegal content such as child pornography and in some jurisdictions even "involuntary possession" such as in browser caches may be enough to convict you.

creekorful|6 years ago

Do you think I should add a license in Github to mention that? To protect me and the users who will use the crawler?

rolltiide|6 years ago

I’ve been pretty surprised at how big hidden services have become

Dread, the dark net reddit, is surprisingly vibrant

I think its weird that people almost don't want to hear positive stories about dark net.

It’ll be funny when news articles and romcoms just start “forgetting” to qualify their plot piece with the “its scary” trope

Phenomenit|6 years ago

I thought dread was dead?

zhdc1|6 years ago

Crawlers are fun!

If you're new to the field and want something that's easy to set up & polite, I strongly recommend Apache Storm Crawler (https://github.com/DigitalPebble/storm-crawler).

sbmthakur|6 years ago

A well written article with lot of technical details. Well done.

However, I'm wondering what would be a good practical purpose of crawling dark web.

creekorful|6 years ago

Thank you!

There's no practical purpose for the crawler. It's more an educational project than anything.

seisvelas|6 years ago

I did the same in Racket when I made a Tor search engine. Here's the source code of the crawler!

https://github.com/torgle/torgle/blob/master/backend/torgle....

fs111|6 years ago

Any http-aware software that supports socks proxies can access information on hidden services, so any crawler can do it. I fail to see what is novel about that, except that it uses k8s and mongo and a catchy blog title.

woodandsteel|6 years ago

So how well would this thing work? What I am asking is what percentage of all the tor hidden service sites out there would get detected by it?

goatsi|6 years ago

How well does it handle a gzip bomb? https://www.hackerfactor.com/blog/index.php?/archives/762-At...

Havoc|6 years ago

Sounds like a recipe to score yourself a free FBI visit

penagwin|6 years ago

Generally the FBI doesn't give a hoot until you start distributing illegal stuff....

getpolarized|6 years ago

Go is a horrible language in which to write a crawler. The main problem is that NLP and machine learning code simply isn't as prevalent and robust as it is in Java and Python.

marcrosoft|6 years ago

Go is great for a crawler. What does NLP and ML have to do with crawling?

107 comments