Second, to anyone crawling hidden services or crawling over tor, please run a relay or decrease your hop. Don’t sacrifice other’s desperate need for anonymity for your $whatever_purpose_thats_probably_not_important. It could be some fun thing to do for you, but some people are relying on tor to use the free, secure and anonymous Internet.
People who actually need anonymity need to hide among traffic that is boring. If you reduce the number of hops your crawler is using, you're reducing the amount of boring traffic and making it easier to find the interesting people.
Running a relay in addition to using Tor in the normal way is a good idea, however, as it increases the bandwidth of the network.
A polite suggestion, but this is not currently possible.
The Tor Project recently added a consensus flag which can globally disable single hop client connections as a DDoS mitigation approach. It is currently enabled. (DoSRefuseSingleHopClientRendezvous)
Disclaimer: I have rather small experience with Golang and just skimmed the crawler code.
From what I could see, author made effort to make the crawler distributed with k8s (which I don't is needed considering there are only approximately 75 000 onion addresses) using modern buzzword technology, but from what I could see the crawler itself is rather simplistic. It doesn't even seem to index/crawl relative urls, just absolute ones.
Assume 100 pages on each onion address (it’s probably power-law but let’s just assume that’s the mean). Latency with Tor is super high. Assume average of 5s to load a single page. This is generous because tail latency will probably dominate mean latency in this setting.
These things can happen in parallel but let’s also assume no more than 32 simultaneous TCP connections per host through a Tor proxy.
So we’re looking at ~75k1005/32 seconds = 14 days to run through all of them. You may not need to distribute this but there are situations (e.g. I want a fresh index daily) where it is warranted.
Author here. I'm fairly new to Golang too and it's my first project.
Regarding the number of onion addresses available you are wrong. Addresses are encoded in Base32 which means there are 32 characters available. So there are 32^16=1.208925819614629174706176×10^24 addresses available.
Not taken but available.
I agree with the fact that the crawler is really simplistic. But the project is new (2 months I think) and has to evolve. You can make a PR If you want to help me to improve it!
To anyone experimenting with such stuff, take care and don't make your services publically available. Especially the dark web is full with highly illegal content such as child pornography and in some jurisdictions even "involuntary possession" such as in browser caches may be enough to convict you.
Any http-aware software that supports socks proxies can access information on hidden services, so any crawler can do it. I fail to see what is novel about that, except that it uses k8s and mongo and a catchy blog title.
Go is a horrible language in which to write a crawler. The main problem is that NLP and machine learning code simply isn't as prevalent and robust as it is in Java and Python.
bureaucrat|6 years ago
Second, to anyone crawling hidden services or crawling over tor, please run a relay or decrease your hop. Don’t sacrifice other’s desperate need for anonymity for your $whatever_purpose_thats_probably_not_important. It could be some fun thing to do for you, but some people are relying on tor to use the free, secure and anonymous Internet.
jstanley|6 years ago
People who actually need anonymity need to hide among traffic that is boring. If you reduce the number of hops your crawler is using, you're reducing the amount of boring traffic and making it easier to find the interesting people.
Running a relay in addition to using Tor in the normal way is a good idea, however, as it increases the bandwidth of the network.
clashmeifyoucan|6 years ago
buildbuildbuild|6 years ago
The Tor Project recently added a consensus flag which can globally disable single hop client connections as a DDoS mitigation approach. It is currently enabled. (DoSRefuseSingleHopClientRendezvous)
malux85|6 years ago
For the uninitiated, can you please explain the differences in what they are and how they're accessed?
Myrmornis|6 years ago
unknown|6 years ago
[deleted]
alufers|6 years ago
vectorEQ|6 years ago
MuffinFlavored|6 years ago
Can somebody list some positive, legitimate, not illegal uses to desperately be anonymous?
Hitton|6 years ago
From what I could see, author made effort to make the crawler distributed with k8s (which I don't is needed considering there are only approximately 75 000 onion addresses) using modern buzzword technology, but from what I could see the crawler itself is rather simplistic. It doesn't even seem to index/crawl relative urls, just absolute ones.
etrain|6 years ago
These things can happen in parallel but let’s also assume no more than 32 simultaneous TCP connections per host through a Tor proxy.
So we’re looking at ~75k1005/32 seconds = 14 days to run through all of them. You may not need to distribute this but there are situations (e.g. I want a fresh index daily) where it is warranted.
creekorful|6 years ago
Regarding the number of onion addresses available you are wrong. Addresses are encoded in Base32 which means there are 32 characters available. So there are 32^16=1.208925819614629174706176×10^24 addresses available.
Not taken but available.
I agree with the fact that the crawler is really simplistic. But the project is new (2 months I think) and has to evolve. You can make a PR If you want to help me to improve it!
jmnicolas|6 years ago
robjan|6 years ago
iends|6 years ago
creekorful|6 years ago
mschuster91|6 years ago
creekorful|6 years ago
rolltiide|6 years ago
Dread, the dark net reddit, is surprisingly vibrant
I think its weird that people almost don't want to hear positive stories about dark net.
It’ll be funny when news articles and romcoms just start “forgetting” to qualify their plot piece with the “its scary” trope
Phenomenit|6 years ago
zhdc1|6 years ago
If you're new to the field and want something that's easy to set up & polite, I strongly recommend Apache Storm Crawler (https://github.com/DigitalPebble/storm-crawler).
sbmthakur|6 years ago
However, I'm wondering what would be a good practical purpose of crawling dark web.
creekorful|6 years ago
There's no practical purpose for the crawler. It's more an educational project than anything.
seisvelas|6 years ago
https://github.com/torgle/torgle/blob/master/backend/torgle....
fs111|6 years ago
woodandsteel|6 years ago
goatsi|6 years ago
Havoc|6 years ago
penagwin|6 years ago
getpolarized|6 years ago
marcrosoft|6 years ago