top | item 38510616

(no title)

ikatson | 2 years ago

When I built it at first a couple years ago, it all went pretty smooth until I hit concurrent communication with many peers, both for DHT and torrent downloading.

E.g. parsing bencode, and other binary protocols involved in the network (bittorrent peer protocol, tracker requests, DHT protocol) was all easy comparing to managing state.

Managing state (e.g. what parts we have downloaded so far, what are we downloading at the moment, what peers are we expecting to get pieces from, etc) has a lot of edge cases that need to be handled.

For a recent example, I was dealing with a bug that manifested like this: when resuming a torrent from "paused" state, it was stuck at ~99.87% downloaded, and never progressed. Turned out, than when I put the torrent into "paused" state, after disconnecting the peers, I wasn't marking the in-flight pieces for re-download. So when new peers connected, they had nothing to do.

DHT was also tricky, for multiple reasons:

- as it works via UDP, there's no "connection" - you need to match incoming messages to previously sent requests yourself. But then you might never get a response, so you must manage some kind of timeouts, to clean up memory of old outgoing requests that are no longer expected to be completed.

- DHT involves "recursive" requests. You query N peers at first, and each of them might return M other peers. So on next round you need to make N * M requests. And this recursion can continue forever, ever growing exponentially. You need to put some heuristics around not to explode the network and at least your own computer. For example, my MacOS UI was freezing when rqbit was trying to send too many DHT requests at once.

- Handling the above (and other) examples makes the code worse. Managing code complexity, at least for myself to understand what on earth is going on, is much harder than e.g. implementing the binary protocols.

Otherwise, it's all about tricky details, behaviours that can only be observed under certain circumstances, and only if you're curious enough to look, e.g.

- Garbage collection. Ensuring that when the client (e.g. peer, browser etc) disconnect, everything is cleaned up. E.g. when the torrent is paused, this causes a massive "stop" for all spawned tasks. If you don't account for that, they might keep running forever.

- Network issues. Re-connecting to peers, to DHT etc, re-trying everything that can be retried, might be a head-scratcher sometimes.

All that said, it's quite fun to deal with these when that's your goal by itself - to enjoy the process of coding.

discuss

anacrolix|2 years ago

I found the trickiest parts to be DHT concurrency (finally solved this adequately after 5-6 years of experimenting), and efficient block requesting (I've rewritten this every few years, but my latest implementation seems to be solid since 2020 or so). The major thing I've not solved, but also been too lazy to tackle is a peer cache. I just reaannounce when all peers are exhausted and start over.

ikatson|2 years ago

Concurrency and the shared state nature of torrents are definitely what makes it all hard and tricky. I've rewritten this part of DHT completely recently, but according to your experience looks like this time won't be the last.

For block requesting, rqbit has a pretty simple algorithm https://github.com/ikatson/rqbit/blob/main/crates/librqbit/s..., and I didn't notice it in benchmarks, thanks to Rust being fast by default I guess. I admit though, never looked how other clients do it, maybe the rqbit algorithm is too naive.