(no title)
sixtyfourbits | 3 years ago
Picture this: 3 million books, at least one CID each (in practice it's often multiple, since the libgen collection uses a chunk size of 256kb). Section 3.1 of the paper talks about content publication - for each CID, a provider record is published on up to 20 different peers. Because the CIDs are derived from a high-quality hash function, they are evenly distributed. So this means that a node with a sufficient number of items ends up connecting to every single node on the network. For 3 million CIDs * 20 publication records, this means sending out 60 million publication records, every 12 hours, i.e. an average of 1388 publication records per second (assuming one CID per file, which is conservative). This is just to announce to the network "hi... just wanted to let you know I still have the same content I did yesterday". And every full replica of libgen is doing this.
Another major flaw derives from the way bitswap works as discussed in Section 3.2, which states "before entering the DHT lookup, the requesting peer asks all peers it is already connected to for the desired CID". I'm not sure if that actually means all the machines the node has any type of connection to, or only those connections over which it is using bitswap. Even so, asking every peer you're connected to, even if it's only the subset for which you have a bitswap connection established, is inefficient.
Compare this to bittorrent:
- First there is a much coarser level of granularity; in the libgen case the torrents contain 1,000 books each, so there's much fewer announcements to the bittorrent DHT and trackers. Of course the tradeoff with this is that you can't look up the identifier of an individual book. Instead if you know the magnet link (which can be constructed from the torrent hash) and the name of the file within that torrent.
- Secondly, a node hosting a large number of torrents (and a large number of active connections) will only send out want lists to the peers that it knows are also hosting that torrent. Peers also exchange have lists, and I think one or both can be represented as a bitfield for efficiency (rather than a list of CIDs/hashes). With bitswap you can end up asking every connected peer, just in case they have it.
On a practical note, hosting the torrents is quite practical with adequate hardware, but even with fairly powerful machines IPFS (at least in it's current main implementation, go-ipfs aka Kubo) really struggles and can bring a machine to its knees, even when hosting only a portion of the full collection. In terms of scalability, bittorrent and IPFS are in completely different leagues. Scalability is the main reason the archive of papers from sci-hub (over 80 million) isn't available via IPFS yet, because it's just not going to be able to handle that at all in its current state.
Having said all this, I should state my knowledge of the protocol and go-ipfs is incomplete, as I've only used it but not done any development work on it or dived deeply into the code. I'm happy to be corrected if I've misunderstood anything mentioned above. Also, bittorrent has more than 20 years of implementation experience and i'm sure with further work IPFS can be made to scale better. I don't have the answers as to how to achieve the granularity you get with IPFS vs bittorrent (which is a major point of difference, and something that sets IPFS apart in a significant way). But it's something that definitely has to be fixed to be truly capable of achieving it's stated goals.
b_fiive|3 years ago
I have a fair amount of experience with the kubo (go-IPFS) codebase, and can confirm the broad strokes of what you've posted here, including the part where bittorrent is straight-up better at scaling, both in terms of protocol design choices & having robust implementations.
The chattiness of the protocol is a very real problem. It used to be _much_ worse. Further order-of-magnitude drops will require rethinking numerous aspects of the protocol. The implied-start topology of the network needs more thought. What remains to be understood is if those pile of changes can bring IPFS into the same league as bittorrent in terms of network efficiency, while also having the "single-swarm property" that provides fine-grained content routing.
A bunch of us are committed to building this. Hopefuly a HN post in a few years time will point to this one as a reference for just how far we've come
xani_|3 years ago
Even in massive torrents with thousands of peers you just pick a bunch to talk with and that's it.
And publishing one was same thing essentially. Just that each peer's peer network will mesh enough to propagate data quickly.
> Also, bittorrent has more than 20 years of implementation experience and i'm sure with further work IPFS can be made to scale better.
It honestly looks like fundamental design issue. Bittorrent was blazing fast from the beginning and it only got better.