(no title)
pilimi_anna | 2 years ago
High-speed access available for anyone who can do at-scale text extraction, or who can supply us with new collections.
pilimi_anna | 2 years ago
High-speed access available for anyone who can do at-scale text extraction, or who can supply us with new collections.
sillysaurusx|2 years ago
Please focus on your opsec. The more visible you become, the progressively angrier people will get. Don’t do anything silly like edit your Wikipedia page from your house.
With that out of the way, someone I know happens to have the original books3 epub files. I think they can be convinced to send them to you. It’s only 200,000 books, but that could theoretically grow your collection by 10% or so. I don’t know whether that would be helpful to you (you’ve far surpassed books3 at this point), but if so, let me know.
Given the legal risks, the best course of action for AI companies is probably to ignore English and European books entirely. There is plenty of Chinese data, and the models would learn all the same concepts without exposing anyone to lawsuits.
stavros|2 years ago
Basically, you download a client, say "allocate 2 TB of my disks to whatever archive.org/donate/disk.rss" says, and the server/client combination ensures you download and seed the rarest 2TB of the collection.
This design is also open, in the sense that the server can share the database of torrents it contains, and anyone can use it to fetch any of the files in the dataset from the swarm.
Would something like this be at all useful? I've emailed a few archivists, but I got no response, and the one person I've managed to talk to about this said there have been a few attempts on this, but they always fail for one reason or another.
pilimi_anna|2 years ago