This is a recent video presentation by Jonah Edwards, who runs the Core Infrastructure Team at the Internet Archive. He explains the IA’s server, storage, and networking infrastructure, and then takes questions from other people at the Archive.
I found it all interesting. But the main takeaway for me is his response to Brewster Kahle’s question, beginning at 13:06, about why the IA does everything in-house rather than having its storage and processing hosted by, for example, AWS. His answer: lower cost, greater control, and greater confidence that their users are not being tracked.
I have ADD and typically eschew watching video if I can get the same, quality content faster in text.
I loved this and watched it to the end.
To any that feel the need to hide or disparage this because it seems to promote doing things on your own vs in the cloud, this isn’t some tech ops Total Money Makeover where you read a book and you’re suddenly in some sort of anti-credit cult. This is hard shit, and it’s the basics of the hard shit that I grew up with as being the only shit.
Yes, you can serve your own data. No one should fault you for doing that if you want. It takes the humble intelligence of the core team and everyone at IA to pull that off at this scale. If you don’t want to do the hard things, you could use the cloud. There are financial reasons also for one or the other, just as there are reasons people live with their family, rent, lease, and buy homes and office space- an imperfect analogy of course.
I hope that some of the others that could go on to work at the big guys or have been working there and want a challenge consider applying to IA when there’s an opening. They’ve done an incredible job, and I look forward to the cool things they accomplish in the future.
Speaking of Infrastructure, it is amazing that the initial set of the Apache BigData projects started at Internet Archive [0] whilst Alexa Internet, a startup Brewster Kahle sold to Amazon in 1999, formed the basis of Alexa Web Information Service, one of the first ''AWS'' products [1] which is still up? https://aws.amazon.com/awis/
Curious about the cost. Does that already include manpower and various acquisition cost of constructing their internal network (hardware, fiber link between site)?
I guess the biggest downside is the speed of scaling that they can do. As it is limited by how fast they can purchase and install new storage device. But with the use case of Internet Archive, that shouldn't matter much.
When I was a kid, my dreams was to work at Google, Microsoft, Apple or any of these big companies. Now that I am reaching 30 (and becoming very nostlagic of the old web), I think the company that would make me most happy of login every morning to get some work done would be Internet Archive.
Hate their "front end". One of the most disorganized user-facing sites.
I would love to see an effort to address that.
Would it be possible for archive.org to offer an API to allow other sites to present archive.org using their own front-end? We could see lots of specialty sites that focus on the user experience for some slice of archive.org.
I get the impression Wayback Machine data is stored in powered down drives and they only spin them up when someone accesses the data. That would explain the several second delay and it'd make sense that an archive wouldn't need 95% of its data ready to go at a moments notice since that'd be a terrible waste of power.
One aspect is that our data centers are in California, with no CDN. If you are on the other side of the world, you will have higher round-trip latency on every request, for all services.
Another is layers of caching. Popular or recently requested Wayback content is more likely to be in either an explicit cache (eg, redis), or implicitly in kernel page caches across all layers of the request.
Every wayback replay request hits several layers of index indexes (sorted by domain, path, and timestamp), which are huge and thus actually served from spinning disk over HTTP (!). This includes a timeline summary for the primary document, to display the banner. Then the actual raw records are fetched from another spinning disk over HTTP. This may result in one or more layers of internal redirect (additional fetches) if there was a "revisit" (identical HTTP body content, same URL, different timestamp). Then finally the record is re-written for replay (for HTML, CSS, Javascript, etc, unless the raw record was requested). Some pages will have many sub-resources, so this process is repeated many times, but that is the same as page load and you can see which resources are slow or not.
As mentioned in the video, depending on where we are in the network hardware upgrade lifecycle, sometimes outbound bandwidth is tight also, which slows down transfer.
And of course most of these services operate without a ton of overhead, so if there is a spike in traffic everything will slow down a bit. There is a lot of multi-tenancy-like situations also, so if there is a very popular zip file or Flash game getting served from the same storage disk as the WARC file holding a wayback resource, the replay for that specific resource will be slow due to disk I/O contention.
If you are curious about why a specific HTML wayback replay was slow, you can look in the source code of the re-written document and see some timing numbers.
Several organizations run large web archives that operate similarly to web.archive.org, and have described cost/benefit trade offs for different components. Eg, National Library of Australia has an alternative CDX index called OutbackCDX, which uses RocksDB on SSDs. I believe other folks store WARC files in S3 or S3-like object storage systems. The Wayback Machine is somewhat unique in the amount of (read) traffic it gets, the heterogeneity of archived content (from several crawlers, in older ARC as well as WARC), volume of live crawling ("save paper now" results show up pretty fast in the main site, which is black magic), running on "boring" general purpose hardware, and deep integration with our general purpose storage cluster.
I see incredible value in IA collection of books, videos and software. OTOH I'm puzzled by lack of organization.
Take for example this newer document: https://archive.org/details/manualzilla-id-5695071 The document has horrible name and useless tags, and the content seems to be only section 2 of some SW manual. How would I ever hope to find it if I needed that exact document?
Obviously such huge archive cannot be categorized and annotated by small team, so it would make sense to crowdsource the labeling process. Yet, as registered user I can only flag the item, or write a review. Why doesn't IA let users label content and build their own curated collections of items?
Commented above before I saw yours. I agree, and wonder further if archive.org could play host instead to any number of spinoff sites that try to better organize/present the data (or a subset of the data) on archive.org.
To be fair the linked document was uploaded just today to a collection that seems be considered a "waystation" collection so probably the document will be moved later to a permanent collection.
And I think they have bots that process the uploaded documents to do OCR and create previews.
The Internet Archive is incredibly commendable! It's impressive that such a small organization can do so much. I wish they would fix the Wayback Machine being broken in Firefox, though.
Are you getting "Fail with status: 498 No Reason Phrase"? You might have your Referer header disabled.
If that's the case, you can fix it by going to about:config and setting network.http.sendRefererHeader to 2 (or pressing the reset button to the right).
According to an IA blog post in 2016 [1], they had partial backups in Egypt and the Netherlands then and were planning to establish a partial mirror in Canada as well. Perhaps someone at the IA can tell us what the current status of those mirrors is.
Brewster Kahle does mention in the video (around 26:50) that one reason they use paired storage is so that the two disks in a pair can be in different countries. He then goes on to praise “Linux and the wonder of open source”; he recently blogged about that, too [2].
If anyone from the IA reads this, are there any plans for IPv6 support? Archiving IPv4 for future generations is an important goal so it's perhaps fitting that in 2021 the IA is still running a historical Internet Protocol but it would nice to have IPv6 support for those running IPv6-only networks.
Does the IA have Data sites that are not in SF?
When he shows the map of sites they all seem very close to each other and a natural disaster could wipe out alot of the archive.
They are mostly all in California, though not entirely in SF. They send some of their data to other parts of the world too, but I'm concerned that the don't have the redundancy needed. There was a project in attempt to back up the internet archive, but it became unmaintained in 2019 and only about 200tb were actually being backed up http://iabak.archiveteam.org/
Honestly, I've lost a bit of confidence in the Internet Archive project's judgement and long term stability, given the risk they took in distributing books during Covid-19. I feel that they risked the entire project with massive (almost certainly fatal) copyright violation fines in order to distribute extra copies of books. That lawsuit is still pending and we don't know what the outcome of that lawsuit will be.
If they're willing to risk everything they've accomplished to date in order to issue a few extra copies of books, I don't see them surviving long term, nor do I feel comfortable donating to the project.
If I understood it correctly, then they are using simple physical disk mirrors for redundancy. To me that seems like a huge waste of disk space. Parity based redundancy schemes like RAID-Z3 are way more space efficient.
I do understand that parity based schemes need more time to heal/rebuild on drive replacements, but that does not seem to outweigh the huge amount of wasted disk space IMHO.
The paired disks are in a different physical location so they also provide a degree of geographic redundancy.
Short of splitting a RAID array across two physical locations (a terrible idea), your proposal would require them to run mirrored RAID arrays in both locations.
This would give them greater redundancy, but would be a less efficient use of raw disk space than their current solution. It would also be more complex and difficult to maintain, and have performance impacts.
Besides the cross-DC issue others have mentioned, erasure coding everything can also exacerbate CPU or memory bottlenecks. Not sure if this is an issue for IA, but on my last project data would be initially replicated and then transparently converted to erasure codes after some time. I believe that some other exabyte-scale storage systems work similarly.
I was wondering if they use some Machine Learning/Artificial Intelligence to prevent hard drives for failing, or move the data more efficiently around,...
I've been looking into Ceph a lot recently and was just wondering if they used it. Apparently not. Perhaps too abstract given their value of simplicity.
Point is ceph & friends have a lot of overhead. Example in ceph: by default, a file in S3 layer is split into 4MB chunks, and each of those chunks is replicated or erasure-coded. Using the same erasure coding as wasabi,b2-cloud, which is 16+4=20 (or 17+3=20), each of those 4MB chunks is split into 20 shards of ~200KB each. Each of those shards ends up having ~512B to 4KB of metadata.
So from 10KB to 80KB of metadata for single 4MB chunk.
[+] [-] tkgally|5 years ago|reply
I found it all interesting. But the main takeaway for me is his response to Brewster Kahle’s question, beginning at 13:06, about why the IA does everything in-house rather than having its storage and processing hosted by, for example, AWS. His answer: lower cost, greater control, and greater confidence that their users are not being tracked.
[+] [-] 7800|5 years ago|reply
I loved this and watched it to the end.
To any that feel the need to hide or disparage this because it seems to promote doing things on your own vs in the cloud, this isn’t some tech ops Total Money Makeover where you read a book and you’re suddenly in some sort of anti-credit cult. This is hard shit, and it’s the basics of the hard shit that I grew up with as being the only shit.
Yes, you can serve your own data. No one should fault you for doing that if you want. It takes the humble intelligence of the core team and everyone at IA to pull that off at this scale. If you don’t want to do the hard things, you could use the cloud. There are financial reasons also for one or the other, just as there are reasons people live with their family, rent, lease, and buy homes and office space- an imperfect analogy of course.
I hope that some of the others that could go on to work at the big guys or have been working there and want a challenge consider applying to IA when there’s an opening. They’ve done an incredible job, and I look forward to the cool things they accomplish in the future.
[+] [-] ignoramous|5 years ago|reply
[0] http://radar.oreilly.com/2015/04/coming-full-circle-with-big...
[1] https://web.stanford.edu/class/ee204/Publications/Amazon-EE3...
[+] [-] benaadams|5 years ago|reply
[+] [-] Ansil849|5 years ago|reply
And that right there is why I continue to donate to IA. I am sick and tired of services offloading my data to destructive companies like Amazon.
[+] [-] janjim|5 years ago|reply
I guess the biggest downside is the speed of scaling that they can do. As it is limited by how fast they can purchase and install new storage device. But with the use case of Internet Archive, that shouldn't matter much.
[+] [-] nodesocket|5 years ago|reply
[+] [-] 101008|5 years ago|reply
[+] [-] JKCalhoun|5 years ago|reply
Hate their "front end". One of the most disorganized user-facing sites.
I would love to see an effort to address that.
Would it be possible for archive.org to offer an API to allow other sites to present archive.org using their own front-end? We could see lots of specialty sites that focus on the user experience for some slice of archive.org.
[+] [-] fireattack|5 years ago|reply
Most of content on IA loads pretty fast, so WB is a notable exception.
[+] [-] npunt|5 years ago|reply
[+] [-] kilroy123|5 years ago|reply
Still, I strongly support the work they do and think it's very important work. I also think they do a good job for their size and resources.
[+] [-] shaunparker|5 years ago|reply
[+] [-] bnewbold|5 years ago|reply
One aspect is that our data centers are in California, with no CDN. If you are on the other side of the world, you will have higher round-trip latency on every request, for all services.
Another is layers of caching. Popular or recently requested Wayback content is more likely to be in either an explicit cache (eg, redis), or implicitly in kernel page caches across all layers of the request.
Every wayback replay request hits several layers of index indexes (sorted by domain, path, and timestamp), which are huge and thus actually served from spinning disk over HTTP (!). This includes a timeline summary for the primary document, to display the banner. Then the actual raw records are fetched from another spinning disk over HTTP. This may result in one or more layers of internal redirect (additional fetches) if there was a "revisit" (identical HTTP body content, same URL, different timestamp). Then finally the record is re-written for replay (for HTML, CSS, Javascript, etc, unless the raw record was requested). Some pages will have many sub-resources, so this process is repeated many times, but that is the same as page load and you can see which resources are slow or not.
As mentioned in the video, depending on where we are in the network hardware upgrade lifecycle, sometimes outbound bandwidth is tight also, which slows down transfer.
And of course most of these services operate without a ton of overhead, so if there is a spike in traffic everything will slow down a bit. There is a lot of multi-tenancy-like situations also, so if there is a very popular zip file or Flash game getting served from the same storage disk as the WARC file holding a wayback resource, the replay for that specific resource will be slow due to disk I/O contention.
If you are curious about why a specific HTML wayback replay was slow, you can look in the source code of the re-written document and see some timing numbers.
Several organizations run large web archives that operate similarly to web.archive.org, and have described cost/benefit trade offs for different components. Eg, National Library of Australia has an alternative CDX index called OutbackCDX, which uses RocksDB on SSDs. I believe other folks store WARC files in S3 or S3-like object storage systems. The Wayback Machine is somewhat unique in the amount of (read) traffic it gets, the heterogeneity of archived content (from several crawlers, in older ARC as well as WARC), volume of live crawling ("save paper now" results show up pretty fast in the main site, which is black magic), running on "boring" general purpose hardware, and deep integration with our general purpose storage cluster.
Note: I work at IA but not on the Wayback system
[+] [-] ignoramous|5 years ago|reply
[+] [-] Bestia0728|5 years ago|reply
[+] [-] jmiskovic|5 years ago|reply
Take for example this newer document: https://archive.org/details/manualzilla-id-5695071 The document has horrible name and useless tags, and the content seems to be only section 2 of some SW manual. How would I ever hope to find it if I needed that exact document?
Obviously such huge archive cannot be categorized and annotated by small team, so it would make sense to crowdsource the labeling process. Yet, as registered user I can only flag the item, or write a review. Why doesn't IA let users label content and build their own curated collections of items?
[+] [-] JKCalhoun|5 years ago|reply
[+] [-] niea_11|5 years ago|reply
And I think they have bots that process the uploaded documents to do OCR and create previews.
[+] [-] caslon|5 years ago|reply
[+] [-] jolmg|5 years ago|reply
Are you getting "Fail with status: 498 No Reason Phrase"? You might have your Referer header disabled.
If that's the case, you can fix it by going to about:config and setting network.http.sendRefererHeader to 2 (or pressing the reset button to the right).
[+] [-] jasoncartwright|5 years ago|reply
[+] [-] tkgally|5 years ago|reply
Brewster Kahle does mention in the video (around 26:50) that one reason they use paired storage is so that the two disks in a pair can be in different countries. He then goes on to praise “Linux and the wonder of open source”; he recently blogged about that, too [2].
[1] http://blog.archive.org/2016/12/03/faqs-about-the-internet-a...
[2] http://blog.archive.org/2021/02/04/thank-you-ubuntu-and-linu...
[+] [-] ahrs|5 years ago|reply
[+] [-] chimbosonic|5 years ago|reply
[+] [-] ajdude|5 years ago|reply
[+] [-] vermilingua|5 years ago|reply
[+] [-] lprd|5 years ago|reply
[+] [-] 88|5 years ago|reply
This is preferred for its simplicity and performance, and if I were in their position I would do the same thing.
[+] [-] jgowdy|5 years ago|reply
If they're willing to risk everything they've accomplished to date in order to issue a few extra copies of books, I don't see them surviving long term, nor do I feel comfortable donating to the project.
https://www.npr.org/2020/06/03/868861704/publishers-sue-inte...
[+] [-] bestboy|5 years ago|reply
[+] [-] 88|5 years ago|reply
Short of splitting a RAID array across two physical locations (a terrible idea), your proposal would require them to run mirrored RAID arrays in both locations.
This would give them greater redundancy, but would be a less efficient use of raw disk space than their current solution. It would also be more complex and difficult to maintain, and have performance impacts.
[+] [-] nwmcsween|5 years ago|reply
[+] [-] notacoward|5 years ago|reply
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] bsmith0|5 years ago|reply
Have they lost data during rebuild?I know he briefly talked about that risk with different HD sizes.
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] bobnarizes|5 years ago|reply
[+] [-] acidburnNSA|5 years ago|reply
[+] [-] ddorian43|5 years ago|reply
Point is ceph & friends have a lot of overhead. Example in ceph: by default, a file in S3 layer is split into 4MB chunks, and each of those chunks is replicated or erasure-coded. Using the same erasure coding as wasabi,b2-cloud, which is 16+4=20 (or 17+3=20), each of those 4MB chunks is split into 20 shards of ~200KB each. Each of those shards ends up having ~512B to 4KB of metadata.
So from 10KB to 80KB of metadata for single 4MB chunk.
[+] [-] known|5 years ago|reply
[+] [-] sidpatil|5 years ago|reply
[+] [-] iDATATi|5 years ago|reply
[deleted]
[+] [-] coopreme|5 years ago|reply