(no title)
agamble | 8 years ago
Briefly: Sites are archived using a system written in Golang and uploaded to a Google Cloud bucket.
More: The system downloads the remote HTML, parses it to extract the relevant dependencies (<script>, <link>, <img> etc) and then downloads these as well. Tesoro is even parsing CSS files to extract the url('...') file dependencies from here as well, meaning most background images and fonts should continue to work. All dependencies (even those hosted at remote domains) are downloaded and hosted with the archive, meaning the src attributes on the original page tags are wrangled to support the new location.
The whole thing is hosted on GCP Container Engine and I deploy with Kubernetes.
I'll write up a more comprehensive blog post in some time, which portion of this would you like to hear more about?
19eightyfour|8 years ago
How can you pay for this if it's free? It's unreliable unless its financially viable.
agamble|8 years ago
For now it's a free service with a single rate-limited form. Now it's time to work on adding specialty features that are worth paying for.
Faaak|8 years ago
The page URI is a bit obscure though. I think a tresoro.io/example.tld/page/foobar/timestamp would look good.
What about big media content and/or small differences between them ?
agamble|8 years ago
Currently there is no global redundancy checking, only locally within the same page. So two CSS files which are the same from multiple archives are both kept. While this might not be ideal in terms of scaling to infinity, each archive + its dependencies are right now limited in size to 25MB, which should help keep costs under control until this is monetised. :)