top | item 14644772

(no title)

agamble | 8 years ago

OP here. Definitely, great idea :)

Briefly: Sites are archived using a system written in Golang and uploaded to a Google Cloud bucket.

More: The system downloads the remote HTML, parses it to extract the relevant dependencies (<script>, <link>, <img> etc) and then downloads these as well. Tesoro is even parsing CSS files to extract the url('...') file dependencies from here as well, meaning most background images and fonts should continue to work. All dependencies (even those hosted at remote domains) are downloaded and hosted with the archive, meaning the src attributes on the original page tags are wrangled to support the new location.

The whole thing is hosted on GCP Container Engine and I deploy with Kubernetes.

I'll write up a more comprehensive blog post in some time, which portion of this would you like to hear more about?

discuss

19eightyfour|8 years ago

The issue is cost. Your costs are disk space for people's archives, instances for people's use, and bandwidth for the fetches and crawls and access.

How can you pay for this if it's free? It's unreliable unless its financially viable.

agamble|8 years ago

Totally right, great observation :)

For now it's a free service with a single rate-limited form. Now it's time to work on adding specialty features that are worth paying for.

Faaak|8 years ago

How about avoiding redundancies ? Are same CSS files cached twice or referenced by their hash ?

The page URI is a bit obscure though. I think a tresoro.io/example.tld/page/foobar/timestamp would look good.

What about big media content and/or small differences between them ?

agamble|8 years ago

Great question.

Currently there is no global redundancy checking, only locally within the same page. So two CSS files which are the same from multiple archives are both kept. While this might not be ideal in terms of scaling to infinity, each archive + its dependencies are right now limited in size to 25MB, which should help keep costs under control until this is monetised. :)