(no title)
lappa | 9 months ago
It's easy to set up, but be warned, it takes up a lot of disk space.
$ du -h ~/archive/webpages
1.1T /home/andrew/archive/webpages
https://github.com/gildas-lormeau/SingleFilelappa | 9 months ago
It's easy to set up, but be warned, it takes up a lot of disk space.
$ du -h ~/archive/webpages
1.1T /home/andrew/archive/webpages
https://github.com/gildas-lormeau/SingleFile
internetter|9 months ago
1. find a way to dedup media
2. ensure content blockers are doing well
3. for news articles, put it through readability and store the markdown instead. if you wanted to be really fancy, instead you could attempt to programatically create a "template" of sites you've visited with multiple endpoints so the style is retained but you're not storing the content. alternatively a good compression algo could do this, if you had your directory like /home/andrew/archive/boehs.org.tar.gz and inside of the tar all the boehs.org pages you visited are saved
4. add fts and embeddings over the pages
ashirviskas|9 months ago
windward|9 months ago
It is. 1.1TB is both:
- objectively an incredibly huge amount of information
- something that can be stored for the cost of less than a day of this industry's work
Half my reluctance to store big files is just an irrational fear of the effort of managing it.
davidcollantes|9 months ago
nirav72|9 months ago
snthpy|9 months ago
A couple of questions:
- do you store them compressed or plain?
- what about private info like bank accounts or health issuance?
I guess for privacy one could train oneself to use private browsing mode.
Regarding compression, for thousands of files don't all those self-extraction headers add up? Wouldn't there be space savings by having a global compression dictionary and only storing the encoded data?
d4mi3n|9 months ago
Can’t speak to your other issues but I would think the right file system will save you here. Hopefully someone with more insight can provide color here, but my understanding is that file systems like ZFS were specifically built for use cases like this where you have a large set of data you want to store in a space efficient manner. Rather than a compression dictionary, I believe tech like ZFS simply looks at bytes on disk and compresses those.
genewitch|9 months ago
I haven't put the effort in to make a "bookmark server" that will accomplish what singlefile does but on the internet because of how well singlefile works.
shwouchk|9 months ago
- Do you also archive logged in pages, infinite scrollers, banking sites, fb etc? - How many entries is that? - How often do you go back to the archive? is stuff easy to find? - do you have any organization or additional process (eg bookmarks)?
did you try integrating it with llms/rag etc yet?
eddd-ddde|9 months ago
nyarlathotep_|9 months ago
dataflow|9 months ago
RiverCrochet|9 months ago
90s_dev|9 months ago