Show HN: PageDash, Your Personal Web Archive

[+] vitovito|8 years ago|reply

Please consider producing archives in WARC format, and either donating captures of public pages to the Internet Archive (and other interested archives), or supporting ways for users to download their own archives in that format for them to donate them themselves and use in systems like Webrecorder.

(Note that a download of just page content and assets isn't enough; WARC stores headers, etc., also.)

[+] ernsheong|8 years ago|reply

Thanks for the comment. Admittedly I bypassed WARC completely as I felt overwhelmed by its technicalities in favor of how I knew the web worked. If I have a better understanding of WARC maybe that can be done, but I make no promises.

[+] AdieuToLogic|8 years ago|reply

For those who had not heard of the WARC format before now (which included myself), I believe this is the official specification:

http://archive-access.sourceforge.net/warc/warc_file_format-...

As indicated by the US Library of Congress[0].

0 - https://www.loc.gov/preservation/digital/formats/fdd/fdd0002...

[+] jaytaylor|8 years ago|reply

http://fileformats.archiveteam.org/wiki/WARC

Never heard of this until now, it sounds exceptionally pragmatic and good!

I also found these WARC tools made by the folks at Internet Archive, certainly interesting:

https://github.com/internetarchive/warctools

[+] tomc1985|8 years ago|reply

Wait, you can't download archived pages?

ANOTHER ----id ----ing cloud service trying to replace files and programs with some BS pricing scheme?

Seriously are the "entrepreneurs" of HN even trying? How pathetic that seemingly everything on this site is someone's jobs program?

[+] ernsheong|8 years ago|reply

Founder here, happy to field questions and feedback!

Right now PageDash is quite a simple product, but hopefully with sufficient traction we can continue to implement things like full-text search, tagging support, link sharing, as well as mobile support. Your support is absolutely crucial to making PageDash come alive even more in the future.

This is my first product, thank you for being nice :)

[+] CJKinni|8 years ago|reply

Would it be possible to add auto-archiving to the extension?

On the $9/mo plan, I'd probably still not hit 100GB/mo of uploads.

My main reason for wanting this feature is that I could use the full text search (when it's available) to search every webpage I've visited. I find myself more and more frequently unable to find things I know existed at one point. I've been thinking of building my own solution where I just archive every page I visit on the fly then build a personal search for pages I've previously visited.

[+] gooseus|8 years ago|reply

Dig it, got some questions...

Is the data stored exclusively on your Google Cloud or can I see my archived pages while offline and backup my web archive locally?

Essentially, what guarantees do I have regarding access to my data should your startup, my local infrastructure, or civilization collapse?

[+] teddyh|8 years ago|reply

Why must everything be a cloud service? I use ScrapBook (http://www.xuldev.org/scrapbook/) to save web pages locally.

[+] wongarsu|8 years ago|reply

Because without a cloud service you can't get people to pay $3-$9/month for something like this.

[+] CabSauce|8 years ago|reply

I really like the idea of this and other, similar products/services. I haven't used any of them since they don't seem to be exactly what I want.

What I really want is auto tagging and classification + semantic search. I don't even really want to have to save the page. I want this functionality on my browsing history.

Maybe some increased functionality for saving specific types of pages. If I save a recipe, I want the service to recognize that it's a recipe and put it in my 'cookbook'. With a consistent format, if possible. If I save a blog post, tag the topic, technology and language used.

[+] ernsheong|8 years ago|reply

I really want auto-tagging via ML classification as well, it's one of the things I wanted other than one-click save when I started the project. That's a really nice to have at the moment and can only be achieved once PageDash matures more. Right now the closest antidote I can offer for your use is to configure a keyboard shortcut to do the extension saving via chrome://extensions > Keyboard shortcuts (bottom) for quick saving.

[+] rahiel|8 years ago|reply

The previous web archive launched on HN is already dead [0]. Many of the comments from that discussion also apply here. Good luck and I hope you'll manage to stay online!

[0]: https://news.ycombinator.com/item?id=14644441

[+] j_s|8 years ago|reply

Open-source, self-hosted alternative(s) discussed within days of the above:

Wallabag: a self-hostable application for saving web pages | https://news.ycombinator.com/item?id=14686882 (2017Jul:166points,53comments)

[+] Accacin|8 years ago|reply

What are the advantages of this over something like pinboard.in?

[+] ernsheong|8 years ago|reply

Unless you are on Pinboard's archiving plan, Pinboard mostly manages just your bookmarks. PageDash doesn't claim to be a bookmark manager, but it really can be one. Bookmark the page, along with the content.

[+] adityar|8 years ago|reply

Signed up - saved my first page - and viewed my dash within 5 mins. Good stuff. Now, all you need is not to go out of business (or open source before you do). Seriously though, good luck on the business side.

[+] ernsheong|8 years ago|reply

Thank you. You're right, hopefully business side holds up. I'll keep it up as long as someone is paying me :)

[+] ernsheong|8 years ago|reply

I should point out that PageDash also tries to handle saving nested pages and iframes, I'm not sure it's something that other archivers try to do.

Also Web Components (custom elements, shadow DOM) support is definitely do-able and something for the pipeline. It's not something even the Internet Archive is capable of right now. Wayback Machine's youtube.com archive is blank.

[+] michaelmior|8 years ago|reply

Looks interesting. Why would I use PageDash over something like Evernote or Pocket?

[+] ernsheong|8 years ago|reply

Good question. PageDash aims to preserve the page in the original format and render it just as you saw it. Right now, Evernote does quite a bad job at rendering, I've used it a lot. Pocket on the other hand specializes at stripping out the HTML and leaving just the content in a reader-mode fashion, though I've not tried their premium offering that also archives.

PageDash archives from the front-end, while many archivers tend to archive by sending a link to the backend which then queries the website remotely, so you might not be archiving what you saw exactly, which admittedly in many cases doesn't matter. The upside of this technicality is that you can save content that you see only when you are logged in!

[+] gkya|8 years ago|reply

Anything FOSS in this sphere? I'm slowly going towards building my solution to automatically archive my Firefox bookmarks locally, but a bit too slowly.

[+] ernsheong|8 years ago|reply

There's plenty.

Here's a pretty comprehensive list that someone else made: https://news.ycombinator.com/item?id=14647119

Here's another FOSS that I found: https://github.com/pirate/bookmark-archiver

[+] nels|8 years ago|reply

Have you considered saving files (such as fonts and JS libs) loaded through major CDNs centrally just once instead of storing it again each time a page is saved?

Maybe you already have plans for this, but it would be smart to implement a system that checks whether files are already present on your server so you don't waste any of your user's quota and the server's disk space.

[+] ernsheong|8 years ago|reply

Thanks for the comment! That would be ideal and it has crossed my mind but I have given little thought on how to do de-duplication right (premature optimization from a maker's perspective). Right now each page and its assets sit within it's own "bucket". But yes page assets and all these dependencies can really add up fast.

[+] abainbridge|8 years ago|reply

Excellent work. I can now close all those browser tabs I've had open in the background for weeks, just so I don't lose the page.

[+] ernsheong|8 years ago|reply

Thank you! Would really love to hear your feedback on the product, warts and all. jonathan[[at]]pagedash.com

[+] vpvp|8 years ago|reply

wouldn't OneTab extension be a better solution. I see PageDash as a personal Internet Archive/Wayback Machine

[+] pwenzel|8 years ago|reply

After signing in, my initial reaction is that I wish I didn't have to use a browser extension to save a page.

It would be handy if I could just enter a URL and have it saved, a la Pinboard or Instapaper.

That said, this worked very well on my first try.

[+] ernsheong|8 years ago|reply

Thanks for the comment! Maybe I will make that possible in the future, but for now the advantage of this is that you can save logged-in content, i.e. content that you see when you're logged in. Passing the URL to backend prevents that as the backend is not authenticated, or even worse blocked.

[+] ernsheong|8 years ago|reply

Alright folks, it's 3am where I am at the moment, gonna hit the sack. I'll address more questions and concerns tomorrow. Thank you for all your feedback!

[+] ff7c11|8 years ago|reply

So this works until your cloud site dies. No thanks.

[+] ernsheong|8 years ago|reply

There are a few ways I can go mitigating this.

1) One of them is to provide PageDash with API access to your s3/gcp bucket so that it syncs your pages out to your bucket.

2) Providing an open-source viewer to view files saved within your bucket. It's just like serving a website, really, no more processing needed.

66 comments