Please consider producing archives in WARC format, and either donating captures of public pages to the Internet Archive (and other interested archives), or supporting ways for users to download their own archives in that format for them to donate them themselves and use in systems like Webrecorder.
(Note that a download of just page content and assets isn't enough; WARC stores headers, etc., also.)
Thanks for the comment. Admittedly I bypassed WARC completely as I felt overwhelmed by its technicalities in favor of how I knew the web worked. If I have a better understanding of WARC maybe that can be done, but I make no promises.
Founder here, happy to field questions and feedback!
Right now PageDash is quite a simple product, but hopefully with sufficient traction we can continue to implement things like full-text search, tagging support, link sharing, as well as mobile support. Your support is absolutely crucial to making PageDash come alive even more in the future.
This is my first product, thank you for being nice :)
Would it be possible to add auto-archiving to the extension?
On the $9/mo plan, I'd probably still not hit 100GB/mo of uploads.
My main reason for wanting this feature is that I could use the full text search (when it's available) to search every webpage I've visited. I find myself more and more frequently unable to find things I know existed at one point. I've been thinking of building my own solution where I just archive every page I visit on the fly then build a personal search for pages I've previously visited.
I really like the idea of this and other, similar products/services. I haven't used any of them since they don't seem to be exactly what I want.
What I really want is auto tagging and classification + semantic search. I don't even really want to have to save the page. I want this functionality on my browsing history.
Maybe some increased functionality for saving specific types of pages. If I save a recipe, I want the service to recognize that it's a recipe and put it in my 'cookbook'. With a consistent format, if possible. If I save a blog post, tag the topic, technology and language used.
I really want auto-tagging via ML classification as well, it's one of the things I wanted other than one-click save when I started the project. That's a really nice to have at the moment and can only be achieved once PageDash matures more. Right now the closest antidote I can offer for your use is to configure a keyboard shortcut to do the extension saving via chrome://extensions > Keyboard shortcuts (bottom) for quick saving.
The previous web archive launched on HN is already dead [0]. Many of the comments from that discussion also apply here. Good luck and I hope you'll manage to stay online!
Unless you are on Pinboard's archiving plan, Pinboard mostly manages just your bookmarks. PageDash doesn't claim to be a bookmark manager, but it really can be one. Bookmark the page, along with the content.
Signed up - saved my first page - and viewed my dash within 5 mins. Good stuff. Now, all you need is not to go out of business (or open source before you do). Seriously though, good luck on the business side.
I should point out that PageDash also tries to handle saving nested pages and iframes, I'm not sure it's something that other archivers try to do.
Also Web Components (custom elements, shadow DOM) support is definitely do-able and something for the pipeline. It's not something even the Internet Archive is capable of right now. Wayback Machine's youtube.com archive is blank.
Good question. PageDash aims to preserve the page in the original format and render it just as you saw it. Right now, Evernote does quite a bad job at rendering, I've used it a lot. Pocket on the other hand specializes at stripping out the HTML and leaving just the content in a reader-mode fashion, though I've not tried their premium offering that also archives.
PageDash archives from the front-end, while many archivers tend to archive by sending a link to the backend which then queries the website remotely, so you might not be archiving what you saw exactly, which admittedly in many cases doesn't matter. The upside of this technicality is that you can save content that you see only when you are logged in!
Anything FOSS in this sphere? I'm slowly going towards building my solution to automatically archive my Firefox bookmarks locally, but a bit too slowly.
Have you considered saving files (such as fonts and JS libs) loaded through major CDNs centrally just once instead of storing it again each time a page is saved?
Maybe you already have plans for this, but it would be smart to implement a system that checks whether files are already present on your server so you don't waste any of your user's quota and the server's disk space.
Thanks for the comment! That would be ideal and it has crossed my mind but I have given little thought on how to do de-duplication right (premature optimization from a maker's perspective). Right now each page and its assets sit within it's own "bucket". But yes page assets and all these dependencies can really add up fast.
Thanks for the comment! Maybe I will make that possible in the future, but for now the advantage of this is that you can save logged-in content, i.e. content that you see when you're logged in. Passing the URL to backend prevents that as the backend is not authenticated, or even worse blocked.
Alright folks, it's 3am where I am at the moment, gonna hit the sack. I'll address more questions and concerns tomorrow. Thank you for all your feedback!
[+] [-] vitovito|8 years ago|reply
(Note that a download of just page content and assets isn't enough; WARC stores headers, etc., also.)
[+] [-] ernsheong|8 years ago|reply
[+] [-] AdieuToLogic|8 years ago|reply
http://archive-access.sourceforge.net/warc/warc_file_format-...
As indicated by the US Library of Congress[0].
0 - https://www.loc.gov/preservation/digital/formats/fdd/fdd0002...
[+] [-] jaytaylor|8 years ago|reply
Never heard of this until now, it sounds exceptionally pragmatic and good!
I also found these WARC tools made by the folks at Internet Archive, certainly interesting:
https://github.com/internetarchive/warctools
[+] [-] tomc1985|8 years ago|reply
ANOTHER ----id ----ing cloud service trying to replace files and programs with some BS pricing scheme?
Seriously are the "entrepreneurs" of HN even trying? How pathetic that seemingly everything on this site is someone's jobs program?
[+] [-] ernsheong|8 years ago|reply
Right now PageDash is quite a simple product, but hopefully with sufficient traction we can continue to implement things like full-text search, tagging support, link sharing, as well as mobile support. Your support is absolutely crucial to making PageDash come alive even more in the future.
This is my first product, thank you for being nice :)
[+] [-] CJKinni|8 years ago|reply
On the $9/mo plan, I'd probably still not hit 100GB/mo of uploads.
My main reason for wanting this feature is that I could use the full text search (when it's available) to search every webpage I've visited. I find myself more and more frequently unable to find things I know existed at one point. I've been thinking of building my own solution where I just archive every page I visit on the fly then build a personal search for pages I've previously visited.
[+] [-] gooseus|8 years ago|reply
Is the data stored exclusively on your Google Cloud or can I see my archived pages while offline and backup my web archive locally?
Essentially, what guarantees do I have regarding access to my data should your startup, my local infrastructure, or civilization collapse?
[+] [-] teddyh|8 years ago|reply
[+] [-] wongarsu|8 years ago|reply
[+] [-] CabSauce|8 years ago|reply
What I really want is auto tagging and classification + semantic search. I don't even really want to have to save the page. I want this functionality on my browsing history.
Maybe some increased functionality for saving specific types of pages. If I save a recipe, I want the service to recognize that it's a recipe and put it in my 'cookbook'. With a consistent format, if possible. If I save a blog post, tag the topic, technology and language used.
[+] [-] ernsheong|8 years ago|reply
[+] [-] rahiel|8 years ago|reply
[0]: https://news.ycombinator.com/item?id=14644441
[+] [-] j_s|8 years ago|reply
Wallabag: a self-hostable application for saving web pages | https://news.ycombinator.com/item?id=14686882 (2017Jul:166points,53comments)
[+] [-] Accacin|8 years ago|reply
[+] [-] ernsheong|8 years ago|reply
[+] [-] adityar|8 years ago|reply
[+] [-] ernsheong|8 years ago|reply
[+] [-] ernsheong|8 years ago|reply
Also Web Components (custom elements, shadow DOM) support is definitely do-able and something for the pipeline. It's not something even the Internet Archive is capable of right now. Wayback Machine's youtube.com archive is blank.
[+] [-] michaelmior|8 years ago|reply
[+] [-] ernsheong|8 years ago|reply
PageDash archives from the front-end, while many archivers tend to archive by sending a link to the backend which then queries the website remotely, so you might not be archiving what you saw exactly, which admittedly in many cases doesn't matter. The upside of this technicality is that you can save content that you see only when you are logged in!
[+] [-] gkya|8 years ago|reply
[+] [-] ernsheong|8 years ago|reply
Here's a pretty comprehensive list that someone else made: https://news.ycombinator.com/item?id=14647119
Here's another FOSS that I found: https://github.com/pirate/bookmark-archiver
[+] [-] nels|8 years ago|reply
Maybe you already have plans for this, but it would be smart to implement a system that checks whether files are already present on your server so you don't waste any of your user's quota and the server's disk space.
[+] [-] ernsheong|8 years ago|reply
[+] [-] abainbridge|8 years ago|reply
[+] [-] ernsheong|8 years ago|reply
[+] [-] vpvp|8 years ago|reply
[+] [-] pwenzel|8 years ago|reply
It would be handy if I could just enter a URL and have it saved, a la Pinboard or Instapaper.
That said, this worked very well on my first try.
[+] [-] ernsheong|8 years ago|reply
[+] [-] ernsheong|8 years ago|reply
[+] [-] ff7c11|8 years ago|reply
[+] [-] ernsheong|8 years ago|reply
1) One of them is to provide PageDash with API access to your s3/gcp bucket so that it syncs your pages out to your bucket.
2) Providing an open-source viewer to view files saved within your bucket. It's just like serving a website, really, no more processing needed.
[+] [-] tmlee|8 years ago|reply
[+] [-] Maarius|8 years ago|reply
[+] [-] ernsheong|8 years ago|reply
[+] [-] bobbyongce|8 years ago|reply
[+] [-] tevanraj|8 years ago|reply