Show HN: Pocket Stream Archive – A personal Way-Back Machine

[+] vijucat|9 years ago|reply

I still use Firefox ScrapBook for this: https://addons.mozilla.org/en-US/firefox/addon/scrapbook/

Just for articles, mind you, not entire websites.

[+] cJ0th|9 years ago|reply

Ditto. I just hope the transition to web extension will be smooth or at least happening at all.

[+] unknown|9 years ago|reply

[deleted]

[+] jl6|9 years ago|reply

Screenshotting or PDFing of a website is an increasingly important archiving tool, to supplement wget. I've come across a lot of websites that won't render any content if not connected to a live server.

[+] nikisweeting|9 years ago|reply

I couldn't agree more. I wish more sites would load without needing multiple seconds of JS execution and AJAX. One of my TODOs is to get full-page screenshots working as well.

[+] driverdan|9 years ago|reply

PDFs are really not suitable for archiving websites since they're designed around pages and the web does not have pages.

A better option is to render a page with JS turned on and save the resulting HTML.

[+] robzyb|9 years ago|reply

Wouldn't a copy of the DOM be even better than a screenshot?

I.e. DOM copy > screenshot > wget?

[+] jccalhoun|9 years ago|reply

Agreed. I research new media and archive.org is invaluable to me. I worry that current web sites won't be able to be preserved. (much like many of the flash sites and real audio of the past are largely gone.)

[+] frik|9 years ago|reply

But what do you do when a website has a broken media query, essentially destroying the print layout? Then a PDF is useless.

Well, I took a screenshot, better than nothing.

[+] avian|9 years ago|reply

What version of Google Chrome do you need for the PDF export to work? I tried it on 58.0.3029.96 (Linux) and this does nothing (no error messages, it just quits without writing any files):

$ google-chrome --headless --disable-gpu --print-to-pdf 'http://example.com'

Edit: I'm completely baffled that such widely used software as Google Chrome can have this written in the man page: "Google Chrome has hundreds of undocumented command-line flags that are added and removed at the whim of the developers."

[+] nikisweeting|9 years ago|reply

59 or later, --headless is a brand new feature. apt-get install google-chrome-canary.

https://developers.google.com/web/updates/2017/04/headless-c...

[+] edibleEnergy|9 years ago|reply

This is the only place I've found them parsed and documented: http://peter.sh/experiments/chromium-command-line-switches/

[+] toomuchtodo|9 years ago|reply

Highly recommend switching to wpull (https://github.com/chfoo/wpull), which was built as a wget replacement. It's what grab-site uses, which is a successor to ArchiveTeam's ArchiveBot.

"grab-site is made possible only because of wpull, written by Christopher Foo who spent a year making something much better than wget. ArchiveTeam's most pressing issue with wget at the time was that it kept the entire URL queue in memory instead of on disk. wpull has many other advantages over wget, including better link extraction and Python hooks."

[+] nikisweeting|9 years ago|reply

This looks awesome, thanks for the suggestion! It'll help with WARC support as well, looks like it can output WARCs with just a cli flag.

[+] throw98987|9 years ago|reply

Use zotero and you have your own personal Pocket with snapshots. In addition, you can add tags, organize stuff into folders, etc. https://www.zotero.org/

[+] nikisweeting|9 years ago|reply

Zotero is awesome! It doesn't provide a publishable stream of recently added articles though afaik.

[+] ticoombs|9 years ago|reply

I've been running wallabag[1] for my own pocket instance. It's been running perfectly for a couple years.

Also has a pocket import feature.

[1] https://wallabag.org/en

[+] antman|9 years ago|reply

The demo does not have images. Maybe try

wget -nc -np -E -H -k -K -p -U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4' -e robots=off

[+] nikisweeting|9 years ago|reply

I opted not to download images using wget. I figured if I needed in-article images the PDF+screenshot would be enough.

[+] unicornporn|9 years ago|reply

If it would be connected to Pinboard.in as an alternative to Pocket I would be screaming with joy. :-)

[+] 8ig8|9 years ago|reply

Pinboard offers an archive feature:

https://pinboard.in/upgrade/

[+] nikisweeting|9 years ago|reply

If you can get me a sample pinboard export to look at, I'll whip up a regex that makes it work.

[+] joshstrange|9 years ago|reply

Shouldn't be too terribly difficult to modify since all Pocket does is provide a list of URL's same as pinboard.

[+] snackai|9 years ago|reply

There is a PR with pinboard JSON support now.

[+] djhworld|9 years ago|reply

This is cool.

Could you add an option to either add tagging, or separate the tagged items into folders?

e.g. "programming/", "docker/" etc, I often find myself digging through my Pocket archive trying to find that one article I found 6 months ago and it gets incredibly annoying

[+] nikisweeting|9 years ago|reply

I like having the sites by timestamp because they're guaranteed to be unique, and it makes traversing them easy. I'd be happy to add a tag column to the index though, which you could use with Ctrl+F to find articles. https://github.com/pirate/pocket-archive-stream/issues/1

[+] anc84|9 years ago|reply

Now if only Chromium could learn to write WARC archives, then it would be on par! :)

Great project!

[+] rcarmo|9 years ago|reply

I've been thinking along those very same lines for a long time (this project makes me wish for more free time).

I have half a mind to fork this and add something like https://github.com/internetarchive/warcprox, or at the very least walk through the generated HTML and brute-force inline all assets as a first pass :)

[+] motdiem|9 years ago|reply

Can one automate extensions through headless chrome ? then you might be able to trigger WarCreate instead (It will be more efficient to run the pocket export urls through WAIL though - this should give you the warcs you want)

[+] frik|9 years ago|reply

Or EML/MHT. It's the format the email programs use to store the HTML mail incl all pictures, JS, CSS, ... in one plain text file. IE 9-11 also supports that format (file -> save as...) but calls it MHT?

[+] arkenflame|9 years ago|reply

I wrote a Chrome extension that similarly saves copies of pages you bookmark: https://chrome.google.com/webstore/detail/backmark-back-up-t...

[+] fiatjaf|9 years ago|reply

You see something is flawed in Redux at the point you have to pass strings (uppercase constants defined somewhere) around, import them in every file, pass them as identifiers of what you should do with each data.

Strings!

[+] nikisweeting|9 years ago|reply

Did you comment on the wrong article by accident? https://news.ycombinator.com/item?id=14273549

[+] bicubic|9 years ago|reply

This seems neat, curious what are the use cases for this?

[+] nikisweeting|9 years ago|reply

Slowing down the inevitable tide of https://en.wikipedia.org/wiki/Link_rot. When I cite blog posts or want to share sites that have gone down, I can swap out the links for my archived versions.

[+] ents|9 years ago|reply

Would be cool to see this for Instapaper or Pinboard

[+] nikisweeting|9 years ago|reply

My script should work with very minimal tweaking if you can get a list of urls + titles from those services.

Just one line of regex changes probably: https://github.com/pirate/pocket-archive-stream/blob/master/...

[+] rcarmo|9 years ago|reply

This is great. All it needs is a Docker container and I'd be running it now (need to take some time aside this weekend to do that).

[+] burnbabyburn|9 years ago|reply

this is really cool! I always had in mind a project where you save every page you visit, and somehow expose them in the future to know what you visit and maybe remembering you important stuff based on some heuristic.

[+] anotheryou|9 years ago|reply

yes! thank you so much. Needed this badly.

[+] nikisweeting|9 years ago|reply

You're welcome! I've been wanting to build this for ages but headless chrome finally inspired me to actually finish it.

68 comments