1. Download a static copy of the webpage in a single HTML file, with a PDF exported copy, that also take care of removing adds and unrelated content from the stored content.
2. Run something like http://smmry.com/ to create a summary of the page in few sentences and store it.
3. Use NLP techniques to extract the principle keywords and use them as tags
And another command like:
$ bookmark search "..."
That will:
* Not use regexp or complicated search pattern, but instead;
* Search in titles, tags, AND page content smartly and interactively, and;
* Sort/filter results smartly by relevance, number of matches, frecency, or anything else useful
Storing everything in a git repository or simple file structure for easy synchronization, bonus point for browsers integrations.
I've been thinking along these lines, some other features I'd like:
- ability to have certain sites run site-specific extra processing: i.e. youtube-dl youtube links
- ability to have a list of sites to be archived periodically instead of once only. And the option to be notified when a site updates. even if it were run as a batch job
- ability to ingest a PDF or ebook, identify all the URLs, snapshot all the URLS, present them as a list that links to the original, cached version, the page location
- would also be nice if the data could be stored in a human readable structure in a normal filesystem, so your ability to use the data isn't dependent on your ability to run the tool.
Overall I think it is an interesting project but the commercial potential is limited.
EDIT: maybe the document processing and periodic check thing would make more sense as a higher level tool that depended on the bookmarking tool -- and the extra processing also might make more sense as a plugin type architecture.
except for the the exported pdf and ad removal and git you've basically described Pinboard.in I'm pretty sure it searches the content not just the title, tags, and comment you left. It saves a copy of the page so that the site disappearing doesn't mean you've lost the info. It's not downloadable but i don't know why i'd want to backup their backup anyway. It suggests tags / keywords (probably by harvesting the plethora of other people bookmarking things) .
and it's got an API so you could make a command line client.
I'll build this. It sounds like a useful and fun project to build. Will build it using Crystal to be able to ship a single binary with no dependencies. SQLite will probably be enough for this project so it'll ship with it's own DB.
If you want other output formats, there's little you can do to improve over pandoc. That will generate ePub, .mobi, DJVU, PDF, PS, and a multitude of other formats, on the fly. HTML is a valid input for most of those.
The main problem isn't pandoc, but HTML -- the crap that passes for Web-compatible today is simply any asshat's bad idea. I see as highly useful something which looks at what's been downloaded and reduces it to a hugely simplified structure -- Markdown will almost always be sufficient.
I've found, in writing my own strippers and manually re-writing HTML, that body content rarely amounts to more than paragraphs, italic/emphasis, and anchors/hrefs. Better-written content has internal structure via headers. Bold itself is almost never used within CMS systems for body text, it's almost always a tell for advertising or promotional content.
The sad truth is that manual rewrite or re-tagging of pages, in Markdown is often the best option I've got for getting to something remotely reasonable. The good news is that that's actually a good tool for reading an article, even if you find that on reading, it's not worth keeping :)
he-yo. I'm one of the earliest users of del.icio.us and also pinboard, made similar tools to buku for my Linux desktop, and added a custom tabless web browser on top of it to make bookmarking as convenient as possible. it was a very productive setup, but I don't use it anymore; https://github.com/azer/delicious-surf
Storing bookmarks locally is definitely cool but it's still not convenient enough for us to bookmark any page that we found value. If I'm browsing 30 pages about my upcoming trip to Patagonia, I won't bookmark most of it just because it's not convenient enough. If I google a solution for a problem for hours and go through tens of pages to find information, it's likely that I won't bookmark most of the pages.
You can keep bookmarks in Chrome, Safari, Pinboard, Firefox, whatever. But they are all not innovating bookmarking and likely won't.
And this is exactly why I'm building Kozmos currently. It has a desktop and mobile client, and will bring completely new perspective on bookmarking. You won't need to organize anything, it'll all be done automatically and you'll easily find your stuff thanks to advanced search engine. My goal is to bring good design and good tech together, and provide everyone the most convenient way to bookmark.
You can sign up to the private beta and get an invitation within a week, here is the link; http://getkozmos.com
I’ve started avoiding the whole pretence of “tagging” or “organisation” like I tried to do with Pinboard. My bookmarks are now Safari .webarchive files. I have access to them offline, I can back them up the same way I back everything else up, and I just organise them in folders however I want. I even get search!
A “bookmark manager” doesn’t need to be an app or a service -- it can just be a bunch of files.
Edit: I should say that Buku looks like a good program for those who like that way of things, though!
I hate changing command options, but if you're going to do this, do it early.
Please swap the definitions of '-s' (search any) and '-S' (search all).
Rationale: virtually every time I run a search, I'm interested in the most specific result, and this most especially when I have created the search space myself.
Having to hit the shift key for my default search preference is ... backward.
I know far too many online search tools which OR rather than AND arguments (probably because the underlying tools support OR more readily than AND searches) ... and ... this drives me flipping bananas. Because the more specific my search, the less specific the result.
It's the worst possible antifeature in a knowledge management tool.
I'd also suggest that the capability to distinctly search specific fields be specified:
* URL
* Title
* Tags
* Metadata (author, publisher, date).
A date-ranged search would be particularly useful.
Author of Buku here. In fact Buku can store bookmarks directly from the browser. You have 2 ways to do that (including a dedicated plugin). It also has 5 different search options with a powerful prompt to find out just any bookmark you have stored (we have users who imported even ~40K bookmarks from Delicious and are happy with Buku), extensive flexibility of editing and manipulation, encryption support, multithreaded full DB refresh and a lot more.
In addition, Buku is also developed as a library and Shaarli can use it as a powerful python backend over REST. ;)
Yes, one of our contributors did want to add a feature to generate a full webpage with thumbnails but we decided not to add it as it seemed simply ornamental when you think about the raw potential of Buku.
Instead, if I like a page I want to re-visit, I simply Print it to PDF. Then, I move all the PDF's from my Desktop every week, into their own permanent storage location .. meaning that I have every interesting web page I've ever read since 2000.
Trouble is, now I have a large PDF collection to manage. I get along fine with "ls -l | grep <something>" this and "pdf2txt <blah.pdf> | grep <something>" that .. but of course, this is not as 'clean' as if I had a Bookmark Manager to do all my searching/grep'ing/grok'ing/etc.
I still use bookmarks (Pinboard) but I like your approach. I'm slowly trying to remove all reliances on third-parties as I can as they're too ephermal. I'm guessing you went with print to PDF because saving the page from the browser would result in broken pages? Have you found that the PDFs don't look very good for some websites?
Alternatively, you could use a tool or extension that does full-page screenshots and then run image optimisation on them. I do this a lot for local Pinterest-type inspiration store. At the moment I use [Nimbus](https://chrome.google.com/webstore/detail/nimbus-screenshot-...) but it seems like evert few months the extension I'm using starts to fail with certain websites (scrolljackers, mostly), and I switch to a different, newer extension.
Alternatively again, but back to saving websites, surely someone's created a nice tool that will download a page to store it as an archive that won't be broken? Pinboard for example has archiving for an extra fee, so I wonder how he does it at his scale.
Related to your problem grep'ing, I'm slowly working on a small idea to have a local tagging/metadata approach for finding things.
"I don't need this" doesn't mean someone else doesn't.
I would agree that bookmarks, generally, are a poor fit to current needs or requirements. When a typical hard drive was, say, 100 - 500 MB, the idea of only saving the URL, and not the content, could be argued. With mobile devices having 128 GB - 1 TB of microSD storage, there's no reason you cannot store everything you've read, or at least everything you're interested in, locally, on desktop, laptop, or mobile.
Which is what you've done.
But you're running into the underlying problem (as am I): a pile of randomly-titled, poorly-metadata'd PDFs isn't particularly useful.
There are tools -- Zotero and ... some others -- which manage references, but IMO do so quite poorly. The problem is that what they introduce is a metadata vetting and management problem, and one that IMO GUI tools manage quite poorly. The fact that the tools aren't available on Mobile (I use an Android tablet almost exclusively, because reasons, and yes, it sucks in a great many ways), makes this problem all the more nontractable.
I have the same problem, it turns out, with bookreaders. I use two, mostly: PocketBook and FBReader, with ~2,000 or so references. Unfortunately, other than title and author search, I've little by way of organisation of these, which is ... a major problem.
I'm using Pocket, The Article Management Tool that Gets Worse The More You Use It[tm] (https://redd.it/5x2sfx), which ... suffers many of the same problems, and adds a few more of its own.
That sounds so obvious yet is brilliant... no more link-rot for any of those interesting sites you bookmarked years ago...
I suppose the next logical step would be to save the page with assets so it can be made available for posterity. Not sure what impact that would have on storage use but it would at least enable full text search.
Coming to personal preferences, I do. And that was one of the main reasons I wrote buku. And I don't store the context, just a pointer to the context. Just like you don't normally pass a full structure by value over the stack but use a pointer.
When I need the context I check the original link. If it's not there I try Google Cache or archive.org. If it's lost, I find an alternative (thanks to the title and notes fields in buku). That's more or less my workflow when in comes to the 8K odd bookmarks I have.
I actually thought of having creating a bookmark utility for command line, because I have a lot of commands I use. I could create them as aliases, but I also want the ability to keep track of them and have a description what they do, browse and grep, change parameters and etc.
I'm still not sure this fits my needs or workflow, though that's more useful than the project link itself.
I'm tremendously interested in this or related tools, as I've got an exploding research problem that nothing I've seen yet comes close to addressing, and most of which introduces numerous additional problems. See:
Short version: I've got a library of a few thousand articles, plus another few thousand books, plus another few thousand online references, which I've gathered, am continuing to gather, am trying to assess, prioritise reading of, and generate a number of outputs from, as well as use in what's likely to be a several-decades-long research and writing project.
Online services simply don't offer sufficient longevity, even should they meet my other requirements, which they don't.
Assigning metadata is a significant pain point. Coming to some aggreement as to what metadata to assign is a signficant pain point.
I'm coming to see librarians and library cataloguing as essential domain knowledge and experience. In all seriousness, I suggest any project looking to make use of categories and classification look to the US Library of Congress Classification System: it's extant, expert, unencumbered, comprehensive, hierarchical, extensible, has a change management process, and is applied to a store comprising 164 million works.
There's also a top-level reduction to 21 distinct categories, and the possibility of, say, coming up with a short-list of frequently-used classifications, as well as of assigning multiple classifications to works.
The rationale for only storing bookmarks is ... generally not valid. There are a few types of online resources, generally:
1. Interactive or volatile pages.
2. Static pages.
For the first, search engines, web apps, landing pages, etc., storing a static instance isn't tremendously useful (though it can be more useful than you'd think). For the second, a locally-stored version is almost always more useful than the online instance.
And space for text is now beyond cheap.
I'm looking at this problem in terms of desired outputs, workflow, various states of resources, how to (reasonably) uniquely and persistently identify a given document, managing media (images, audio, video, other interactive elements), etc.
And yes, this starts to look rather much like Memex, for similar reasons.
[+] [-] StreakyCobra|9 years ago|reply
$ bookmark add http://...
That will:
1. Download a static copy of the webpage in a single HTML file, with a PDF exported copy, that also take care of removing adds and unrelated content from the stored content.
2. Run something like http://smmry.com/ to create a summary of the page in few sentences and store it.
3. Use NLP techniques to extract the principle keywords and use them as tags
And another command like:
$ bookmark search "..."
That will:
* Not use regexp or complicated search pattern, but instead;
* Search in titles, tags, AND page content smartly and interactively, and;
* Sort/filter results smartly by relevance, number of matches, frecency, or anything else useful
Storing everything in a git repository or simple file structure for easy synchronization, bonus point for browsers integrations.
[+] [-] elevensies|9 years ago|reply
- ability to have certain sites run site-specific extra processing: i.e. youtube-dl youtube links
- ability to have a list of sites to be archived periodically instead of once only. And the option to be notified when a site updates. even if it were run as a batch job
- ability to ingest a PDF or ebook, identify all the URLs, snapshot all the URLS, present them as a list that links to the original, cached version, the page location
- would also be nice if the data could be stored in a human readable structure in a normal filesystem, so your ability to use the data isn't dependent on your ability to run the tool.
Overall I think it is an interesting project but the commercial potential is limited.
EDIT: maybe the document processing and periodic check thing would make more sense as a higher level tool that depended on the bookmarking tool -- and the extra processing also might make more sense as a plugin type architecture.
[+] [-] masukomi|9 years ago|reply
and it's got an API so you could make a command line client.
[+] [-] sergiotapia|9 years ago|reply
[+] [-] dredmorbius|8 years ago|reply
The main problem isn't pandoc, but HTML -- the crap that passes for Web-compatible today is simply any asshat's bad idea. I see as highly useful something which looks at what's been downloaded and reduces it to a hugely simplified structure -- Markdown will almost always be sufficient.
I've found, in writing my own strippers and manually re-writing HTML, that body content rarely amounts to more than paragraphs, italic/emphasis, and anchors/hrefs. Better-written content has internal structure via headers. Bold itself is almost never used within CMS systems for body text, it's almost always a tell for advertising or promotional content.
The sad truth is that manual rewrite or re-tagging of pages, in Markdown is often the best option I've got for getting to something remotely reasonable. The good news is that that's actually a good tool for reading an article, even if you find that on reading, it's not worth keeping :)
[+] [-] apjana|9 years ago|reply
[+] [-] benibela|9 years ago|reply
[+] [-] roadbeats|8 years ago|reply
Storing bookmarks locally is definitely cool but it's still not convenient enough for us to bookmark any page that we found value. If I'm browsing 30 pages about my upcoming trip to Patagonia, I won't bookmark most of it just because it's not convenient enough. If I google a solution for a problem for hours and go through tens of pages to find information, it's likely that I won't bookmark most of the pages.
You can keep bookmarks in Chrome, Safari, Pinboard, Firefox, whatever. But they are all not innovating bookmarking and likely won't.
And this is exactly why I'm building Kozmos currently. It has a desktop and mobile client, and will bring completely new perspective on bookmarking. You won't need to organize anything, it'll all be done automatically and you'll easily find your stuff thanks to advanced search engine. My goal is to bring good design and good tech together, and provide everyone the most convenient way to bookmark.
You can sign up to the private beta and get an invitation within a week, here is the link; http://getkozmos.com
[+] [-] cytzol|9 years ago|reply
A “bookmark manager” doesn’t need to be an app or a service -- it can just be a bunch of files.
Edit: I should say that Buku looks like a good program for those who like that way of things, though!
[+] [-] dredmorbius|8 years ago|reply
Please swap the definitions of '-s' (search any) and '-S' (search all).
Rationale: virtually every time I run a search, I'm interested in the most specific result, and this most especially when I have created the search space myself.
Having to hit the shift key for my default search preference is ... backward.
I know far too many online search tools which OR rather than AND arguments (probably because the underlying tools support OR more readily than AND searches) ... and ... this drives me flipping bananas. Because the more specific my search, the less specific the result.
It's the worst possible antifeature in a knowledge management tool.
I'd also suggest that the capability to distinctly search specific fields be specified:
* URL
* Title
* Tags
* Metadata (author, publisher, date).
A date-ranged search would be particularly useful.
[+] [-] apjana|9 years ago|reply
- Edit bookmarks in EDITOR at prompt
- Import folder names as tags from browser html
- Append, overwrite, delete tags at prompt using >>, >, << (familiar, eh? ;))
- Negative indices with `--print` (like `tail`)
- Update in EDITOR along with `--immutable`
- Request HTTP HEAD for immutable records
- Interface revamp (title on top in bold, colour changes...)
- Per-level colourful logs in colour mode
- Changes in program OPTIONS
- Lots of new automated test cases- REST APIs for server-side apps
- Document, notify behaviour when not invoked from tty
- Fix Firefox tab-opening issues on Windows
Home: https://github.com/jarun/Buku
[+] [-] a3n|9 years ago|reply
https://github.com/jarun/Buku
[+] [-] joepvd|9 years ago|reply
[+] [-] rawfan|9 years ago|reply
[+] [-] tokenizerrr|9 years ago|reply
[+] [-] subbz|9 years ago|reply
I recommend Shaarli: http://sebsauvage.net/wiki/doku.php?id=php:shaarli
[+] [-] a3n|9 years ago|reply
Really? Why? I would much rather my browsers were clients of a bookmarks server/api, and what happens/seen in one browser is exactly the same in all.
I use different email clients, and thank $GOD that they all consume IMAP instead of email storage "happening in the client."
[+] [-] apjana|9 years ago|reply
In addition, Buku is also developed as a library and Shaarli can use it as a powerful python backend over REST. ;)
Yes, one of our contributors did want to add a feature to generate a full webpage with thumbnails but we decided not to add it as it seemed simply ornamental when you think about the raw potential of Buku.
[+] [-] mmjaa|9 years ago|reply
Instead, if I like a page I want to re-visit, I simply Print it to PDF. Then, I move all the PDF's from my Desktop every week, into their own permanent storage location .. meaning that I have every interesting web page I've ever read since 2000.
Trouble is, now I have a large PDF collection to manage. I get along fine with "ls -l | grep <something>" this and "pdf2txt <blah.pdf> | grep <something>" that .. but of course, this is not as 'clean' as if I had a Bookmark Manager to do all my searching/grep'ing/grok'ing/etc.
[+] [-] dorian-graph|9 years ago|reply
Alternatively, you could use a tool or extension that does full-page screenshots and then run image optimisation on them. I do this a lot for local Pinterest-type inspiration store. At the moment I use [Nimbus](https://chrome.google.com/webstore/detail/nimbus-screenshot-...) but it seems like evert few months the extension I'm using starts to fail with certain websites (scrolljackers, mostly), and I switch to a different, newer extension.
Alternatively again, but back to saving websites, surely someone's created a nice tool that will download a page to store it as an archive that won't be broken? Pinboard for example has archiving for an extra fee, so I wonder how he does it at his scale.
Related to your problem grep'ing, I'm slowly working on a small idea to have a local tagging/metadata approach for finding things.
[+] [-] js2|9 years ago|reply
"Ask HN: Do you still use browser bookmarks?" (19 days ago, 451 comments):
https://news.ycombinator.com/item?id=14064096
[+] [-] dredmorbius|8 years ago|reply
I would agree that bookmarks, generally, are a poor fit to current needs or requirements. When a typical hard drive was, say, 100 - 500 MB, the idea of only saving the URL, and not the content, could be argued. With mobile devices having 128 GB - 1 TB of microSD storage, there's no reason you cannot store everything you've read, or at least everything you're interested in, locally, on desktop, laptop, or mobile.
Which is what you've done.
But you're running into the underlying problem (as am I): a pile of randomly-titled, poorly-metadata'd PDFs isn't particularly useful.
There are tools -- Zotero and ... some others -- which manage references, but IMO do so quite poorly. The problem is that what they introduce is a metadata vetting and management problem, and one that IMO GUI tools manage quite poorly. The fact that the tools aren't available on Mobile (I use an Android tablet almost exclusively, because reasons, and yes, it sucks in a great many ways), makes this problem all the more nontractable.
I have the same problem, it turns out, with bookreaders. I use two, mostly: PocketBook and FBReader, with ~2,000 or so references. Unfortunately, other than title and author search, I've little by way of organisation of these, which is ... a major problem.
I'm using Pocket, The Article Management Tool that Gets Worse The More You Use It[tm] (https://redd.it/5x2sfx), which ... suffers many of the same problems, and adds a few more of its own.
[+] [-] jonathonf|9 years ago|reply
I suppose the next logical step would be to save the page with assets so it can be made available for posterity. Not sure what impact that would have on storage use but it would at least enable full text search.
[+] [-] SirFatty|9 years ago|reply
[+] [-] apjana|8 years ago|reply
When I need the context I check the original link. If it's not there I try Google Cache or archive.org. If it's lost, I find an alternative (thanks to the title and notes fields in buku). That's more or less my workflow when in comes to the 8K odd bookmarks I have.
[+] [-] yeukhon|9 years ago|reply
[+] [-] apjana|9 years ago|reply
[+] [-] gkya|9 years ago|reply
[+] [-] hhandoko|8 years ago|reply
Coincidentally, `Buku` translates to `Book` in Indonesian...
[+] [-] apjana|8 years ago|reply
[+] [-] orschiro|9 years ago|reply
[+] [-] r3bl|9 years ago|reply
[+] [-] pacuna|8 years ago|reply
[+] [-] dredmorbius|8 years ago|reply
I'm still not sure this fits my needs or workflow, though that's more useful than the project link itself.
I'm tremendously interested in this or related tools, as I've got an exploding research problem that nothing I've seen yet comes close to addressing, and most of which introduces numerous additional problems. See:
https://ello.co/dredmorbius/post/fj5rzi8zmouyrmvg8yzzva
Short version: I've got a library of a few thousand articles, plus another few thousand books, plus another few thousand online references, which I've gathered, am continuing to gather, am trying to assess, prioritise reading of, and generate a number of outputs from, as well as use in what's likely to be a several-decades-long research and writing project.
Online services simply don't offer sufficient longevity, even should they meet my other requirements, which they don't.
Assigning metadata is a significant pain point. Coming to some aggreement as to what metadata to assign is a signficant pain point.
I'm coming to see librarians and library cataloguing as essential domain knowledge and experience. In all seriousness, I suggest any project looking to make use of categories and classification look to the US Library of Congress Classification System: it's extant, expert, unencumbered, comprehensive, hierarchical, extensible, has a change management process, and is applied to a store comprising 164 million works.
https://mammouth.cafe/@dredmorbius/56485 http://www.loc.gov/catdir/cpso/lcco/
There's also a top-level reduction to 21 distinct categories, and the possibility of, say, coming up with a short-list of frequently-used classifications, as well as of assigning multiple classifications to works.
The rationale for only storing bookmarks is ... generally not valid. There are a few types of online resources, generally:
1. Interactive or volatile pages.
2. Static pages.
For the first, search engines, web apps, landing pages, etc., storing a static instance isn't tremendously useful (though it can be more useful than you'd think). For the second, a locally-stored version is almost always more useful than the online instance.
And space for text is now beyond cheap.
I'm looking at this problem in terms of desired outputs, workflow, various states of resources, how to (reasonably) uniquely and persistently identify a given document, managing media (images, audio, video, other interactive elements), etc.
And yes, this starts to look rather much like Memex, for similar reasons.