TMSU is the only such system I found useful without being cumbersome. After years of trying to use Git Annex, it was refreshing that TMSU doesn't alter the files in any way, merely storing all the (meta)data out-of-band in a separate DB.
These days I use TMSU via my own Emacs-based UI almost every single day, so thank you for that!
Storing metadata out-of-band strikes me as key to any usable content management system within a realistically complex space.
Naming schemes and directory hierarchies have some limited application, but ultimately there will be data which simply won't be shoehorned into any such system, and an externally-managed catalogue tying together disparate elements is required.
(Keeping that catalogue up to date and consistent is a whole 'nother issue.)
I do like the idea of a virtual filesystem in which elements are effectively search dimensions, which leads to an interesting notion that search is identity.
That is, a search will produce one of three possible result sets:
- Null, that is, no matches.
- Plural, that is, a list of matches.
- Unity, that is, one matching item.
In the last case, the search providing a single result is an identity of that result. (It may not be a stable identity over time, but it is at least for the present.)
Where a list is returned, the size of the list determines how usable it is, and how it is usable. Ten items can be quickly scanned to find the relevant item(s), if they exist. 100 or 1,000 items can often still be managed manually, though they'll typically take some time. Somewhere between 100 and a few thousand items, though, you're in the range where automated assessments or filtering becomes necessary.
Large libraries themselves typically have tens of thousands to millions of items. The largest book collections (Library of Congress, British Library) have roughly 150 million books (or equivalents). Other records may exist in greater numbers: periodicals, financial records, databases. Facebook has reported ~5 billion items posted daily for some years now. (I suspect most of those are trivial, but that still leaves a large number of potentially non-trivial items.) Surveillance and other large-scale data collection systems may be larger still.
> Storing metadata out-of-band strikes me as key to any usable content management system within a realistically complex space.
Yes and no. I was specifically comparing it to Git Annex which is hard to categorize in these terms. It forces every file to become a symbolic link to the actual file living in `.git/annex/` and then every query temporarily mutating the hierarchy of directories storing these symbolic links. I found the latter disruptive enough (in particular for the directory mtimes) that I was actively avoiding doing any such queries. See: https://git-annex.branchable.com/tips/metadata_driven_views/
On the other hand my current setup involves TMSU queries which result in virtual Emacs directory editor (dired) views that don't affect anything else. I don't even use the FUSE functionality of TMSU.
dredmorbius|2 years ago
Naming schemes and directory hierarchies have some limited application, but ultimately there will be data which simply won't be shoehorned into any such system, and an externally-managed catalogue tying together disparate elements is required.
(Keeping that catalogue up to date and consistent is a whole 'nother issue.)
I do like the idea of a virtual filesystem in which elements are effectively search dimensions, which leads to an interesting notion that search is identity.
That is, a search will produce one of three possible result sets:
- Null, that is, no matches.
- Plural, that is, a list of matches.
- Unity, that is, one matching item.
In the last case, the search providing a single result is an identity of that result. (It may not be a stable identity over time, but it is at least for the present.)
Where a list is returned, the size of the list determines how usable it is, and how it is usable. Ten items can be quickly scanned to find the relevant item(s), if they exist. 100 or 1,000 items can often still be managed manually, though they'll typically take some time. Somewhere between 100 and a few thousand items, though, you're in the range where automated assessments or filtering becomes necessary.
Large libraries themselves typically have tens of thousands to millions of items. The largest book collections (Library of Congress, British Library) have roughly 150 million books (or equivalents). Other records may exist in greater numbers: periodicals, financial records, databases. Facebook has reported ~5 billion items posted daily for some years now. (I suspect most of those are trivial, but that still leaves a large number of potentially non-trivial items.) Surveillance and other large-scale data collection systems may be larger still.
vifon|2 years ago
Yes and no. I was specifically comparing it to Git Annex which is hard to categorize in these terms. It forces every file to become a symbolic link to the actual file living in `.git/annex/` and then every query temporarily mutating the hierarchy of directories storing these symbolic links. I found the latter disruptive enough (in particular for the directory mtimes) that I was actively avoiding doing any such queries. See: https://git-annex.branchable.com/tips/metadata_driven_views/
On the other hand my current setup involves TMSU queries which result in virtual Emacs directory editor (dired) views that don't affect anything else. I don't even use the FUSE functionality of TMSU.