top | item 37511227

(no title)

vifon | 2 years ago

> Storing metadata out-of-band strikes me as key to any usable content management system within a realistically complex space.

Yes and no. I was specifically comparing it to Git Annex which is hard to categorize in these terms. It forces every file to become a symbolic link to the actual file living in `.git/annex/` and then every query temporarily mutating the hierarchy of directories storing these symbolic links. I found the latter disruptive enough (in particular for the directory mtimes) that I was actively avoiding doing any such queries. See: https://git-annex.branchable.com/tips/metadata_driven_views/

On the other hand my current setup involves TMSU queries which result in virtual Emacs directory editor (dired) views that don't affect anything else. I don't even use the FUSE functionality of TMSU.

discuss

dredmorbius|2 years ago

The situation is one of compromise.

The core problem is that there are formats and storage modes which don't readily allow for the modification of the item itself. Editing PDFs is already a pain, applying metadata to some entry in a database or wiki, or third-party website, isn't possible at all.

The remaining options seem to me analogues of practices with books.

It's possible to "rebind" an item and include biblographic information into the equivalent of fly-leaves of that work, much as a library may rebind a book and apply a label with inventory number(s), call number(s), and/or bibliographic data to that book. Since physical objects are inherently modifiable and enclosable, this makes sense. The digital analogues vary, but are at least theoretically available (e.g., enclosing a work in an archive format which includes metadata. See the WARC (Web ARChive) file format for example: <https://en.wikipedia.org/wiki/WARC_(file_format)>. Epub files and software packaging formats such as RPM and APT are other examples of standardised file structures which encapsulate others.

The other practice of a library is to abstract out the metadata to a catalogue, effectively a metadata index.

In a physical library archive, this involves a cataloguing process, as part of item acquisition workflows. The metadata for many traditionally-published works is already centralised such that it can be obtained from specific organisations such as the US Library of Congress, the British Library, the OCLC (originally the Ohio College Library Center, which has both its own item identifier and manages the Dewey Decimal Classification), the International ISBN Agency or one of its national affiliates (e.g., Bowker, in the US), the International DOI Foundation (for DOI assignments: digital object identifiers, used extensively in academic journals). Circulation of items is managed through a circulation desk, for both external lending and managing, tracking, and reshelving books used within the library itself, but not being borrowed externally.

For digital media, the equivalents would be either some sort of management system, which would require an application-specific interface, or a filesystem which incorporates not only metadata but workflow management. I'm leaning toward the latter concept as more universal, though that also raises the question of how to deal with workflows in which contents leave or enter that filesystem context itself.

But with a filesystem, you have a number of additional possibilities available:

- The pairing of works with metadata is automated.

- Workflows can be integrated into filesystem actions. The act of creating a file would also create the file metadata (to a greater extent than present inode entries do).

- Additionally, introducing the notion of process status means that the filesystem itself could distinguish between works which have no or only default cataloguing data applied, those which have had additional metadata added (say, from an external look-up or automated heuristics based on file contents).

- Renames and deletions are now managed through the filesystem itself rather than third-party tools.

- I'd like to see both versioning (changes to a given file) and relationships (source, derived, referenced, and referencing works) tracked as well.

- Different forms of a work could be tracked together. The markup-language source and generated outputs (PDF, PS, ePub, HTML, plain text, etc.) versions of a text. Translations. Audio formats. Different performances of a work. Optical scans of printed material. Photographs of visual or plastic arts (sculpture), or architecture. See FRBR (Functional Requirements for Biblographic Records), and the Work, Expression, Manifestation, Item distinctions: <https://en.wikipedia.org/wiki/Functional_Requirements_for_Bi...>).

- Ideally, some sort of highly-invariant fingerprinting such that different versions of a work can be identified and matched despite different formats or slight modifications (intentional differences between editions, translations, errors or damage introduced over time). Traditional whole work hashes fail to offer this, though segmented and normalised hashes or vectors might be able to do so.

Again, the largest problem with a filesystem approach is in imports and exports from that filesystem-based archive. That would probably best be achieved through wrapper formats and applications or servers for other organisations or the general public.