In almost every database-backed application I've ever built someone, at some point, inevitable asks for the ability to see what changes were made when and by whom.
My current preferred strategy for dealing with this (at least for any table smaller than a few GBs) is to dump the entire table contents to a git repository on a schedule.
I've run this for a few toy projects and it seems to work really well. I'm ready to try it with something production-scale the next time the opportunity presents itself.
> My current preferred strategy for dealing with this (at least for any table smaller than a few GBs) is to dump the entire table contents to a git repository on a schedule.
How does that solve the problem? All this would do is give you snapshots of the state of the DB at particular points in time.
The only way to know who changed what when is to keep track of those changes as they are made, and the best way to do that is to stick the changes in an eventlog table.
Completely agreed on the inevitability of that ask.
At risk of extreme overkill the other way, something like Debezium [1] doing change monitoring dumping into S3 might be a viable industrial strength approach. I haven't used it in production, but have been looking for appropriate time to try it.
usually businesses use auditing tables to track this, usually by setting up triggers to insert a row into the auditing table
any time a delete, update or insert occurs on the original
Since this is being picked apart a bit in the comments, I should clarify that I don't think this is the BEST solution to this problem. It's the solution that gives you 90% if the value with 5% of the work.
I should write a SQL as a NoSQL database blog post, showing a trivial one-table schema that does all you could ever want in a NoSQL, especially after choosing to ignore that it still has SQL.
This may or may not be the case anymore[1] - but for a long while Postgres' managed to outperform MongoDB when restricted to just using single JSON column tables.
One of the larger relational DBs beating the OG NoSQL will always fill me with amusement.
1. Ed. looks like this is still the case, but please remember this is MongoDB specifically. AFAICT MongoDB is mostly a dead branch at this point and a comparison against Redis or something else that's had continued investment might be more fair. I'm just happy chilling in my SQL world - I'm not super up-to-date on the NoSQL market.
> The reason why it says stupid in the man-pages is that it makes no assumptions about what content you store in it.
That's not true. The assumption that you are storing text files is very much built in to the design of git, and specifically, into the design of its diff and merge algorithm. Those algorithms treat line breaks as privileged markers of the structure of the underlying data. This is the reason git does not play well with large binary files.
The diffing/merging stuff that you see when you use the git tool is really just some sugar on top of the underlying data which is being stored. If you look at what a commit actually contains, it's just a hash of the tree root, one or more parent commits identified by their hash, and some other metadata (author, date, etc). There's nothing about the commit that cares about the contents of its tree or its parents' trees. It's the git tooling on top of those structures which does.
In git world this distinction is referred to as the "porcelain" vs "plumbing", the plumbing being the underlying structures and porcelain being the stuff most people actually use (diffs, merges, rebases, etc...)
Tangentially I started a re-implementation of git in Common Lisp[1], and have completed parsers for most of the file formats except delta encoded objects.
Does anyone happen to know of an implementation or tests for delta encoding I could consult that is available under an MIT-like license? (BSD, Apache v2, etc.)
I like it a lot so far, and I think it could be used for "cloud" stuff, not just backups.
I'd like to see Debian/Docker/PyPI/npm repositories in git annex, etc.
It has lazy checkouts which fits that use case. By default you just sync the metadata with 'git annex sync', and then you can get content with 'git annex get FILE', or git annex sync --content.
Can anyone see a reason why not? Those kinds of repositories all seem to have weird custom protocols. I'd rather just sync the metadata and do the query/package resolution locally. It might be a bigger that way, but you only have fully sync it once, and the rest are incremental.
No lolz here! I actually need exactly this as I'll soon be starting work on a CMS-style editor for Git-based static sites, and it's all Ruby to boot. Awesome sauce!
I haven't used it myself, but based on their blog, Dolt looks like a nicer way to share tables of data in a git-like way, since you get both branches and merges and SQL support.
I don't know the name of this type of learning style but I would eat up a tutorial on Git that started with the plumbing commands like this and worked their way up. I think this is oddly one of the clearest explanations of Git I've ever read. I know that wasn't really the point of the post but still.
This doesn't start with the plumbing commands, but it's along those lines and seemed helpful to some new-to-git folks at work when we started switching to it last year: https://eagain.net/articles/git-for-computer-scientists/
There’s nothing like writing it yourself for understanding something. This tutorial goes through reimplementing the basics of git in a few lines of Python, it’s surprisingly simple and doable in an afternoon: https://wyag.thb.lt/
> You can query by key … and that’s about it. The only piece of good news here is that you can structure your data in folders in such a way that you can easily get content by prefix, but that’s about it. Any other query is off limits, unless you want to do a full recursive search. The only option here is to build indices specifically for querying. You can do this on a scheduled basis if staleness is of no concern or you can use git hooks to update indices as soon as a commit happens.
Isn't the point of a database the ability to query? Why else would you want "Git as a NoSql database?" If this is really what you're after, maybe you should be using Fossil, for which the repo is an sqlite database that you can query like any sqlite database.
How could you reimplement the poorest of a poor man's SQL on top of a key/value store? The idea I came up with is: whatever fields you want to query by need an index as a separate key. Maybe this would allow JOINs? Probably not?
The point of a database is to hold data, and there are many key/value databases for which you can only query by key unless you add a secondary index. (e.g. Bigtable, DynamoDB, Riak, Redis, etc).
Fascinating article. Looks like it was posted to HN at least six times before today and got no traction. In my browser, the examples are sometimes a concatenation of the command and the output. Does anyone else see this?
How about StackExchangeQL? Write your data as comments on StackExchange questions. To update the data, reply to your comment. It's like NoSQL because somebody else is hosting it for you and there's no schema.
I've been very curious to explore this type of use case with askgit (https://github.com/augmentable-dev/askgit) which was designed for running simple "slice and dice" queries and aggregations on git history (and change stats) for basic analytical purposes. I've been curious about how this could be applied to a small text+git based "db". Say, for a regular json or CSV dumps.
You don't even need to write your key/value in a file. Git accept committing empty commit. and allow any message to be committed with it so content of your database can be the git commit message.
I wrote about that concept before and tried to implement it but never went to the full potential (too many side projects)
https://www.emadelsaid.com/git-as-messages-queue/https://github.com/emad-elsaid/gitmq
Isn't the problem attempting to be solved here solved by event sourcing[0]? Which lets you 're-create' your changes and is - ultimately- suppose to allow you effectively log each change in such a way its re-playable backwards and forwards
Kind of, but there’s more to it. “Event sourcing” is the pattern being implemented by git’s immutable object database, but there’s also all the hashing and reference management.
[+] [-] simonw|5 years ago|reply
My current preferred strategy for dealing with this (at least for any table smaller than a few GBs) is to dump the entire table contents to a git repository on a schedule.
I've run this for a few toy projects and it seems to work really well. I'm ready to try it with something production-scale the next time the opportunity presents itself.
[+] [-] lisper|5 years ago|reply
How does that solve the problem? All this would do is give you snapshots of the state of the DB at particular points in time.
The only way to know who changed what when is to keep track of those changes as they are made, and the best way to do that is to stick the changes in an eventlog table.
[+] [-] swiley|5 years ago|reply
[+] [-] acidbaseextract|5 years ago|reply
At risk of extreme overkill the other way, something like Debezium [1] doing change monitoring dumping into S3 might be a viable industrial strength approach. I haven't used it in production, but have been looking for appropriate time to try it.
[1] https://debezium.io/
[+] [-] Aqueous|5 years ago|reply
[+] [-] sedeki|5 years ago|reply
I guess you can't just plug this into a common system of rotating logs easily, as there might be several changes beteeen the log rotation.
Also, I guess you'd need a user friendlier interface to actually display who made the change from the git repo.
Anyway, interesting solution.
[+] [-] irrational|5 years ago|reply
Sally makes a change to column 1 of record 1.
Billy makes a change to column 2 of record 1 a nanosecond later.
You commit to Git again.
Your boss wants to know who changed column 1 of record 1.
You report it was Billy.
Billy is fired.
[+] [-] simonw|5 years ago|reply
[+] [-] vishnugupta|5 years ago|reply
[+] [-] gjvnq|5 years ago|reply
[+] [-] cryptonector|5 years ago|reply
[+] [-] munk-a|5 years ago|reply
One of the larger relational DBs beating the OG NoSQL will always fill me with amusement.
1. Ed. looks like this is still the case, but please remember this is MongoDB specifically. AFAICT MongoDB is mostly a dead branch at this point and a comparison against Redis or something else that's had continued investment might be more fair. I'm just happy chilling in my SQL world - I'm not super up-to-date on the NoSQL market.
[+] [-] devoutsalsa|5 years ago|reply
[+] [-] richardwhiuk|5 years ago|reply
[+] [-] wingspan|5 years ago|reply
[+] [-] dmingod666|5 years ago|reply
[+] [-] Lorin|5 years ago|reply
[+] [-] lisper|5 years ago|reply
That's not true. The assumption that you are storing text files is very much built in to the design of git, and specifically, into the design of its diff and merge algorithm. Those algorithms treat line breaks as privileged markers of the structure of the underlying data. This is the reason git does not play well with large binary files.
[+] [-] mediocregopher|5 years ago|reply
In git world this distinction is referred to as the "porcelain" vs "plumbing", the plumbing being the underlying structures and porcelain being the stuff most people actually use (diffs, merges, rebases, etc...)
[+] [-] fiddlerwoaroof|5 years ago|reply
Does anyone happen to know of an implementation or tests for delta encoding I could consult that is available under an MIT-like license? (BSD, Apache v2, etc.)
[1]: https://github.com/fiddlerwoaroof/cl-git
[+] [-] chubot|5 years ago|reply
https://git-annex.branchable.com/
I like it a lot so far, and I think it could be used for "cloud" stuff, not just backups.
I'd like to see Debian/Docker/PyPI/npm repositories in git annex, etc.
It has lazy checkouts which fits that use case. By default you just sync the metadata with 'git annex sync', and then you can get content with 'git annex get FILE', or git annex sync --content.
Can anyone see a reason why not? Those kinds of repositories all seem to have weird custom protocols. I'd rather just sync the metadata and do the query/package resolution locally. It might be a bigger that way, but you only have fully sync it once, and the rest are incremental.
[+] [-] GordonS|5 years ago|reply
[+] [-] ioquatix|5 years ago|reply
[+] [-] jaredcwhite|5 years ago|reply
[+] [-] pokstad|5 years ago|reply
[+] [-] skybrian|5 years ago|reply
https://www.dolthub.com/
[+] [-] schaefer|5 years ago|reply
I’m not seeing a way to do anything similar with dolthub.
[+] [-] simon_acca|5 years ago|reply
Some pointers to related ideas:
https://en.m.wikipedia.org/wiki/Persistent_data_structure - https://docs.datomic.com/cloud/index.html - https://opencrux.com/ - https://github.com/attic-labs/noms - https://researcher.watson.ibm.com/researcher/files/us-leejin...
[+] [-] Spivak|5 years ago|reply
[+] [-] bacon_waffle|5 years ago|reply
If you're already comfortable with git, the Git Internals section of the book may be better: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
[+] [-] cryptonector|5 years ago|reply
[+] [-] bspammer|5 years ago|reply
[+] [-] chmaynard|5 years ago|reply
https://maryrosecook.com/blog/post/git-from-the-inside-out
[+] [-] finnthehuman|5 years ago|reply
[+] [-] weaksauce|5 years ago|reply
[+] [-] bachmeier|5 years ago|reply
Isn't the point of a database the ability to query? Why else would you want "Git as a NoSql database?" If this is really what you're after, maybe you should be using Fossil, for which the repo is an sqlite database that you can query like any sqlite database.
[+] [-] MuffinFlavored|5 years ago|reply
I was thinking about this in terms of the new CloudFlare Durable Objects open beta: https://blog.cloudflare.com/durable-objects-open-beta/ Storing data is really only half the battle.
How could you reimplement the poorest of a poor man's SQL on top of a key/value store? The idea I came up with is: whatever fields you want to query by need an index as a separate key. Maybe this would allow JOINs? Probably not?
[+] [-] js2|5 years ago|reply
[+] [-] chmaynard|5 years ago|reply
[+] [-] escanor|5 years ago|reply
[+] [-] 0xbadcafebee|5 years ago|reply
[+] [-] zimbatm|5 years ago|reply
One of the main advantage of this approach is that replication becomes trivial, it's just a matter of running `git fetch` on all the repositories.
The main disadvantage is that NodeDB only has one implementation (Java), and no CLI to play with it.
[1]: https://gerrit-review.googlesource.com/Documentation/note-db...
[+] [-] zachmu|5 years ago|reply
https://github.com/dolthub/dolt
[+] [-] patrickdevivo|5 years ago|reply
This also reminds me of Dolt: https://github.com/dolthub/dolt which I believe has been on HN a couple times
[+] [-] blazeeboy|5 years ago|reply
[+] [-] rrdharan|5 years ago|reply
[+] [-] abathologist|5 years ago|reply
> A distributed database built on the same principles as Git (and which can use git as a backend)
[+] [-] no_wizard|5 years ago|reply
[0]: https://martinfowler.com/eaaDev/EventSourcing.html
[+] [-] hcarvalhoalves|5 years ago|reply
[+] [-] math-dev|5 years ago|reply