top | item 26703808

Git as a NoSql Database (2016)

143 points| EntICOnc | 5 years ago |kenneth-truyers.net | reply

135 comments

order
[+] simonw|5 years ago|reply
In almost every database-backed application I've ever built someone, at some point, inevitable asks for the ability to see what changes were made when and by whom.

My current preferred strategy for dealing with this (at least for any table smaller than a few GBs) is to dump the entire table contents to a git repository on a schedule.

I've run this for a few toy projects and it seems to work really well. I'm ready to try it with something production-scale the next time the opportunity presents itself.

[+] lisper|5 years ago|reply
> My current preferred strategy for dealing with this (at least for any table smaller than a few GBs) is to dump the entire table contents to a git repository on a schedule.

How does that solve the problem? All this would do is give you snapshots of the state of the DB at particular points in time.

The only way to know who changed what when is to keep track of those changes as they are made, and the best way to do that is to stick the changes in an eventlog table.

[+] swiley|5 years ago|reply
ITT HN collectively learns what an "audit log" is.
[+] acidbaseextract|5 years ago|reply
Completely agreed on the inevitability of that ask.

At risk of extreme overkill the other way, something like Debezium [1] doing change monitoring dumping into S3 might be a viable industrial strength approach. I haven't used it in production, but have been looking for appropriate time to try it.

[1] https://debezium.io/

[+] Aqueous|5 years ago|reply
usually businesses use auditing tables to track this, usually by setting up triggers to insert a row into the auditing table any time a delete, update or insert occurs on the original
[+] sedeki|5 years ago|reply
So each commit is exactly one change to a relevant table?

I guess you can't just plug this into a common system of rotating logs easily, as there might be several changes beteeen the log rotation.

Also, I guess you'd need a user friendlier interface to actually display who made the change from the git repo.

Anyway, interesting solution.

[+] irrational|5 years ago|reply
So you commit to Git.

Sally makes a change to column 1 of record 1.

Billy makes a change to column 2 of record 1 a nanosecond later.

You commit to Git again.

Your boss wants to know who changed column 1 of record 1.

You report it was Billy.

Billy is fired.

[+] simonw|5 years ago|reply
Since this is being picked apart a bit in the comments, I should clarify that I don't think this is the BEST solution to this problem. It's the solution that gives you 90% if the value with 5% of the work.
[+] vishnugupta|5 years ago|reply
A running joke at Uber is that every single configuration management system will end up re-implementing a half ass version of Git.
[+] gjvnq|5 years ago|reply
Why not have a history table and use SQL triggers to insert the new data as JSON everytime someone changes something?
[+] cryptonector|5 years ago|reply
I should write a SQL as a NoSQL database blog post, showing a trivial one-table schema that does all you could ever want in a NoSQL, especially after choosing to ignore that it still has SQL.
[+] munk-a|5 years ago|reply
This may or may not be the case anymore[1] - but for a long while Postgres' managed to outperform MongoDB when restricted to just using single JSON column tables.

One of the larger relational DBs beating the OG NoSQL will always fill me with amusement.

1. Ed. looks like this is still the case, but please remember this is MongoDB specifically. AFAICT MongoDB is mostly a dead branch at this point and a comparison against Redis or something else that's had continued investment might be more fair. I'm just happy chilling in my SQL world - I'm not super up-to-date on the NoSQL market.

[+] devoutsalsa|5 years ago|reply
Someone wrote a SQL wrapper around InnoDB & called it MySQL XD
[+] richardwhiuk|5 years ago|reply
sqlite and json.1 gets you pretty close to emulating a more powerful nosql database.
[+] Lorin|5 years ago|reply
check out RedBeanPHP - it does some interesting stuff for schema liquidity
[+] lisper|5 years ago|reply
> The reason why it says stupid in the man-pages is that it makes no assumptions about what content you store in it.

That's not true. The assumption that you are storing text files is very much built in to the design of git, and specifically, into the design of its diff and merge algorithm. Those algorithms treat line breaks as privileged markers of the structure of the underlying data. This is the reason git does not play well with large binary files.

[+] mediocregopher|5 years ago|reply
The diffing/merging stuff that you see when you use the git tool is really just some sugar on top of the underlying data which is being stored. If you look at what a commit actually contains, it's just a hash of the tree root, one or more parent commits identified by their hash, and some other metadata (author, date, etc). There's nothing about the commit that cares about the contents of its tree or its parents' trees. It's the git tooling on top of those structures which does.

In git world this distinction is referred to as the "porcelain" vs "plumbing", the plumbing being the underlying structures and porcelain being the stuff most people actually use (diffs, merges, rebases, etc...)

[+] fiddlerwoaroof|5 years ago|reply
Tangentially I started a re-implementation of git in Common Lisp[1], and have completed parsers for most of the file formats except delta encoded objects.

Does anyone happen to know of an implementation or tests for delta encoding I could consult that is available under an MIT-like license? (BSD, Apache v2, etc.)

[1]: https://github.com/fiddlerwoaroof/cl-git

[+] chubot|5 years ago|reply
I just started using git annex, which is an extension for large files.

https://git-annex.branchable.com/

I like it a lot so far, and I think it could be used for "cloud" stuff, not just backups.

I'd like to see Debian/Docker/PyPI/npm repositories in git annex, etc.

It has lazy checkouts which fits that use case. By default you just sync the metadata with 'git annex sync', and then you can get content with 'git annex get FILE', or git annex sync --content.

Can anyone see a reason why not? Those kinds of repositories all seem to have weird custom protocols. I'd rather just sync the metadata and do the query/package resolution locally. It might be a bigger that way, but you only have fully sync it once, and the rest are incremental.

[+] GordonS|5 years ago|reply
I'm a little confused about what git annex is - I think it's perhaps a basic file sync utility that uses git behind the scenes?
[+] ioquatix|5 years ago|reply
[+] jaredcwhite|5 years ago|reply
No lolz here! I actually need exactly this as I'll soon be starting work on a CMS-style editor for Git-based static sites, and it's all Ruby to boot. Awesome sauce!
[+] pokstad|5 years ago|reply
Looks like it is inspired by CouchDB
[+] skybrian|5 years ago|reply
I haven't used it myself, but based on their blog, Dolt looks like a nicer way to share tables of data in a git-like way, since you get both branches and merges and SQL support.

https://www.dolthub.com/

[+] schaefer|5 years ago|reply
A big part of the magic of git is that I can run it locally. And sync up with my team much later.

I’m not seeing a way to do anything similar with dolthub.

[+] Spivak|5 years ago|reply
I don't know the name of this type of learning style but I would eat up a tutorial on Git that started with the plumbing commands like this and worked their way up. I think this is oddly one of the clearest explanations of Git I've ever read. I know that wasn't really the point of the post but still.
[+] bspammer|5 years ago|reply
There’s nothing like writing it yourself for understanding something. This tutorial goes through reimplementing the basics of git in a few lines of Python, it’s surprisingly simple and doable in an afternoon: https://wyag.thb.lt/
[+] bachmeier|5 years ago|reply
> You can query by key … and that’s about it. The only piece of good news here is that you can structure your data in folders in such a way that you can easily get content by prefix, but that’s about it. Any other query is off limits, unless you want to do a full recursive search. The only option here is to build indices specifically for querying. You can do this on a scheduled basis if staleness is of no concern or you can use git hooks to update indices as soon as a commit happens.

Isn't the point of a database the ability to query? Why else would you want "Git as a NoSql database?" If this is really what you're after, maybe you should be using Fossil, for which the repo is an sqlite database that you can query like any sqlite database.

[+] MuffinFlavored|5 years ago|reply
> Isn't the point of a database the ability to query?

I was thinking about this in terms of the new CloudFlare Durable Objects open beta: https://blog.cloudflare.com/durable-objects-open-beta/ Storing data is really only half the battle.

How could you reimplement the poorest of a poor man's SQL on top of a key/value store? The idea I came up with is: whatever fields you want to query by need an index as a separate key. Maybe this would allow JOINs? Probably not?

[+] js2|5 years ago|reply
The point of a database is to hold data, and there are many key/value databases for which you can only query by key unless you add a secondary index. (e.g. Bigtable, DynamoDB, Riak, Redis, etc).
[+] chmaynard|5 years ago|reply
Fascinating article. Looks like it was posted to HN at least six times before today and got no traction. In my browser, the examples are sometimes a concatenation of the command and the output. Does anyone else see this?
[+] escanor|5 years ago|reply
yes, i see this as well. at first it was confusing, but later i understood it was the output.
[+] 0xbadcafebee|5 years ago|reply
How about StackExchangeQL? Write your data as comments on StackExchange questions. To update the data, reply to your comment. It's like NoSQL because somebody else is hosting it for you and there's no schema.
[+] zimbatm|5 years ago|reply
Gerrit is using Git as a NoSql database. All the users, projects, reviews, ... are stored in Git repositories[1].

One of the main advantage of this approach is that replication becomes trivial, it's just a matter of running `git fetch` on all the repositories.

The main disadvantage is that NodeDB only has one implementation (Java), and no CLI to play with it.

[1]: https://gerrit-review.googlesource.com/Documentation/note-db...

[+] patrickdevivo|5 years ago|reply
I've been very curious to explore this type of use case with askgit (https://github.com/augmentable-dev/askgit) which was designed for running simple "slice and dice" queries and aggregations on git history (and change stats) for basic analytical purposes. I've been curious about how this could be applied to a small text+git based "db". Say, for a regular json or CSV dumps.

This also reminds me of Dolt: https://github.com/dolthub/dolt which I believe has been on HN a couple times

[+] abathologist|5 years ago|reply
Related https://irmin.org/ (and in development since 2013)

> A distributed database built on the same principles as Git (and which can use git as a backend)

[+] no_wizard|5 years ago|reply
Isn't the problem attempting to be solved here solved by event sourcing[0]? Which lets you 're-create' your changes and is - ultimately- suppose to allow you effectively log each change in such a way its re-playable backwards and forwards

[0]: https://martinfowler.com/eaaDev/EventSourcing.html

[+] hcarvalhoalves|5 years ago|reply
Kind of, but there’s more to it. “Event sourcing” is the pattern being implemented by git’s immutable object database, but there’s also all the hashing and reference management.
[+] math-dev|5 years ago|reply
Thank you, was a nice read (particularly the conclusion, which was insightful and made sense)