top | item 6455987

Ask HN: How would one go about writing his own database?

7 points| bernatfp | 12 years ago | reply

I was thinking that writing a DB would be a good project to start getting comfortable with a language I've been learning. However I have no idea about design, strategies or best practices for creating a DB.

17 comments

order
[+] kkowalczyk|12 years ago|reply
What kind of database? An embedded key/value database like TokyoCabinet? An embedded SQL database like SQLite? An in memory data-structure database like Redis? A server SQL database like MySQL? A distributed key/value database like Cassandra? A json document database like CouchDB? I could go on...

In general, if you want to learn a new language, pick a project of limited scope (weeks) that you have done in the past in some other language.

A database is a bad choice. Even the simplest database worth being called a database requires months of effort.

[+] bernatfp|12 years ago|reply
To be precise, I want to code a key/value store exclusively oriented at geolocated queries by key. It doesn't look like a very complex project or am I underestimating it?
[+] zamalek|12 years ago|reply
I noticed you want to do geolocation. Let's put that aside for now, just looking at key/value.

1. You need to store all your data in "pages". These are fixed-length blocks (that can be chained together).

2. Use a classical "dictionary" data structure to store your index (most use B-Tree), but you place that structure on-disk instead of in-memory. Your index basically points a key to the index of the first page that contains the value.

3. You may need to store other types of data in the file, such as the free page list.

As for geo-location, B-Tree fits nicely here.

1. Intelligently lay out your B-Tree so that nearby nodes are typically within the same node-family-system. Include information about the distance spanned by descendants in the tree (you probably want to use Manhatten distance).

2. When trying to do "near" queries, first locate the node you are interested in and then start traversing up the parents until you find the ancestor that encompasses the distance you are looking for.

3. Grab all the descendants of that ancestor and re-check them against the distance.

[+] jacquesm|12 years ago|reply
This is one of those questions where if you ask it the answer is 'you shouldn't'.

If you want to get comfortable with a language that you are learning my advice would be to pick a simple real-world problem and to try to solve that rather than to tackle a major project like this. That will only cause frustration in the long run.

[+] ulisesrmzroche|12 years ago|reply
in the short run too. even setting up the project sounds like a bitch.
[+] hardwaresofton|12 years ago|reply
I think if you really wanted to get comfortable with something that could have you make a performant, database quickly, you might want to start with something like C...

Now, C is a shotgun that you will invariably use to shoot yourself in the foot with, but I think it's absolutely straightforward and powerful manual memory management, as well as raw speed, would make it worth learning.

Like, you could for example just focus on building a database of a given size, in memory -- you could even do it all in RAM, if you wanted.

You could have the database specialize in holding one kind of data structure (for example, how about tries? they're pretty rare, despite how useful they are)

[+] bernatfp|12 years ago|reply
Thanks for the constructive feedback. In fact, I wanted to focus just in latitude-longitude pairs to perform geolocated queries, this is my single goal. I'm trying to get rid of MongoDB for my backend and this is one of the last barriers...
[+] arh68|12 years ago|reply
I, for one, think you should do it. But you've got to realize, up front, this will be thrown away, will only work for 1 use case, will be slow, etc. Don't build anything general, or anything reusable. Write in Go. Put everything in one big textfile. Start with O(slow) code. Use a simple 'schema', like tsv/csv/json. It will work.

Production-ready, simple, written by you: pick 2. The 'correct' thing to do is choose the first two. Have at it.

[+] redox_|12 years ago|reply
Wow, to start getting comfortable with a language, a DB won't be my choice. You should really start with something easier.
[+] lmm|12 years ago|reply
One feature at a time, the same as any other program.

Start with something very simple - maybe a single table with a hardcoded schema, or a simple string key/value store (a la memcached). Find a use case for which this is an improvement over what you had before. Add features as and when you need them.

[+] pbrumm|12 years ago|reply
if you have a different angle on querying your data that the current solutions don't solve then you should definitely give it a shot.

You should try to limit what you build though. Can you utilize leveldb, Riak, Cassandra, Redis or even Postgres Foreign Data Wrapper.

Your data will probably need to be queried with other user data so consider that as well.

I would start with figuring out how to query the data you need that different from the current measures and prove to yourself that it beats out the existing strategies. If it does then expand by storing and loading the data to disk.

then work on connecting to it through multiple clients, look into using msgpack, thrift, message pack.

Have fun.

[+] drakaal|12 years ago|reply
The language you have been learning had best be C or Assembly. Almost anything else and your database will always be the weakest link in performance.
[+] jacquesm|12 years ago|reply
I really don't see how you could suggest to learn assembly in this context. I'm not aware of any production database for commodity hardware written in assembler. (in the embedded world there are a few but that's specialty stuff).