Ask HN: Distributed File System
28 points| ErrantX | 16 years ago
So I need some recommendations.
I've been building a distributed file system for work to store our hash tables with. These are 1 gig files (about 40TB worth of them) that are write once, read many.
It needs to duplicate the data across servers and make the file available via HTTP. Oh and it needs to scale quite well because as from next month we are potentially adding another TB per month.
So far I havent been able to find a DFS that does all the above so have been working on my own. But am nervous - the files are mission critical but I am not too worried about losing stuff per se (there are alternative backup solutions that make sure we have multiple static copies safe). Im more worried about not being able ot cope with the load. My current implementation is in Python and simply uses a central MYSQL server to track file locations.
So. Can anyone recommend a DFS option I have missed that fulfills my requirements. Or even better can anyone offer technical ideas to help with the development of our code.
:)
[+] [-] comice|16 years ago|reply
It doesn't serve via HTTP directly, but it's easy to point a web server at the filesystem (it has an Apache module to provide direct access without going through FUSE if you want higher performance).
http://www.gluster.org
[+] [-] bjclark|16 years ago|reply
[+] [-] kierank|16 years ago|reply
[+] [-] cperciva|16 years ago|reply
[+] [-] ErrantX|16 years ago|reply
[+] [-] Tichy|16 years ago|reply
I think there must be open source clones of the FS Google uses, but I don't know the names.
[+] [-] ErrantX|16 years ago|reply
Cassandra looks pretty fun - are you suggesting that as the database right? Im thinking a quick python implementation for PUT (maybe DELETE) and meta operations, using cassandra as a backend and Lighttpd for the GET (high performance) might work..... cheers.
[+] [-] gamache|16 years ago|reply
Edit: OK, now I see that S3 was suggested but vetoed. Unwise decision, in my opinion. If you need data security, encrypt your data.
[+] [-] ErrantX|16 years ago|reply
S3 is great for us atm but the work to encrypt it is too much because the cost doesnt scale for us long term. Potentially within a couple of years we will be spending 1/2 a million on bandwith and 1/2 a million on storage - which will continue upwards :)
Were facing the Google model: commodity hardware on a cheap T1 connection :)
[+] [-] simonw|16 years ago|reply
[+] [-] tk999|16 years ago|reply
[+] [-] ajb|16 years ago|reply
[+] [-] Top80sSons|16 years ago|reply
[+] [-] fh973|16 years ago|reply
[+] [-] bjko|16 years ago|reply
[+] [-] dizz|16 years ago|reply
[+] [-] speek|16 years ago|reply
[+] [-] ErrantX|16 years ago|reply
We have 4000 clients atm but hope to increase that quickly (20,000 within a year). That's TB's of data a month. SAN was considered but it too expensive to scale :( EDIT: well, not over the top expensive. But commodity servers with software is cheaper/better for us.
Also we at some point need a HTTP interface to the outside world.
[+] [-] gcv|16 years ago|reply