Baidu File System – A distributed file system for real-time applications

[+] notacoward|9 years ago|reply

Disclosure: I'm a Gluster developer.

Looks like a pretty good first attempt at a distributed filesystem. Initial impression is HDFS with a distributed NameNode/Nameserver. The first diagram also shows a Metaserver layer that's not mentioned at all in the more recent of the two design docs but "separate Metaerver from Nameserver" appears (unchecked) in the roadmap. All operations using access methods other than their own SDK seem to get funneled through the NameServer cluster, which will severely limit throughput. Not clear how they do replication, though weakly implied that it's driven from the client (like Gluster) or NameServer rather than the first ChunkServer (like Ceph, HDFS, everything else). No mention of how they handle consistency or repair. Likewise no information about performance or security. Not clear if it's anywhere near POSIX compliant (probably not).

FUSE support is in the diagrams, but not checked off on the roadmap. Slow-node detection and avoidance seemed like one of the most interesting features from the design, but is not checked off either. Other things not even on the roadmap, using Gluster not as a fair comparison but as a handy list of possibilities: multiple replication levels, tiering, erasure coding, NFS/SMB, caching, quota, snapshots.

As I said, looks like a good first attempt. Better than most I've seen, with lots of potential, but as of today it seems rather bare-bones. Many hard problems remain to be solved, and I wish them well.

[+] Jacky007|9 years ago|reply

> Looks like a pretty good first attempt at a distributed filesystem. You are damn right. It is. ~~ 3 years ago, the most widely used DFS in baidu was Peta which is similar to HDFS V2. We have migrate to AFS now.

[+] justinsb|9 years ago|reply

Looks nice. I know Raft better than most of the other pieces, so that's where I started; I didn't see code for dynamic membership changes nor log truncation. I can understand getting by with a fixed membership, but log truncation seems like a requirement for a production system. Would be interested to hear whether this is planned or whether there is a clever way around it!

[+] imafatboy|9 years ago|reply

Well, there's another project in the same organization named iNexus achieved in log truncation. It uses leveldb as underlying storage and the leveldb is slightly modified to clean the outdated data when compacting. Maybe BFS will do something similar. For the source code, please refer to https://github.com/baidu/ins And I'm sorry for the lack of English documents in this repo. We are working on it.

[+] usgroup|9 years ago|reply

Looking through the code it supports fuse, but the documentation in ENG is sparse. It also looks to underpin Tera: the Baidu distributed DB.

I think a low read/write latency dfs suitable for real time applications would be a game changer. I'm hoping they up the documentation from here and engage the English speaking community.

[+] lylei|9 years ago|reply

Thanks for your advice. We are working on translating all the documents :)

[+] daviesliu|9 years ago|reply

BFS has a very limited FUSE client.

There is another distributed file system that support full POSIX semantics with well tuned FUSE client, called MooseFS [1].

My ex-employer used that in production for about 8 years, the biggest cluster has more than 2PB.

Disclosure: I'm a MooseFS fan and contributor :)

[1] http://moosefs.org/

[+] bluejekyll|9 years ago|reply

Speaking of documentation, I see almost no comments in the code.

It all looks reasonable, but this takes self-documenting to an extreme.

[+] espadrine|9 years ago|reply

Based on the design[1], it has a leader / follower pattern (although you should have multiple leaders with Raft consensus to avoid having a single point of failure), where the leader is called "nameserver" and decides where to put each piece of data and metadata among a set of chunk servers and metadata servers.

That design is very reminiscent of CephFS's cluster monitors, metadata servers, object storage devices.

[1]: https://github.com/baidu/bfs/blob/master/docs/design.md, https://github.com/baidu/bfs/blob/master/docs/BFS_design.md

[+] WeaselNo7|9 years ago|reply

I thought Raft was always-single-leader? Followers can happily become leaders through elections?

[+] ergo14|9 years ago|reply

what is the leader/follower pattern? Something like master/slave approach?

[+] revelation|9 years ago|reply

They need to stop saying "real-time". Real-time does not mean "fast", it means "guaranteed performance". This is nothing of the sort.

[+] anilgulecha|9 years ago|reply

Can a distributed storage expert comment in what ways this differs from hadoop?

[+] 00k|9 years ago|reply

first of all, impl in C++ (JVM/GC is pain in the ass) - clear arch (only master and dataserver) - very concise config file and easy to deploy - most important, 10k nodes scalability without federation design of namespace

[+] gravypod|9 years ago|reply

For the distributed FS people out there I've got a complicated question. In my job I need to poll and collect data from many remote sensor devices, log all the output, and process that. Not only do I do this buy MANY of my colleges do this and have a different way to manage this process. Can a distributed file system help with this case?

Is there any file system that would be able to sync what amounts to text/binary data across many hosts and allow me to aggregate the data off the network for more secure storage?

I was thinking about using IPFS for this but this also seems better. I'd hopefully like to have a private network for this use case so that other people can't post up a device on this file system and introduce fake data.

[+] chubot|9 years ago|reply

Interesting: Google flags, protocol buffers, and Google C++ style.

[+] puzzle|9 years ago|reply

If you look at Baidu's infrastructure, it's almost like a parallel universe where the names are identical or almost identical to Google's: BFE, GTC, GSLB. And BFS does look a lot like GFS2 aka Colossus.

[+] int_handler|9 years ago|reply

Seems that most of Baidu's C++ open source projects have this pattern as well, albeit with minor variations on the Google C++ style such as 4 spaces for indentation.

[+] jpgvm|9 years ago|reply

More like a C++ clone of HDFS than most people are likely hoping. While you seem to be able to mount it with FUSE I imagine it's primarily meant to be programmed against directly.

Using Raft over a dependency on an external consensus system is nice. Definitely makes the namenode architecture much better.

[+] ciucanu|9 years ago|reply

It looks like a faster version of HDFS since it's written in C++ (vs Java).

Another important aspect is that is using SSD + SATA(I suppose) , which could be a better option than standard SATA/SSD or LV cache using SATA + SSD.

Even if it's just a new thing, if it proves to be faster it may be implemented in Hadoop ecosystem in the future. HDFS has a lot of features being a mature piece of software but it lacks on the response time.

[+] jprince|9 years ago|reply

What's the difference then between this and MapR, besides a similar CLDB pattern? (No single point of failure)

[+] luibelgo|9 years ago|reply

Is there any benchmark available?

[+] NicoJuicy|9 years ago|reply

What wonders me the most, is when they change titles. 1 chinese character sometimes matches 1 english word

Eg. lylei changed the title from "cs启动太慢" to "cs start is too slow " ( on https://github.com/baidu/bfs/issues/376 )

lylei changed the title from "其他SDK写策略" to "SDK writing strategies(fan-out write for example)" (on https://github.com/baidu/bfs/issues/243 )

[+] toxik|9 years ago|reply

"cs启动太慢" means "cs start-up too slow," where 启动 is start-moving and likely a verb-result construction, a pattern in Chinese that to my knowledge doesn't exist in Germanic languages. The second one is more accurately translated to "Other SDK writing strategies."

[+] jeffbax|9 years ago|reply

Not commenting on the BDFS so much as its really cool to see large Chinese companies contributing to open source, does anyone know of other large projects outside of the main Android forks? Pardon the ignorance.

Also wonder if there will be larger skepticism toward integrating Chinese O/S in regards to potential influence by the government (like the NSA has tried to influence in the past)

[+] marknadal|9 years ago|reply

This looks extremely promising and good. I work on distributed system, in particular on databases (so one abstraction layer above file systems). This looks like it would make for a really nice storage engine for https://github.com/amark/gun . Also it is nice to see non-English projects! Very exciting work.

[+] unknown|9 years ago|reply

[deleted]

[+] vonnik|9 years ago|reply

How is this better/different than HDFS? Is this simply an example of NIH?

[+] khc|9 years ago|reply

What I really want to know:

"Once your code has passed the code-review and merged, it will be run on thousands of servers"

And the Chinese text below says tens of thousands of servers, which is it? :-)

[+] muddyrivers|9 years ago|reply

Considering Baidu's scale, it would be tens of thousands.

There are several other discrepancies in the doc between the Chinese version and the English one. Some technical proofreading is needed.

[+] HammadB|9 years ago|reply

Has anyone found a good deep-dive on the architecture in english?

[+] unknown|9 years ago|reply

[deleted]

[+] merb|9 years ago|reply

I wonder whey they choose to rewrite raft and didn't use something with etcd or another working raft solution.

[+] chronid|9 years ago|reply

You usually don't want to add another dependency you don't control to your system, if you don't have to (in term of time and resources).

162 comments