Looks like a pretty good first attempt at a distributed filesystem. Initial impression is HDFS with a distributed NameNode/Nameserver. The first diagram also shows a Metaserver layer that's not mentioned at all in the more recent of the two design docs but "separate Metaerver from Nameserver" appears (unchecked) in the roadmap. All operations using access methods other than their own SDK seem to get funneled through the NameServer cluster, which will severely limit throughput. Not clear how they do replication, though weakly implied that it's driven from the client (like Gluster) or NameServer rather than the first ChunkServer (like Ceph, HDFS, everything else). No mention of how they handle consistency or repair. Likewise no information about performance or security. Not clear if it's anywhere near POSIX compliant (probably not).
FUSE support is in the diagrams, but not checked off on the roadmap. Slow-node detection and avoidance seemed like one of the most interesting features from the design, but is not checked off either. Other things not even on the roadmap, using Gluster not as a fair comparison but as a handy list of possibilities: multiple replication levels, tiering, erasure coding, NFS/SMB, caching, quota, snapshots.
As I said, looks like a good first attempt. Better than most I've seen, with lots of potential, but as of today it seems rather bare-bones. Many hard problems remain to be solved, and I wish them well.
> Looks like a pretty good first attempt at a distributed filesystem.
You are damn right. It is.
~~
3 years ago, the most widely used DFS in baidu was Peta which is similar to HDFS V2. We have migrate to AFS now.
Looks nice. I know Raft better than most of the other pieces, so that's where I started; I didn't see code for dynamic membership changes nor log truncation. I can understand getting by with a fixed membership, but log truncation seems like a requirement for a production system. Would be interested to hear whether this is planned or whether there is a clever way around it!
Well, there's another project in the same organization named iNexus achieved in log truncation. It uses leveldb as underlying storage and the leveldb is slightly modified to clean the outdated data when compacting. Maybe BFS will do something similar.
For the source code, please refer to https://github.com/baidu/ins
And I'm sorry for the lack of English documents in this repo. We are working on it.
Looking through the code it supports fuse, but the documentation in ENG is sparse. It also looks to underpin Tera: the Baidu distributed DB.
I think a low read/write latency dfs suitable for real time applications would be a game changer. I'm hoping they up the documentation from here and engage the English speaking community.
Based on the design[1], it has a leader / follower pattern (although you should have multiple leaders with Raft consensus to avoid having a single point of failure), where the leader is called "nameserver" and decides where to put each piece of data and metadata among a set of chunk servers and metadata servers.
That design is very reminiscent of CephFS's cluster monitors, metadata servers, object storage devices.
first of all, impl in C++ (JVM/GC is pain in the ass) - clear arch (only master and dataserver) - very concise config file and easy to deploy - most important, 10k nodes scalability without federation design of namespace
For the distributed FS people out there I've got a complicated question. In my job I need to poll and collect data from many remote sensor devices, log all the output, and process that. Not only do I do this buy MANY of my colleges do this and have a different way to manage this process. Can a distributed file system help with this case?
Is there any file system that would be able to sync what amounts to text/binary data across many hosts and allow me to aggregate the data off the network for more secure storage?
I was thinking about using IPFS for this but this also seems better. I'd hopefully like to have a private network for this use case so that other people can't post up a device on this file system and introduce fake data.
If you look at Baidu's infrastructure, it's almost like a parallel universe where the names are identical or almost identical to Google's: BFE, GTC, GSLB. And BFS does look a lot like GFS2 aka Colossus.
Seems that most of Baidu's C++ open source projects have this pattern as well, albeit with minor variations on the Google C++ style such as 4 spaces for indentation.
More like a C++ clone of HDFS than most people are likely hoping. While you seem to be able to mount it with FUSE I imagine it's primarily meant to be programmed against directly.
Using Raft over a dependency on an external consensus system is nice. Definitely makes the namenode architecture much better.
It looks like a faster version of HDFS since it's written in C++ (vs Java).
Another important aspect is that is using SSD + SATA(I suppose) , which could be a better option than standard SATA/SSD or LV cache using SATA + SSD.
Even if it's just a new thing, if it proves to be faster it may be implemented in Hadoop ecosystem in the future. HDFS has a lot of features being a mature piece of software but it lacks on the response time.
"cs启动太慢" means "cs start-up too slow," where 启动 is start-moving and likely a verb-result construction, a pattern in Chinese that to my knowledge doesn't exist in Germanic languages. The second one is more accurately translated to "Other SDK writing strategies."
Not commenting on the BDFS so much as its really cool to see large Chinese companies contributing to open source, does anyone know of other large projects outside of the main Android forks? Pardon the ignorance.
Also wonder if there will be larger skepticism toward integrating Chinese O/S in regards to potential influence by the government (like the NSA has tried to influence in the past)
This looks extremely promising and good. I work on distributed system, in particular on databases (so one abstraction layer above file systems). This looks like it would make for a really nice storage engine for https://github.com/amark/gun . Also it is nice to see non-English projects! Very exciting work.
[+] [-] notacoward|9 years ago|reply
Looks like a pretty good first attempt at a distributed filesystem. Initial impression is HDFS with a distributed NameNode/Nameserver. The first diagram also shows a Metaserver layer that's not mentioned at all in the more recent of the two design docs but "separate Metaerver from Nameserver" appears (unchecked) in the roadmap. All operations using access methods other than their own SDK seem to get funneled through the NameServer cluster, which will severely limit throughput. Not clear how they do replication, though weakly implied that it's driven from the client (like Gluster) or NameServer rather than the first ChunkServer (like Ceph, HDFS, everything else). No mention of how they handle consistency or repair. Likewise no information about performance or security. Not clear if it's anywhere near POSIX compliant (probably not).
FUSE support is in the diagrams, but not checked off on the roadmap. Slow-node detection and avoidance seemed like one of the most interesting features from the design, but is not checked off either. Other things not even on the roadmap, using Gluster not as a fair comparison but as a handy list of possibilities: multiple replication levels, tiering, erasure coding, NFS/SMB, caching, quota, snapshots.
As I said, looks like a good first attempt. Better than most I've seen, with lots of potential, but as of today it seems rather bare-bones. Many hard problems remain to be solved, and I wish them well.
[+] [-] Jacky007|9 years ago|reply
[+] [-] justinsb|9 years ago|reply
[+] [-] imafatboy|9 years ago|reply
[+] [-] usgroup|9 years ago|reply
I think a low read/write latency dfs suitable for real time applications would be a game changer. I'm hoping they up the documentation from here and engage the English speaking community.
[+] [-] lylei|9 years ago|reply
[+] [-] daviesliu|9 years ago|reply
There is another distributed file system that support full POSIX semantics with well tuned FUSE client, called MooseFS [1].
My ex-employer used that in production for about 8 years, the biggest cluster has more than 2PB.
Disclosure: I'm a MooseFS fan and contributor :)
[1] http://moosefs.org/
[+] [-] bluejekyll|9 years ago|reply
It all looks reasonable, but this takes self-documenting to an extreme.
[+] [-] espadrine|9 years ago|reply
That design is very reminiscent of CephFS's cluster monitors, metadata servers, object storage devices.
[1]: https://github.com/baidu/bfs/blob/master/docs/design.md, https://github.com/baidu/bfs/blob/master/docs/BFS_design.md
[+] [-] WeaselNo7|9 years ago|reply
[+] [-] ergo14|9 years ago|reply
[+] [-] revelation|9 years ago|reply
[+] [-] anilgulecha|9 years ago|reply
[+] [-] 00k|9 years ago|reply
[+] [-] gravypod|9 years ago|reply
Is there any file system that would be able to sync what amounts to text/binary data across many hosts and allow me to aggregate the data off the network for more secure storage?
I was thinking about using IPFS for this but this also seems better. I'd hopefully like to have a private network for this use case so that other people can't post up a device on this file system and introduce fake data.
[+] [-] chubot|9 years ago|reply
[+] [-] puzzle|9 years ago|reply
[+] [-] int_handler|9 years ago|reply
[+] [-] jpgvm|9 years ago|reply
Using Raft over a dependency on an external consensus system is nice. Definitely makes the namenode architecture much better.
[+] [-] ciucanu|9 years ago|reply
Another important aspect is that is using SSD + SATA(I suppose) , which could be a better option than standard SATA/SSD or LV cache using SATA + SSD.
Even if it's just a new thing, if it proves to be faster it may be implemented in Hadoop ecosystem in the future. HDFS has a lot of features being a mature piece of software but it lacks on the response time.
[+] [-] jprince|9 years ago|reply
[+] [-] luibelgo|9 years ago|reply
[+] [-] NicoJuicy|9 years ago|reply
Eg. lylei changed the title from "cs启动太慢" to "cs start is too slow " ( on https://github.com/baidu/bfs/issues/376 )
lylei changed the title from "其他SDK写策略" to "SDK writing strategies(fan-out write for example)" (on https://github.com/baidu/bfs/issues/243 )
[+] [-] toxik|9 years ago|reply
[+] [-] jeffbax|9 years ago|reply
Also wonder if there will be larger skepticism toward integrating Chinese O/S in regards to potential influence by the government (like the NSA has tried to influence in the past)
[+] [-] marknadal|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] vonnik|9 years ago|reply
[+] [-] khc|9 years ago|reply
"Once your code has passed the code-review and merged, it will be run on thousands of servers"
And the Chinese text below says tens of thousands of servers, which is it? :-)
[+] [-] muddyrivers|9 years ago|reply
There are several other discrepancies in the doc between the Chinese version and the English one. Some technical proofreading is needed.
[+] [-] HammadB|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] merb|9 years ago|reply
[+] [-] chronid|9 years ago|reply
It's another point of failure, more infrastructure you have to keep alive...
[+] [-] andrewclunn|9 years ago|reply
[+] [-] codezero|9 years ago|reply
[+] [-] faizshah|9 years ago|reply