top | item 6609998

Seagate just reinvented the disk interface using Ethernet

235 points| slyall | 12 years ago |speakingofclouds.com | reply

117 comments

order
[+] noonespecial|12 years ago|reply
I really like the "its just a server that takes a 4k key and stores and retrieves a 1M value" approach. I'm not so keen on the physical drive "repurposing" the standard pinout of existing hardware unless they are prepared to gracefully fall back to the old block device standard if it gets plugged into a "muggle" device.

This has real promise so long as it stays as radically open as they are claiming it will be. When I can grab an old scrub machine, put a minimal debian on it and apt-get seagate-drive-emulator and turn whatever junk drives I've got laying around into instant network storage (without buying magic seagate hardware), I'm sold (and then might think about buying said hardware).

[+] asharp|12 years ago|reply
Key value stores are useful, and they are especially useful in this form factor. On the other hand, you now have a very large black box that you have to somehow navigate in order to create a workable system. Given that this is likely an arm core running linux on the inside, I would have considered a slightly more open approach to be 'Here's a working KV store using backing db X and here's how to reflash it if it doesn't quite work for you'.
[+] lukeschlather|12 years ago|reply
I think the idea is that if you want to do that, you would use OpenStack, and your application logic must be pluggable so that it supports this protocol, OpenStack, S3, or any other KV store you can get a library for.
[+] oofabz|12 years ago|reply
I hope it will support IPv6. The article mentions DHCP and has an example address of 1.2.3.4, but IPv4 seems like a poor choice for a new LAN protocol in 2013. Not everyone has IPv6 internet connectivity but we do all have IPv6 LAN.

Apple has been using IPv6 for local network services for years now, like file sharing and Time Capsule backups, and it works great.

[+] whalesalad|12 years ago|reply
Well on an internal network ipv4 is more than enough. With subnets you can put a LOT of hosts on your private network without any problems.
[+] Daniel_Newby|12 years ago|reply
IPv6 has gigantic variable-length headers. That is inefficient, and too hard to implement in FPGA/silicon for a niche exploratory project.
[+] sneak|12 years ago|reply
This seems like a reinvention of Coraid's ATAoE, which has the added benefit of already being in the mainline kernel, good server/target support (vblade), hardware products shipping now, a lack of IP/TCP overhead, and a dead-simple protocol.

http://aoetools.sourceforge.net/

[+] ChuckMcM|12 years ago|reply
Basically came here to say the same thing, I like Geoff but this isn't "new" in that sense. The "newness" here is that Seagate just put it into their base board controller. Had they been a bit smarter about it they would have put in two Ethernet PHYs and then you could dual port the drive, much like the old DSSI drives from DEC.

Routing is also a non-issue since a single drive on the network is about as useful as a single drummer in a marching band, basically you're going to need at least three to make something with a bit of reliability, and more if you want efficient reliability. So between your actual storage 'processor' and the storage 'elements' you drop in a cheap bit of Broadcom silicon to make a 48 port GbE switch and voila, your much more reliable than SATA and much cheaper than FC.

I'm sure tho that the folks at Google are all over this. :-)

[+] noselasd|12 years ago|reply
Note that while the lack of TCP/IP overhead may be helpful, it also means you don't get any routeing - often one of the pains of using FCoE.
[+] rektide|12 years ago|reply
ATAoE was- to my knowledge- never integrated into a drive controller.
[+] WalterBright|12 years ago|reply
I'm waiting for stereo components that connect to each other via an Ethernet cable and a hub.

Imagine a CD player, turntable, receiver, preamp, etc., that all have only two connectors: power, and Ethernet. You wouldn't have problems anymore with running out of connections on the back of your receiver. That incredible rats nest of disparate wires and cables would be gone. No more RCA cables, coax cables, HDMI, optical cables, composite video, supervideo, component video, BNC, various adapters, etc.

No more fumbling around the back trying to figure out which socket to plug the RCA cables into, which is input, which is output, etc.

[+] gcb1|12 years ago|reply
problem with audio is delay. even all analog you already had delays (granted some of the analogs are acumulators so...) but all the formats you see in audio are just lots of people, all with differents ideas of trade offs for delay and easy of use. you are on the far right... so do not invent a new one and use what we already have there, which is optical i think.
[+] konstruktor|12 years ago|reply
"Über alles" will be primarily associated with Hitler's anthem of the Third Reich by most native speakers of German who know some history. Not a good choice for a title.
[+] buyx|12 years ago|reply
Since the article was written in English, for an English speaking audience, who consider generally "über alles" to not be particularly offensive anymore (it has come to mean something like "all conquering"). I don't think it was a poor choice of title at all, especially since it sums up the tone of his article.

It's quite unfortunate that the far-right in Germany has since appropriated the phrase, but it's been appropriated differently by English-speakers. Censoring people based on a usage that is foreign to them is a little harsh.

[+] grn|12 years ago|reply
Do Germans really associate that phrase mostly with Hitler? I find it quite surprising because despite not being German (nor German-speaking) I know that this phrase comes from the first verse of the Deutschlandlied.
[+] naiv|12 years ago|reply
yes, I thought it would be some storage reinvented by Aryans
[+] rurounijones|12 years ago|reply
As a counterpoint: A slightly less gushing article with some good comments (yes, even on El Reg) http://www.theregister.co.uk/2013/10/22/seagate_letting_apps...

Comments along the lines of "Backups? Snapshots? RAID? How they handling this then?"

[+] vidarh|12 years ago|reply
How are the handling this today? You can treat an existing HD as a key-value store where the keys is the location on disk and the value is a sector of binary data. Conceptually there's no difference.

The answer is: If you need those capabilities to offer up a traditional file system, you do as you do today: you layer it on top.

But many systems don't, because they already re-implement reliability measures on top of hard drives, as we want systems that are reliably available in the case of server failure too.

E.g. consider something like Sheepdog: https://github.com/sheepdog/sheepdog Sheepdog is a cluster block device solution with automatic rebalancing and snapshots. It implements this on top of normal filesystems by storing "objects" on any of a number of servers, and uses that abstraction to provide all the services. Currently sheepdog requires the sheep daemon to run on a set of servers that can mount a file system on the disks each server is meant to use. With this system, you could possibly dispense with the filesystem, and have the sheep daemons talk directly to a number of disks that are not directly attached.

For sheepdog RAID is not really recommended, as sheepdog implements redundancy itself (and you can specify the desired number of copies of each "block device" ), and it also provides snapshots, copy on write, extensive caching and support incremental snapshot based backups of the entire cluster in one go.

So in other words, there are applications that can make very good use of this type of arrangement without any support for raid etc. at the disk level. And for applications that can't, a key value store can trivially emulate a block device - after all sheepdog emulates a block device on top of object storage on top of block devices...

You could also potentially reduce the amount of rebalancing needed in the case of failures, by having sheep daemons take over the disks of servers that die if the disks are still online and reachable.

The biggest challenge is going to be networking costs - as I mentioned elsewhere, SSDs are already hampered by 6Gbps in SATA III, and 10GE switches are ludicrously expensive still.

[+] asharp|12 years ago|reply
Bitrot? Error recovery? SMART data? ...?
[+] pessimizer|12 years ago|reply
Thanks for this - surprised by the Basho/Riak connection.
[+] mrb|12 years ago|reply
I wish SD cards would implement a key-value storage interface natively. It would instantly remove the need to implement a filesystem in many embedded systems eg. music players: all they need is access to keys (song filenames) and values (blob of ogg/mp3 data).
[+] kamaal|12 years ago|reply
As some doing a bit of embedded system work these days, I was wondering why the MCU manufacturers don't offer a key value store(Even small ones would do) for configuration purposes.

The most famous ways of managing configuration is serializing a structure on EEPROM/Flash, or writing a string with lengths of the strings as delimiters.

Even if you assume, its for saving space etc. The way I see you will inevitably use space while you write the code to serialize and de serialize the configuration data.

[+] saljam|12 years ago|reply
These hardware KV interfaces are still fairly low-level. Fixed-size values, etc. So you'll still need to have some abstraction layer that handles that. You might not call it a filesystem but I bet it'll look a lot like one.
[+] rektide|12 years ago|reply
SD cards implement a linear map of blocks. We typically introduce a hierarchical key-value store on top of that- a filing system where we file values by certain keys.

What is your ask? Get rid of hierarchy and use only a single flat directory on the SD card? Plan9 was close to the kind of vision you describe- configuration and state for applications lived live in the file system.

[+] TheLoneWolfling|12 years ago|reply
Database as a file system. In some ways it actually makes an odd sort of sense...

    SELECT * FROM sdc
    WHERE Type='mp3';
I could see uses for something like that. You could even treat it like a traditional file system for fallback purposes, if one of the tags was a 'directory' tag.

Also, it would make sense in cases where you have... [whatever the equivalent for NUMA is for disks. NUDA? Things like hard drives with a limited flash cache.] Store the indexes on the flash or in RAM (periodically backed up to the disk, of course). Biggest issue would be wear on the flash, though.

[+] Someone|12 years ago|reply
Classical Forth systems worked that way. Being Forth, they of course went for minimalism here. Keys were unsigned ints (actually, ints interpreted as unsigned) and all values were 1024 bytes (see http://c2.com/cgi/wiki?ForthBlocks)

Programming that way was fun, but I wouldn't want to use it on a system with megabytes of RAM. Embedded, it would be fun to implement what you describe on top of that, though.

[+] zwieback|12 years ago|reply
That would be nice and it's doable. I like eMMC for microcontroller apps since it implements wear leveling and bad block management internally. Adding a simple key-value mapping could probably be added without too much effort.

eMMC isn't meant to be removable, though.

[+] justinsb|12 years ago|reply
I think this is an incredibly interesting approach, and I hope Seagate open it up a little more. If we could run some computation on the drive, that could be incredibly powerful.

I can imagine that once these are SSD drives, paired with reasonably powerful (likely ARM) chips, that we'll have massively parallel storage architectures (GPU-like architectures for storage). We'll have massive aggregate CPU <-> disk bandwidth, while SSD + ARM should be very low power. We could do a raw search over all data in the time it takes to scan the flash on the local CPU, and only have to ship the relevant data over (slower) Ethernet for post-processing.

I'd love to get my hands on a dev-kit :-)

[+] arkj|12 years ago|reply
Are you munching some number crunching?
[+] _wmd|12 years ago|reply
Seems like an odd invention given the industry is moving to storage technologies with sub-microsecond latencies, which is at least an order of magnitude better than 10ge is usually capable of. Still at least 'object store' style operations are much richer, so the need is avoided to make many round trips to the disk to resolve the location of a database record.

Hmm, which raises the question: how much RAM should a hard disk have? In a regular architecture, that database lookup could be meaningfully cached (and you could design and provision exactly to ensure your entire set is cached). Opaque K/V "disk" seems less appealing from this angle

[+] vidarh|12 years ago|reply
If this means 10gbps ethernet switches finally comes down in price, awesome...

Otherwise this will be hampered by the fact that the 6Gbps of SATA III is already too slow to take maximum advantage of many SSD devices (hence OCZ experiments with effectively extending PCIe over cables to the devices.

[+] zurn|12 years ago|reply
These are 4 TB units of 5900 RPM spinning rust.
[+] peterwwillis|12 years ago|reply
"The Seagate Kinetic Open Storage platform eliminates the storage server tier of traditional data center architectures by enabling applications to speak directly to the storage device, thereby reducing expenses associated with the acquisition, deployment, and support of hyperscale storage infrastructures."

First of all: Hyperscale? I'm not a retarded non-technical manager or MBO, so I just stopped listening to your entire pitch. Second: You're still selling storage infrastructure, and I still have to support it. The expense just has a different name now.

"Companies can realize additional cost savings while maximizing storage density through reduced power and cooling costs, and receiving potentially dramatic savings in cloud data center build outs."

How does reducing my power and cooling costs maximize my storage density? Oh, by getting me to spend more money on your product instead of power and cooling. Nice try, buddy; give me the cost comparison or stfu.

Their whole pitch here is "throw away your key/value servers and use our key/value server instead". I wonder which will be more expensive: something I throw together with commodity PCs, or a SAN developed by Seagate.

[+] polskibus|12 years ago|reply
I wonder about performance - will this new storage protocol be at least as performant as current standards (ATA, SCSI) ? We need better performing drives, didn't the datacenter sort of already took care of itself?
[+] mcpherrinm|12 years ago|reply
That's an interesting question, and I think the answer isn't immediately obvious.

One important thing is that the disk is doing more work now -- you offload a bunch of what the filesystem has traditionally had to do onto the disk itself. That should mean less traffic, and lower latency. Maybe not higher throughput, though.

The interface is 2x1gigabit, so that's obviously slower than a 3-6 gigabit SAS or Sata interface. But maybe the offloaded work will be worth it? Especially if you are doing lots of "small" IO operations, the potential for lower latency might be a win.

It's a cost reduction at the end of the day, not a huge performance bonus. I am very interested to get my hands one and see how it plays out.

[+] wmf|12 years ago|reply
I don't think we need better performing hard disks. Everyone who cares about performance should have moved to flash already. Kinetic looks like it was designed by "disk people", not "flash people".
[+] rythie|12 years ago|reply
I don't understand why this is branded as a Ethernet protocol when it's a IP protocol
[+] notacoward|12 years ago|reply
tl;dr it's not nearly as cool as it could have been. I already posted a more detailed explanation here:

http://pl.atyp.us/2013-10-comedic-open-storage.html

I tried to post a comment on the NSOP (Not So...), but first I got "HTTP internal error" and then I got "duplicate comment" but it still hasn't shown up, so I'll post it here.

"The “private” bit is important; although various techniques have been created for shared (multi-master) access to the interconnect, all were relatively expensive, and none are supported by the consumer-grade drives which are often used for scale-out storage systems."

I was working on multi-master storage systems using parallel SCSI in 1994. Nowadays you can get an FC or SAS disk array for barely more than a JBOD enclosure. Shared storage is neither new nor expensive. It's not common at the single-disk layer, but it's not clear why that should matter.

The idea of network disks with an object interface isn't all that new either. NASD (http://www.pdl.cmu.edu/PDL-FTP/NASD/Talks/Seagate-Dec-14-99....) did it back in '99, and IMO did it better (see http://pl.atyp.us/2013-10-comedic-open-storage.html for the longer explanation.

"Don’t fall into the trap of thinking that this means we’ll see thousand upon thousands of individual smart disks on the data center LANs. That’s not the goal."

...and yet that's exactly what some of the "use cases" in the Kinetics wiki show. Is it your statement that's incorrect, or the marketing materials Seagate put up in lieu of technical information?

"they don’t have to use one kind of (severely constrained) technology for one kind of traffic (disk data) and a completely different kind of technology for their internal HA traffic."

How does Kinetic do anything to help with HA? Array vendors are not particularly constrained by the interconnects they're using now. In the "big honking" market, Ethernet is markedly inferior to the interconnects they're already using internally, and doesn't touch any of the other problems that constitute their value add - efficient RAID implementations, efficient bridging between internal and external interfaces (regardless of the protocol used), tiering, fault handling, etc. If they want to support a single-vendor object API instead of several open ones that already exist, then maybe they can do that more easily or efficiently with the same API on the inside. Otherwise it's just a big "meh" to them.

At the higher level, in distributed filesystems or object stores, having an object store at the disk level isn't going to make much difference either. Because the Kinetics semantics are so weak, they'll have to do for themselves most of what they do now, and performance isn't constrained by the back-end interface even when it's file based. Sure, they can connect multiple servers to a single Kinetics disk and fail over between them, but they can do the same with a cheap dual-controller SAS enclosure today. The reason they typically don't is not because of cost but because that's not how modern systems handle HA. The battle between shared-disk and shared-nothing is over. Shared-nothing won. Even with an object interface, going back to a shared-disk architecture is a mistake few would make.

[+] perlpimp|12 years ago|reply
Radical simplification and IMO this is great. Remains to be seen how this will fare in comparison with RAID. I'd wager that google would be very interested, if they already not doing something like that in their data centers.

Nerdy me likes idea of POE hub and bunch of drives doing their own thing.

Also pretty good time to start writing stuff to support this into Linux kernel and developing support apps.

my 2c

[+] dmpk2k|12 years ago|reply
I'd wager that google would be very interested, if they already not doing something like that in their data centers.

I wonder about that.

It's usually a lot cheaper to move computation to data, rather than data to computation. The model that Seagate is presenting here strikes me as wrong, because it assumes very fat pipes (or specialized topologies) for any non-trivial app. At the scale Google operates at, I just don't see this happening.

That, and I have a healthy distrust of networks. Instead of having a box with an OSS OS and dumb drives with small(er) closed firmware blobs, now you have the OS, all the network devices and their closed firmware blobs, and drives with large(r) closed firmware blobs, just to access your data. A lot more can go wrong. A lot more byzantine things can go wrong. Drives are dodgy lying sacks of fecal matter as is; this looks like it'll make things much worse.

The model Seagate presents could be useful for data that is rarely accessed, but I'm not really sold on that either.

[+] bluedino|12 years ago|reply
It'd be very interesting if BackBlaze open-sourced at least part of their code. It may be optimized for archival purposes but they're sticking your data on multiple 180TB pods using an open-source stack.

JFS file system, and the only access we then allow to this totally self-contained storage building block is through HTTPS running custom Backblaze application layer logic in Apache Tomcat 5.5. After taking all this into account, the formatted (useable) space is 87 percent of the raw hard drive totals. One of the most important concepts here is that to store or retrieve data with a Backblaze Storage Pod, it is always through HTTPS. There is no iSCSI, no NFS, no SQL, no Fibre Channel.

[+] cpr|12 years ago|reply
Why do they need the overhead of HTTPS for internal use like this?