top | item 42480105

Introducing S2

372 points| brancz | 1 year ago |s2.dev

195 comments

order

animex|1 year ago

IANAL,but naming your product S2 and mentioning in the intro that AWS S3 is the tech you are enhancing is probably looking for a branding/copyright claim from Amazon. Same vertical & definitely will cause consumer confusion. I'm sure you've done the research about whether a trademark has been registered.

https://tsdr.uspto.gov/#caseNumber=98324800&caseSearchType=U...

fcortes|1 year ago

Fun fact: S2 and EC2 sound exactly the same in Spanish - both are "ese dos". Add that to EC2 and S3 already being confusing to tell apart by ear

volemo|1 year ago

TBF, building something with the goal of enhancing S3 I would call it S4.

fasteo|1 year ago

At least cloudflare’s R2 has an argument for the naming (IBM vs HAL, A Space Odyssey)

kevingadd|1 year ago

I'm not sure whether they consulted a bad trademark lawyer or didn't consult one at all, but it wouldn't have cost that much to do so. I say this having just recently started the process of filing a trademark - the cost is about the same as buying i.e. 's4.dev' according to the domain registry's website.

Having to rebrand your product after launching is a lot more painful than doing it before launching.

ChicagoDave|1 year ago

OR

Amazon just builds the same thing, calls it S3 Streams, and doesn’t care about S2.

Maybe they make a buyout offer.

I highly doubt they would sue.

rsync|1 year ago

What could possibly be better than being sued by Amazon for some nitpicky naming Issue ?

That’s the kind of David vs. Goliath publicity one could only dream of …

pxtail|1 year ago

Yep, letter S and a number is copyrighted, can't do that

_1tem|1 year ago

This is a really good idea, beautiful API, and something that I would like to use for my projects. However I have zero confidence that this startup would last very long in its current form. If it's successful, AWS will build a better and cheaper in-house version. It's just as likely to fail to get traction.

If this had been released instead as a Papertrail-like end-user product with dashboards, etc. instead of a "cloud primitive" API so closely tied to AWS, it would make a lot more sense. Add the ability to bring my own S3-Compatible backend (such as Digital Ocean Spaces), and boom, you have a fantastic, durable, cloud-agnostic product.

shikhar|1 year ago

(Founder) we do intend to be multi-cloud, we are just starting with AWS. Our internal architecture is not tied to AWS, it's interfaces that we can implement for other cloud systems.

torginus|1 year ago

It would be extra ironic if the whole thing already ran on top of AWS.

There's no end to startups which can be described as existing-open-source-software as a service, marketed as a cheaper alternative to AWS offerings.. who run on AWS.

qudat|1 year ago

People keep making the same argument against Aptible (https://aptible.com) and it is still a very successful PaaS over a decade later.

gr__or|1 year ago

If you do cloud infra stuff, AWS will try to undercut you on price but will never outdo you on D/UX. So I wouldn't let Beezus hold me back

Too|1 year ago

They just did https://news.ycombinator.com/item?id=42211280 (Amazon S3 now supports the ability to append data to an object, 30 days ago). Azure has had the same with append blobs for a long time. It's still a bit more raw than S2, without the concept of record. The step for a cloud provider to offer this natively is very small. And with the concept of a record, isn't this essentially a message queue, where the competitor space is equally big? Likewise if you look into log storage solutions.

throwaway519|1 year ago

Amazon don't compete for price sensitive product offerings.

If anything, they normlise an expectation with a budget aware base.

solatic|1 year ago

Help me understand - you build on top of AWS, which charges $0.09/GB for egress to the Internet, yet you're charging $0.05/GB for egress to the Internet? Sounds like you're subsidizing egress from AWS? Or do you have access to non-public egress pricing?

shikhar|1 year ago

(Founder) We are not charging in preview. At the scale where it matters, we will work it out. Definitely some assumptions in here.

nfm|1 year ago

List pricing is $0.05 per GB after 150TB and at high volume it’s cheaper than that

kondro|1 year ago

They’re probably betting on most users being in AWS and only having to pay 1¢-2¢ transfer.

amazingman|1 year ago

Nobody with sufficient scale will be paying retail for data transfer.

CodesInChaos|1 year ago

Looks like they changed it to $0.08/GB. Which loses them at most $300/month at 50TB, and makes money after that.

MicolashKyoka|1 year ago

strat is likely just get users, then offboard aws if the product works.

masterj|1 year ago

So is this basically WarpStream except providing a lower-level API instead of jumping straight to Kafka compatibility?

An S3-level primitive API for streaming seems really valuable in the long-term if adopted

shikhar|1 year ago

(Founder) That somewhat summarizes yes :) We take a different approach than WarpStream architecturally too, which allows us to offer much lower latencies. No disks in our system, either.

iambateman|1 year ago

These folks knowingly chose to spend the rest of their careers explaining that they are not, in fact, S3.

shikhar|1 year ago

(Founder) well 50% of our name is different

jsheard|1 year ago

How many of these letter-number storage services are there now? S3, B2, R2, S2...

andrelaszlo|1 year ago

Seems preferable to having to explain you're not a paramilitary organization responsible for unspeakable war crimes. Nothing funny about that.

OJFord|1 year ago

Including potentially in court / to lawyers? IANAL, but isn't this just inviting Amazon to claim it's deliberately leveraging their 'S3' trademark and sowing confusion in order to lift their own brand? (Correctly, and even somewhat transparently in TFA, IMO.)

cchance|1 year ago

My issue is that 2<3 and for most people they will just assume its older/shittier S3 lol

pram|1 year ago

It looks neat but, no Java SDK? Every company I've personally worked at is deeply reliant on Spring or the vanilla clients to produce/consume to Kafka 90% of the time. This kind of precludes even a casual PoC.

infiniteregrets|1 year ago

(S2 Team member) As we move forward, a Java/Kotlin and a Python SDK are on our list. There is a Rust sdk and a CLI available (https://s2.dev/docs/quickstart) . Rust felt as a good starting point for us as our core service is also written in it.

karmakaze|1 year ago

I do like this. The next part I'd like someone to build on top of this is applying the stream 'events' into a point-in-time queryable representation. Basically the other part to make it a Datatomic. Probably better if it's a pattern or framework for making specific in-memory queryable data rather than a particular database. There's lots of ways this could work, like applying to a local Sqlite, or basing on a MySQL binlog that can be applied to a local query instance and rewindable to specific points, or more application-specific apply/undo events to a local state.

jgraettinger1|1 year ago

Roughly ten years ago, I started Gazette [0]. Gazette is in an architectural middle-ground between Kafka and WarpStream (and S2). It offers unbounded byte-oriented log streams which are backed by S3, but brokers use local scratch disks for initial replication / durability guarantees and to lower latency for appends and reads (p99 <5ms as opposed to >500ms), while guaranteeing all files make it to S3 with niceties like configurable target sizes / compression / latency bounds. Clients doing historical reads pull content directly from S3, and then switch to live tailing of very recent appends.

Gazette started as an internal tool in my previous startup (AdTech related). When forming our current business, we very briefly considered offering it as a raw service [1] before moving on to a holistic data movement platform that uses Gazette as an internal detail [2].

My feedback is: the market positioning for a service like this is extremely narrow. You basically have to make it API compatible with a thing that your target customer is already using so that trying it is zero friction (WarpStream nailed this), or you have to move further up to the application stack and more-directly address the problems your target customers are trying to solve (as we have). Good luck!

[0]: https://gazette.readthedocs.io/en/latest/ [1]: https://news.ycombinator.com/item?id=21464300 [2]: https://estuary.dev

shikhar|1 year ago

(S2 Founder) Congrats on the success with Estuary! You are not the first person to tell me there is no/tiny market for this. Clearly _you_ thought there was something to it, when you looked to HN for validation. We may do a lot more on top of S2, like offering Kafka compatibility, but the core primitive matters. I have wanted it. It gets reinvented in all kinds of contexts and reused sub-optimally in the form of systems that have lost their soul, and that was enough for me to have this conviction and become a founder.

ED: I appreciate where you are coming from, and understand the challenges ahead. Thank you for the advice.

Scaevolus|1 year ago

This is a very useful service model, but I'm confused about the value proposition given how every write is persisted to S3 before being acknowledged.

I suppose the writers could batch a group of records before writing them out as a larger blob, with background processes performing compaction, but it's still an object-backed streaming service, right?

AWS has shown their willingness to implement mostly-protocol compatible services (RDS -> Aurora), and I could see them doing the same with a Kafka reimplementation.

sensodine|1 year ago

(S2 team member here)

> I suppose the writers could batch a group of records before writing them out as a larger blob, with background processes performing compaction, but it's still an object-backed streaming service, right?

This is how it works essentially, yes. Architecting the system so that chunks that are written to object storage (before we acknowledge a write) are multi-tenant, and contain records from different streams, lets us write frequently while still targeting ideal (w/r/t price and performance) blob sizes for S3 standard and express puts respectively.

evantbyrne|1 year ago

Seems like really cool tech. Such a bummer that the it is not source available. I might be a minority in this opinion, but I would absolutely consider commercial services where the core tech is all released under something like a FSL with fully supported self-hosting. Otherwise, the lock-in vs something like kafka is hard to justify.

shikhar|1 year ago

(Founder) We are happy for S2 API to have alternate implementations, we are considering an in-memory emulator to open source ourselves. It is not a very complicated API. If you would prefer to stick with the Kafka API but benefit from features like S2's storage classes or having a very large number of topics/partitions or high throughput per partition, we are planning an open source Kafka compatibility layer that can be self-hosted, with features like client-side encryption so you can have even more peace of mind.

throwawayian|1 year ago

I look at the egress costs to internet and it doesn’t check out. It’s a premium product dependent on DX, marketed to funded startups.

But if I care about ingress and egress costs, which many stream heavy infrastructure providers do.. This doesn’t add up.

I wish them luck, but I feel they would have had a much better chance from the start by getting some funding and having a loss leader start, then organising and passing on wholesale rates from cloud providers once they’d reached critical mass.

Instead they’re going in at retail which is very spicy. I feel like someone will clone the tech and let you self host, before big players copy it natively.

It’s a commodity space and they’re starting with a moat of a very busy 2 weeks from some Staff engineers at AWS.

shikhar|1 year ago

(Founder) Thanks for sharing your thoughts. We are early and figuring things out. I agree egress cost is going to be a big concern. We want to do the best we can for users as we unlock some scale. During preview, we are focused on getting feedback so the service is free (we will need to talk if the usage is significant though).

h05sz487b|1 year ago

Just you wait, I am launching S1 next year!

graypegg|1 year ago

Ok good, my startup S½ (also known as Ç) is still unique, phew

Lucasoato|1 year ago

Wow, imagine Debezium offering native compatibility with this, capturing the changes from a Postgres database, saving them as delta or iceberg in a pure serverless way!

bushido|1 year ago

I wish more dev-tools startups would focus on clearly explaining the business use cases, targeting a slightly broader audience beyond highly technical users. I visited several pages on the site before eventually giving up.

I can sort of grasp what the S2 team is aiming to achieve, but it feels like I’m forced to perform unnecessary mental gymnastics to connect their platform with the specific problems it can solve for a business or product team.

I consider myself fairly technical and familiar with many of the underlying concepts, but I still couldn’t work out the practical utility without significant effort.

It’s worth noting that much of technology adoption is driven by technical product managers and similar stakeholders. However, I feel this critical audience is often overlooked in the messaging and positioning of developer tools like this.

shikhar|1 year ago

(Founder) Appreciate the feedback. We will try to do a better job on the messaging. It is geared at being a building block for data systems. The landing page has a section talking about some of the patterns it enables (Decouple / Buffer / Journal) in a serverless manner, with example use cases. It just may not be something that resonates with you though! We are interested in adoption by developers for now.

8n4vidtmkvmk|1 year ago

If you ever figure it out, LMK. I don't think I've ever looked at logs more than about 24 hours old. Persistence and durability is not something I care about.

Errors, OTOH, I need a week or two of. But I consider these 2 different things. Logs are kind of a last resort when you really can't figure out what's going on in prod.

rswail|1 year ago

"Replace our MSK clusters and EBS storage with S3 storage costs."

CodesInChaos|1 year ago

1. Do you support compression for data stored in segments?

2. Does the choice of storage class only affect chunks or also segments?

To me the best solution seem like combining storing writes on EBS (or even NVMe) initially to minimize the time until writes can be acknowledged, and creating a chunk on S3 standard every second or so. But I assume that would require significant engineering effort for applications that require data to be replicated to several AZs before acknowledging them. Though some applications might be willing to sacrifice 1s of writes on node failure, in exchange for cheap and fast writes.

3. You could be clearer about what "latency" means. I see at least three different latencies that could be important to different applications:

a) time until a write is durably stored and acknowledged

b) time until a tailing reader sees a write

c) time to first byte after a read request for old data

4. How do you handle streams which are rarely written to? Will newly appended records to those streams remain in chunks indefinitely? Or do you create tiny segments? Or replace and existing segment with the concatenated data?

shikhar|1 year ago

(Founder) Thanks for the deep questions!

1) Storage is priced on uncompressed data. We don't currently compress segments.

2) It only affects chunk storage. We do have a 'Native' chunk store in mind, the sketch involves introducing NVMe disks (as a separate service the core depends on) - so we can offer under 5 millisecond end-to-end tail latencies.

3) The append ack latency and end-to-end latency with a tailing reader is largely equivalent for us since latest writes are in memory for a brief period after acknowledgment. If you try the CLI ping command (see GIF on landing page) from the same cloud region as us (AWS us-east-1 only currently), you'll see end-to-end and append ack latency as basically the same. TTFB for older data is ~ TTFB to get a segment data range from object storage, so it can be a few hundred milliseconds.

4) We have a deadline to free chunks, so we we PUT a tiny segment if we have to.

jgraettinger1|1 year ago

> To me the best solution seem like combining storing writes on EBS (or even NVMe) initially to minimize the time until writes can be acknowledged, and creating a chunk on S3 standard every second or so.

Yep, this is approximately Gazette's architecture (https://github.com/gazette/core). It buys the latency profile of flash storage, with the unbounded storage and durability of S3.

An addendum is there's no need to flush to S3 quite that frequently, if readers instead tail ACK'd content from local disk. Another neat thing you can do is hand bulk historical readers pre-signed URLs to files in cloud storage, so those bytes don't need to proxy through brokers.

johnrob|1 year ago

This is a very interesting abstraction (and service). I can’t help but feature creep and ask for something like Athena, which runs PrestoDB (map reduce) over S3 files. It could be superior in theory because anyone using that pattern must shoehorn their data stream (almost everything is really a stream) into an S3 file system. Fragmentation and file packing become requirements that degrade transactional qualities.

shikhar|1 year ago

(Founder) There are definitely some interesting possibilities. Pretty hyped about S3 Table (Iceberg) buckets. S2 stream to buffer small writes so you can flush decent size Parquet into the table, and avoid compaction costs.

nextworddev|1 year ago

This is cool but I think it overlaps too much with something like Kinesis Data Streams from AWS which has been around for a long time. It’s good that AWS has some competition though

shikhar|1 year ago

(Founder) We plan to be multi-cloud over time. Kinesis has pretty low ordered throughput limit (i.e. at the level of a stream shard) of 1 MBps, if you need higher. S2 will be cheaper and faster than Kinesis with the Express storage class. S2 is also a more serverless pricing model - closer to S3 - than paying for stream shard hours.

jcmfernandes|1 year ago

In the long-term, how different do you want to be from Apache Pulsar? At the moment, many differences are obvious, e.g., Pulsar offers transactions, queues and durable timers.

shikhar|1 year ago

(Founder) We want S2 to be focussed on the stream primitive (log if you prefer). There is a lot that can be built on top, which we mostly want to do as open source layers. For example, Kafka compatibility, or queue semantics.

behnamoh|1 year ago

so the naming convention for 2024-25 products seems to be <letter><number>.

o1, o3, s2, M4, r2, ...

bawolff|1 year ago

In terms of a pitch, i'm not sure i understand how this differs from existing solutions. Is the core value proposition a simpler api?

shikhar|1 year ago

(Founder) Besides simple API,

- Unlimited streams. Current cloud systems limit to a few thousand. With dedicated clusters, few hundred K? If you want a stream per user, you are now dealing with multiple clusters.

- Elastic throughput per stream (i.e. a partition in Kafka) to 125 MiBps append / 500 MiBps realtime read / unlimited in aggregate for catching up. Current systems will have you at tens. And we may grow that limit yet. We are able to live migrate streams in milliseconds while keeping pipelined writes flowing, which gives us a lot of flexibility.

- Concurrency control mechanisms (https://s2.dev/docs/stream#concurrency-control)

adverbly|1 year ago

Seems really good for IoT no? Been a while since I worked in that space, but having something like this would have been nice at the time.

shikhar|1 year ago

(Founder) so many possibilities! That's what I love about building building blocks. I think we will create an open source layer for an IoT protocol over time (unless community gets to it first), e.g. MQTT. I have to admit I don't know too much about the space.

cultofmetatron|1 year ago

I had an idea like this a few years ago. basicly emitting a stream interface to a cloud based fs to enable random access seeking on bystreams. I envisioned it to be useful for things like loading large files. would be amazing for enabling things like cloud gaming, images processing and CAD

kudos for sitting down and makin it happen!

siliconc0w|1 year ago

Definitely a useful API but not super compelling until I could store the data in my own bucket

ComputerGuru|1 year ago

So is this a "serverless" named-pipe-as-a-service cloud offering? Or am I misreading?

38|1 year ago

Yep. Just tack "serverless" onto something that already exists and charge for it

unsnap_biceps|1 year ago

I really liked the landing page and the service, but it took me a while to realize it wasn't a AWS service with a snazzy landing page.

dragonwriter|1 year ago

Apparently this is “S2, a new S3 competitor” not “S2, the spatial index system based on heirarchical qaudrilaterals”.

zffr|1 year ago

How does this compare to Kafka? Is the primary difference that this is a hosted solution?

tdba|1 year ago

Is it possible to bring my own cloud account to provide the underlying S3 storage?

shikhar|1 year ago

(Founder) Not currently! We want to explore this.

rswail|1 year ago

Really interesting service and bookmarked.

I'd really love this extending more into the event sourcing space not just the log/event streaming space.

Dealing with problems like replay and log compaction etc.

Plus things like dealing with old events. Under GDPR, removing personal information/isolating it from the data/events themselves in an event sourced system are a PITA.

shikhar|1 year ago

(Founder) An S2 stream is a durable log and can be replayed! We do want to add compaction support. Event sourcing is a great use case for S2.

kdazzle|1 year ago

Would this be like an alternative to Delta? Am I thinking about that right?

nikolay|1 year ago

Pretty bad branding! It should have at least been S4!

BaculumMeumEst|1 year ago

S2 is, in my opinion, the sweet spot of PRS's lineup.

veqq|1 year ago

Related to an old comment of yours:

> I also kind of strongly dislike HtDP.

I'm researching programming pedagogy and I'm curious about your thoughts on this.

ThinkBeat|1 year ago

This would sell much better is was S5 or S6 next level thing.

Wow man are you stil stuck on S3?

locusofself|1 year ago

"Making the world a better place through streamable, appendable object streams"

aorloff|1 year ago

Kafka as a service ?

shikhar|1 year ago

(Founder) Nope! We have a FAQ for this ;)

ms7892|1 year ago

Can someone tell me what does this do? And why its better.

shikhar|1 year ago

(Founder) There is a table on the landing page https://s2.dev/ which hopefully gives a nice overview :) It's like S3, but for streams. Cheap appends, and instead of dealing with blocks of data and byte ranges, you work with records. S2 takes care of ordering records, and letting you read from anywhere in the stream.

This is an alternative to systems like Kafka which don't do great at giving a serverless experience.

alanfranz|1 year ago

Sort of serverless Kafka, which natively uses object storage and promises better latencies than things like warpstream.

revskill|1 year ago

Serverless pricing to me is exactly like the ETH gas pricing !