top | item 30904220

HTTP Feeds

214 points| mcp_ | 4 years ago |http-feeds.org | reply

95 comments

order
[+] lexicality|4 years ago|reply
Feels like a bit silly not to have made this compatible with SSE since all the mechanisms for acessing that are built into browsers these days.

https://developer.mozilla.org/en-US/docs/Web/API/Server-sent... https://html.spec.whatwg.org/multipage/server-sent-events.ht...

[+] rendall|4 years ago|reply
> "Feels like a bit silly not to have made this compatible with SSE ..."

The feedback is great, and valid. I wish it could have been expressed less dismissively though.

[+] r3trohack3r|4 years ago|reply
Had never heard of SSE before. Followed the link and found this warning:

> Warning: When not used over HTTP/2, SSE suffers from a limitation to the maximum number of open connections, which can be especially painful when opening multiple tabs, as the limit is per browser and is set to a very low number (6). The issue has been marked as "Won't fix" in Chrome and Firefox. This limit is per browser + domain, which means that you can open 6 SSE connections across all of the tabs to www.example1.com and another 6 SSE connections to www.example2.com (per Stackoverflow). When using HTTP/2, the maximum number of simultaneous HTTP streams is negotiated between the server and the client (defaults to 100).

This is the first time I’ve heard of per-domain connection limits. Seems… not great? Doesn’t this turn into a client side DoS? User opens 6+1 tabs and now the browser has exhausted all HTTP connections to your domain serving SSE connections?

I’ve used long polling before, I don’t understand how I’ve never observed the connection limit of 6…

[+] aiobe|4 years ago|reply
Hi! I am the author of http-feeds.org. Thank you for your feedback.

For this spec I aimed to keep it as simple as possible. And plain polling-based JSON Endpoints are the most simple and robust endpoints IMHO.

If you need, you could implement an SSE representation on the server endpoint by prober content negotiation.

The main reason, why I dropped SSE it the lack of proper back pressure, i.e. what happens when a consumes slower than the server produces messages. Plus, it is quite hard to debug SSE connections, e. g. no support by Postman and other dev tools. And long lasting HTTP connections are still a problem in todays infrastructure. E. g. there is currently no support for SSE endpoints in Digital Ocean App Platform, and I am not sure about them in Google Cloud Run.

Overall, plain GET endpoints felt much simpler.

[+] technicolorwhat|4 years ago|reply
Yeah, we're also pushing for SSE for this kind of things.
[+] josephg|4 years ago|reply
Very cool! This looks very similar to what we're doing with the braid spec[1], though I really like how clear and concise your examples are! This is a great little website.

Some differences between our approaches:

- I think its a good idea to support arbitrary patch formats via a content-type style field, just like we have different formats for images. This lets you use the same protocol for things like collaborative editors.

- For some data sets (like CRDTs), you want each change to be able to refer to multiple "parents".

- Your protocol is very JSON-y and not very HTTP-y. It looks like you're basically using JSON to express HTTP. Why not just use HTTP? One big downside of the JSON approach is that it makes it awkward to transmit binary patches. (Eg, 'patching' an image)

Feel free to reach out if you're up for a chat! Looks like we're working on the same problem.

[1] https://github.com/braid-org/braid-spec/blob/master/draft-to...

[+] 0des|4 years ago|reply
This made me all warm and fuzzy inside that nerds can still help other nerds even when we are on different teams. Hope y'all are able to share some ideas.
[+] brasetvik|4 years ago|reply
It'd be great to get a fairly standard way of doing this. :)

Having worked in this problem space a bit recently, I find this part a bit too optimistic:

> The event.id is used as lastEventId to scroll through further events. This means that events need to be strongly ordered to retrieve subsequent events.

The example relies on time-ordered UUIDv6 and mentions time sync as a gotcha. This should work well if you only have a single writer.

Even with perfectly synced clocks, anything that lets you do _concurrent_ writes can still commit out of order, though.

Consider two transactions in a single-node-and-trivially-clock-synced Postgres, for example. If the first transaction that gets the lower timestamp commits after a second transaction that gets a higher timestamp, the second and higher timestamp might've been retrieved by a consumer already (it committed, so it's visible after all), and now you've missed writes. This is also (at least for Postgres, but I guess also in general) true for sequences.

The approach I'm currently pursuing involves having an opaque cursor that encodes enough of the MVCC information (i.e. Postgres' txid_current and xip_list) to be able to catch those situations. For a client, the cursor is opaque and they can't see the internals. For the server side, it's quite implementation specific, however. It still has the nice property that clients keep track on where they are, without the server keeping track of where the clients are, which is desirable if the downstream client can roll back e.g. due to recovery/restore from backup)

A base64-encoded (possibly encrypted) cursor can wrap whatever implementation specifics are needed and hide them from the client. That implementation could of course be a simple event id if the writing side is strictly serial.

[+] 0des|4 years ago|reply
Perhaps the time problem can be handled with an eventually consistent Lamport clock system
[+] raggi|4 years ago|reply
JSON inside server sent events gives you a browser native client, no need for streaming extensions to your JSON lib which is kinda rare, and various intermediates know not to buffer the content type already. Sorry to be the what about X guy, maybe there's something I missed?
[+] tehbeard|4 years ago|reply
Yeah this feels very much like someone retyped the Server Sent Events spec and bolted an arbitrary spec on for each message...
[+] diordiderot|4 years ago|reply
Can someone ELI5 the significance of this?

* What are some scenarios where you would need a feed?

* What's being done now to solve the problem and how is this different?

Many thanks

[+] kitd|4 years ago|reply
This basically puts a REST/HTTP GET frontend on your messaging backend (Kafka, RabbitMQ, MQTT, etc).

So whatever events you want to expose publicly can be done over plain HTTP.

Having said that, IMO using SSE or websockets would be a better fit than raw HTTP.

[+] tiernano|4 years ago|reply
I could see this being used as part of a distributed system for shopping. for example, say you have 20 stores, each with its own storage area for items, but you also have 2 large warehouses. Your online site might need to know which stores have stock now, or which warehouses have stock. The site might allow you to do "Click and collect" do it would need to know, in somewhat real time, which stores actually have the stock. each store would have its own endpoint with a data feed that the central server can get data from. Same with online orders for delivery. It needs to know what warehouses have the stock, and if none do, how to get the store to ship it to a customer.

Likewise, the stores might need to know what is in the warehouses, or even across town; someone walks in store to order something, but its not in that store. But they know, in real time, that its in the warehouse for delivery next day or that the store across the city has it.

What is done now is a more centralised approach; all sites would have a connection to a single DB in head office that stores everything. This makes things more distributed and, in theory, removes a single point of failure.

I should clarify, I am not working on any of this, this is just how I think it would work... if anyone wants to step in and tell me if I'm wrong, right, or just plan stupid, please shout.

[+] wrren|4 years ago|reply
I like this idea, however I think the event ID being encoded in the response body places constraints on what each element looks like. Perhaps it would make more sense to encode the last ID/feed position in a response header and have the client submit that in a subsequent request header? That would decouple the feed position from any one element or the response structure itself.
[+] technicolorwhat|4 years ago|reply
The idea is nice and needed. However maybe the spec is a bit elaborate for me to adopt it immediately. I've been rolling my own for some time at some clients, for our kafkaesque/event sourcing patterns.

However what I used there was simple http stream/json stream like this:

- No start of [] but JSON newline entries a new line is an new entry

- Using Anything as an id (we've been using redis XSTREAMS as lightweight kafka concepts, just 64bit integers)

- have an type as an event, and versioning is just done by upgrading the type, ugly, but easy.

- We'er considering using SSE at this moment

Compaction is not something that I would do in the protocol I think I would just expose another version of it on a different url I think or put it in a different spec.

[+] xg15|4 years ago|reply
How do you recover when there is e.g. some connection error and a client misses some events? Can the client ask to replay the events? Then, how far back can they go?
[+] xg15|4 years ago|reply
It's an interesting idea, but I wonder how well the promise will hold up that this is less complex than using a message broker.

Yes, the network topology and protocol are certainly less complex, but there are now additional strong requirements how an endpoint has to store and manage existing events. (See the bits about strict time ordering, compaction, aggregation, etc).

A lot of this is effectively what a message broker is doing in a "traditional" system to guarantee consistency. Those tasks aren't gone, they are just pushed to the endpoints now.

[+] aiobe|4 years ago|reply
A few issues with message brokers, esp. in the system-to-system integration:

- Security: In B2B scenarios or public APIs would you open your broker to the WWW? HTTP has a solid infrastructure, including firewalls, ddos defence, API gateways, certificate management, ... - Organisational dependencies: Some team needs to maintain the broker (team 1, team 2, or a third platform team). You have a dependency to this team, if you need a new topic, user, ... Who is on call when something goes wrong? - Technology ingestion: A message broker ingests technology into the system. You need compatible client libraries, handle version upgrades, resilience concepts, learn troubleshooting...

[+] efunneko|4 years ago|reply
My thoughts exactly. While this is more human readable on the wire, a message broker delivering the feed would provide many different other features that might be useful, such as transactions, load-balancing, guaranteed delivery and per-endpoint state to simplify the individual application instances.

For those not aware of what message brokers are, there are many to choose from such as: Mosquito, RabbitMQ, ActiveMQ, Solace... If delivery over HTTP is a requirement, many of these brokers support delivery over websockets or (in the case of Solace) also support long polling.

[+] sparsely|4 years ago|reply
This seems like a really attractive way of publishing events to a third party. Platform independent and easy to understand.
[+] dotancohen|4 years ago|reply

  > Platform independent and easy to understand.
The platform independence is implicit in the "HTTP" part of the name.
[+] xg15|4 years ago|reply
With more attention going to long polling again, I wonder if it would be useful to introduce some kind of HTTP signaling (header or 1xx status) to indicate on the protocol level that long polling is going on. This might be useful information for intermediaries - e.g. proxies, firewalls, browser network tab, etc.
[+] chaz6|4 years ago|reply
My first thought is if the client does not specify a start ID it could be sent billions of records depending on the elapsed time and frequency of events. What would be the best way to avoid overloading a client?
[+] quaintdev|4 years ago|reply
I think the server will give a truncated response.

They need to clearly document what is the behaviour for this case.

[+] rendall|4 years ago|reply
This could form the foundation of a distributed social media network.
[+] outsomnia|4 years ago|reply
How about "catching up" on events, if a client had an outage, can it indicate the last event ID it had and then get a natural replay of the older events?
[+] tirpen|4 years ago|reply
That's the second example on the page.
[+] catwell|4 years ago|reply
I wrote a blog post advocating for something like this recently, with an additional push to notify the client when it should poll: https://blog.separateconcerns.com/2022-03-05-push-to-poll.ht...

I wasn't aware of this but it fits the use case perfectly, I will update the post.

[+] yencabulator|4 years ago|reply
I think serving data from multiple Kafka partitions gets unnecessarily hard if your "continue from this point" token is tied to a singular event ID. For that, it'd be better to have the "cursor token" be an arbitrary blob of data you repeat back to the server. Then the server can e.g. encode a list of partition->offset values into it.
[+] sdze|4 years ago|reply
I always implement my own standards. I am faster that way instead of learning what others created. Obviously this is only true for less complex subjects.
[+] adenozine|4 years ago|reply
See also: technical debt
[+] yashasolutions|4 years ago|reply
This looks great. Has anyone any insights or reference about the limits of this approach compared to more traditional message queues?
[+] dgritsko|4 years ago|reply
One obvious downside is that polling at a fixed interval is going to be less efficient than having an event "pushed" to your client as soon as it's available. If your events are low-volume, then there will be periods where your polling requests return no new data, but you still have to make those requests anyway in order to determine that. And conversely, if your events are high-volume, then your polling interval represents an arbitrary amount of delay that you're introducing into the system. That might not be big deal for some applications, but it's worth mentioning as a potential downside in situations where you want the behavior to be as near-instantaneous as possible.
[+] talideon|4 years ago|reply
Way back I recall (ab)using chunked encoding to do something similar. Now there's an underexploited element of HTTP...