top | item 21081756

Rudder, an open source Segment alternative

309 points| feross | 6 years ago |github.com | reply

99 comments

order
[+] martingxx|6 years ago|reply
This looks great, and I look forward to finding time to try it soon.

However, I believe it's misleading to call it "open source". The SPPL license is not generally considered to be an open source licence by any meaningful definition, and in particular does not meet the OSI definition and is incompatible with most licences that do.

I understand the need these days to protect against aggressive cloud providers, but there are other ways to achieve that without becoming completely non open source, such as the BSL

(See license.https://opensourceforu.com/2019/06/cockroach-labs-changes-it... )

[+] soumyadeb|6 years ago|reply
Thanks for the pointer - will check it out. We are totally novice on this - we don't even have a license attorney. We just picked SSPL because that's what everyone seemed to suggest to prevent the likes of AWS cloning it. Now that we got some visibility, we will carefully take a look at this issue.

But at heart we want to build an open-source community while still being a viable business on the likes of MatterMost, Elastic etc.

[+] ensignavenger|6 years ago|reply
The BSL is not open source, but appears to be a commitment to eventually releaseing code as open source.
[+] tomnipotent|6 years ago|reply
I want to see this married with the Meltano project from Gitlab - it would create an unprecedented end-to-end data environment.

https://meltano.com/

[+] soumyadeb|6 years ago|reply
Love the thought. We weren't aware of this project - thanks for the pointer. Will follow-up.
[+] tankster|6 years ago|reply
Segment has done an awesome job of building a great product but it is impossible to be provably secure and private given that they host all the data. Rudder can avoid situations such as the Segment Security incident that happened recently. https://segment.com/security/bulletins/incident090519/

Very excited about this project!

[+] vollmarj|6 years ago|reply
As a longtime segment customer, I am excited to see this exist. Segment's product is fairly good but their pricing model is really bad. We recently ran into an issue where they were billing us for almost 10x the number of users we were actually tracking. It was a nightmare that required a few months of back and forth with their support trying to get it fixed. In the end, they gave us a partial refund for the overages but we had to do a lot of technical work to resolve the issue ourselves.

It would be nice to have an open source alternative that doesn't get you locked into an unpredictable pricing model that you have very little control over.

[+] soumyadeb|6 years ago|reply
Some of our initial pilots had similar problems. Segment's pricing also doesn't work when you have a lot of free users. That(and privacy) are the two pain points we hope to address with this.

Happy to help you try this out if you want (please email [email protected]). We are behind on number of integrations (vs segment) and features but we will catch-up. And hoping to get community support on that.

[+] torpedolaser|6 years ago|reply
We are in the mobile game industry. Due to the incredible high volume events data per user generates per day, we need to join certain events together to reduce the events amount send to our analytics platform. There is no other segment tool can do this for us. We have been working with Rudder Labs to solve this problem. They have been really helpful and act super fast to our requests and suggestions. With Rudder Labs SDK, we are able to join event data on the platform(which we choose to host internally) + all other segment tool features. Besides that, since we run a freemium game so that most of our users are free users/non-payers, cost is another thing we have to consider, current segment market pricing apparently is way too high for us. Rudder labs solves this problem for us as well. Great deal!
[+] dehrmann|6 years ago|reply
> Due to the incredible high volume events data per user generates per day, we need to join certain events together to reduce the events amount send to our analytics platform.

Don't gloss over the fact that telemetry over cell networks can be costly for users (more and more plans are unlimited, so this doesn't worry me as much) and draining on batteries. However you do it, data that's not latency-critical should be buffered and batched.

[+] soumyadeb|6 years ago|reply
Thanks for being our first pilot and for all your patience while we were fixing/improving the product.
[+] soumyadeb|6 years ago|reply
One of the authors here. This post was a pleasant surprise!! Happy to answer any questions. Or please feel to reach out to me at [email protected].
[+] tyri_kai_psomi|6 years ago|reply
You put a spotlight by saying "privacy and security focused alternative to segment"

Are there things you believe make segment not privacy and security focused? As a long time user of segment, I find their protocols feature and new data and privacy features world class for this.

Having also just left their Synapse conference, privacy and security was the #1 topic of discussion throughout the conference. I would say they are very much privacy and security focused.

Not trying to shill, but it comes off as maybe you are misrepresenting segment a little bit. "Open source segment alternative" would have probably been just fine.

[+] AndrewKemendo|6 years ago|reply
This is awesome, I've been hoping someone would make this! And being enterprise focused is even better.

Thanks!

[+] cyberferret|6 years ago|reply
I am a happy Segment user, however their handling of a recent data breach did leave a bit of a sour taste. They took several days to get back to me with a definitive answer as to whether our customer data that we collected was compromised by the internal breach.

I am keen to look at competitive product where we may have more control over the data collected and can manage the risk ourselves.

[+] beager|6 years ago|reply
So what's the pricing model? Your site lists "pricing" and shows no info.

- Are you charging for support?

- Do you/will you have a paid enterprise tier that will increasingly be the only tier with a viable feature set?

- What's keeping you from dumping on Segment's market until you hit traction then ratcheting up to Segment's pricing?

- Who are you? Who are your investors?

[+] soumyadeb|6 years ago|reply
Pricing: Honestly, we haven't figured out the business model yet. Like other open-source products, it will likely be a combination of support + enterprise features (like HA, auto-scale etc) but again we don't know what those enterprise features would be.

Ratch up Pricing: Good question and not sure how to answer. Our vision is to be like other open-source companies like Mattermost, ES. Our base version (which would work for 90% of users) would be free and under open-source license. But I do understand your concern - maybe there is a way to put that in the license (that the base version will be perpetually open-source)

- Here is our company page (https://www.linkedin.com/company/rudderlabs). Our lead investor is S28 capital (Partner: Shvet) - they have also invested in Mattermost (an open-source slack competitor)

[+] sundbry|6 years ago|reply
Is it really so difficult for engineers to create a task to process a Kafka topic? It takes one day to write a program to consume from a topic of events and push to an API like Amplitude, and you have total flexibility in how you push to those integrations.

Why would you use postgres for an event professing system? This seems like an inefficient architecture.

[+] soumyadeb|6 years ago|reply
Great question. Complication arises because of failures. One or more destinations may be down for any length of time, individual payloads may be bad etc. To handle all these you need to retry with timeouts while not blocking other events to other destinations. Also, not all events may be going to all destinations.

We build our own streaming abstraction on top of Postgres. Think layman's leveled compaction. We will write a blog on that soon. The code (jobsdb/jobsdb.go) should have some comments too in case you want to check out. Segment had a similar architecture and had a blog post on it but can't seem to find it. Also, eventually we will replace postgres with something low-level like RocksDB or even native files.

Yes, in theory you can use Kafka's streaming abstraction and create create a topic per destination. Two reason's we didn't go that route

1) We were told Kafka is not easy to support in an on-prem environment. We are not Kafka expert but paid heed to people who have designed and shipped such on prem softwares.

2) More importantly, for a given destination, we have dozens of writers all reading from same stream. Only ordering requirement is events from a given device (end consumer) are in order. So we assign same user to same writer. However, the writers themselves are independent. If a payload fails, we just block events from that user while other users continue. Blocking the whole stream for that one bad payload (retried 4-5 times) will slow things down quite a bit. If we had to achieve the same abstraction on Kafka, we would have had to create dozens of topics per destination.

[+] indianCoder|6 years ago|reply
How would you get ordering for millions of users? How many Kafka topics would you create? How would you managed failed events, would you reorder whole queue?

I don't think it is inefficient. Segment blog linked below talks about specifics of the problem.

[+] Roritharr|6 years ago|reply
Interesting, we've built our in-house solution that is also Go based, also writing to a postgres db besides forwarding events, but much simpler, without a UI and comes with backend sdks already.

What I found interesting is that you wrote 3k/events per second on a rather beefy 2xlarge machine. Our version is MUCH less demanding, I wonder if there isn't a lot of performance left on the table here.

I'll keep this in mind once we've grown out of our solution, though.

[+] soumyadeb|6 years ago|reply
The bottleneck for us (on that instance) is not postgres but transformations. Transformations are tiny snippets of javascripts which convert the event from Rudder JSON to whatever structure (JSON, keyval etc) that the destinations expect. We also support user defined transformations - functions defined by the user to transform/enhance the event.

Currently, transformations are run in nodeJS. So, for every batch of events there is a call into nodeJS from go and that is slow. We do batching/parallel-calls but still.

I think, postgres gets us > 15K/sec throughput.

[+] yodi|6 years ago|reply
This is awesome! Never know if anyone build things like Segment. I will try this alternative solution to our company and will keep update here about the result.
[+] sails|6 years ago|reply
How does this compare to Snowplow?

https://github.com/snowplow/snowplow

[+] sumanthpur|6 years ago|reply
Thanks for asking, I am one of the authors of Rudder. Snowplow is a great analytics tool, especially used for internal analytics. It is open-source and on-prem and keeps your data privacy. It is centered around event enriching and storing to a data warehouse.

We are aiming at routing the events reliably to destinations, transforming events real-time, storing them into your data warehouse with a dynamic schema and eventually build a data platform with help from the community.

[+] namanyayg|6 years ago|reply
I've been using snowplow for 2M+ events/mo and can still remember the pains of setting up. Plus, there's a fixed schema required.

I haven't looked into rudder but I'll switch if it offers easier setup and schema.

[+] indianCoder|6 years ago|reply
One issue I had with Segment, I couldn't run real-time transformation of the event to join data from our data tables. We eventually got over it with AWS Lambda and sending it back to Segment. Segment recently announced functions to help on this, still could not get my hands on it.

Any plans on this?

[+] soumyadeb|6 years ago|reply
Yes, that is exactly the use case of our "user defined" transformation functions. You can define any javascript function (right now by modifying the code but will be available from the UI in the release coming next week). Inside that function you can filter/transform/enhance the event in any way you like. You can lookup your DB, call external APIs etc. You can also combine multiple events into one.

Since this whole thing runs inside your VPC, you don't have to open up your production database to 3rd party as you have to do with segment.

Happy to work with you on your use case. Please email [email protected]

We wrote a couple of blog posts on that.

Case Study: https://rudderlabs.com/customer-case-study-casino-game/

Transformation Details:

https://rudderlabs.com/transformations-in-rudder-part-1/

https://rudderlabs.com/transformations-in-rudder-part-2/

[+] NuclearTide|6 years ago|reply
Hey there! I work at Segment, and I'm one of the engineers working on Segment Functions. If you let me know your Segment workspace's name (get in touch through [email protected]), I can grant you beta access.
[+] fuzzyfroghunter|6 years ago|reply
This looks great, nice work.

Is there an easy way for someone to set it up on their cloud infrastructure?

[+] soumyadeb|6 years ago|reply
Yes, this is exactly the use case we want to target. What's your cloud infra? We can easily run inside your AWS VPC. If you have your own private cloud, we can run there too - just need to disable the S3 dump. The only dependency we have is Postgres.

Happy to help you setup - Please email [email protected] OR join our slack https://rudderlabs.herokuapp.com

[+] mushufasa|6 years ago|reply
also, segment was the O.G. segment alternative https://github.com/segmentio/analytics.js
[+] soumyadeb|6 years ago|reply
Analytics.js was only a client/browser side utility. They developed an entire backend stack later - which is needed for data-warehouse dump, event replay etc etc. We are also developing a complete backend stack with all those features. This is still a work in progress
[+] drixta|6 years ago|reply
We're lucky to find and deploy this project early at our startup. Being a cybersecurity company, we cannot have out customers data leave our aws account.
[+] rixed|6 years ago|reply
> we cannot have out customers data leave our aws account.

Unintentionally funny?

[+] amelius|6 years ago|reply
If every successful business is eventually copied by open source, can we really blame big companies for protective practices such as customer lock-in?
[+] HillRat|6 years ago|reply
It’s licensed under SSPL, so the definition of “open source” may vary. (Specifically, they have an enterprise business model, so this is more properly a proprietary, source-available on-prem Segment competitor.)
[+] dv_dt|6 years ago|reply
If there is no competition, markets cannot work as intended.
[+] jarfil|6 years ago|reply
Yes. Customer services, as in both support and feature development, should be enough to keep paying customers paying. If a bunch of "use at your own risk" software can replace it, what's the actual value of a business?
[+] platform|6 years ago|reply
very interesting and approacheable solution.

With regards to: >" Rudder runs as a single go binary with Postgres. It also needs the destination (e.g. GA, Amplitude) specific transformation code which are node scripts. "

what would be a recommended approach, if I would like to keep the data internally, and not use external Analytics engine?

[+] soumyadeb|6 years ago|reply
We also have a S3 destination so you can just add S3.

Support for other data-warehouses (Redshift, Bigquery etc) is coming soon.

[+] mychael|6 years ago|reply
Segment IO is currently being blocked on browsers with adblock. Does this mean it will work even with adblock?