This looks great, and I look forward to finding time to try it soon.
However, I believe it's misleading to call it "open source". The SPPL license is not generally considered to be an open source licence by any meaningful definition, and in particular does not meet the OSI definition and is incompatible with most licences that do.
I understand the need these days to protect against aggressive cloud providers, but there are other ways to achieve that without becoming completely non open source, such as the BSL
Thanks for the pointer - will check it out. We are totally novice on this - we don't even have a license attorney. We just picked SSPL because that's what everyone seemed to suggest to prevent the likes of AWS cloning it. Now that we got some visibility, we will carefully take a look at this issue.
But at heart we want to build an open-source community while still being a viable business on the likes of MatterMost, Elastic etc.
Segment has done an awesome job of building a great product but it is impossible to be provably secure and private given that they host all the data. Rudder can avoid situations such as the Segment Security incident that happened recently. https://segment.com/security/bulletins/incident090519/
As a longtime segment customer, I am excited to see this exist. Segment's product is fairly good but their pricing model is really bad. We recently ran into an issue where they were billing us for almost 10x the number of users we were actually tracking. It was a nightmare that required a few months of back and forth with their support trying to get it fixed. In the end, they gave us a partial refund for the overages but we had to do a lot of technical work to resolve the issue ourselves.
It would be nice to have an open source alternative that doesn't get you locked into an unpredictable pricing model that you have very little control over.
Some of our initial pilots had similar problems. Segment's pricing also doesn't work when you have a lot of free users. That(and privacy) are the two pain points we hope to address with this.
Happy to help you try this out if you want (please email [email protected]). We are behind on number of integrations (vs segment) and features but we will catch-up. And hoping to get community support on that.
We are in the mobile game industry. Due to the incredible high volume events data per user generates per day, we need to join certain events together to reduce the events amount send to our analytics platform. There is no other segment tool can do this for us. We have been working with Rudder Labs to solve this problem. They have been really helpful and act super fast to our requests and suggestions. With Rudder Labs SDK, we are able to join event data on the platform(which we choose to host internally) + all other segment tool features. Besides that, since we run a freemium game so that most of our users are free users/non-payers, cost is another thing we have to consider, current segment market pricing apparently is way too high for us. Rudder labs solves this problem for us as well. Great deal!
> Due to the incredible high volume events data per user generates per day, we need to join certain events together to reduce the events amount send to our analytics platform.
Don't gloss over the fact that telemetry over cell networks can be costly for users (more and more plans are unlimited, so this doesn't worry me as much) and draining on batteries. However you do it, data that's not latency-critical should be buffered and batched.
You put a spotlight by saying "privacy and security focused alternative to segment"
Are there things you believe make segment not privacy and security focused? As a long time user of segment, I find their protocols feature and new data and privacy features world class for this.
Having also just left their Synapse conference, privacy and security was the #1 topic of discussion throughout the conference. I would say they are very much privacy and security focused.
Not trying to shill, but it comes off as maybe you are misrepresenting segment a little bit. "Open source segment alternative" would have probably been just fine.
I am a happy Segment user, however their handling of a recent data breach did leave a bit of a sour taste. They took several days to get back to me with a definitive answer as to whether our customer data that we collected was compromised by the internal breach.
I am keen to look at competitive product where we may have more control over the data collected and can manage the risk ourselves.
Pricing: Honestly, we haven't figured out the business model yet. Like other open-source products, it will likely be a combination of support + enterprise features (like HA, auto-scale etc) but again we don't know what those enterprise features would be.
Ratch up Pricing: Good question and not sure how to answer. Our vision is to be like other open-source companies like Mattermost, ES. Our base version (which would work for 90% of users) would be free and under open-source license. But I do understand your concern - maybe there is a way to put that in the license (that the base version will be perpetually open-source)
- Here is our company page (https://www.linkedin.com/company/rudderlabs). Our lead investor is S28 capital (Partner: Shvet) - they have also invested in Mattermost (an open-source slack competitor)
Is it really so difficult for engineers to create a task to process a Kafka topic? It takes one day to write a program to consume from a topic of events and push to an API like Amplitude, and you have total flexibility in how you push to those integrations.
Why would you use postgres for an event professing system? This seems like an inefficient architecture.
Great question. Complication arises because of failures. One or more destinations may be down for any length of time, individual payloads may be bad etc. To handle all these you need to retry with timeouts while not blocking other events to other destinations. Also, not all events may be going to all destinations.
We build our own streaming abstraction on top of Postgres. Think layman's leveled compaction. We will write a blog on that soon. The code (jobsdb/jobsdb.go) should have some comments too in case you want to check out. Segment had a similar architecture and had a blog post on it but can't seem to find it. Also, eventually we will replace postgres with something low-level like RocksDB or even native files.
Yes, in theory you can use Kafka's streaming abstraction and create create a topic per destination. Two reason's we didn't go that route
1) We were told Kafka is not easy to support in an on-prem environment. We are not Kafka expert but paid heed to people who have designed and shipped such on prem softwares.
2) More importantly, for a given destination, we have dozens of writers all reading from same stream. Only ordering requirement is events from a given device (end consumer) are in order. So we assign same user to same writer. However, the writers themselves are independent. If a payload fails, we just block events from that user while other users continue. Blocking the whole stream for that one bad payload (retried 4-5 times) will slow things down quite a bit. If we had to achieve the same abstraction on Kafka, we would have had to create dozens of topics per destination.
How would you get ordering for millions of users? How many Kafka topics would you create? How would you managed failed events, would you reorder whole queue?
I don't think it is inefficient. Segment blog linked below talks about specifics of the problem.
Interesting, we've built our in-house solution that is also Go based, also writing to a postgres db besides forwarding events, but much simpler, without a UI and comes with backend sdks already.
What I found interesting is that you wrote 3k/events per second on a rather beefy 2xlarge machine. Our version is MUCH less demanding, I wonder if there isn't a lot of performance left on the table here.
I'll keep this in mind once we've grown out of our solution, though.
The bottleneck for us (on that instance) is not postgres but transformations. Transformations are tiny snippets of javascripts which convert the event from Rudder JSON to whatever structure (JSON, keyval etc) that the destinations expect. We also support user defined transformations - functions defined by the user to transform/enhance the event.
Currently, transformations are run in nodeJS. So, for every batch of events there is a call into nodeJS from go and that is slow. We do batching/parallel-calls but still.
This is awesome! Never know if anyone build things like Segment. I will try this alternative solution to our company and will keep update here about the result.
Thanks for asking, I am one of the authors of Rudder.
Snowplow is a great analytics tool, especially used for internal analytics. It is open-source and on-prem and keeps your data privacy. It is centered around event enriching and storing to a data warehouse.
We are aiming at routing the events reliably to destinations, transforming events real-time, storing them into your data warehouse with a dynamic schema and eventually build a data platform with help from the community.
Haha, it is https://www.rudderlabs.com Unfortunately, we haven't done anything re: SEO/SEM. I think we have no-index on our website (default with wp-engine) - would fix it.
One issue I had with Segment, I couldn't run real-time transformation of the event to join data from our data tables. We eventually got over it with AWS Lambda and sending it back to Segment. Segment recently announced functions to help on this, still could not get my hands on it.
Yes, that is exactly the use case of our "user defined" transformation functions. You can define any javascript function (right now by modifying the code but will be available from the UI in the release coming next week). Inside that function you can filter/transform/enhance the event in any way you like. You can lookup your DB, call external APIs etc. You can also combine multiple events into one.
Since this whole thing runs inside your VPC, you don't have to open up your production database to 3rd party as you have to do with segment.
Happy to work with you on your use case. Please email [email protected]
Hey there! I work at Segment, and I'm one of the engineers working on Segment Functions. If you let me know your Segment workspace's name (get in touch through [email protected]), I can grant you beta access.
Yes, this is exactly the use case we want to target. What's your cloud infra? We can easily run inside your AWS VPC. If you have your own private cloud, we can run there too - just need to disable the S3 dump. The only dependency we have is Postgres.
Analytics.js was only a client/browser side utility. They developed an entire backend stack later - which is needed for data-warehouse dump, event replay etc etc. We are also developing a complete backend stack with all those features. This is still a work in progress
We're lucky to find and deploy this project early at our startup. Being a cybersecurity company, we cannot have out customers data leave our aws account.
It’s licensed under SSPL, so the definition of “open source” may vary. (Specifically, they have an enterprise business model, so this is more properly a proprietary, source-available on-prem Segment competitor.)
Yes. Customer services, as in both support and feature development, should be enough to keep paying customers paying. If a bunch of "use at your own risk" software can replace it, what's the actual value of a business?
With regards to:
>" Rudder runs as a single go binary with Postgres. It also needs the destination (e.g. GA, Amplitude) specific transformation code which are node scripts. "
what would be a recommended approach, if I would like to keep the data internally, and not use external Analytics engine?
[+] [-] martingxx|6 years ago|reply
However, I believe it's misleading to call it "open source". The SPPL license is not generally considered to be an open source licence by any meaningful definition, and in particular does not meet the OSI definition and is incompatible with most licences that do.
I understand the need these days to protect against aggressive cloud providers, but there are other ways to achieve that without becoming completely non open source, such as the BSL
(See license.https://opensourceforu.com/2019/06/cockroach-labs-changes-it... )
[+] [-] soumyadeb|6 years ago|reply
But at heart we want to build an open-source community while still being a viable business on the likes of MatterMost, Elastic etc.
[+] [-] ensignavenger|6 years ago|reply
[+] [-] tomnipotent|6 years ago|reply
https://meltano.com/
[+] [-] soumyadeb|6 years ago|reply
[+] [-] tankster|6 years ago|reply
Very excited about this project!
[+] [-] soumyadeb|6 years ago|reply
[+] [-] vollmarj|6 years ago|reply
It would be nice to have an open source alternative that doesn't get you locked into an unpredictable pricing model that you have very little control over.
[+] [-] soumyadeb|6 years ago|reply
Happy to help you try this out if you want (please email [email protected]). We are behind on number of integrations (vs segment) and features but we will catch-up. And hoping to get community support on that.
[+] [-] torpedolaser|6 years ago|reply
[+] [-] dehrmann|6 years ago|reply
Don't gloss over the fact that telemetry over cell networks can be costly for users (more and more plans are unlimited, so this doesn't worry me as much) and draining on batteries. However you do it, data that's not latency-critical should be buffered and batched.
[+] [-] soumyadeb|6 years ago|reply
[+] [-] soumyadeb|6 years ago|reply
[+] [-] tyri_kai_psomi|6 years ago|reply
Are there things you believe make segment not privacy and security focused? As a long time user of segment, I find their protocols feature and new data and privacy features world class for this.
Having also just left their Synapse conference, privacy and security was the #1 topic of discussion throughout the conference. I would say they are very much privacy and security focused.
Not trying to shill, but it comes off as maybe you are misrepresenting segment a little bit. "Open source segment alternative" would have probably been just fine.
[+] [-] AndrewKemendo|6 years ago|reply
Thanks!
[+] [-] cyberferret|6 years ago|reply
I am keen to look at competitive product where we may have more control over the data collected and can manage the risk ourselves.
[+] [-] soumyadeb|6 years ago|reply
[+] [-] beager|6 years ago|reply
- Are you charging for support?
- Do you/will you have a paid enterprise tier that will increasingly be the only tier with a viable feature set?
- What's keeping you from dumping on Segment's market until you hit traction then ratcheting up to Segment's pricing?
- Who are you? Who are your investors?
[+] [-] soumyadeb|6 years ago|reply
Ratch up Pricing: Good question and not sure how to answer. Our vision is to be like other open-source companies like Mattermost, ES. Our base version (which would work for 90% of users) would be free and under open-source license. But I do understand your concern - maybe there is a way to put that in the license (that the base version will be perpetually open-source)
- Here is our company page (https://www.linkedin.com/company/rudderlabs). Our lead investor is S28 capital (Partner: Shvet) - they have also invested in Mattermost (an open-source slack competitor)
[+] [-] sundbry|6 years ago|reply
Why would you use postgres for an event professing system? This seems like an inefficient architecture.
[+] [-] soumyadeb|6 years ago|reply
We build our own streaming abstraction on top of Postgres. Think layman's leveled compaction. We will write a blog on that soon. The code (jobsdb/jobsdb.go) should have some comments too in case you want to check out. Segment had a similar architecture and had a blog post on it but can't seem to find it. Also, eventually we will replace postgres with something low-level like RocksDB or even native files.
Yes, in theory you can use Kafka's streaming abstraction and create create a topic per destination. Two reason's we didn't go that route
1) We were told Kafka is not easy to support in an on-prem environment. We are not Kafka expert but paid heed to people who have designed and shipped such on prem softwares.
2) More importantly, for a given destination, we have dozens of writers all reading from same stream. Only ordering requirement is events from a given device (end consumer) are in order. So we assign same user to same writer. However, the writers themselves are independent. If a payload fails, we just block events from that user while other users continue. Blocking the whole stream for that one bad payload (retried 4-5 times) will slow things down quite a bit. If we had to achieve the same abstraction on Kafka, we would have had to create dozens of topics per destination.
[+] [-] kevsim|6 years ago|reply
[+] [-] indianCoder|6 years ago|reply
I don't think it is inefficient. Segment blog linked below talks about specifics of the problem.
[+] [-] Roritharr|6 years ago|reply
What I found interesting is that you wrote 3k/events per second on a rather beefy 2xlarge machine. Our version is MUCH less demanding, I wonder if there isn't a lot of performance left on the table here.
I'll keep this in mind once we've grown out of our solution, though.
[+] [-] soumyadeb|6 years ago|reply
Currently, transformations are run in nodeJS. So, for every batch of events there is a call into nodeJS from go and that is slow. We do batching/parallel-calls but still.
I think, postgres gets us > 15K/sec throughput.
[+] [-] yodi|6 years ago|reply
[+] [-] soumyadeb|6 years ago|reply
[+] [-] sails|6 years ago|reply
https://github.com/snowplow/snowplow
[+] [-] sumanthpur|6 years ago|reply
We are aiming at routing the events reliably to destinations, transforming events real-time, storing them into your data warehouse with a dynamic schema and eventually build a data platform with help from the community.
[+] [-] namanyayg|6 years ago|reply
I haven't looked into rudder but I'll switch if it offers easier setup and schema.
[+] [-] qurt|6 years ago|reply
[+] [-] soumyadeb|6 years ago|reply
[+] [-] indianCoder|6 years ago|reply
Any plans on this?
[+] [-] soumyadeb|6 years ago|reply
Since this whole thing runs inside your VPC, you don't have to open up your production database to 3rd party as you have to do with segment.
Happy to work with you on your use case. Please email [email protected]
We wrote a couple of blog posts on that.
Case Study: https://rudderlabs.com/customer-case-study-casino-game/
Transformation Details:
https://rudderlabs.com/transformations-in-rudder-part-1/
https://rudderlabs.com/transformations-in-rudder-part-2/
[+] [-] NuclearTide|6 years ago|reply
[+] [-] fuzzyfroghunter|6 years ago|reply
Is there an easy way for someone to set it up on their cloud infrastructure?
[+] [-] soumyadeb|6 years ago|reply
Happy to help you setup - Please email [email protected] OR join our slack https://rudderlabs.herokuapp.com
[+] [-] mushufasa|6 years ago|reply
[+] [-] soumyadeb|6 years ago|reply
[+] [-] drixta|6 years ago|reply
[+] [-] rixed|6 years ago|reply
Unintentionally funny?
[+] [-] amelius|6 years ago|reply
[+] [-] HillRat|6 years ago|reply
[+] [-] dv_dt|6 years ago|reply
[+] [-] jarfil|6 years ago|reply
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] platform|6 years ago|reply
With regards to: >" Rudder runs as a single go binary with Postgres. It also needs the destination (e.g. GA, Amplitude) specific transformation code which are node scripts. "
what would be a recommended approach, if I would like to keep the data internally, and not use external Analytics engine?
[+] [-] soumyadeb|6 years ago|reply
Support for other data-warehouses (Redshift, Bigquery etc) is coming soon.
[+] [-] mychael|6 years ago|reply