Sort of interesting just to hear about the ups and downs of companies like dubsmash. They were often cited as an example of Berlin's future as a startup city [1]. They went from 35+ employees to 27 [2] to now 12 as they've stated in this post. They also moved from Berlin to New York, which seems to imply they felt like the city couldn't offer what they currently need. It looks like in the process of moving they didn't take that many of the employees with them (maybe this was also a way out of strict German employment rules?) Seems like a bit of an attempt at a restart (co-founder Roland Grenke seems to be gone, etc).
Looking at their rank history at AppAnnie, they were doing really well in 2015 but it's been downhill from there (from top 10 to >500 in all the major App store charts). How they were able to go from 140M to 350M downloads in the last year (compare this article with the techcrunch one) is a complete mystery. Also, stating your number of users without any qualifier (e.g. MAU) in a tech article is a bit of a red flag, in my experience that usually means that it's a vanity number (yearly active? Who knows).
It also sounds odd that they have 3 engineers and 12 employees. What do the other people do?
And hopefully they had more than 3 engineers back when they had 35 employees...but even then why would they choose to fire engineers and have that tech to non tech ratio?
interesting, they even couldn't manage a basic signup process. app says my email is in the users list. when i try to login, it says no email is exist in their system. tried forgot option same. ok, hire someone to manage your acl part guys ,_,
Thought the same. My wild guess is that those engineers are at their career peek in terms of energy / ability to deliver glue code, but a few years behind getting to be a well-rounded engineer that can live/work sustainably.
Three engineers maintain code in Java, Swift, previously Objective-C, Go, Python (both Django and Flask), Node.JS, considering Kotlin, and additionally make use of Celery, RabbitMQ, React, Redux, Apollo, GraphQL, Postgres, Heroku, AWS, Jenkins, Kubernetes, Redis, DynamoDB, Elasticsearch, Algolia, Memcached, and more.
I might be an inexperienced engineer by comparison, but I'll be honest, that sounds absolutely fucking insane. These three people must be geniuses to be able to use all of that with sufficient mastery to effectively handle 200M users.
Sometimes I wonder if there are any internet companies (startup or otherwise) that do customer support. With numbers like that, it's hard to imagine one of those users getting even one second of attention with any problems they might have.
You can only really do customer support if it makes financial sense, which it won't unless you make a significant amount of money on your average customer. Tech companies that don't have sales, but instead take their revenue through ads or through selling data are making cents per customer. With average profit that low, even 1/1000 customers making use of your support for 5 minutes would destroy any chance of profit.
> We since have moved to a multi-way handshake-like upload process that uses signed URLs vendored to the clients upon request so they can upload the files directly to S3.
How does this work in practice / where can one learn more about this?
I want to make sure that I understand the security aspect of this.
You can argue that the user can upload anything using the original api anyway. But in the original case you can do server-side validation before the upload is proxied. I am thinking stuff that are domain specific like only allowing videos that are 6 seconds long or something.
You can move the validation to the client but the client can be easily modified. An actual user might not do this but someone trying steal your storage space (for serving malware or something) might?
These signed urls also seem to expire based on time so you can potentially save the url and upload again later if you allow generous expiration. (again, not really something I see being a huge problem)
But I guess these aren't really serious issues compared to the cost savings. Am I missing other ways this can be exploited?
Not 100% sure what they mean by _vendored_ here, but I'm guessing they make a request to one of their backends to generate the URL and return it to the client for use.
One thing to keep in mind, users should be able to upload (to the specific signed URL), they should not be able to download from that location. Don't make the files users can upload publicly downloadable, otherwise you can be used to host malware. After the video/image is uploaded, you need to download and process it[1], then upload it to an S3 bucket that allows download (e.g, via CDN).
[1] Use caution when processing user content. It is best to process media in a sandbox that can protect you against exploits in the media processing libraries.
Client makes request to server passing back auth token, server verifies auth token and uses the S3 library to generate a unique 1 time use URL for upload directly to the client. Client makes a put request to the s3 url. After it's finished s3 revokes the URL.
Multipart signed upload is much harder and requires signing every chunk.
Just google s3 signed upload there are a few tutorials from Amazon.
> However, we discovered after some time that the custom Python implementation for those workers was dropping up to 5% of the events. This was mostly due to the nature of how reading happens with Kinesis: every stream has multiple shards (ours up to 50!) and each reading client would use a so-called shard iterator to keep track of where it was reading last. Since the used machines could always crash, be recycled, or scaled down, we needed to save those shard iterators in some serialized format to Redis and share them across machines and process boundaries. Since we had so many shards, every once in awhile we would skip events and hence lose them.
I've never worked with Kinesis, but in Kafka you'd store offsets specifically to solve this issue. When one of the members of a consumer group would drop out, the partition (read: shard) would automatically be reassigned to another member. This gives an at least once delivery guarantee, combined with idempotent actions gives effectively once semantics. No need to loose any messages. What was the issue that the dubsmash engineers were solving here?
Home-rolling a checkpoint-free event pipeline is a rookie mistake; it's a pity they didn't come across our Snowplow project (Apache 2.0 event pipeline running on Kinesis, Kafka and NSQ, https://github.com/snowplow/snowplow/).
> Although we were using Elasticsearch in the beginning to power our in-app search, we moved this part of our processing over to Algolia a couple of months ago;
I am genuinely curious about the trade-offs, as the bad and the ugly are not mentioned. Being realistic, there are too many moving pieces there, and yet the team of 3 remains experimental?
[+] [-] gjhgqpqndpe|8 years ago|reply
[1] http://www.wired.co.uk/article/european-startups-2016-berlin [2] https://techcrunch.com/2016/11/30/dubsmash-9m/
[+] [-] svantana|8 years ago|reply
[+] [-] kolmogorov|8 years ago|reply
[+] [-] tschellenbach|8 years ago|reply
[+] [-] kunthar|8 years ago|reply
[+] [-] gaius|8 years ago|reply
[+] [-] vemv|8 years ago|reply
[+] [-] aerovistae|8 years ago|reply
I might be an inexperienced engineer by comparison, but I'll be honest, that sounds absolutely fucking insane. These three people must be geniuses to be able to use all of that with sufficient mastery to effectively handle 200M users.
[+] [-] pepijndevos|8 years ago|reply
[+] [-] dan_mctree|8 years ago|reply
[+] [-] mooreds|8 years ago|reply
Would love to read more about whether they started with microservices or had an MVP monolith that they then cut parts off of.
[+] [-] bambax|8 years ago|reply
How does this work in practice / where can one learn more about this?
[+] [-] sergiotapia|8 years ago|reply
It works like this:
User tells the backend, “I want to upload picture.jpeg!”
Backend tells the user, “Alright you have my permission but ONLY for that filename with that extension. Here’s a token, enjoy.”
User uses that signed token and pushes the file to your S3 bucket.
Here's how you do it in Phoenix. https://sergiotapia.me/phoenix-framework-uploading-to-amazon...
[+] [-] rawnlq|8 years ago|reply
You can argue that the user can upload anything using the original api anyway. But in the original case you can do server-side validation before the upload is proxied. I am thinking stuff that are domain specific like only allowing videos that are 6 seconds long or something.
You can move the validation to the client but the client can be easily modified. An actual user might not do this but someone trying steal your storage space (for serving malware or something) might?
These signed urls also seem to expire based on time so you can potentially save the url and upload again later if you allow generous expiration. (again, not really something I see being a huge problem)
But I guess these aren't really serious issues compared to the cost savings. Am I missing other ways this can be exploited?
I am looking into the GCS version, not S3, if that matters: https://cloud.google.com/storage/docs/access-control/signed-...
[+] [-] theIV|8 years ago|reply
Not 100% sure what they mean by _vendored_ here, but I'm guessing they make a request to one of their backends to generate the URL and return it to the client for use.
[+] [-] antoncohen|8 years ago|reply
https://devcenter.heroku.com/articles/s3#file-uploads
One thing to keep in mind, users should be able to upload (to the specific signed URL), they should not be able to download from that location. Don't make the files users can upload publicly downloadable, otherwise you can be used to host malware. After the video/image is uploaded, you need to download and process it[1], then upload it to an S3 bucket that allows download (e.g, via CDN).
[1] Use caution when processing user content. It is best to process media in a sandbox that can protect you against exploits in the media processing libraries.
[+] [-] CryoLogic|8 years ago|reply
Multipart signed upload is much harder and requires signing every chunk.
Just google s3 signed upload there are a few tutorials from Amazon.
[+] [-] pul|8 years ago|reply
I've never worked with Kinesis, but in Kafka you'd store offsets specifically to solve this issue. When one of the members of a consumer group would drop out, the partition (read: shard) would automatically be reassigned to another member. This gives an at least once delivery guarantee, combined with idempotent actions gives effectively once semantics. No need to loose any messages. What was the issue that the dubsmash engineers were solving here?
[+] [-] alexatkeplar|8 years ago|reply
Home-rolling a checkpoint-free event pipeline is a rookie mistake; it's a pity they didn't come across our Snowplow project (Apache 2.0 event pipeline running on Kinesis, Kafka and NSQ, https://github.com/snowplow/snowplow/).
[+] [-] pinarello|8 years ago|reply
[+] [-] mooreds|8 years ago|reply
http://lemoinefirm.com/parody-fair-use-or-copyright-infringe...
[+] [-] coob|8 years ago|reply
[+] [-] mlevental|8 years ago|reply
[+] [-] karterk|8 years ago|reply
How many records are you storing on Algolia?
[+] [-] sundev|8 years ago|reply
[+] [-] ramshanker|8 years ago|reply
Neither does Facebook login work.
[+] [-] dominotw|8 years ago|reply
[+] [-] wkd|8 years ago|reply
[+] [-] gagabity|8 years ago|reply
[+] [-] the_scrivener|8 years ago|reply