top | item 8532625

Startup Fuck-ups: How we lost 25% of our monthly revenue overnight

114 points| noellep | 11 years ago |medium.com | reply

56 comments

order
[+] shizcakes|11 years ago|reply
This could be solved an even more fundamental way: Don't run your own mailer as a startup. There are lots of companies that will be responsible for email deliverability on your behalf, via an API. If it took them 3 months to notice no mail was being sent at all, imagine how long it's going to take them to figure out that their IP is blacklisted in Spamhaus or any number of other deliverability issues?
[+] kenrikm|11 years ago|reply
Focus on your core competency, as someone who has run production mail servers in the past I know it's not something you really want to do (My advice: don't) there is always some fire to put out or blacklist you were put on that needs to be cleaned up.

Mailgun offers 10,000 emails per month free and is dead simple to use.

[+] tankenmate|11 years ago|reply
I'd modify that by saying; if you really know how to run a mail server go for it, but if you don't then leave it to someone who does know. Same goes for just about every other service your start up relies on; accounting, legal, etc.

Cost benefit analysis (even a brief one) will always help, even if you get the answer wrong. It's just a part of planning; your plans don't always work, but if you don't plan then you'll never know if you succeeded or not (or why) until it's too late.

[+] NDizzle|11 years ago|reply
I don't entirely agree with this.

I don't use my own mailer currently. I DO, however, use postfix to queue and relay email to rackspace, who actually sends my email.

I don't think the API method is appropriate because then you need to run some other queue system so your app has an instant response time for the user. Their action would create a queue entry (with whatever data) that will eventually be fired off as an API call to whoever you're using to send email via an API.

Or, you just set up an SMTP relay and use sendmail/postfix/whatever locally to handle that part of it.

I often see these startups using the API calls as part of the customer facing flow (website or otherwise) and the increased latency waiting on that API call to return really, really sucks.

[+] michaelbuckbee|11 years ago|reply
Excellent advice, the only thing I'd add is that it's like this for everything that isn't core to your service when you are getting started.
[+] spacefight|11 years ago|reply
Not everyone wants to outsource transactional mail handling to a third party.
[+] protonfish|11 years ago|reply
If you want to send bulk email blasts from a large list then definitely use a 3rd party, but it seems their emails were triggered programmatically from many different parts of their application. The email providers I have used don't make this easy so it is usually better to run the server yourself. You don't need much knowledge or experience to run an email server (if you are competent at general IT tasks already) but you need one thing: to actually check once in a while to see if your sever is sending email.

Which is the real problem here. Team members knew mail wasn't being delivered in the forums and they chose to ignore it. They must have never done any follow-up (personal email, phone call, survey) on new customers even when they were doing their big marketing "ramp-up." They must not have even checked with a test walk-through of the new user process. Leadership was just too far removed from the customer experience, whether they used a 3rd party email service or not.

[+] wmt|11 years ago|reply
What we found was that a number of failed jobs were being kept by the system, which meant that these were taking up a ton of space that they shouldn’t have been.

To fix the issue, we put together a script to delete the failed items, since any retries to send them didn’t appear to work.

At this point my head was screaming "NOOOOooooo!" and made me feel bad for author for the whatever disaster would soon follow.

Not only was the problem not fixed, it wasn't even understood. Hiding the problem by fixing its symptoms will rarely get you far. I don’t think I'm even Captain Hindsighting here, as I've learned over and over that not understanding the root cause of any issue means you will be screwed by the issue sooner or later, and it likely will not be pretty.

Sure, sometimes you don't have the time to get to bottom of an issue, but even then you cannot pretend that it's fixed. It'll be back with a vengeance.

[+] Bluecobra|11 years ago|reply
Well said. This is the end result of letting developers perform system administration tasks. I doubt that they were able to Google the root cause on Stack Overflow. :)
[+] aqeel|11 years ago|reply
Exactly what I thought!
[+] rbadaro|11 years ago|reply
I don't think terms like "Fuck-ups" or "screwed" should belong in corporate communications, start-up or not. It's cool they are talking about this openly, but unfortunately what I took from their write-up is that their communication style is less than professional.
[+] protonfish|11 years ago|reply
I don't care what language they use, personally. However I am at work and having the F-word in 72pt font blaring from the top of my browser made scroll down very quickly. I doubt my bosses would be thrilled to see that.
[+] glibgil|11 years ago|reply
Sorry, fuck-up is an acceptable professional term now. A fuck-up is an error that is so bad that from CEO to customer, there is no sense in calling it anything other than what it is. Send 10 emails at once to a customer? Sorry, we fucked up. Lose 25% of revenue for several months? Sorry, CEO, I fucked up. It is an admission that you have made an error that will happen less than once a year and hopefully only once a career.
[+] wrs|11 years ago|reply
Further evidence that "if you aren't monitoring it, it isn't happening"! Ensuring that code is monitorable needs to be right up there with ensuring it's testable.
[+] derwiki|11 years ago|reply
"Small" is relative -- at the "small startup" I run, our email volume is low enough that I BCC myself on every email the site sends. Poor man's monitoring and it won't scale, but I usually notice within hours if something had been broken.
[+] aqeel|11 years ago|reply
Where I work, BCC-ing on all mails is not scalable. Instead a random sample of the mails is sent to ourselves.
[+] Someone1234|11 years ago|reply
As someone who has created quite a few of these mailers, the queue getting stuck on a single piece of mail and hanging indefinitely is incredibly common. As time has gone on my solutions have become simpler and more pragmatic, since additional complexity breads additional problems.

For example, if I was going to design an emailer today:

-Grab the email from a database save it to a file (likely one or several XML files) and place it in an "Outgoing" directory (ye olde file system).

- Then have a process which grabs an atomic lock (only one running at a time!), gets the directory listings, and launches the actual "sender" for every file individually (concurrently).

- When the launcher launches the sender it records the PIDs of the process against the actual emails/XML files internally.

- After a set wait period if any processes are still running, the launcher kills them, and moves the email/XML into a "Failed" directory which we monitor independantly.

- Every email which is sent gets moved to an "Archive" directory by the sender process, and we monitor that to see if no emails have been archived for a long time (e.g. 30 minutes).

You can accomplish the same thing using a database (Outgoing, Achive, and Failed tables), but frankly with so many awesome file system tools already around it doesn't make sense to reinvent that wheel. Plus people intuitively understand that if a file is sitting in the "Outgoing" or "Failed" directories then it hasn't been sent yet (just like your client would!).

[+] specialist|11 years ago|reply
I strongly approve. That's how I implemented the backend of my medical records exchange stack.

File system based queues. Point-to-point data interchange, so no concurrency; your notion of imprinted work tasks with PIDs is a good idea.

I used a "pull" model. A thread would take work from one directory and drop into another. Poor man's workflow. Worked great. Super easy to monitor and troubleshoot.

Using Java, implementing the cross platform file locking (so a downstream process wouldn't pull a task before it was ready) took some finesse, a small caveat.

[+] brusch64|11 years ago|reply
Sounds like you just recreated Microsoft Biztalk.

But to sound less like a dick - communicating with a low probability of errors is hard !

[+] lazyant|11 years ago|reply
If you don't outsource this type of service (preferred solution imho) then from the very beginning you have to monitor internally (the solution done after the fact) and also and most importantly externally, in this case having one or more monitored client-like email accounts.
[+] nasalgoat|11 years ago|reply
I have a script that sets up 90% of a full nagios/icinga server automatically in about 5 minutes.

Why, in 2014, are people still not monitoring everything as job #1?

Why isn't this being taught in schools? How do people with tech jobs not know this?

[+] meritt|11 years ago|reply
Make sure your mistakes never directly affect your consumers. Don't spam them, don't overload them with ads, and don't leak their PII. Quickest way to lose customers.
[+] baudehlo|11 years ago|reply
I'd really like to know what the mail server software was that failed this way.
[+] pbhjpbhj|11 years ago|reply
No one in the company was subscribed to their mailings?

Hope they're monitoring their backups.

[+] Animats|11 years ago|reply
scheduled emails to check in with our users.

That's not "transactional email". That's spam.

You're a spammer. Die.

[+] general_failure|11 years ago|reply
"I am single, 41 and pregnant"...

I don't know when pregnancy became a part of a person's identity. I guess this adds to the coolness factor these days because you have to try harder? Sigh.

[+] amouat|11 years ago|reply
Where does it say this? (I guess it's been removed?)

And why does it matter to you whether or not the author considers being pregnant to be part of her identity?

IMO it would be more healthy to comment on the content of article than than the author's identity.

[+] Spooky23|11 years ago|reply
Pregnancy at 41 is generally high-risk. So you have lots of doctor's appointments, often lots of discomfort and you probably end up with mandated (and necessary) bed rest.

Being single while dealing with that is an enormous burden mentally and physically.

I didn't notice it referenced in the story, but it is definitely a factor that would influence a person's behavior that would be relevant, especially in a startup context when everyone is wearing 10 hats.

[+] ceejayoz|11 years ago|reply
The pregnancy is just as relevant as being single and 41. Why are you complaining about it and not the others?