top | item 15737215

Ask HN: How do you deal with operational work as a software engineer?

31 points| lamansion | 8 years ago | reply

As a software engineer doing infrastructure work I often find myself working on operational stuff (mostly chasing weird bugs, some on-call, etc.). In my position I am also expected to release features and do development too, but I feel like it's very difficult to focus because of all the operational issues I am dealing with. How are you guys dealing with that sort of work?

36 comments

order
[+] megaman22|8 years ago|reply
Badly. We've lost a couple devs/ops people in the last year, and haven't adequately replaced them. We're stretched way too thin and everyone is getting very burned out.

I haven't done any significant development work in more than six months, just chasing bugs, doing support, and fussing with email and meetings. It blows; I've got to find a different job.

[+] Jtsummers|8 years ago|reply
Identify points to automate. Automate them. Get the automation peer reviewed by the team. Establish testing for the automation. Deploy the automation.

If it's one-offs and not consistent misbehavior that the above can deal with, improve testing infrastructure. If you're unable to hit your feature development schedule, point to the problems in the present system and infrastructure.

Ask your boss for clear priorities: Do they want a stable system, or more features. If the present system is this unstable, then more features will only exarcibate this. If they say they want both, and give them equal priority, ask for a pay raise and search for new jobs.

[+] kaikai|8 years ago|reply
Chasing bugs and being on-call sound like core parts of a software engineer's job, rather than operational work.

That said, some teams at my company are experimenting with having a week-long rotation for "bread box" issues. Those include tending issues/PRs in open source repos, handling bugs as they come in, etc. That frees up the rest of the rest of the team to work on core feature work.

I like to keep a running list of smaller, non-urgent tasks that would otherwise get neglected. When I have a long-running script or need to take a break from another project, I can refer to the list.

[+] mottomotto|8 years ago|reply
Chasing bugs? Yes. Being on-call? No. Not unless you signed up for that. Too many companies think they can just get Pagerduty going and sign up all their engineering staff for operations duty. This is stupid for a number of reasons least of which is managed services get rid of most of the need for this and it is typically cheaper than developer time.

Do some developers on the team need to think about scale? Yes. Should all the developers be on call because perhaps the company decided to roll it's own infrastructure and someone has to deal with occasional server with full disks? No.

[+] lin_lin|8 years ago|reply
On call as a core part? Really? Thankfully I've never worked anywhere with such a "duty", tbh if my current place proposed it I'd be applying for new jobs by lunch time.

What's the standard pay for being on-call as a matter of interest?

[+] amriksohata|8 years ago|reply
Never agree to be on call and if you do, make sure you are being paid double salary as a minimum, all modern science points to working unsociable hours as a massive detriment to your health. Also working Saturdays and Sundays does not make your team more productive, because your staff will be tired the following week, it's a false economy.
[+] scarface74|8 years ago|reply
So if your software needs to run 24 hours and something breaks with your software, how do you avoid being on call?

A developer shouldn't be the first person called, there should be an operations staff but they may have to escalate.

On the other hand, any time that a developer is routinely being called in the middle of the night, there is usually either an issue with the software or the infrastructure not being fault tolerant.

[+] bradhe|8 years ago|reply
> Never agree to be on call and if you do, make sure you are being paid double salary as a minimum

Good luck with that.

[+] flukus|8 years ago|reply
It depends on the frequency and nature of these issues, but it sounds like you are experiencing technical debt and that your paying for it with slower development speed. Solving the stability issues should take precedence over developing new features.

Is the stuff you have to intervene for under your control or external? If you're relying on outside systems that are flakey then you need make your systems more resilient, things like automatically retrying a few minutes later if some third party service is down and/or being more transnational so you can deal with errors.

[+] sqldba|8 years ago|reply
We may need some clarity in the problem are you experiencing.

If the problem is that you can’t focus long enough to do non-operations work whats the problem with that?

Are you unhappy you’re not coding? If so then ask for a new hire to take over the part you don’t want or start looking for another job.

Are you unhappy that your boss is still pushing you for results and is an utterly clueless idiot who has no idea where your time actually goes?

Fill us in.

[+] watwut|8 years ago|reply
I see chasing weird bugs as part of development job, not something separate. As long as it has weird bugs, the feature is not really done. As a side note, developers who do "only development" and offload all weird bugs to someone else tend to create less maintainable software overtime - they lack feedback and tend to favor whatever makes them produce new stuff faster over what makes us all avoid those weird bugs.

As for infrastructure and first line support, lobbying management for more people continuously is just about the only long term solution.

The other thing is planning and transparency which helps the above. Keep plan with realistic estimates to show it management each time you talk with them. Do your best work, definitely dont slack etc, but dont skip corners to make something look like done when it is not. Instead, move dates in plan and send it to management again. The point is to convince them that there is really more overall work then possible by one person. (If they get offended over that or treat you badly over that, find a new job.)

[+] Kuraj|8 years ago|reply
> As a side note, developers who do "only development" and offload all weird bugs to someone else tend to create less maintainable software overtime

My problem is that I have become that someone else.

[+] eitland|8 years ago|reply
Time Management for System Administrators has some ideas I think: http://shop.oreilly.com/product/9780596007836.do

(I haven't read this cover to cover but I has more or less read his and Christina J. Hogans book cover to cover I thing and I've also bought a couple of copies of the above book to share.)

Summary of what I've learned and found useful from those and other resources:

Get someone to step in for you half the time. (If only to fill in a ticket or - in a real emergency: call you.)

Manage expectations. (You don't expect hard interrupts except for emergencies. )

Make support requests asynchronous. (Mail, support tickets - not calls. Even when you (or someone else) are available for real time support, - make chat the preferred option.

[+] holydude|8 years ago|reply
Yeah I really get your suffering. I really hate when software engineers try to meddle in the ops part. It usually ends up being a stupid piles of crap on another crap. It is also sad to see companies pushing devs to do this instead of giving it to someone who understands what they are doing.
[+] pmontra|8 years ago|reply
If I don't fix bugs and I don't help my customers with setting up servers and the like I don't think I'll get new projects with them. Why would they trust a developer that disappears? It's as simple as that.

Some of those activities are paid, but fixes close to a delivery are not and it's OK. Usually I set up a maintenance contract for quick activities, like small new features or investigating puzzling events (not necessarily bugs.) I have a ticketing system to keep track of those activities. Customers have access to it.

Obviously one has to make clear that maintenance will slow down development.

[+] dozzie|8 years ago|reply
What issues exactly are you dealing with? You only provided a very vague description of this "operational stuff" you do and are disturbed with.
[+] lamansion|8 years ago|reply
Dealing with production problems, which may be functionality, performance, and reliability related.
[+] aprdm|8 years ago|reply
You need a better system in place to prevent bugs from happening.

- Separation between development / staging / production environments.

- Integration tests.

- Service / System Metrics.

- Central logging.

- High availability.

- Alerts.

When you have a solid deployment pipeline things don't usually break. Errors and regressions are caught in the staging part of the deployment pipeline and errors in production can be rolled back automatically (and then you add a integration test for the regression!)

All this devopsy work at my company is done by software engineers with advise from systems engineers. And we do it because neither of the groups want to get called in the weekends :) it has been working really well. Last year we had 0 calls. Before we had this in place things would break in a weekly basis.

You can build all of what I mentioned with OSS like:

- Ansible (deployment)

- Jenkins (ci)

- ELK stack (metrics / logging)

- Zabbix (system metrics)

This system has been serving us, on premises, without much maintenance.

[+] thisisit|8 years ago|reply
> As a software engineer doing infrastructure work

So you are into devops but doing more ops than dev? This doesn't sound like a problem until your team's agenda and objective is to deliver more ops work.

[+] bradhe|8 years ago|reply
Treat your operations work like your engineering work. Over time things get a lot better.
[+] akulbe|8 years ago|reply
It's hard to read this and not want to offer help. I don't know if this is the best venue though.