top | item 34131337

General guidance when working as a cloud engineer

227 points| lockedinspace | 3 years ago |lockedinspace.com

147 comments

order
[+] nielsole|3 years ago|reply
Another random selection:

* When choosing internal names and identifiers (e.g. DNS) do not include org hierarchy of the team. Chances are the next reorg is coming faster than the lifetime of the identifier and renaming is often hard.

* The industry leading tools will contain bugs. From Linux kernel to deploy tooling, there are bugs everywhere. Part of your job is to identify and work around them until upstream patches make it to you if ever.

* Maintaining a patched fork is usually more expensive than setting up a workaround

* Your hyperscaler cloud provider has plenty of scalability limitations. Some of which are not documented. If you want to do something out of the ordinary make sure to check with your account rep before wasting engineering time.

* Bought SaaS will break production in the middle of the night. Your own team will have the best context and motivation to fix/workaround them. When choosing a vendor, include the visibility into their internal monitoring as a factor for disaster recovery (exported metrics and logs of their control plane for example)

[+] vladvasiliu|3 years ago|reply
> * Your hyperscaler cloud provider has plenty of scalability limitations. Some of which are not documented. If you want to do something out of the ordinary make sure to check with your account rep before wasting engineering time.

If only they'd tell you. We had this exact issue on AWS. Seemingly random packet drops. Metrics on both clients and servers were ok, latency specifically was very low when it worked.

Call up support "yeah, you're running into our connection limit". "Oh. What's that limit?" "yeah, I can't tell you that". His solution was that, since this was somehow related to connection tracking in the security group, I could set this to allow all/all, and set up filtering at the NACL level. Turns out I could do it for this particular issue.

This was before there was a possibility to monitor this [0]. Called up our customer manager. "Let me check". A few days later, "yeah, that's not something we divulge".

---

[0] For those who don't know, it's now possible to keep an eye on refused connections (at least on Linux). https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitori... -> conntrack_allowance_exceeded

[+] anilakar|3 years ago|reply
> When choosing internal names and identifiers (e.g. DNS) do not include org hierarchy of the team.

Naming in general is hard. If you name stuff based on location, use an identifier that won't change, like provider datacenter names, street addresses or customer building codes, not the current tenant or purpose of use.

For products, come up with an internal product/project name and stick to it in everything that is not immediately visible to the customer. At one point you could see the current and three previous names of our product if you popped out an iframe and opened the inspector (logo with name, page title, URL and prefixed log messages).

> Maintaining a patched fork is usually more expensive than setting up a workaround

When your bosses demand additional features a single customer requested, you absolutely have to make them understand that the functionality must be added to the main product.

[+] throwawaaarrgh|3 years ago|reply
Truth is an interesting concept. It's often subjective and has many forms. Within the context of the cloud, almost all cloud services are only mutable, so "truth" is whatever the current state of the cloud actually is. Whatever is in Git is merely idealism.

Whatever you are maintaining, read the docs completely first. And I mean cover to cover. Not just the one chapter you need to get a PoC up and running. You will wish you had later, and it will come in handy many times over your career. Consider it an investment in your future.

Read books on microservices before you implement them. Whatever two-line quip you read on a blog will not be as good as reading several whole books from experts.

Docker multi-stage builds won't work in some circumstances. Build optimization eventually gets complex, the more you rely on builds to be "advanced".

[+] crdrost|3 years ago|reply
Thanks for the alternative microservices quip, it was better than the original. Indeed, I find that “microservices should only perform a single task” is a really dangerous way to phrase it because we have no idea what the article means by “task.” The classic microservices separation is to separate an ordering service from a shipping service, is each of those one “task”? Or at the most extreme, is saving an order distinct from returning the list of your outstanding orders? Even when people graduate to a language of DDD and refine, often they settle on “one microservice per bounded context” where “bounded context” means “separated however I want it separated at the time,” and has no consistent principle behind... This despite the fact that I think Eric was quite explicit in his explanation of the idea, he meant a mapping of the software idea to the fussy complex world of businesspeople and business language: perhaps a better way to phrase this is that it's one microservice per archetype of user, “we have people from the warehouses who all speak the same shipping jargon, we should have a microservice specifically for them which speaks their language,” and I think most developers target their microservices smaller than that, in which case it is definitely not “one microservice per bounded context”.

Don't get me started on how “strong coupling” is shorthand for “coupled in ways I don't like” etc. ... Sometimes I feel like I'm on an episode of “whose line is it anyway?”, where everything is made up and the points don't matter.

[+] k__|3 years ago|reply
"read the docs completely first"

I learned from using software like Photoshop and Ableton Live, that you shouldn't underestimate the complexity of any software you use.

Take a few days or weeks, if you can, to read docs or do high quality courses on the topic and it will make your life easier in the long run.

[+] pabs3|3 years ago|reply
The only truth is the memory and disk contents of the devices that make up your cloud. Everything else is an abstraction of that, which discards data and potentially is out of sync with reality.
[+] thewisenerd|3 years ago|reply
while I wouldn't wish the bootstrap problem on my worst enemy, I think the idealism helps for at least versioning configuration changes, and partial component-level tear-downs and bring ups (you don't need this often, but when you do, you do).

also, with k8s, nothing like deleting the wrong object or making a change and not knowing what it was, N revisions ago.

[+] kator|3 years ago|reply
Don't forget "pets vs cattle", thinking of servers as ephemeral and working towards quickly being able to scale up/down based on demand. So often I see people "lift and shift" from a dedicated server model into the cloud and never convert their pets into cattle. This reduces flexibility later, not to mention makes it harder to respond to patching needs, scaling, and moving to optimize latency or costs.
[+] r3trohack3r|3 years ago|reply
As an ex-FAANG engineer, this is FAANG advice. Pets are just fine. Most companies arent FAANG and don't need that class of solution.

An R620 plugged into a switch in a colo, a bash script via cron, or a cloudflare worker are just fine for a lot of use cases. The only time it stops being fine is when you can't afford to do your pet -> cattle migration as you scale up. But I don't think this is a common death for companies.

If you call "cattle" a cloudflare worker or lambda function - fine. But when we are talking about multiple redundant servers with load balancing across them, you really need to justify the cost of that vs the value you squeeze out. Sometimes you're squeezing the juice out of the rind.

[+] voiper1|3 years ago|reply
Some replies are saying this is only for "as-scale/FAANG".

It may only be absolutely necessary there, but it's helpful even for smaller folks.

Over the years, even Debian LTS goes out of support and new features and software should be installed. There's moving systems, doing restores, things breaking and wanting to "reset" to a known working state. Any time you can do something simple with docker or even just (short) step-by-step build scripts, that's a huge win.

I have playbooks for deploying a system, but with npm installs, bower installs, secrets to be hand copied from multiple places, etc, it feels more like pets and it's NOT simple to deploy.

[+] candiddevmike|3 years ago|reply
Citation needed? There are tradeoffs to both, one is not always better than the other.
[+] birdymcbird|3 years ago|reply
> A good monitoring system, well-organized repository, fault-tolerance workloads and automation mechanisms are the basis of any architecture.

Monitoring/alarming, and knowing what to monitor. Also, properly instrument your services or whatever it is you have. Take time to reflect on what are the signals that tell you operational health. An error metric alone is useless if you don’t know the denominator. Also be careful to avoid adding noisy metrics that cause panic for no reason.

I’m not sure what fault tolerance means in this context. Very handwavy statement. I think if you have dependencies, have a plan and understanding of which ones tipping over will bring down your service or how you can build resiliency. For example, some feature on your page requires talking to a recommendations service. If the service goes down, can you call back to a generic list of hard coded recommendations or some static asset?

As for automation: yeah, have test workflows built into your CI/CD harness. And avoid manual steps there requiring human intervention. Use canaries to test certain functions are up and running as expected, etc

[+] TrackerFF|3 years ago|reply
"Learn to say: I do not know about this/that. You cannot know everything that gets presented to you. The bad habit comes when the same technological asset appears for a second time and you still do not know how it works or what it does."

Absolutely. I've seen so many junior engineers / devs go on about it like this:

Someone higher up: Could you please look at this problem? I need it fixed ASAP.

Jr. Engineer, presented with a problem he's never seen before: No problem, I will look into it!

Someone higher up (the next day): Did you fix the problem?

Jr. Engineer: Sorry, I haven't still gotten around to look at it / I'm still working on it / etc.

Someone higher up: We really need it fixed today, please prioritize it and give me a call when it is fixed.

Jr. Engineer works on the problem all night, feeling stressed out, not wanting to let down his seniors.

[+] WolfOliver|3 years ago|reply
"Microservices should only perform a single task." -> I guess this advice is the reason there are so widely misunderstood, see: https://linkedrecords.com/challenging-the-single-responsibil...
[+] adamisom|3 years ago|reply
Wow and I thought functions should only perform a single task. I need to keep up with the times! Apparently you need an entire deployable app and API to do anything these days. I guess it makes sense. How else could we justify so many software engineers!?
[+] _vertigo|3 years ago|reply
I think this advice really depends on your scaling needs. If you need to scale your services up, it’s a lot easier to do that if each service only does one thing.

It also depends on how much functionality you consider to be “one thing”.

[+] pondidum|3 years ago|reply
> Do not make production changes on Fridays

I ~hate~ dislike this advice. If you can't deploy on a Friday, you need to fix your deployment strategy. By removing Friday from when you can deploy, you're wasting 1/5 of your available days.

Note: deploy != Release[1]. Use flags, canaries etc.

[1]: https://andydote.co.uk/2022/11/02/deploy-doesnt-mean-release...

Edit: hate is far too stronger word for this

[+] Sevii|3 years ago|reply
The point of not deploying on friday is to reduce the risk of getting paged over the weekend. It's a quality of life move for the oncall team. No deployment strategy will change the fact that deployments are the leading cause of outages.

If you can't afford to give up 1/5 of your available deployment days you have a problem somewhere in your CI/CD system.

[+] kevan|3 years ago|reply
I'm a huge advocate for CI/CD pipelines and my team owns a lot of them. We're confident enough to deploy anytime but we choose to limit deploys to our team's business hours and not on Fridays. Why? Because we think the return going from deploying 4 days/week to 5 days/week is outweighed by the stress and morale hit of ruined weekend plans if something weird happens. There's probably situations where that extra speed makes a difference but for us deploying to all regions safely can take a full day anyways so it's pretty normal to have multiple changes flowing at the same time.
[+] grogenaut|3 years ago|reply
CI/CD, flags, canaries don't catch everything, and can still cause outages to others. We try and do pretty heavy CI/CD where I work, but not everyone does (we, like everyone, has old systems). It's actually quite easy for us to have the well behaved systems honor release hours or not depending how their release history has gone, or coverage,etc... but they're well behaved, so they usually have great tests, and they're not usually panicked about rolling out after hours, they have their sh*t together.

The reason we have core hours release only without director approval (aka director approval required outside core hours) is so you don't piss off another team by paging them after hours, and so you aren't trying to shove out a thing on a system that doesn't have good coverage or by turning off the safeties. In a large company I've noticed many engineers assume urgency even where there isn't. As an approver myself, most of the time someone wants to rush is because they've not even had the convo with their manager on if it's worth the risk, they are assuming urgency because that's when the sprint ends or what some TPM added to a jira ticket 4 months ago.

I admit that sounds risky itself (the engineers not having the right risk training) but this is why we have a policy and tooling... most of the times I've dug in they're just very new and worried about perception as a new employee, so my job is to shepherd them through having that convo with their managers which inevitably has the managers saying "yes it can totally wait till monday", and the change is inevitibly a bit more hot than it should be due to accidental deadline pressure.

[+] rexarex|3 years ago|reply
I get that people really want to flex that they can deploy on Friday afternoon and NOTHING CAN GO WRONG, but it’s still foolish and flaunts Murphy’s Law. It can wait.
[+] dopylitty|3 years ago|reply
This one made me laugh. I've been places that only allow deployments on Fridays because it gives the whole weekend to fix things if they break.

It's a good interview question as a candidate. If you ask the interviewer when they deploy and they say only Friday (or worse only once a month) then perhaps look elsewhere for your own sanity because it's a sign of serious malfunction either organizationally, technically, or both.

[+] doctor_eval|3 years ago|reply
You should have both the confidence that you could deploy on Fridays, and the wisdom to know that you shouldn’t.
[+] lopatin|3 years ago|reply
Interestingly my company only deploys on Friday because it has to wait for (most) markets to close for the weekend.
[+] elric|3 years ago|reply
Hating it seems a little strong. I'm sure that any team far along enough on the quality spectrum can just read this and say "we've moved beyond this worry". The post is titled "general guidance", not "absolute truths". Adjust expectations accordingly.
[+] abledon|3 years ago|reply
> If you need to build an architecture which involves microservices, I am sure that your cloud provider has a solution that fits better than Kubernetes. E.g: ECS for AWS.

Thank you! So many people running unnecessary things on Kubernetes

[+] rswail|3 years ago|reply
On the other hand, K8S provides you with orchestration abstraction across AWS, GCP, Azure, VMWare, bare metal.

There are distinct advantages to that in terms of both development (running a local K8S cluster is relatively easy) and deployment.

ECS has no distinct advantages over K8S (or EKS in AWS land). Particularly now that there are CRDs for K8S that allow you to deploy AWS functionality (eg ALBs, TGs) from K8S.

[+] raydiatian|3 years ago|reply
> If you need to build an architecture which involves microservices, I am sure that your cloud provider has a solution that fits better than Kubernetes. E.g: ECS for AWS. Kubernetes is a fantastic toolkit, but only shines when all that it has to offer, gets used.

As far as FAAS goes, I think more people need to go check out Cloud Run as a Knative implementation. Having used it for sometime now it feels like a near-perfect FAAS solution. The only gripe I have is that versioning is a bit dopey. But hey, if I can have autoscaling services with absolute impunity over how my HTTP interface is shaped (looking at you AWS lambda) and without needing to worry about Kubernetes headaches, I’m perfectly happy to embed version names in service domains.

[+] elric|3 years ago|reply
> Certify yourself with official courses.

Can anyone recommend some certifications that are worthwhile? I realize that this is a very broad ask, but the advise is also rather broad.

[+] hiAndrewQuinn|3 years ago|reply
_one word answer: AWS_

There's a lot of conceptual carryover between cloud platform offerings, so getting _any_ of the big 3 (GCP, AWS, Az) is likely to help you out a ton of you're new to the space. Much like how your first programming language took much longer than your second through fifth ones, learning your first platform well enough to get employed is much more challenging than filing the serials number off and learning the new quirks of the other two.

In the absence of further information as to your career goals, I'd lightly recommend AWS. It came into existence years before the others and can offer SLAs that approach "this S3 bucket will outlive you in the event of thermonuclear war".

Azure is where I have lived so far in my career and it seems to be catering more towards enterprise and government needs. I actually imagine finding an Azure shop is harder than an AWS shop if you haven't already worked at one before, but it's a pretty sweet gig otherwise.

GCP goes the other direction from what I've seen - much more startup-oriented, as the newest kid on the block itself. It looked nice from the last time I played around with it.

Kubernetes exists as a useful "stage 2" if you want to go further down the pipeline, as a technology whose business raison d'etre is to commoditize cloud providers. _In theory_ a Kubernetes cluster can be engineered to run unaltered on any of the big 3, since they all offer k8s clusters.

It's also totally cool to say that's okay thanks, I'll stick with simple architectures and a focus on getting MVPs out the door rapidly. For me $DAYJOB is spent between Azure and k8s, but my side projects start the same way every time - SQLite, Django, and _maybe_ Docker Compose to sidecar Litestream if I'm feeling extra infra-inclined that day. Really there's no reason to get dogmatic about anything in a space with so many options.

[+] eikenberry|3 years ago|reply
Just about any Certificate is worthwhile depending on your reasons. Best case I've seen them used for is to help you break into new technology areas, EG. you want to work as an SRE for AWS services, having a few AWS Certificates under your belt might be just enough to get you that interview (plus you'd kill at AWS trivia nights).
[+] zikduruqe|3 years ago|reply
EVERYTHING costs money. Tag every resource. Come up with ways to show cost avoidance and cost savings. This is will be appreciated more by management than any code you can bang out.
[+] rr808|3 years ago|reply
I love monitoring but after a few decades working I still haven't found a good way to monitor everything. Still a mix of email, pagerduty, prometheus, cloudwatch, websites, kibana consoles. Surely there is a good way to do this? I figure some of the new BI dashboards would be good but haven't seen much usage.
[+] nijave|3 years ago|reply
>Before jumping straight into a new technology, read and understand their docs

The number of issues I've seen that turn out to be documented features... (or, more accurately, things just being configured incorrectly)

[+] virgilp|3 years ago|reply
> Microservices should only perform a single task. If you are not able to achieve that isolation, maybe you should switch back to a monolithic architecture. Do not get fooled by the current trends, microservices are not meant for everything.

I feel like this is spectacularly bad advice. "Do not get fooled by shades of grey, things are meant to be either black or white!"

[+] mustafabisic1|3 years ago|reply
Some solid career advice in there as well.

I feel like this could used as one of those "How to 10x career" articles - and be better than all of them.

[+] myfirstproject|3 years ago|reply
> Git should be your only source of truth. Discard any local files or changes, what's not pushed into the repository, does not exist.

Completely agree with that.

[+] bobismyuncle|3 years ago|reply
Some of these are lessons you only really learn once you make the mistake yourself
[+] lockedinspace|3 years ago|reply
A helpful list of things to have in mind when working with anything tech related.
[+] raxits|3 years ago|reply
One more

Have a good logging & rollback strategy well communicated across stakeholders

[+] martynvandijke|3 years ago|reply
Nice guide, just curious are there more of these guides ?