top | item 46581698

(no title)

crawshaw | 1 month ago

The idea that an "observability stack" is going to replace shell access on a server does not resonate with me at all. The metrics I monitor with prometheus and grafana are useful, vital even, but they are always fighting the last war. What I need are tools for when the unknown happens.

The tool that manages all my tools is the shell. It is where I attach a debugger, it is where I install iotop and use it for the first time. It is where I cat out mysterious /proc and /sys values to discover exotic things about cgroups I only learned about 5 minutes prior in obscure system documentation. Take it away and you are left with a server that is resilient against things you have seen before but lacks the tools to deal with the future.

discuss

order

ValdikSS|1 month ago

>It is where I attach a debugger, it is where I install iotop and use it for the first time. It is where I cat out mysterious /proc and /sys values to discover exotic things about cgroups I only learned about 5 minutes prior in obscure system documentation.

It is, SSH is indeed the tool for that, but that's because until recently we did not have better tools and interfaces.

Once you try newer tools, you don't want to go back.

Here's the example of my fairly recent debug session:

    - Network is really slow on the home server, no idea why
    - Try to just reboot it, no changes
    - Run kernel perf, check the flame graph
    - Kernel spends A LOT of time in nf_* (netfilter functions, iptables)
    - Check iptables rules
    - sshguard has banned 13000 IP addresses in its table
    - Each network packet travels through all the rules
    - Fix: clean the rules/skip the table for established connections/add timeouts
You don't need debugging facilities for many issues. You need observability and tracing.

Instead of debugging the issue for tens of minutes at least, I just used observability tool which showed me the path in 2 minutes.

IgorPartola|1 month ago

See I would not reboot the server first before figuring out what is happening. You lose a lot of info by doing that and the worst thing that can happen is that the problem goes away for a little bit.

gerdesj|1 month ago

I fail to understand how your approach is different to your parent.

perf is a shell tool. iptables is a shell tool. sshguard is a log reader and ultimately you will use the CLI to take action.

If you are advocating newer tools, look into nft - iptables is sooo last decade 8) I've used the lot: ipfw, ipchains, iptables and nftables. You might also try fail2ban - it is still worthwhile even in the age of the massively distributed botnet, and covers more than just ssh.

I also recommend a VPN and not exposing ssh to the wild.

Finally, 13,000 address in an ipset is nothing particularly special these days. I hope sshguard is making a properly optimised ipset table and that you running appropriate hardware.

My home router is a pfSense jobbie running on a rather elderly APU4 based box and it has over 200,000 IPs in its pfBlocker-NG IP block tables and about 150,000 records in its DNS tables.

johnisgood|1 month ago

Your example is a shell debugging session. You ran perf, checked iptables, inspected sshguard - all via SSH (or locally). The "observability tool" here is shell access to system utilities.

This proves the parent's point: when the unknown happens, you need a shell.

crawshaw|1 month ago

How did you use tracing to check the current state of a machine’s iptables rules?

kelnos|1 month ago

That only works if the people who built the observability tool have thought of everything. They haven't, of course; no one can.

It's great that you were able to solve this problem with your observability tools. But nothing will ever be as comprehensive as what you can do with shell access.

I don't get what the big deal is here. Just... use shell access when you need it. If you have other things in place that let you easily debug and fix some classes of issues, great. But some things might be easier to fix with shell access, and you could very easily run into something you can't figure out without ssh.

Completely disabling shell access is just making things harder for you. You don't get brownie points or magical benefits from denying yourself that.

reactordev|1 month ago

Or… you build a container, that runs exactly what you specify. You print your logs, traces, metrics home so you can capture those stack traces and error messages so you can fix it and make another container to deploy.

You’ll never attach a debugger in production. Not going to happen. Shell into what? Your container died when it errored out and was restarted as a fresh state. Any “Sherlock Holmes” work would be met with a clean room. We have 10,000 nodes in the cluster - which one are you going to ssh into to find your container to attach a shell to it to somehow attach a debugger?

toast0|1 month ago

> We have 10,000 nodes in the cluster - which one are you going to ssh into to find your container to attach a shell to it to somehow attach a debugger?

You would connect to any of the nodes having the problem.

I've worked both ways; IMHO, it's a lot faster to get to understanding in systems where you can inspect and change the system as it runs than in systems where you have to iterate through adding logs and trying to reproduce somewhere else where you can use interactive tools.

My work environment changed from an Erlang system where you can inspect and change almost everything at runtime to a Rust system in containers where I can't change anything and can hardly inspect the system. It's so much harder.

IgorPartola|1 month ago

Say you are debugging a memory leak in your own code that only shows up in production. How do you propose to do that without direct access to a production container that is exhibiting the problem, especially if you want to start doing things like strace?

cyberax|1 month ago

Because you're holding it wrong!

The dashboards are something that looks cool, but they usually are not really helpful for debugging. What you're looking for is per-request tracing and logging, so you can grab a request ID and trace it (get log messages associated with it) through multiple levels of the stack. Even maybe across different services.

Debuggers are great, but they are not a good option for production traffic.

raggi|1 month ago

Yep.

Observability stacks are a similar blind alley to containers: They solve a handful of defined problems and immediately fall down on their own KPI's around events handled/prevented in-place, efficiency, easier to use than what came before.

ValdikSS|1 month ago

>What I need are tools for when the unknown happens.

There are tools which show what happens per process/thread and inside the kernel. Profiling and tracing.

Check Yandex's Perforator, Google Perfetto. Netflix also has one, forgot the name.

cryptonector|1 month ago

The problem lies in surveillance and others understanding what you did. Say your security department records every shell interaction with prod services: how does one then review and understand what happened? This is a fairly tricky problem. Perhaps through it at an LLM, but it'd have to be well trained to look for malicious actions.

jeffbee|1 month ago

I guess the question is why your observability stack isn't exposing proc and sys for you.

crawshaw|1 month ago

Mine (prometheus) doesn’t because there are a lot of high-dimensional values to track in /proc and /sys that would blow out storage on a time-series database. Even if they did though, they could not let me actively inject changes to a cgroup. What do you suggest I try that does?

gear54rus|1 month ago

Agreed, this sounds like some complicated ass-backwards way to do what k8s already does. If it's too big for you, just use k3s or k0s and you will still benefit from the absolutely massive ecosystem.

But instead we go with multiple moving parts all configured independently? CoreOS, Terraform and a dependence on Vultr thing. Lol.

Never in a million years I would think it's a good idea to disable SSH access. Like why? Keys and non-standard port already bring China login attempts to like 0 a year.