top | item 37598382

(no title)

tkahnoski | 2 years ago

Worked at a medium size enterprise and was trying to get some detailed performance metrics with a legacy tech stack that didn't have a drop-in APM soluion. This was in the age of graphite which was great for aggregating metrics cheap but not getting detail.

Splunk was used by a much larger product (easily 10x our scale) for monitoring events so there was no red tape to start using it.

After launching the detailed instrumentation (1 structured log event per HTTP request with a breakout of database/service activity) I was able to gain all of the insight needed and build a simple user/url lookup dashboard page to help other engineers see what was going on. We went from being mostly blind to almost full visibility in less than two weeks.

The downside was, we increased our billable Splunk usage by 50% since we were capturing so much more data per log event than the other product just consuming standard IIS/Apache logs.

That type of flexibility was totally worth it. Due to some acquisition shenanigans we broke off from that group and wound up on ELK stack which didn't perform quite as well, but was still usable with the same data. In today's day and age we could have just built an OpenTelemtry library.

discuss

order

hparadiz|2 years ago

Comcast would drop all the error logs for all the cable boxes in the country into splunk. I then queried this to figure out the error code count in a given period. It's really the only thing that can handle the volume.

sib|2 years ago

No wonder Comcast subscriptions are so expensive...

closeparen|2 years ago

We had an ELK stack I was never very happy with (granted it was very old versions) and then it got replaced by Clickhouse. It’s been excellent.

ilyt|2 years ago

E in it is great, L is fiddly but useful but K is easily my least liked tool