top | item 20818106

Ask HN: How do you handle logging?

268 points| ElFitz | 6 years ago | reply

Hi!

I work as the backend developer at a mobile app startup, and we don't currently have any centralized logging.

So... how do you do it? Is there any way to have something similar to AWS X-Ray, to trace a single chain of events across platforms? Unless it's a bad idea? I really don't know ^^'

120 comments

order
[+] wenc|6 years ago|reply
1) Log to local disk (most people will tell you this is bad practice and that you should directly log to socket or whatever, but it's more likely for your network to be down than for your disk to fail).

In Python, use the RotatingFileHandler to avoid running out of space.

2) Incrementally forward your log files to a server using something like fluentd that can pre-aggregate and/or filter messages.

Big advantage of logging to disk: if logging server is unreachable, forwarder can resume once it's up again. If you log directly over network, if things fail the very log messages you need to troubleshoot the failure are potentially gone.

3) Visualize. Create alerts.

I've evaluated a bunch of logging solutions. Splunk is the best, and affordable at low data volumes (they have a pricing calculator, you can check for yourself). It's medium hard to setup.

Sumo Logic is the easiest to set up, and at low data volumes, prices are similar to Splunk. You can get something working within an hour or less.

ELK stack is free only in bits but not in engineering time.

I've not actually tried Sentry.io but I saw it at PyCon and it looks pretty impressive. If you only care about tracking errors/events and not about general-purpose logging functionality per se, I would take a serious look at it.

[+] jasonrojas|6 years ago|reply
“ELK stack is free only in bits but not in engineering time.” — best thing I’ve read all week. Thank you.
[+] viraptor|6 years ago|reply
Sentry is not for logging really. It's designed for errors/exceptions. I believe you should use rollbar/sentry/airbrake regardless of whether you use centralised logging. Or even before it.
[+] codemac|6 years ago|reply
> it's more likely for your network to be down than for your disk to fail

For most people, the network being down means they can't reach the disk.

Buffering unsent logs via local disk or RAM is critical due to network flakiness for sure, but not logging over the network as well is a bad idea 100% of the time.

[+] linsomniac|6 years ago|reply
I set up Sentry a few years ago, based on a PyCon hallway-track talk I saw. We had a few false starts with it, but have integrated it with a couple of our newer platforms and have liked it.

It takes a kind of "ticket" approach to messages, it'll deduplicate and combine similar errors, and you go into a dashboard and see "We got ten thousand of this error, let me track it down, fix it, ack it and see if we keep getting it."

[+] MuffinFlavored|6 years ago|reply
It blows my mind that there isn't an easily Googable example of how to pipe JSON logs on disk to logstash. It's 2019. The software should be able to come online, check what logs it has already indexed, and then `tail -f` files matching a pattern. I really thought this was a more common use case but it's obvious things went the "directly log to socket" route you mentioned.
[+] jetru|6 years ago|reply
We do a similar thing. Additionally, we use logrotate to move these files into S3, which later get ETLed into Parquet so we can scoop it for long term analytics.
[+] caseyf7|6 years ago|reply
One should keep in mind that ELK is quite expensive once you reach the point where you need to pay for it.
[+] devonkim|6 years ago|reply
With cloud options available for both Elasticsearch and Splunk these days difficulty of setup / ease of use may be better evaluated from the client perspective. I’ve setup both ES and Splunk in the past few years and it’s not terribly different at lower end scale (< 100 GB logs / mo). But currently Splunk is not as good as ELK for metrics and vice verse and with the recent SignalFX acquisition that may change in a couple years but it definitely isn’t now. Also, there’s tons of options for streaming logs to ES besides Logstash including Filebeat which is at least written in Go (Splunk’s forwarders are probably in C given I swear they’ve been mostly the same since the early 2000s).
[+] TrickyRick|6 years ago|reply
https://sematext.com/cloud/ Sematext Cloud has been working well for me in a relatively small startup (Couple of thousands active users). Cheap solution and easy to get started.
[+] mandeepj|6 years ago|reply
> network to be down than for your disk to fail

If network is down then how your users are going to reach your app?

[+] enobrev|6 years ago|reply
Everything logs to syslog (I generally use rsyslog) in JSON format.

All syslog instances push to a central instance, also running rsyslog. This allows us to tail logs on each instance, as well as tail / grep system-wide on the central instance.

Central instance pushes everything directly into elasticsearch.

Using Kibana for searching and aggregating. Using simple scripts for generating alarms and reports.

Every day a snapshot of the previous day is uploaded to S3 and indexes from 14 days ago are removed. This allows us to easily restore historical data from the past, but also keeps our ES instance relatively thin for daily usage / tracking / debugging. It also makes it possible to replace our central log instance without losing too much.

All devs use some simple convention (ideally built into the logging libs) to make searching and tracing relatively easy. These include "request ids" for all logs pertaining to a single process of work, and "thread ids" for tracing multiple related "requests".

I documented how I have rsyslog and elasticsearch set up here: https://www.reddit.com/r/devops/comments/9g1nts/rsyslog_elas...

[+] porker|6 years ago|reply
How do you change everything on a system to use JSON format? My syslog (Debian) is filled with text-line entries, and I've not seen a setting to change this.
[+] vinay_ys|6 years ago|reply
Since others have answered with specific tech stacks, I'll give a more generalized/abstracted answer. While getting started, here are a few high-level principles I found useful to adhere that will make your life easier later:

Think of a multi-stage pipeline for getting raw data from your transactional/interaction systems and extracting insights and intelligence out of them.

Stage-1: Ingestion – Keep this simple. Don't mess this up. Its a serious headache if you do.

1. Generate a request-id or message-id at the genesis of the request and propagate it throughout your call graph from client to servers (across any number of api call graph hops).

2. At each node in the call graph, emit whatever logs you want to emit, include this id.

3. Use whatever format you find natural and easy to use in each tech stack environment. Key is to make the logging instrumentation very natural and normal to each codebase such that the instrumentation does not get accidentally broken while adding new features.

4. Build a plumbing layer (agnostic of what is being logged) that can locally buffer these log messages, periodically compress and package them with added sequence and integrity verification mechanisms, and reliability transmit them to a central warehouse. Use this across all your server-side nodes. Build a similar one for each of your client side platforms.

5. At the central warehouse, immediately persist these log packages durably and then only respond to client indicating it is safe to purge those packages on their local nodes.

Stage-2: Use-case driven ETLs.

6. Come up with use-cases to consume this data. Define data tables (facts and dimensions) needed to support these consumption use case.

7. Build a high-performance stream processing system that can process the raw log packages for doing ETL (extract, transform and load) on the raw data in different formats to the defined consumable data tables.

Stage-3: Actual Use-case data applications.

Run your analytics and machine learning systems on top of these stable consumable data formats.

Keep the stages separate and decoupled in code and systems. Don't do end-to-end optimizations and break the boundaries. Recognize that the actors/stakeholders involved in each stage are different. The job of data team is to be the guardian of these stages and run the systems and org processes to support it.

[+] meowface|6 years ago|reply
This is true if you plan to develop a SIEM or ELK completely from scratch. Interesting as general background info, but I can't see this information being practically useful to anyone who just wants to log stuff. It'd be like building a washing machine and drying machine from scratch because you want to wash your clothes.

You seem to be describing low level principles, not high level ones. A high level principle would be "forward your logs to a centralized logging service and let the logging library and the service do 100% of the work for you", which I think is what nearly everyone should do (and which most are already doing).

[+] mustardo|6 years ago|reply
Such a HN answer,

The dudes...

>the backend developer at a mobile app startup,

How about syslog, ELK stack or something and focus on building the app

Some good points in there like correlation IDs etc all the same

[+] stareatgoats|6 years ago|reply
Having battled the curse of spurious errors in both large and small systems without centralized logging I'd say looks like great tips to me. Even for smaller setups, who might need something like this more than anyone. (A simplified analytics setup in that case and the machine learning excluded I guess).
[+] alex-bender|6 years ago|reply
> and propagate it throughout your call graph from

Have you tried something like opentracing.io ?

[+] hooch|6 years ago|reply
Thanks. This provides a very useful architectural map to keep in mind whilst investigating the various implementations.
[+] stickfigure|6 years ago|reply
I use the stackdriver logging in Google Cloud Platform.

My GAE apps and google services just log there automatically. My non-GCP services require a keyfile and couple lines of fairly trivial setup.

I have a single logging console across my entire system with nearly zero effort and expense. It works incredibly well. Doing this in-house is a waste of engineering resources.

[+] jmb12686|6 years ago|reply
Stackdriver is remarkably awesome for log aggregation, storage, and querying. Uptime checks to arbitrary HTTP endpoints are fantastic!

Not sure about other use cases such as visualization and triggering events. I assume they have an API or integrations for such things, just haven't needed it as of yet.

Their pricing changed recently, don't remember the details, but I do remember previously that non Google Cloud nodes did incur an additional cost. Free limits are decent, haven't paid yet for personal side stuff. But YMMV, check the pricing page https://cloud.google.com/stackdriver/pricing

[+] bpp|6 years ago|reply
Been using Stackdriver for four years and will back up those who are saying it works very well.
[+] Axsuul|6 years ago|reply
How has their client been with searching/tailing?
[+] samblr|6 years ago|reply
This.

After using stack driver, the setting up your whole logging mechanism in AWS atleast is so backward.

[+] vonseel|6 years ago|reply
Does stackdriver logging become cost-prohibitive quickly?
[+] halotrope|6 years ago|reply
Same here. It works remarkably well out of the box.
[+] jacobsenscott|6 years ago|reply
As a startup you should be using one of the many logging services out there - definitely don't waste time rolling your own or trying to install some open source log aggregator in an EC2 instance or something.

For error tracking, which is mostly what you'll care about, use a service like honeybadger, or rollbar, or whatever fits well with your stack.

For performance metrics use a dedicated service for that as well. NewRelic, or Skylight, or whatever works well for your stack.

[+] jrockway|6 years ago|reply
Yes, you want to have a single chain of events across all of your infrastructure. This is called "distributed tracing". There are a few solutions available; I recommend Jaeger.

You do need to instrument your applications to emit traces, but don't go overboard. Make sure everything can extract the trace ID from headers / metadata and that requests they generate include the trace ID. Most languages have plugins for their HTTP / gRPC server and client libraries to do this automatically.

You will want your edge proxy to start the trace for you; this is very easy with Envoy and ... possible ... with nginx and the opentracing plugins.

I use structured logs (zap specifically), so I wrote an adaptor that accepts a context.Context for a given request, extracts the trace ID from that (and x-b3-sampled), and logs it with every line. This means that when I'm looking at logs, I can easily cut-n-paste the ID into Jaeger to look at the entire request, or if I'm looking at a trace, type the ID into Kibana and see every log line associated with the request. (The truly motivated engineer would modify the Jaeger UI to pull logs along with spans since they're both stored in ES. Someday I will do this.)

As for log storage and searching, every existing solution is terrible and you will hate it. I used ELK. With 4 Amazon-managed m4.large nodes... it still takes forever to search our tiny amount of logs (O(GB/day)). It took me days to figure out how to make fluentd parse zap's output properly. And every time I use Kibana, I curse it as the query language does overly-broad full-text searches, completely ignoring my query and then spending a minute to return all log lines that contain the letter "a" or something. "kubectl logs xxx | grep whatever" was my go-to searching solution. Fast and free.

If anyone wants to pay me to write a sane log storage and searching system... send me an email ;)

[+] dmoy|6 years ago|reply
> If anyone wants to pay me to write a sane log storage and searching system... send me an email ;)

You can pry lingo/sawzall from my cold, dead hands.

[+] jshawl|6 years ago|reply
Disclaimer: I work for both Papertrail and Loggly's parent company: SolarWinds.

For general purpose logging - we deploy Papertrail's remote_syslog2 https://github.com/papertrail/remote_syslog2 - which is more or less set it and forget it setup. e.g. specify which text files I want to aggregate, and then watch them flow into the live tail viewer.

For logging in more limited environments (can't sudo or apt-get install), we use Loggly's http API (https://www.loggly.com/docs/http-endpoint/). Also, Loggly's JSON support allows us to answer questions like: "how many signup events failed since the last deployment". Or "What is the most common signup error".

Bonus! If you're looking for trace-level reporting and integrating that with your logs, check out the AppOptics and Loggly integration: https://www.loggly.com/blog/announcing-appoptics-loggly-inte...

[+] pragmatic|6 years ago|reply
remote_syslog2 project doesn't seem to be very active. Still supported and maintained?
[+] binarylogic|6 years ago|reply
I'm biased because my team and I created Vector [0], but I'd highly recommend investing in a vendor-agnostic data collector to start. You can use this to collect your data and send it wherever you please. This will afford you the flexibility to make changes as you learn more, which will be inevitable.

[0]: https://github.com/timberio/vector

[+] jbob2000|6 years ago|reply
Don't try to roll it out all in one shot. Just work on solving problems. Database is timing out? Add some logging there. Requests getting dropped between proxy and app servers? Add some logging there.

If you try to add logging across the entire infrastructure in one shot, you won't know what logs you actually need. And when it comes time to diagnose a problem, you probably won't be capturing the correct data.

[+] weq|6 years ago|reply
This is a good point.

For me, this looked like logging to a ringbuffer and then dumping that log with an associated error report when an exception occoured. was good enough for 99% of the errors i debugged, and we never actually needed a log-shipping solution. Logs were kept on disk and requested to be uploaded on demand when investigating specific issues.

it depends on what kind of startup u are in, what kind of product you ship, what kind of user base you have, what kind of solution you have. if you cobble together a set of SaaS solutions, ETL will be your integration challenge.

[+] gtsteve|6 years ago|reply
Well I threw together a system which assigns a guid to each request and reports this guid to the user if something goes wrong. The guid is sent when calling across services internally so you can trace log lines across API calls and services.

The logs are written from containers to CloudWatch and consequently forwarded to ElasticSearch where we use Kibana and LogTrail [0] to view the logs and search them.

It's nowhere near as nice as XRay and other APM solutions but it hardly took any time to throw together. Fundamentally, this is how XRay works, only there is a specific format for the ID.

However, XRay now supports our runtime so we'll take another look at that. It looked like an interesting option at the time.

For a mobile app you'd want to assign a guid or some sort of user id to the device itself so you can track the distinct API calls it makes. I believe XRay and other systems support this but we don't have a mobile app so I don't know how that'd work for you.

[0] https://github.com/sivasamyk/logtrail

[+] badrabbit|6 years ago|reply
I am shocked that no one has mentioned Graylog so far.

Check it out. It's done wonders for me. You can manipulate,sort,retain and do other things on log events with it. It uses elasticsearch to store the logs.

It has SIEM like functionality with alerts and they are continuing to make it more suitable as a SIEM replacement.

And it does have cloudtrail support.

[+] nullwarp|6 years ago|reply
My only real complaint with graylog is it seems between V2 and V3 all the modules/packs (or whatever they call them) broke and you the useful ones are broken now.

Maybe it's better now since I tried but it was a real negative when trying to import some of them to find out later they were incompatible.

[+] keyle|6 years ago|reply
I was just searching for the word "graylog" as I was about to say the exact same thing.
[+] colechristensen|6 years ago|reply
Centralized logging: SaaS services that do this are a dime a dozen. Sumologic, Datadog, Elastic, etc.

You seem to be interesting in tracing or APM [1] which also has many providers.

Lots of people do a local Elasticsearch, Logstash, Kibana stack which can be done without licensing with a variety of forwarders.

You might be most interested in Envoy Proxy or Elastic APM (there are many others)

https://www.envoyproxy.io

https://www.elastic.co/products/apm

1. https://en.wikipedia.org/wiki/Application_performance_manage...

[+] kirktrue|6 years ago|reply
It sounds like you're looking for something like distributed tracing (vs. vanilla logging).

Zipkin (https://github.com/openzipkin) and OpenTracing (https://github.com/opentracing) purport to be vendor/platform agnostic tracing frameworks and have support with various servers/systems/etc.

X-Ray was pretty trivial to use in AWS land w/ Java as a client.

[+] ElFitz|6 years ago|reply
I really didn't expect to get this many passionate opinions on the matter.

It took me some time to... build up the courage to read through all of your answers, and you have been of tremendous help. I've learned quite a lot. Thank you very much! I deeply appreciate it!

I'll steer clear of self-hosted ELK, for now, mostly because being the only backend, I can't really take the risk of holding the whole team back while getting it up and running or maintaining it.

I'll look into Splunk, Sumo Logic, Sentry & a few others, while keeping in mind the more general guidelines that were laid down here.

Also, thank you for the terminology! It's much easier to find the proper resources know that I know what to look for!

Edit: I'll also take some time to answer to the different comments; but it really felt rude of me to be procrastinating while you all had taken the time to properly answer

[+] Sevii|6 years ago|reply
Log to disk. Rotate every hour and upload to S3. Download from S3 as needed and query via grep, awk, etc.
[+] cbanek|6 years ago|reply
> trace a single chain of events across platforms

Since it sounds like you also control the app, maybe make an HTTP header that the app sends that has some kind of UUID for that transaction. When your backend gets it, keep passing it on and logging it as part of your context when you emit log lines. Then using whatever log aggregation system you use, you can search for that UUID.

As for collecting your logs, I like ELK stacks, and they are easy to set up and get all your syslogging to go there. There are also ready made helm charts to install these into a kubernetes cluster if you're using that, and they will automatically scoop up everything logged to stdout/stderr.

[+] linsomniac|6 years ago|reply
Apache logs to go rsyslog via logger (apparently the best option with Apache 2.4). Syslogs go to a central rsyslog server over RELP (which mostly has been reliable, but a recent bug in rsyslog daily caused us to have to reload a week's worth of logs).

Central rsyslog server uses mmnormalize/liblognorm to parse the apache logs and load them into Elasticsearch.

haproxy logs directly to rsyslog via a domain socket, RELP to central server, lognorm to load into ES.

ELB logs go into S3, and logstash pulls them down and loads them into ES.

The remainder of syslog messages just go into files on the central server.

We also have Sentry set up with some newer applications logging into that.

[+] atmosx|6 years ago|reply
The only worthy math teacher I came across once said "The right answer to 99,9% of the questions I will ask you in the oral exam is: it depends" while smiling cunningly. To answer your questions, the answer is "it depends". What you're really looking for however is not logging, what you're looking for is observability which has 3 pillars:

- Logs

- Metrics

- Tracing and/or APM

The above are true for systems and applications but let's talk applications. Your decision should be based on assessment of at leat the following:

- Do you have compliance requirements? (e.g. GDPR)

- What is your logs/metrics/traces retention period? (let's assume 30 days)

- What is your logs/metrics/traces lifecycle requirements? (are you going to need logs older the 30 days? If not, I'd say don't bother delete everything, keeping them around has managerial and hosting costs)

I advice to take a look at ElasticSearch:

- ElasticSearch for hosting logs

- For sending logs, metrics and tracer you can use filebeat, metricbeat and ElasticSearch APM or Jaeger.

If you are a small startup, I'd say go with ElasticSearch Cloud and use their tools. They do all you need and more.

[1]: I prefer metricbeat over prometheus/grafana because it solves the high availability headache for those who already have an ES cluster and you don't have to support (setup, monitor, manage, scale) an additional stack. You can use a push model which has it's own pros and cons.

ps. No affiliation with elastic, I just spent some time with a variety of their products and like what I see so far.

[+] danesparza|6 years ago|reply
First, centralized logging is not just a good idea -- its key when you start working with multiple servers (which will most likely be almost right away). You need to be able to trace requests / responses / errors across your platform. Many tools (including logging library -> database and a custom log search / viewer) can give you this. Just pick something that works for your budget and development process and start there. To track a single chain of events, you'll just need to have a GUID that you pass between calls in a single request (used for logging).

Next, you'll want to track analytics centrally. Etsy and Netflix have been pioneers in this area. Their engineering blogs are very good to follow. Think: something like a timeseries database (like Influx / Prometheus) and getting data into it. Use tools like Grafana to get data out of it in dashboards or reports. This is separate from your application debug / error logging system.

The next step after this is developing something that consumes data from both of those systems and provides alerts based on unusual activity -- something that provides early warning to devops.

[+] nodesocket|6 years ago|reply
Recommend DataDog logs[1]. It integrates with cloud providers and pulls logs from resources like load balancers, s3 buckets, etc. Additionally you can ingest from files on servers using the DataDog agent, and finally there are language SDK's to push log events from code.

[1] https://docs.datadoghq.com/logs/