Ask HN: How do you handle logging?
I work as the backend developer at a mobile app startup, and we don't currently have any centralized logging.
So... how do you do it? Is there any way to have something similar to AWS X-Ray, to trace a single chain of events across platforms? Unless it's a bad idea? I really don't know ^^'
[+] [-] wenc|6 years ago|reply
In Python, use the RotatingFileHandler to avoid running out of space.
2) Incrementally forward your log files to a server using something like fluentd that can pre-aggregate and/or filter messages.
Big advantage of logging to disk: if logging server is unreachable, forwarder can resume once it's up again. If you log directly over network, if things fail the very log messages you need to troubleshoot the failure are potentially gone.
3) Visualize. Create alerts.
I've evaluated a bunch of logging solutions. Splunk is the best, and affordable at low data volumes (they have a pricing calculator, you can check for yourself). It's medium hard to setup.
Sumo Logic is the easiest to set up, and at low data volumes, prices are similar to Splunk. You can get something working within an hour or less.
ELK stack is free only in bits but not in engineering time.
I've not actually tried Sentry.io but I saw it at PyCon and it looks pretty impressive. If you only care about tracking errors/events and not about general-purpose logging functionality per se, I would take a serious look at it.
[+] [-] jasonrojas|6 years ago|reply
[+] [-] viraptor|6 years ago|reply
[+] [-] codemac|6 years ago|reply
For most people, the network being down means they can't reach the disk.
Buffering unsent logs via local disk or RAM is critical due to network flakiness for sure, but not logging over the network as well is a bad idea 100% of the time.
[+] [-] linsomniac|6 years ago|reply
It takes a kind of "ticket" approach to messages, it'll deduplicate and combine similar errors, and you go into a dashboard and see "We got ten thousand of this error, let me track it down, fix it, ack it and see if we keep getting it."
[+] [-] MuffinFlavored|6 years ago|reply
[+] [-] jetru|6 years ago|reply
[+] [-] caseyf7|6 years ago|reply
[+] [-] devonkim|6 years ago|reply
[+] [-] TrickyRick|6 years ago|reply
[+] [-] mandeepj|6 years ago|reply
If network is down then how your users are going to reach your app?
[+] [-] enobrev|6 years ago|reply
All syslog instances push to a central instance, also running rsyslog. This allows us to tail logs on each instance, as well as tail / grep system-wide on the central instance.
Central instance pushes everything directly into elasticsearch.
Using Kibana for searching and aggregating. Using simple scripts for generating alarms and reports.
Every day a snapshot of the previous day is uploaded to S3 and indexes from 14 days ago are removed. This allows us to easily restore historical data from the past, but also keeps our ES instance relatively thin for daily usage / tracking / debugging. It also makes it possible to replace our central log instance without losing too much.
All devs use some simple convention (ideally built into the logging libs) to make searching and tracing relatively easy. These include "request ids" for all logs pertaining to a single process of work, and "thread ids" for tracing multiple related "requests".
I documented how I have rsyslog and elasticsearch set up here: https://www.reddit.com/r/devops/comments/9g1nts/rsyslog_elas...
[+] [-] porker|6 years ago|reply
[+] [-] vinay_ys|6 years ago|reply
Think of a multi-stage pipeline for getting raw data from your transactional/interaction systems and extracting insights and intelligence out of them.
Stage-1: Ingestion – Keep this simple. Don't mess this up. Its a serious headache if you do.
1. Generate a request-id or message-id at the genesis of the request and propagate it throughout your call graph from client to servers (across any number of api call graph hops).
2. At each node in the call graph, emit whatever logs you want to emit, include this id.
3. Use whatever format you find natural and easy to use in each tech stack environment. Key is to make the logging instrumentation very natural and normal to each codebase such that the instrumentation does not get accidentally broken while adding new features.
4. Build a plumbing layer (agnostic of what is being logged) that can locally buffer these log messages, periodically compress and package them with added sequence and integrity verification mechanisms, and reliability transmit them to a central warehouse. Use this across all your server-side nodes. Build a similar one for each of your client side platforms.
5. At the central warehouse, immediately persist these log packages durably and then only respond to client indicating it is safe to purge those packages on their local nodes.
Stage-2: Use-case driven ETLs.
6. Come up with use-cases to consume this data. Define data tables (facts and dimensions) needed to support these consumption use case.
7. Build a high-performance stream processing system that can process the raw log packages for doing ETL (extract, transform and load) on the raw data in different formats to the defined consumable data tables.
Stage-3: Actual Use-case data applications.
Run your analytics and machine learning systems on top of these stable consumable data formats.
Keep the stages separate and decoupled in code and systems. Don't do end-to-end optimizations and break the boundaries. Recognize that the actors/stakeholders involved in each stage are different. The job of data team is to be the guardian of these stages and run the systems and org processes to support it.
[+] [-] meowface|6 years ago|reply
You seem to be describing low level principles, not high level ones. A high level principle would be "forward your logs to a centralized logging service and let the logging library and the service do 100% of the work for you", which I think is what nearly everyone should do (and which most are already doing).
[+] [-] mustardo|6 years ago|reply
The dudes...
>the backend developer at a mobile app startup,
How about syslog, ELK stack or something and focus on building the app
Some good points in there like correlation IDs etc all the same
[+] [-] stareatgoats|6 years ago|reply
[+] [-] alex-bender|6 years ago|reply
Have you tried something like opentracing.io ?
[+] [-] hooch|6 years ago|reply
[+] [-] stickfigure|6 years ago|reply
My GAE apps and google services just log there automatically. My non-GCP services require a keyfile and couple lines of fairly trivial setup.
I have a single logging console across my entire system with nearly zero effort and expense. It works incredibly well. Doing this in-house is a waste of engineering resources.
[+] [-] jmb12686|6 years ago|reply
Not sure about other use cases such as visualization and triggering events. I assume they have an API or integrations for such things, just haven't needed it as of yet.
Their pricing changed recently, don't remember the details, but I do remember previously that non Google Cloud nodes did incur an additional cost. Free limits are decent, haven't paid yet for personal side stuff. But YMMV, check the pricing page https://cloud.google.com/stackdriver/pricing
[+] [-] bpp|6 years ago|reply
[+] [-] Axsuul|6 years ago|reply
[+] [-] samblr|6 years ago|reply
After using stack driver, the setting up your whole logging mechanism in AWS atleast is so backward.
[+] [-] vonseel|6 years ago|reply
[+] [-] halotrope|6 years ago|reply
[+] [-] jacobsenscott|6 years ago|reply
For error tracking, which is mostly what you'll care about, use a service like honeybadger, or rollbar, or whatever fits well with your stack.
For performance metrics use a dedicated service for that as well. NewRelic, or Skylight, or whatever works well for your stack.
[+] [-] jrockway|6 years ago|reply
You do need to instrument your applications to emit traces, but don't go overboard. Make sure everything can extract the trace ID from headers / metadata and that requests they generate include the trace ID. Most languages have plugins for their HTTP / gRPC server and client libraries to do this automatically.
You will want your edge proxy to start the trace for you; this is very easy with Envoy and ... possible ... with nginx and the opentracing plugins.
I use structured logs (zap specifically), so I wrote an adaptor that accepts a context.Context for a given request, extracts the trace ID from that (and x-b3-sampled), and logs it with every line. This means that when I'm looking at logs, I can easily cut-n-paste the ID into Jaeger to look at the entire request, or if I'm looking at a trace, type the ID into Kibana and see every log line associated with the request. (The truly motivated engineer would modify the Jaeger UI to pull logs along with spans since they're both stored in ES. Someday I will do this.)
As for log storage and searching, every existing solution is terrible and you will hate it. I used ELK. With 4 Amazon-managed m4.large nodes... it still takes forever to search our tiny amount of logs (O(GB/day)). It took me days to figure out how to make fluentd parse zap's output properly. And every time I use Kibana, I curse it as the query language does overly-broad full-text searches, completely ignoring my query and then spending a minute to return all log lines that contain the letter "a" or something. "kubectl logs xxx | grep whatever" was my go-to searching solution. Fast and free.
If anyone wants to pay me to write a sane log storage and searching system... send me an email ;)
[+] [-] dmoy|6 years ago|reply
You can pry lingo/sawzall from my cold, dead hands.
[+] [-] jshawl|6 years ago|reply
For general purpose logging - we deploy Papertrail's remote_syslog2 https://github.com/papertrail/remote_syslog2 - which is more or less set it and forget it setup. e.g. specify which text files I want to aggregate, and then watch them flow into the live tail viewer.
For logging in more limited environments (can't sudo or apt-get install), we use Loggly's http API (https://www.loggly.com/docs/http-endpoint/). Also, Loggly's JSON support allows us to answer questions like: "how many signup events failed since the last deployment". Or "What is the most common signup error".
Bonus! If you're looking for trace-level reporting and integrating that with your logs, check out the AppOptics and Loggly integration: https://www.loggly.com/blog/announcing-appoptics-loggly-inte...
[+] [-] pragmatic|6 years ago|reply
[+] [-] scoobyyabbadoo|6 years ago|reply
[+] [-] binarylogic|6 years ago|reply
[0]: https://github.com/timberio/vector
[+] [-] jbob2000|6 years ago|reply
If you try to add logging across the entire infrastructure in one shot, you won't know what logs you actually need. And when it comes time to diagnose a problem, you probably won't be capturing the correct data.
[+] [-] weq|6 years ago|reply
For me, this looked like logging to a ringbuffer and then dumping that log with an associated error report when an exception occoured. was good enough for 99% of the errors i debugged, and we never actually needed a log-shipping solution. Logs were kept on disk and requested to be uploaded on demand when investigating specific issues.
it depends on what kind of startup u are in, what kind of product you ship, what kind of user base you have, what kind of solution you have. if you cobble together a set of SaaS solutions, ETL will be your integration challenge.
[+] [-] gtsteve|6 years ago|reply
The logs are written from containers to CloudWatch and consequently forwarded to ElasticSearch where we use Kibana and LogTrail [0] to view the logs and search them.
It's nowhere near as nice as XRay and other APM solutions but it hardly took any time to throw together. Fundamentally, this is how XRay works, only there is a specific format for the ID.
However, XRay now supports our runtime so we'll take another look at that. It looked like an interesting option at the time.
For a mobile app you'd want to assign a guid or some sort of user id to the device itself so you can track the distinct API calls it makes. I believe XRay and other systems support this but we don't have a mobile app so I don't know how that'd work for you.
[0] https://github.com/sivasamyk/logtrail
[+] [-] badrabbit|6 years ago|reply
Check it out. It's done wonders for me. You can manipulate,sort,retain and do other things on log events with it. It uses elasticsearch to store the logs.
It has SIEM like functionality with alerts and they are continuing to make it more suitable as a SIEM replacement.
And it does have cloudtrail support.
[+] [-] nullwarp|6 years ago|reply
Maybe it's better now since I tried but it was a real negative when trying to import some of them to find out later they were incompatible.
[+] [-] keyle|6 years ago|reply
[+] [-] colechristensen|6 years ago|reply
You seem to be interesting in tracing or APM [1] which also has many providers.
Lots of people do a local Elasticsearch, Logstash, Kibana stack which can be done without licensing with a variety of forwarders.
You might be most interested in Envoy Proxy or Elastic APM (there are many others)
https://www.envoyproxy.io
https://www.elastic.co/products/apm
1. https://en.wikipedia.org/wiki/Application_performance_manage...
[+] [-] kirktrue|6 years ago|reply
Zipkin (https://github.com/openzipkin) and OpenTracing (https://github.com/opentracing) purport to be vendor/platform agnostic tracing frameworks and have support with various servers/systems/etc.
X-Ray was pretty trivial to use in AWS land w/ Java as a client.
[+] [-] ElFitz|6 years ago|reply
It took me some time to... build up the courage to read through all of your answers, and you have been of tremendous help. I've learned quite a lot. Thank you very much! I deeply appreciate it!
I'll steer clear of self-hosted ELK, for now, mostly because being the only backend, I can't really take the risk of holding the whole team back while getting it up and running or maintaining it.
I'll look into Splunk, Sumo Logic, Sentry & a few others, while keeping in mind the more general guidelines that were laid down here.
Also, thank you for the terminology! It's much easier to find the proper resources know that I know what to look for!
Edit: I'll also take some time to answer to the different comments; but it really felt rude of me to be procrastinating while you all had taken the time to properly answer
[+] [-] Sevii|6 years ago|reply
[+] [-] cbanek|6 years ago|reply
Since it sounds like you also control the app, maybe make an HTTP header that the app sends that has some kind of UUID for that transaction. When your backend gets it, keep passing it on and logging it as part of your context when you emit log lines. Then using whatever log aggregation system you use, you can search for that UUID.
As for collecting your logs, I like ELK stacks, and they are easy to set up and get all your syslogging to go there. There are also ready made helm charts to install these into a kubernetes cluster if you're using that, and they will automatically scoop up everything logged to stdout/stderr.
[+] [-] linsomniac|6 years ago|reply
Central rsyslog server uses mmnormalize/liblognorm to parse the apache logs and load them into Elasticsearch.
haproxy logs directly to rsyslog via a domain socket, RELP to central server, lognorm to load into ES.
ELB logs go into S3, and logstash pulls them down and loads them into ES.
The remainder of syslog messages just go into files on the central server.
We also have Sentry set up with some newer applications logging into that.
[+] [-] atmosx|6 years ago|reply
- Logs
- Metrics
- Tracing and/or APM
The above are true for systems and applications but let's talk applications. Your decision should be based on assessment of at leat the following:
- Do you have compliance requirements? (e.g. GDPR)
- What is your logs/metrics/traces retention period? (let's assume 30 days)
- What is your logs/metrics/traces lifecycle requirements? (are you going to need logs older the 30 days? If not, I'd say don't bother delete everything, keeping them around has managerial and hosting costs)
I advice to take a look at ElasticSearch:
- ElasticSearch for hosting logs
- For sending logs, metrics and tracer you can use filebeat, metricbeat and ElasticSearch APM or Jaeger.
If you are a small startup, I'd say go with ElasticSearch Cloud and use their tools. They do all you need and more.
[1]: I prefer metricbeat over prometheus/grafana because it solves the high availability headache for those who already have an ES cluster and you don't have to support (setup, monitor, manage, scale) an additional stack. You can use a push model which has it's own pros and cons.
ps. No affiliation with elastic, I just spent some time with a variety of their products and like what I see so far.
[+] [-] danesparza|6 years ago|reply
Next, you'll want to track analytics centrally. Etsy and Netflix have been pioneers in this area. Their engineering blogs are very good to follow. Think: something like a timeseries database (like Influx / Prometheus) and getting data into it. Use tools like Grafana to get data out of it in dashboards or reports. This is separate from your application debug / error logging system.
The next step after this is developing something that consumes data from both of those systems and provides alerts based on unusual activity -- something that provides early warning to devops.
[+] [-] nodesocket|6 years ago|reply
[1] https://docs.datadoghq.com/logs/