Django Metrics, Logs & Traces Observability with Opentelemetry and Grafana

May 11, 2022 · 13 mins read
Django Metrics, Logs & Traces Observability with Opentelemetry and Grafana

Metrics, Logs and Traces (MLT) are the three pillars of observability that can provide us with complete visibility of a software system. In this article, we’re going to set up a system where MLT data is collected from a hello-world microservice application.

Contents

  1. What are metrics, logs and traces?
  2. How can we collect metrics, logs and traces?
  3. Project background
  4. Step by step implementation
  5. Demo
  6. Conclusion

1. What are metrics, logs and traces (MLT)?

Metrics

Metrics show the measure of how a system resource is used. They are typically numerical. For example-

  • How much CPU was used in the past hour?
  • How much disk space is consumed?
  • How much bandwidth has been used?

Logs

Logs are events that a running software records. For example-

  • Log the stack trace of a runtime error
  • Log when a user accesses the system
  • Log a critical error

Traces

Traces show the path of a program’s execution. In a microservice/distributed system, a request from a client may get processed through several services. It is important to know through which path a request has been processed and how much processing time was needed in each node so that errors and bottlenecks can be identified.

More about MLT

It’s not necessary that all three of MLT must be collected. Logs are the most important element to collect as they convey what errors have occurred and how the software executes.

Metrics are the next most important element to collect. With metrics, we can know how well the software is performing and whether we need to optimize functions or scale up servers.

Finally, with tracing, we can track in what sequence a request has been processed. This was unnecessary previously because most systems were monoliths and did not require tracing. But as microservices and distributed systems are getting more popular, it is increasingly becoming important to track how requests go through each service in the system.

Thus we should get complete observability of a system by collecting metrics, logs and traces. In addition to that, we should be able to correlate MLT easily to better understand the system. For example, from a log we should also be able to go to its trace from where it was formed and vice-versa. By the end of this article, we should be able to see this implementation.

2. How can we collect metrics, logs and traces?

The following tools can be used to collect MLT data. There are of course other alternatives. But for this tutorial, We will use the tools below.

  1. Grafana [Dashboard]: This is a dashboard where we will observe the MLT data collected by the rest of the services.
  2. Prometheus [Monitoring]: Collects and stores metrics from applications.
  3. Promtail [Logging]: Collects log data from applications and sends them to Loki.
  4. Loki [Logging]: Aggregates and stores all logs sent by Promtail.
  5. Opentelemetry [Tracing]: Instruments an app to collect traces and sends them to Jaeger.
  6. Jaeger [Tracing]: Collects and stores tracing information. Also helps in trace visualization.

The following figure summarizes the technology dependencies.

MLT in kubernetes

3. Project Background

3.1 Prerequisites

Before we get to the implementation, there are a few things that you should be acquainted with before moving forward.

  • Docker: You should be familiar with docker. The example project relies heavily on Dockerfiles and docker-compose files.
  • Python/Django: You should be familiar with python. Knowing Django isn’t necessary but understanding it will help understand the hello-world microservices in the example project.

3.2 Source code

The complete implementation of this article can be found at github.com/rafed/opentelemetry-python-grafana-mlt. Clone the project using

$ git clone https://github.com/rafed/opentelemetry-python-grafana-mlt

Explore the project structure of the repository to get a general idea of what exists where.

3.3 Services and System Architecture

The project has two django apps- a webapp and a webapi. They have the following endpoints.

  • Webapp
    • /
    • /service
  • Webapi
    • /
    • /add

The system architecture is simple. A user hits Webapp at localhost:8000/service. The webapp subsequently hits webapi at localhost:8001/add to calculate the value of 2+2.

graph TD; A[webapp
localhost:8000]-->B[webapi
localhost:8001];

The root endpoints are provided so that we can check whether the services are running or not.

4. Step by step implementation

If you have a django project you can add the monitors by following the steps below. Or you can use the django projects in the repository provided with this tutorial.

4.1 Add/run the services in docker-compose

To run our implementation we will make use of docker-compose as it will make redeploying our iterations easier. Add services to a docker-compose.yaml file.

services:
  webapp:
    build: ./webapp
    command: python manage.py runserver 0.0.0.0:8000
    volumes:
      - ./webapp:/webapp
    ports:
      - "8000:8000"
    depends_on:
      - webapi

  webapi:
    build: ./webapi
    command: python manage.py runserver 0.0.0.0:8001
    volumes:
      - ./webapi:/webapi
    ports:
      - "8001:8001"

Run the services using

$ docker-compose up

4.2 Install Prometheus (for metrics)

To collect metrics from django we need to install a metrics collector and exporter for django.

In your django requirements.txt add the following.

django-prometheus==2.2.0

In settings.py add in INSTALLED_APPS

INSTALLED_APPS = [
    ...
    'django_prometheus',
]

and in MIDDLEWARE

MIDDLEWARE = [
    'django_prometheus.middleware.PrometheusBeforeMiddleware',
    ... Other
    ... middlewares
    ... in the middle
    'django_prometheus.middleware.PrometheusAfterMiddleware',
]

In urls.py add

urlpatterns = [
    ...
    path('', include('django_prometheus.urls')),
]

Now, when the django apps run, they will export the django metrics in /metrics endpoint.

Next we need to run prometheus to collect metrics from these endpoints exported from django. Create file a file prometheus/prometheus.yaml and add the following code.

global:
  scrape_interval: 10s
  scrape_timeout: 5s

scrape_configs:
  - job_name: prometheus
    metrics_path: /metrics
    static_configs:
      - targets:
          - 'prometheus:9090'

  - job_name: webapp-scraper
    metrics_path: /metrics
    static_configs:
      - targets:
          - 'webapp:8000'
        labels:
          app: webapp

  - job_name: webapi-scraper
    metrics_path: /metrics
    static_configs:
      - targets:
          - 'webapi:8001'
        labels:
          app: webapi

This config tells prometheus that our app should be scraped at a 10 seconds interval. There are three jobs that define from what endpoints metrics should be collected. Notice that we are adding labels to the logs so that we can later filter search results with this label.

Finally add prometheus to the docker-compose.yaml file.

services:
    prometheus:
        image: prom/prometheus:v2.35.0
        ports:
            - 9090:9090
        volumes:
            - ./prometheus:/etc/prometheus
        command: --config.file=/etc/prometheus/prometheus.yaml

Prometheus will now collect and export metrics from our django apps. Skip to section 4.5 to install grafana and check whether the metrics configuration is working or not.

4.3 Install Loki and Promtail (for logs)

At first we need to make sure that our django apps are writing logs.

In your django requirements.txt make sure to have the following and install them.

opentelemetry-sdk==1.10.0
opentelemetry-api==1.10.0
opentelemetry-instrumentation==0.29b0
opentelemetry-instrumentation-django==0.29b0
opentelemetry-instrumentation-logging==0.29b0

In settings.py add a LOGGING configuration. I use the following.

LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'formatters': {
        'trace_formatter': {
            'format': '[%(asctime)s] %(levelname)s [%(name)s:%(lineno)s] [trace_id=%(otelTraceID)s span_id=%(otelSpanID)s] [%(funcName)s] %(message)s',  # optional, default is logging.BASIC_FORMAT
            'datefmt': '%Y-%m-%d %H:%M:%S',  # optional, default is '%Y-%m-%d %H:%M:%S'
        },
    },
    'handlers': {
        'file': {
            'level': 'WARNING',
            'class': 'logging.FileHandler',
            'formatter': 'trace_formatter',
            'filename': 'webapp.log',
        },
        'console': {
            'class': 'logging.StreamHandler',
            'formatter': 'trace_formatter',
        },
    },
    'loggers': {
        'django': {
            'handlers': ['console'],
            'level': 'INFO',
            'propagate': True,
        },
    },
    'root': {
        'handlers': ['console', 'file'],
        'level': 'WARNING',
    },
}

Ignore the trace_formatter for now. We will need it in the next section for logging trace ids.

Make sure your django views are writing some sort of logs. You can use the following template for recording logs.

import logging

logger = logging.getLogger(__name__)

logger.error("Log1 on Webapp!")

After confirming that some sort of logs are being written on a file you can continue installing promtail.

Create a file promtail/promtail.yaml and put the following config.

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
- job_name: django-log-scraper
  static_configs:
  - targets:
      - localhost
    labels:
      __path__: "/var/log/webapp.log"
      app: webapp
  - targets:
      - localhost
    labels:
      __path__: "/var/log/webapi.log"
      app: webapi

This config tells promtail to collect logs from /var/log/webapp.log and /var/log/webapi.log in the promtail container and send them to a loki endpoint. But how does promtail get these logs in its own container? We mount the django logs in the promtail container using docker-compose.

In the docker-compose add the following config.

services:
    promtail:
        image: grafana/promtail:2.5.0
        volumes:
        - ./promtail/promtail.yaml:/etc/promtail/promtail.yaml
        - ./webapp/webapp.log:/var/log/webapp.log
        - ./webapi/webapi.log:/var/log/webapi.log
        command: --config.file=/etc/promtail/promtail.yaml
        depends_on:
        - webapp
        - webapi

    loki:
        image: grafana/loki:2.5.0
        ports:
        - "3100:3100"
        depends_on:
        - promtail

Here we are binding the django log files to the container of promtail at /var/log/, so that they can be read by promtail.

Voila! The log configuration should now be working properly. Skip to section 4.5 to install grafana and check whether the log configuration is working or not.

4.4 Install Jaeger (for tracing)

For tracing to work make sure you have the following dependencies installed.

opentelemetry-sdk==1.10.0
opentelemetry-api==1.10.0
opentelemetry-exporter-jaeger==1.10.0
opentelemetry-instrumentation==0.29b0
opentelemetry-instrumentation-django==0.29b0
opentelemetry-instrumentation-logging==0.29b0
opentelemetry-instrumentation-requests==0.29b0
requests==2.27.1

We wil use the requests library to call APIs from the webapi service. By adding opentelemetry-instrumentation-requests package, calls made by requests library will automatically be injected with tokens required to trace requests.

Now go to manage.py and add the following.

    DjangoInstrumentor().instrument()
    LoggingInstrumentor().instrument()
    RequestsInstrumentor().instrument()

    jaeger_exporter = JaegerExporter(
        agent_host_name=os.getenv("TRACING_HOST"),
        agent_port= int(os.getenv("TRACING_PORT")),
    )
    trace.set_tracer_provider(TracerProvider(
        resource=Resource.create({SERVICE_NAME: 'webapp'})
    ))
    span_processor = BatchSpanProcessor(jaeger_exporter)
    trace.get_tracer_provider().add_span_processor(span_processor)

The collected traces will be exported to the url TRACING_HOST:TRACING_PORT. But from where are the values coming? It can be set using docker-compose. Modify docker-compose to something like this.

webapp:
    build: ./webapp
    command: python manage.py runserver 0.0.0.0:8000
    environment:
      - TRACING_HOST=jaeger
      - TRACING_PORT=6831
    ...

  webapi:
    build: ./webapi
    command: python manage.py runserver 0.0.0.0:8001
    environment:
      - TRACING_HOST=jaeger
      - TRACING_PORT=6831
    ...

Now that the environment variables are set up let’s set up the host that will collect the traces. We will use Jaeger for this.

In the docker-compose add configuration for jaeger.

services:
  jaeger:
    image: jaegertracing/all-in-one:1.32
    ports:
      - 16686:16686
      - 6831:6831

Now run all the containers in docker-compose. When requests are made, the traces should be collected and stored in jaeger. Check if the traces are being collected properly by visiting localhost:16686. Or optionally set up grafana in section 4.5 to view them from grafana.

By default we don’t need any extra code in our application to trace requests. However, if we need to know the execution time of a subpart of our trace we can add snippets like the following.

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def service(request):
    with tracer.start_as_current_span("Heavy task") as span:
        ### A compute heavy task
        ...

    return render(request, 'app/service.html', context)

If you remember, in the previous section we added a config in the LOGGER in settings.py which was something like this.

'formatters': {
    'trace_formatter': {
        'format': '[%(asctime)s] %(levelname)s [%(name)s:%(lineno)s] [trace_id=%(otelTraceID)s span_id=%(otelSpanID)s] [%(funcName)s] %(message)s',  # optional, default is logging.BASIC_FORMAT
        'datefmt': '%Y-%m-%d %H:%M:%S',  # optional, default is '%Y-%m-%d %H:%M:%S'
    },
},

This formatter adds tracing information to the logs so that we can jump to traces right from our logs. How it’s gonna work, we’re going to look at it in the next section.

4.5 Install Grafana for monitoring

Now that we are collecting everything, it’s finally time to visualize them. We will use Grafana as our central observability system for everything.

First, make a file grafana/dashboard.yaml and put in the content below.

apiVersion: 1   # Dont remove this or you'll suffer

datasources:
  - name: prometheus
    type: prometheus
    uid: my-prometheus
    access: browser
    url: http://prometheus:9090
    isDefault: false
    version: 1
    editable: true
    basicAuth: false

  - name: loki
    type: loki
    uid: my-loki
    access: server
    orgId: 1
    url: http://loki:3100
    basicAuth: false
    isDefault: false
    version: 1
    editable: true
    jsonData:
      derivedFields:
        - datasourceUid: my-jaeger
          name: TraceID
          matcherRegex: trace_id=(\w+)
          url: '$${__value.raw}'
        - name: TraceID
          matcherRegex: trace_id=(\w+)
          url: 'http://localhost:16686/trace/$${__value.raw}'

  - name: jaeger
    type: jaeger
    uid: my-jaeger
    access: browser
    url: http://jaeger:16686
    isDefault: false
    version: 1
    editable: true
    basicAuth: false
    jsonData:
      httpMethod: GET
      tracesToLogs:
        datasourceUid: my-loki
        tags: []
        mappedTags: [{ key: 'service.name', value: 'app' }]
        mapTagNamesEnabled: true

This config defines the datasources grafana should pull data from. Since we made three data sources in the previous sections for metrics, logs and traces, this config also has three entries corresponding to each of them.

The jsondata config in loki defines how to jump to traces from the viewed logs. And jsondata config in jaeger defines how to jump to logs from the viewed traces.

Now lets add grafana to our list of services in the docker-compose file.

services:
  grafana:
    image: grafana/grafana:8.5.2
    ports:
      - "3000:3000"
    volumes:
      - ./grafana/dashboard.yaml:/etc/grafana/provisioning/datasources/dashboard.yaml
    depends_on:
      - loki
      - prometheus
      - jaeger

Run docker compose and then the grafana dashboard should be up and running. Go to localhost:3000 to check if it is running.

Let’s now see the demo of how everything should work!

5. Demo

If you have followed the tutorial so far, kudos to you! Run your docker-compose file to spin up all ther services. If you haven’t then clone the example repo and run it by following the steps below.

# 1. Clone the repo
$ git clone https://github.com/rafed/opentelemetry-python-grafana-mlt

# 2. Go inside the repo
$ cd opentelemetry-python-grafana-mlt

# 3. Run the apps and services (Docker should be installed)
$ docker-compose up

Now go to localhost:3000 where the Grafana admin panel is served. Login using username:admin and password:admin.

grafana

To generate metrics, logs and traces make sure that you load the app several times (localhost:8000/service).

Then, go to explore on the left side panel. On the top of the explore page, you’ll find datasources loki, jaeger and prometheus. You can select each to explore logs, traces and metrics respectively emitted from the webapp and webapi.

grafana explore

To view logs, select Loki as a datasource first. Select a log source by label from the Log browser and view logs. You should see something like this.

loki logs

From these logs you can jump to their traces as well (click red marked area).

To view metrics, select Prometheus as a datasource. Make a query (e.g. rate(django_http_requests_latency_seconds_by_view_method_count[1m])) and you should see something like this.

prometheus

Finally, to view traces, select Jaeger as a data source. Choose a service and run query. You should see many traces like the following.

Jaeger traces

When you click a trace_id, the screen will be split to show the actual traces.

Jaeger traces split screen

You can also jump to logs from traces by clicking on the red marked area. Isn’t this awesome?

Conclusion

MLT observability is now a necessity for distributed systems. But to actually implement it can be hard. I hope that this article can be of help to anyone who wants to achieve observability in their systems using open source technologies such as Grafana, Opentelemetry, Loki, Prometheus and Jaeger. If you find any problems with this article feel free to comment below. You can also contact me through my email if you face any issues.

References

Sharing is caring!