Guide for Self-Hosted Countly Clients: Setting Up Monitoring and Alerts

Follow

Setting up monitoring and alert systems is essential for the performance and reliability of self-hosted Countly environments. This guide will help you implement these systems, allowing you to detect and resolve issues.

By following the below steps, you can detect and address issues before they impact your application performance, and manage server resources efficiently.

The article outlines the necessary system and software requirements, key metrics to monitor, and the advantages of a good monitoring and alert setup. The final section will cover the setup of Grafana and Prometheus for thorough monitoring and alerts.

Monitoring & Alert Tools

  • Prometheus -This tool is recommended for collecting and storing metrics.
  • Grafana - It is used for visualizing metrics and creating dashboards.

Prometheus and Grafana are used together to set up alerts which would notify you if any critical metrics are missed.

Environment Setup and Tools Required

  1. To run both Prometheus and Grafana effectively on a self-hosted server, ensure your system has a minimum of a 2-core CPU and 4 GB of memory to handle the combined load of both tools.
  2. Grafana is compatible with multiple operating systems, including Debian, Ubuntu, Red Hat, RHEL, Fedora, SUSE, openSUSE, macOS, and Windows. However, for optimal stability and support, we recommend using Linux, specifically Ubuntu or CentOS. Detailed system requirements can be found here.
  3. Your setup should include Nginx to function as a reverse proxy, which is essential for managing web traffic efficiently.

Benefits of Using Monitoring and Alerts

  • Issues can be identified and resolved before they affect users.
  • Continuous monitoring and optimization of application performance is facilitated.
  • Efficient management of server resources to prevent bottlenecks is enabled.
  • A smooth and reliable user experience is ensured by addressing performance and stability issues promptly.

Key Metrics to Monitor

The metrics and thresholds that are monitored to ensure that the Countly service and its requirements are working properly, includes:

CPU Usage

It should be set to send a warning when the processor utilization rate reaches 100%, but this warning is prone to false positives due to sudden peaks, etc. If the 100% warning repeats for a while, there may be a problem and should be intervened.

Countly Uptime

The Countly service should be checked to be in the “Active” state, you can check how long the Countly service has been Active to see if it has been restarted recently. Alerts should be set to send a warning if The Countly service is not Active or if a reboot is detected. This information can be obtained from the “countly status” command or by using the “systemctl status countly” output, e.g:

countly status | grep Active | awk -F'since' '{print $2}' | awk -F';' '{print $1}'

The command shows when the Countly service became Active, if subtracting this output from “currently” and “/proc/uptime”, which records how long the operating system has been active, is less than a specified time, it can be inferred that a reboot has occurred.

Countly Service

The Countly API should be checked to be reachable, by sending a request to the http://<domain_name>/o/ping endpoint, it should be set to send a warning if it is unreachable, this can also be checked with the HTTP Status code, if the request to the endpoint returns the code “200 OK”, the service is reachable. It is useful to use some parameters when sending requests to this endpoint. When the servers are under load or at a momentary peak, even if the Countly service is available, there may be a negative return due to timeout times. For this reason, the recommended timeout value for this metric is 30-60 seconds.

Data Disk

The occupancy rate of the disks where MongoDB data is written and stored should be checked. If the data disks are full, MongoDB services will stop and Countly will become unreachable.

Although it varies depending on disk sizes and how fast data is processed, you can decide at which thresholds you will receive warnings by evaluating factors such as how fast you can intervene. At Countly, we monitor medium-density standalone servers holding MongoDB data by sending a warning when the remaining disk space is 14 GB and a critical warning when the remaining disk space is 9 GB. You may have a much larger structure and more intense data input. We recommend you decide on this threshold based on the data density as your applications go live and your response time to disk alerts.

Memory

The memory used by the servers should be checked, the available memory should be set to send a warning at the 5% threshold and a Critical warning at the 2% threshold, but as in the CPU Usage example, this metric is very prone to false positives at sudden spikes. These false positives can be prevented with thresholds such as how long the available memory has been below the threshold or how many times the same warning has been sent before sending a warning.

MongoDB Process

It should be checked that the MongoDB service (mongod) is present in the processes running in the operating system, if Countly cannot communicate with the database, will become unreachable. It should be set to send a warning if there is no “mongod” output in active processes.

Root Disk

The occupancy rates of the root disks of the servers should be checked, due to incorrect log entries, etc. root disks may fill up even when the server is not operating and may prevent even the simplest operations on the server. The root disk should be set to send a warning when the remaining disk space reaches the 9% threshold and a Critical warning at the 4% threshold.

Servers where metrics will be applied

  CPU Usage Countly Uptime Countly Service Data Disk Memory MongoDB Process Root Disk
Countly - -
MongoDB - -

 

How Monitoring is Done

Countly’s monitoring setup relies on a combination of Prometheus and Grafana to provide comprehensive insights about your server usage.

Prometheus acts as the data collection engine, periodically pulling metrics from our target systems, such as the API server and client server. These metrics capture critical performance indicators and health data. The collected metrics are then stored for later analysis.

Grafana serves as the visualization layer, connecting to Prometheus to query and display the stored metrics in a user-friendly format. By creating various graphs and dashboards, we can effectively monitor system performance, identify anomalies, and gain valuable insights into overall system health.

This monitoring stack enables us to proactively identify and address potential issues, ensuring optimal system performance and reliability.

Setting Up Grafana and Prometheus

Monitoring

Prometheus is used for monitoring and the data collected for monitoring can be visualized by Grafana. You should follow the steps below to set up Prometheus and Grafana.

  1. Use install_prometheus_exporters.shon monitored instances to setup exporters.
  2. Use install_blackbox_exporter.shon Prometheus instance to install blackbox exporter
  3. Add monitored instance to Prometheus by adding those into /etc/prometheus/prometheus.yml
File File Link
install_prometheus_exporters.sh Link
install_blackbox_exporter.sh Link

After importing the packages, run the below code to set up the monitoring instance.


  scrape_configs:
    - job_name: "metrics"
      static_configs:
       - targets: ["FQDN:9100","FQDN:9216"]  
         labels:    
           instance: "instance-name"    
           instance_type: "standalone"    
           instance_project: "prod"
     - job_name: "blackbox_exporter"  
       metrics_path: /probe  
       params:   
         module: [http_2xx] 
       static_configs:
        - targets: ["FQDN/o/ping"]
          labels:         
            instance: "instance-name"
            instance_type: "standalone"
            instance_project: "prod"
       relabel_configs:
         - source_labels: [__address__]
           target_label: __param_target
         - target_label: __address__
           replacement: localhost:9115

Once the instance is set up, you should see two dashboards on Grafana.

  1. Countly Server Application - which shows the usage by the Countly application.
  2. Mongo DB - which shows usage by MongoDB.

Import the packages below to make the dashboards functional.

File File Link
dashboard_countly_monitoring.json Link

dashboard_mongodb_overview.json

Link

Setting up Alerts

Alerts lets you handle any issue/usage related to your server efficiently. You can set up alerts for various instances of your server that are defined below. To set up alerts for various instances of your server, use the steps below.

Data Disk

To set up alerts for data disk run the code below.

node_filesystem_avail_bytes{job="metrics",mountpoint="/data",fstype!="rootfs",instance_type=~"standalone|mongodb",instance_project="prod"} / 1024 / 1024 / 1024

Once you run the code you should see the below screen.

Root Disk

To set up alerts for root disk run the code below.

node_filesystem_avail_bytes{job="metrics",mountpoint="/",fstype!="rootfs"} / 1024 / 1024 / 1024

Once you run the code the following screen will appear.

Memory

To set up alerts for data disk run the code below.

(node_memory_MemAvailable_bytes{job="metrics"} * 100) / node_memory_MemTotal_bytes{job="metrics"}

Once you run the code you should see the below screen.

Countly Service

You can add Countly services by using the code below.

node_systemd_unit_state{name="countly.service",state="active",instance_type=~"standalone|countly"}

Once you run the code the following screen will appear.

MongoDB Service

To set up alerts for MongoDB service you can use the following code.

node_systemd_unit_state{name="mongod.service",state="active",instance_type=~"standalone|mongodb"}

Once you run the code the following screen will appear.

CPU Usage

To set up alerts for CPU usage you can use the following code.

(sum by(instance,instance_project) (irate(node_cpu_seconds_total{job="metrics", mode!="idle"}[$__rate_interval])) / on(instance,instance_project) group_left sum by (instance,instance_project)((irate(node_cpu_seconds_total{job="metrics"}[$__rate_interval])))) * 100

Once you run the code the following screen will appear.

Countly API Endpoint Response Time

To set up alerts for Countly API Endpoint Response Time you can use the following code.

(sum(probe_http_duration_seconds{phase="connect",instance_type=~"standalone|countly"}) + sum(probe_http_duration_seconds{phase="processing",instance_type=~"standalone|countly"}) + sum(probe_http_duration_seconds{phase="resolve",instance_type=~"standalone|countly"}) + sum(probe_http_duration_seconds{phase="tls",instance_type=~"standalone|countly"}) + sum(probe_http_duration_seconds{phase="transfer",instance_type=~"standalone|countly"}) + sum(probe_dns_lookup_time_seconds{instance_type=~"standalone|countly"}) + sum(probe_duration_seconds{instance_type=~"standalone|countly"})) * 1000

Once you run the code you should see the below screen.

Countly API Endpoint Response Code

To set up alerts for Countly API Endpoint Response Code you can use the following code.

probe_http_status_code{instance_type=~"standalone|countly"}

Once you run the code you should see the below screen.

Now, after completing all the steps, the monitoring and alerts for your server has been successfully set up.

Validating your setup

Once you follow the above steps, if they are implemented correctly you should see graphs and dials in the dashboard containing data. This ensures that your setup was successful.

Looking for help?