Dynamic Monitoring

At Vinli, we love data. And it's a good thing since we get a lot of data from our devices. Millions and millions of messages a day, in fact.

Since launching in October, we've worked hard to stay ahead of our incredible customer demand and scale our platform to handle the ingestion of this data from our devices, as well as hundreds of millions of external requests a day/week through our platform APIs. Servicing this size of platform means having insight into every part of our platform. At a glance, we need to be able to see the overall health of the platform as well as recognize potential issues long before the pager goes off.

We use New Relic to monitor things at the application level, but for pure granularity it is pretty tough to beat the data available from an ever-present, built-in SNMP stack (available in most base pkg repos). Cacti & MRTG have been around for years, and I'll bet if you ask any tenured sysadmin/network engineer, they probably have a love-hate relationship with them, but most likely use one of them regardless.

In theory, Cacti or MRTG are the best tools for the job, but the complexity for us was that each host must be manually added/configured in the Cacti interface before it starts polling data. In a dynamic environment such as ours, where servers can come and go from minute to minute, this requires way too much manual interaction. For a hot minute, we considered writing a simple REST API that writes straight to the Cacti database when a server was bootstrapped. This would have been relatively easy to do over the private network, but it still didn't feel right.

We decided to go at it from another direction. Instead of a central monitoring server (that could go away at any time) polling each host for their data, what if each server polled itself and publishes its own data? Better yet, the collector service, Grafana, and clustered InfluxDB can live in an existing Kubernetes cluster!

It turns out this was much easier than we initially thought it would be.

We wrote a small Go process that is run as a cron job every minute and POSTs JSON in the following structure to a simple collector that does nothing more than authenticating the incoming request, and then transforming the JSON into InfluxDB Line Protocol. This way, the InfluxDB service does not need to be open to the public and we don't have to poke holes in access lists to allow UDP snmp traffic between servers.

{
  "data": [
    {
      "points": [
        {
          "fields": {
            "value": "2.98"
          },
          "measurement": "load_average"
        }
      ],
      "tags": {
        "hostname": "kube-minion.aws-us-west-2.ip-10-100-201-16",
        "ip": "10.100.201.16",
        "sample": "1_min",
        "server_sub_type": "minion",
        "server_type": "kubernetes"
      }
    }
    // More...
  ]
}

With this method, we're able to have beautiful system-level graphs that dynamically update, and are automatically added as a new VM is bootstrapped.

![Load Average Graph](https://s3.amazonaws.com/vinli-assets/dev-blog/monitoring/load_average.png)

![Memory Graph](https://s3.amazonaws.com/vinli-assets/dev-blog/monitoring/memory.png)

![Traffic Graph](https://s3.amazonaws.com/vinli-assets/dev-blog/monitoring/traffic.png)