The ELK-Stack Experience

Lately, here at IT:Agenten we were faced with a webserver configured to log multiple domains to one logfile, which jumbled analytics and rendered all usage information (page views and unique visitors) unusable.

Since these numbers were really important, we had to dig in and analyze the data “by hand”. Sounds like a good opportunity to get to know this praised piece called the ELK-stack, i.e. the combination of Elasticsearch, Logstash and Kibana — although we didn’t need Kibana in this case…

image 1Having worked with the ELK now, I have to say, I’m impressed. Tremendously easy to set up, fantastic APIs to query and extremely simple to get started. The praise it gets is fully justified.

So here’s a quick rundown of how you can make use of the ELK stack in no time and start analyzing Apache Server logs according to different criteria.

The idea here is that we will use Logstash to take our unstructured, flat-file data, to finally structure it and enrich it with some metadata. Elasticsearch will act as the data store for the enriched and structured data and will let us query it according to different criteria to make sense of what the logs tell us.

In this demonstration, we will try to find out how many people from Switzerland hit our servers using iPhones in one month. With this quick start you should be able to go ahead and adapt the same procedure to other criteria easily. Sounds good? So let’s start! Please go ahead and install Logstash and Elasicsearch on your machine so we can get going.

After installing LS and ES, download the Apache logs access.log.*.gz to your machine and start configuring Logstash. A simple logstash.conf with the following will do the job:

input {
    stdin {}
}

filter {
    grok {
        match => { "message" => "%{COMBINEDAPACHELOG}"}
    }
    geoip {
        source => "clientip"
    }
}

output {
    elasticsearch {
        protocol => "http"
    }
}

This is all it takes: we will feed logstash from stdin, let it parse the Apache log in “combined format”, enrich the IP-adresses in the requests with GeoIP metadata and expect it to output the structured data to elasticsearch directly.

Start your Elasticsearch instance and let Logstash do it’s job:

$ zcat access.log.*.gz | logstash -f logstash.conf

To begin with, you might want to reduce the initial data set to the first 10.000 lines or so and already start querying ES right away. The full indexing can take some time, depending on the size of your logs.

Next up we can already start inspecting the data using Elasicsearch queries. These queries are formed using a query language DSL and are described here. Elasticsearch offers an HTTP API interface which you can then query using, e.g., curl.

For a quick turnaround time, you might want to check out Sense, a Chrome plugin specifically tailored for this use case. It’s really nice, so make sure to check it out!

To quickly recap, the data we are interested in is the following:

All users, who hit our servers

  • in August 2015
  • using iPhones
  • from Switzerland

Let’s start crafting our query using the Elasicsearch Query DSL!

First off, we use a bool query since we want exact matches according to a few criteria that must match.

If you are using the Sense Chrome plugin, enter the following:

POST _search
{
   "query": {
      "bool": {
         "must": [
            {
               "match": {
                  "timestamp": "Aug/2015"
               }
            }
         ]
      }
   }
}

Otherwise, if you are using curl to query Elasicsearch, you might want to do this instead:

$ curl -XGET localhost:9200/_search -d '
{
   "query": {
      "bool": {
         "must": [
            {
               "match": {
                  "timestamp": "Aug/2015"
               }
            }
         ]
      }
   }
}'

Note that Elasicsearch expects the body of the GET (!) request to /_search to contain the query. Although specified to be valid, not all clients support this, so Elasicsearch will also accept the query body using POST to /_search, as in the example above using the Sense plugin.

In case you find it cumbersome to issue multiline commands on the commandline, you might want to make use of curl’s ability to accept a filename instead using the following:

$ curl -XGET localhost:9200/_search -d @query.json

Or even go further and use process substitution (i.e. <(command)) and write yaml instead if that is your thing:

$ curl -XGET localhost:9200/_search?pretty=true -d @<(yaml2json query.yaml)

Not also, that you can append pretty=true to get formatted output.

Now that the first of the above mentioned criteria is in place, let’s go about modelling the rest.

Requests coming from iPhones? Simple enough:

{
   "query": {
      "bool": {
         "must": [
            {
               "match": {
                  "timestamp": "Aug/2015"
               }
            },
            {
                "match": {
                  "agent": "iPhone"
                }
            }
         ]
      }
   }
} 

While inspecting the data you might have noticed that the Geo-IP filter that we installed in the logstash.conf has enriched the data sets with metadata that wasn’t available in the original server logs, namely something like the following:

"geoip": {
  "ip": "100.200.300.400",
  "country_code2": "MX",
  "country_code3": "MEX",
  "country_name": "Mexico",
  "continent_code": "NA",
  "latitude": 23,
  "longitude": -102,
  "location": [
     -102,
     23
  ]
} 

So to come to an end on this quick tutorial, let’s put the last missing piece in place, this time in YAML:

---
  query:
    bool:
      must:
        - match:
            timestamp: Aug/2015
        - match:
            agent: iPhone
        - match:
            geoip.country_code2: CH 

From here, you should be familiar with the most essential queries and how to set up the initial importing and querying. Have a lot of fun inspecting your data and see you around!