Analyzing access logs to find patterns, or “Just how many bots are out there anyway?”: In which we give a gentle introduction to command-line tools.

You find the most interesting stuff in access logs. I spend a fair amount of time digging around in web analytic software. Looking at who is referring to us, analyzing traffic patterns, figuring out peak usage times…you get the picture. Sometimes though, you just need to get right down into the raw logs to figure out what’s going on, and that situation happened to me a while back.

Around the middle of last year a security exploit was found in the implementation of the XML-RPC protocol in PHP. PHP is obviously a very popular web scripting language and dozens of content management systems, including the blogging software I use for this blog, utilize the language. Now an open security hole in such a popular language is just an invitation for the underground to attempt to exploit the situation, and try to exploit they did.

It started out as just a trickle, but before long it became obvious that the underground was attempting to exploit the security hole. How’d we know? By analyzing our logs.

An access log is generated by your web server, in our case Apache, and contains a lot of information. For example, here is a random line from this blog.

208.52.128.170 - - [28/Apr/2006:11:07:53 -0700] "GET /wp-content/themes/rdc/style.css HTTP/1.1" 200 7581 "http://www.creative-workflow-hacks.com/" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.2"

The first column is where the ip request comes from. The second is the date. The third is the url and the GET means it uses a GET style HTTP request. The fourth is the user agent, in this case Safari. So that’s all well and good, but what’s that have to do with the security exploit?

Well, by looking at your raw access logs, you start to notice patterns. Something like this is what we started to see in our access logs.

209.213.106.25 - - [26/Apr/2006:05:46:19 -0400] "GET /xmlrpc.php HTTP/1.0" 404 1453

and

209.213.106.25 - - [26/Apr/2006:05:46:20 -0400] "GET /drupal/xmlrpc.php HTTP/1.0" 404 1453

Drupal? We're not using Drupal. What was happening were attempts to exploit the security hole. And there were quite a few attempts and they kept coming.

So was this some lone hacker typing into her browser? Not likely. What was happening was bots.

You've probably heard about spyware. In addition to sending you all kinds of weird popups and putting casino shortcuts on your desktop, malware like this can take over your computer and do things like send spam email, or in this case attempt to send security exploits to take over yet more computers and send spam email and then attempt...well, you get the picture.

So by analyzing our logs we start to get a picture of what to watch out for and figure out what software *really* needs to be up to date. Better get over to wordpress and check for patches

What I want to show today is one little tool that I used to analyze our logs while this exploit was going on full tilt, although we still see a lot of requests for xmlrpc as the last laggards still haven't updated their software and the law of diminishing returns kicks in for the bots

We're going to use the command line today (it's really not that scary, you might even end up with 10 different shells open on your deskop one day like a real unix grey beard...don't laugh, it could happen).It goes without saying that the following will only work if your webserver is running apache and you have unix shell acess to the server your log files are kept. The first order of business is to find out where your log files are kept. A common location is /var/log/httpd or if your on a shared host, a logs directory in your home directory. Once you've figured out where your log file is located, let's cd to that directory.

Unix tools are designed to do one thing very well and work in conjunction with other small tools that do one thing well to accomplish the task at hand. Unix tools have a concept called standard input and standard ouput which allow each tool to do their bit of business and then pass the result on to the next tool. The command we are going to use today is:

cat access.log | grep xmlrpc |  awk '{print $1}' | sort | uniq | wc -l

Scary huh? But don't freak out. We'll go through each part and you'll start to understand that it really is easy and you might be able to do some useful work with this command line stuff any how.

So let's start off with the first part. Commands start from the left like you're reading a sentence. cat access_log catalogs or writes out the acess_log to standard output which in this case is the terminal in which you typed the command. Your version of your access_log might be named something else like access.log or somesuch, but you should be able to figure it out. You could also use tail command if you had an access_log that is several days long and you only want to check out just the last bit of access.

Use tail -n 100 to grab the last 100 entries. Which brings us to Unix command switches, the -n 100 is a command switch. Most unix commands let you alter the base behavior of the command with a switch. How do you know which switches to use? You use the man command. Type man tail and see what you get.

A lot of the stuff in a man can seem hopelessly geeky at first, but with a little perserverance we can find what we need. So we've written out our access_log file. This is the raw data that comes from the web server. Take a look around, it's pretty interesting even before you've run analytics on it, but for most folks it's just too overwhelming to find anything useful. Let's take a look at the next part of our command.

cat access_log | grep xmlrpc

the | is the pipe indicator. We talked earlier about standard input and standard output. The | lets us pipe the output of cat into the input of grep. If we wanted to write out our acess_log to a file we would use the > character.

tail -n 100 access_log > ourfile.txt


would write a file called ourfile.txt with the last 100 lines of our access_log.

Back to our command. We pipe the output of our access_log to something called grep. What is grep? Grep stands for "generalized regular expression processor". Grep allows us to search for patterns in our output called regular expressions. Think of a regular expression as a find and replace on steroids. Regular expressions can get really complicated, but in our command we're just looking for xmlrpc. It won't match on variations with upper case or with dashes, etc., but it's surprisingly effective here.

Let's move on to

cat access_log | grep xmlrpc | awk '{print $1}' 


We now pipe any matches on xmlrpc to awk. awk is a general purpose language that is designed to process text data. I don't use awk a lot, but it is really handy to grab a specific column from text output. That's what we do here. awk'{print $1}' takes only the first column and prints that to standard output. So from the first three parts of our command, we've printed the ip address of every access on our web server that used the word xmlrpc. Pretty neat. Let's move on.

cat access_log | grep xmlrpc | awk '{print $1}' | sort

All of those ip addresses are all over the place. Let's sort them into something useful. We first sort them numerically with sort.We use the uniq command to grab only the unique ip's. These bots might hit us once or a bunch of times, but we are only interested in grabbing the unique ones.

cat access_log | grep xmlrpc | awk '{print $1}' | sort | uniq

If we stop here we get a sorted, unique ip list of requests for xmlrpc. If we had a blackhole list we could add these ip's and not accept requests from these ip's in the future when the next exploit becomes available. We do need to be careful with that though because your most important customers might have a virus or spyware and not be able to access your site or your grep search string might find ip's that are accessing your site legitimately, use your judgement.

And finally

cat access_log | grep xmlrpc | awk '{print $1}' | uniq | sort | wc -l

let's use the word count command wc command with the -l switch to count the number of lines. 137. Wow. During the worst of our xmlrpc weirdness we were getting 137 unique ip's trying to exploit our web server.Definitely time to check for patches.

So that's it. This was definitely a long-winded post, but I hope you find it useful especially if your new to the command line. Like most things computer oriented, try to break down your big goal into smaller pieces until you work out the combinations that get you were you need to go.

7 Comments

  1. Dr J said

    Excellent writeup. A good refresher for some and hopefully it opens up the whole new world of the command line to others.

  2. Very informative! I, too, use access logs from my site and the Linux command line in just this way. Always fun to figure out what the game is!

  3. great resource!! rather than downloading our 500MB access_log file, tail-ing the log file with some filter to it has made my job finished quicker.

    great job!!
    thanks,
    regards,
    AJi

  4. […] entirely happy with the results (top thirty visitors only), so a little Googling led me to this post about grabbing just the IP addresses from log files (I’m a little rusty on the finer points of grep and the like). After a little reading, I […]

  5. John said

    Great post, sabernar. I took your methods and came up with a nice way of seeing the most common IP addresses in an apache access log:

    cat access_log | awk ‘{print $1}’ | sort | uniq -c | sort -g

  6. John said

    For some reason, my single quotes got transformed into something else in my prior post. Fix the single quotes around the {print $1} and it will work.

  7. Selvakumar Manickam said

    Great pointers for using web log analysis. Botnets are evolving and new techniques are used to carry out HTTP DDoS attacks. The technique highlighted may not be sufficient to detect bot access.

RSS feed for comments on this post

Comments are closed.