Analyzing Web Traffic

I have another website at dev.kielthecoder.com that isn’t used for much. But I was curious who might be visiting it and where they come from. I’m not doing any cookies or session tracking, so I only have the server log files to go off of. I want to demonstrate some UNIX commands that can be used to gather information.

Access Logs

NGINX stores it’s access logs in a very common format. By default, it looks like this:

$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes $referer $user_agent

There’s also a fair amount of log retention. I have 15 access logs saved (2 – 14 are also compressed). It looks like they rotate every day. If I look at yesterday’s log file I can see (I’m masking the IP addresses since they aren’t from me):

$ head -1 /var/log/nginx/access.log.1
x.x.x.x - - [30/Jun/2021:00:05:44 -0400] "GET / HTTP/1.1" 301 185 "-" "-"

$ tail -1 /var/log/nginx/access.log.1
x.x.x.x - - [30/Jun/2021:23:58:11 -0400] "HEAD /epa/scripts/win/nsepa_setup.exe HTTP/1.1" 404 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"

We can use the head command to view just the first line of yesterday’s file (access.log.1) and tail to view just the last line. If I want to look at older, compressed logs, I can pass it through zcat first like this:

$ zcat /var/log/nginx/access.log.14.gz | head -5
x.x.x.x - - [17/Jun/2021:00:03:57 -0400] "GET /plugin.php?id=xhuaian_makefriends:main&id=xhuaian_makefriends:main&id=xhuaian_makefriends:main&id=xhuaian_makefriends:main&id=xhuaian_makefriends:main&id=xhuaian_makefriends:main&id=xhuaian_makefriends:main&page=13 HTTP/1.1" 301 185 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

x.x.x.x - - [17/Jun/2021:00:03:58 -0400] "GET /plugin.php?id=xhuaian_makefriends:main&id=xhuaian_makefriends:main&id=xhuaian_makefriends:main&id=xhuaian_makefriends:main&id=xhuaian_makefriends:main&id=xhuaian_makefriends:main&id=xhuaian_makefriends:main&page=13 HTTP/1.1" 404 143 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

x.x.x.x - - [17/Jun/2021:00:06:20 -0400] "GET /robots.txt HTTP/1.1" 301 185 "-" "Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"

x.x.x.x - - [17/Jun/2021:00:06:21 -0400] "GET /robots.txt HTTP/1.1" 404 143 "-" "Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"

x.x.x.x - - [17/Jun/2021:00:14:13 -0400] "GET /robots.txt HTTP/1.1" 301 185 "-" "Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"

Bots! Hey, if you’ve got a public web server, you’re going to get hit by lots of bots. The first two entries look like a bot trying to do something malicious with WordPress (Sorry, evil bot! No WordPress installed). The next three entries look like a well-behaved bot simply crawling my site.

Now that we can see what each line looks like, what can we do with them?

Most Requested URL

What if we want to figure out which URL is the most requested? This is where we can make good use of languages like Perl or Awk that specialize in working with text. Let’s start with this small program and name it urls.pl. It will tell us who is looking for robots.txt:

#!/usr/bin/perl

while (<>) {
   chomp;
   if (/robots.txt/) {
      print "$_\n";
   }
}

This program reads from standard input and checks each line for the string robots.txt. Make it executable (chmod +x urls.pl) then we can see some of the visiting bots by typing:

$ zcat /var/log/nginx/access.log.2.gz | ./urls.pl
18 results (mostly Google and Bing)

I want to use regular expressions to pull out each component on each line of the log file. I found online tools like regex101 really useful to write and debug the matching rules:

#!/usr/bin/perl

while (<>) {
    chomp;
    my ($remote_addr, $ident, $remote_user, $datetime,
        $request, $status, $body_bytes, $referer, $user_agent) =
        /^(\S+) (\S+) (\S+) \[([^]]+)\] "(.*)" (\d+) (\d+) "(.+)" "(.+)"$/;

    print $remote_addr, "|", $datetime, "|", $request, "|", $status, "|",
        $referer, "|", $user_agent, "\n";
}

This will assign each of the matches to a variable, like $request. Now we get results like this:

$ zcat /var/log/nginx/access.log.4.gz | ./urls.pl | head -5
x.x.x.x|27/Jun/2021:00:24:22 -0400|GET / HTTP/1.1|301|-|python-requests/2.24.0

x.x.x.x|27/Jun/2021:00:30:19 -0400|GET / HTTP/1.1|200|-|Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.90 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

x.x.x.x|27/Jun/2021:00:38:24 -0400|POST /boaform/admin/formLogin HTTP/1.1|301|http://45.79.94.20:80/admin/login.asp|Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0

x.x.x.x|27/Jun/2021:00:38:24 -0400||400|-|-

x.x.x.x|27/Jun/2021:00:57:54 -0400|POST /boaform/admin/formLogin HTTP/1.1|301|http://45.79.94.20:80/admin/login.asp|Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0

Just constantly getting hammered by bots! Instead of printing out each line, lets keep track of each unique URL by stuffing it into a hash. Then at the end, we can sort based on the weight for each one:

#!/usr/bin/perl

my %urls;

while (<>) {
    chomp;
    my ($remote_addr, $ident, $remote_user, $datetime,
        $request, $status, $body_bytes, $referer, $user_agent) =
        /^(\S+) (\S+) (\S+) \[([^]]+)\] "(.*)" (\d+) (\d+) "(.+)" "(.+)"$/;

    $urls{$request}++;
}

my @keys = sort { $urls{$b} <=> $urls{$a} } keys(%urls);
for (@keys) {
    printf "%4d  %s\n", $urls{$_}, $_;
}

Each URL request is tallied, then we sort by which has the most requests. Now, we can ask for the top 10 requests from a couple days ago:

$ zcat /var/log/nginx/access.log.3.gz | ./urls.pl | head -10
  59  GET / HTTP/1.1
  14  GET /.env HTTP/1.1
  13  POST / HTTP/1.1
  11  GET /vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1
  11  GET /resume.html HTTP/1.1
  10  GET /robots.txt HTTP/1.1
   8  GET /favicon.ico HTTP/1.1
   8
   6  GET /assets/img/facebook.png HTTP/1.1
   6  GET /assets/css/monokai.css HTTP/1.1

Nice! OK, what if we want to get the top 10 across all the saved access logs? First, I’m going to dump them all into a file (that I’ll remove later):

$ cat /var/log/nginx/access.log{,.1} > access-temp
$ zcat /var/log/nginx/access.log.{2..14}.gz >> access-temp
$ wc -l access-temp
5359 access-temp

5,359 log entries to sort through?! Can our little script handle it? What are the top 10 most popular URLs going to be?

$ ./urls.pl < access-temp | head -10
1209  GET /phpmyadmin/ HTTP/1.1
1079  GET / HTTP/1.1
 206  GET /robots.txt HTTP/1.1
 178  GET /.env HTTP/1.1
 153  POST / HTTP/1.1
  96  GET /vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1
  88  GET /resume.html HTTP/1.1
  71  GET /favicon.ico HTTP/1.1
  61
  57  GET /_ignition/execute-solution HTTP/1.1

I’m not surprised someone trying to attack phpMyAdmin is number 1, but I’m happy to see my resume still makes the top 10!

Most Visits

I’m also curious to know who visits my site the most (besides me). I want to use a similar regular expression to break apart each log entry, but this time I want to count how many requests they’ve made and which pages were requested the most. Let’s use C# this time:

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;

namespace CountVisits
{
   class Program
   {
      static void Main(string[] args)
      {
         var re = new Regex(@"(\S+) (\S+) (\S+) \[([^]]+)\] ""(.*)"" (\d+) (\d+) ""(.+)"" ""(.+)""");
			
         var visits = new Dictionary<string, int>();
         var urls = new Dictionary<string, Dictionary<string, int>>();

         foreach (var arg in args)
         {
            if (File.Exists(arg))
            {
               using (var reader = new StreamReader(File.OpenRead(arg)))
               {
                  while (!reader.EndOfStream)
                  {
                     var text = reader.ReadLine();
                     var matches = re.Matches(text);

                     if (matches.Count > 0)
                     {
                        var remoteAddress = matches[0].Groups[1].Value;
                        var identity = matches[0].Groups[2].Value;
                        var remoteUser = matches[0].Groups[3].Value;
                        var dateTime = matches[0].Groups[4].Value;
                        var request = matches[0].Groups[5].Value;
                        var status = matches[0].Groups[6].Value;
                        var bodyBytes = matches[0].Groups[7].Value;
                        var referer = matches[0].Groups[8].Value;
                        var userAgent = matches[0].Groups[9].Value;

                        if (!visits.ContainsKey(remoteAddress))
                        {
                           visits.Add(remoteAddress, 0);
						}

                        visits[remoteAddress]++;

                        if (!urls.ContainsKey(remoteAddress))
						{
                           urls[remoteAddress] = new Dictionary<string, int>();
                        }

                        if (!urls[remoteAddress].ContainsKey(request))
                        {
						   urls[remoteAddress].Add(request, 0);
                        }

						urls[remoteAddress][request]++;
                     }
                  }
               }

               var sortedVisits = new List<KeyValuePair<string, int>>(visits);
               sortedVisits.Sort((KeyValuePair<string, int> a, KeyValuePair<string, int> b) => b.Value.CompareTo(a.Value));

               for (int i = 0; i < 20; i++)
               {
                  Console.WriteLine("{0} {1}", sortedVisits[i].Value, sortedVisits[i].Key);

                  var sortedUrls = new List<KeyValuePair<string, int>>(urls[sortedVisits[i].Key]);
                  sortedUrls.Sort((KeyValuePair<string, int> a, KeyValuePair<string, int> b) => b.Value.CompareTo(a.Value));

                  for (int j = 0; j < sortedUrls.Count; j++)
                  {
                     if (j == 5)
					    break;

                     Console.WriteLine("\t{0} {1}", sortedUrls[j].Value, sortedUrls[j].Key);
                  }
               }
            }
            else
            {
               Console.WriteLine("File does not exist: {0}", arg);
            }
         }
      }
   }
}

You can see that–compared to Perl–the C# code is a bit longer but is still pretty compact. Most of the work is being handled by the regular expression matching on line 26. After that, we’re just counting occurrences. When we print it out, I limit it to anyone with more than 5 visits, then I print out their top requested URLs in order. Here are the top 5 visitors (mostly bots):

$ dotnet run -- ../access-temp
1205 116.1.201.38
   1205 GET /phpmyadmin/ HTTP/1.1
591 45.146.165.123
   72 GET /vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1
   55 GET /wp-content/plugins/wp-file-manager/readme.txt HTTP/1.1
   54 GET /index.php?s=/Index/\x5Cthink\x5Capp/invokefunction&function=call_user_func_array&vars[0]=md5&vars[1][]=HelloThinkPHP21 HTTP/1.1
   54 GET /?XDEBUG_SESSION_START=phpstorm HTTP/1.1
   53 GET /console/ HTTP/1.1
255 119.29.99.56
   1 GET /robots.txt HTTP/1.1
   1 GET /Admin/Common/HelpLinks.xml HTTP/1.1
   1 GET /API/DW/Dwplugin/TemplateManage/login_site.htm HTTP/1.1
   1 GET /API/DW/Dwplugin/SystemLabel/SiteConfig.htm HTTP/1.1
   1 GET /Admin/Login.aspx HTTP/1.1
189 75.67.4.135
   19 GET /favicon.ico HTTP/1.1
   15 GET / HTTP/1.1
   14 GET /assets/css/styles.css HTTP/1.1
   10 GET /assets/css/monokai.css HTTP/1.1
   9 GET /assets/js/app.js HTTP/1.1
59 77.46.59.28
   10 GET / HTTP/1.1
   4
   2 GET //site/wp-includes/wlwmanifest.xml HTTP/1.1
   2 GET //wp2/wp-includes/wlwmanifest.xml HTTP/1.1
   2 GET //test/wp-includes/wlwmanifest.xml HTTP/1.1

Better Analytics

If we wanted to dig deeper into who’s visiting our pages and why, it would probably require storing cookies on the visitor’s machine. WordPress gives me tons of analytics about who visits this blog, but that’s because it probably uses cookies and a full database to track users. My static website doesn’t have–or need–those things.

Access Logs

Most Requested URL

Most Visits

Better Analytics

Share this:

Related

Leave a comment Cancel reply