Track ’em Down!

I’ve been checking the web server logs lately seeking for a way to track down the remote hosts that regularly submit, or try to submit, spam comments massively. Grep-ing the logs is no fun at all, so I wrote a small BASH script to do the dirty work for me. Well, this one was written for my own apache log rotation setup, so it may be totally useless for you.

How It Works For Me

The script is written around the logic that a normal comment submitter usually posts a rational number of comments in a 30-day period. All blog owners have an idea of the comment traffic on their blogs. At least on this blog, most people won’t post more than 2-3 comments during a month. Even if they do not fill all the required fields on the comment form and they have to re-submit, which is almost rare, the number of the POST HTTP requests they send is not greater than 10 or 12.

On the other hand, a spammer, or better a spambot, usually tries to do more than that. After doing some research myself, I have concluded to the following: Even if they use dynamic IP addresses, which is a fact in most cases, the number of the POST HTTP requests they send during a 30-day period to the wp-comments-post.php file is by far greater than the most regular commenter’s POST requests on this blog. And I’m talking about 40-60 or even more POST HTTP requests from the same remote host in a month. This sounds like a spambot to me (correct me if I’m wrong).

The Script

This script, although it was written for my own use, contains some configuration parameters, so to be customized for different setups. You direct it to the directory that contains the rotated apache log files and it searches for the POST HTTP requests in all of them. My logs are rotated daily, but recycled monthly. Every day’s log file is also kept in my main apache log archive (yes, I keep everything). So, this script will give me a report for a whole month.

#! /bin/bash
 
##############################################################
# Configuration BEGIN
####################
# The minimum number of POST HTTP requests a remote host
# may have sent, without getting into this list.
# ADJUST ACCORDING TO YOUR BLOG'S COMMENT TRAFFIC
# 20 is just an example!!
MinPostReq=20
# Where to send the report
MailTo="me@example.com"
 
# Which hosts to exclude. Separate each host with an escaped |, eg \|
fexclude="desktop.example.com\|server.example.com"
 
# The directory path that contains the Apache Logs
# NO TRAILING SLASH
path="/path/to/rotated/apache/logs"
 
# The page that accepts the comments. For WordPress, this is the following:
page="/wp-comments-post.php"
# Set it to empty "" to see the remote hosts that have sent POST requests on
# any page of your web site (you will be amazed!)
#page=""
 
################
# Configuration END
#############################################################
 
if [ -z $fexclude ]; then
    # If no hosts are excluded, then set it to a random value,
    # else grep -v will excude everything
    fexclude="dfahjgf32eDFSDFaFaD"
fi
 
(
echo -e "Req    Remote Host\n===================================================="
(
for i in $path/access_log.*.gz; do
    zcat $i | grep '"POST '$page | grep -v '^\('$fexclude'\)' | awk '{ print $1 }'
done
) | sort | uniq -c | sort -rg \r
    | awk '{ if(int($1)>'$MinPostReq') { printf(" %-5s %s\n",$1,$2) } }'
) | mail -s "Report Of Potential Spammers" $MailTo
 
exit 0

Configuration

Here is some info about the configuration options:

  • MinPostReq : The number of POST HTTP requests a remote host has sent to the /wp-comments-post.php file. This is the most critical option. Adjust it according to your blog’s comment submission traffic. For example if the most regular commenter posts 5 comments/month, set this a lot higher, eg 20 or more. Setting a good number here can reduce the error.
  • MailTo : The email to send the report.
  • fexclude : Which hosts to exclude. For example you can put all the hosts from which you administer WordPress. Make sure you separate them with an escaped | , eg: |
  • path : the path to the directory that contains the rotated apache logs. All log files of the form: access_log.#.gz will be searched.
  • page : The WordPress page where the comments are submitted to. This is the /wp-comments-post.php file. You can set this option to nothing in order to see which hosts have sent POST HTTP requests and the total number of these requests. You will be amazed with the result.

The Report

This script will send a report to your email address, which contains a list of the remote hosts that have sent POST HTTP requests to the /wp-comments-post.php and the number of these requests.

These MAY NOT be all spammers. Some of them could be readers who have submitted a lot of comments relevant to your blog. This report is just an overview of the POST HTTP requests towards WordPress and it will take further investigation to determine if a remote host has actually spammed you or tried to.

Read The Following Section At Least 10 Times

THIS REPORT IS NOT A LIST OF SPAMMERS.
You have been warned! Make sure you don’t accuse your innocent readers/commenters!! This will be your own mistake and not this script’s or my fault.

By using the above script or part of it, you explicitely accept the above statement.

What You Can Do

If, after your own investigation, you are 100% sure that a remote host has spammed you, you can use any whois service to find the remote host’s internet provider and send them the relevant parts of your apache log files, the contents of the comments and any other information you have collected that proves that you were spammed from this address at a specific time. And remember: be polite. These people have no other relation to the spammer, apart from the fact that the latter is their client.

What If It Does Not Work?

Consider this script as a note you have seen. This is no release! I do not even care if it works or not. I’ve already stated that all this may be totally useless for you or for your setup. If this is the case, write a script for your own setup.

Track ’em Down! by George Notaras is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright © 2006 - Some Rights Reserved

George Notaras avatar

About George Notaras

George Notaras is the editor of the G-Loaded Journal, a technical blog about Free and Open-Source Software. George, among other things, is an enthusiast self-taught GNU/Linux system administrator. He has created this web site to share the IT knowledge and experience he has gained over the years with other people. George primarily uses CentOS and Fedora. He has also developed some open-source software projects in his spare time.