How to Prevent Comment Spam & In depth Analysis

Receiving thousands of spam entry submissions per month on a website is enough of an impetus to challenge most webmasters into developing the ultimate spam filter. After carefully analyzing the patterns of thousands of comment spams, I have now built a very effective filter. This page shares some of my detailed comment spam statistics and suggestions for prevention.

What is comment spam / blog spam?

You've undoubtedly come across a website or blog that is littered with hundreds of garbage comments and links. Invariably,you'll see this on pages that have been online for over a year without much activity. While some webmasters choose to engage in a daily manual fight (i.e. delete!), many others simply give up as it is just too much of a hassle to do every day. In some cases, the site owner may not even be aware of the scourge taking place!

As this site contains hundreds of content pages online, each allowing readers to add their own comments, it too is a juicy target for spam attacks. The graphs below show just how much spam is sent to ImpulseAdventure every day via the comment forms. In fact, 96.5% of all entries are spam! Thankfully, you don't have to see any of it, as I have automated many unique tricks in filtering the fluff out before it gets to me. Deleting 50+ comments a day by hand is simply not practical!

Spams Per Day Spams Per Hour
Graph: Spam per Day Graph: Spam per Hour

How Comment Spamming Works

While I have never had access to any blog spamming software, I do have a pretty good idea about how they must operate after analyzing thousands of spam entries. The programs typically range in cost from $100 to $300, offering many very advanced and "clever" features. The methods are continually being revised to keep ahead of developments in the counter-measures deployed by blog software and anti-spam plugin developers.

The majority of these programs take advantage of tricks that such as randomized proxy servers, user agents and referer strings. A good percentage also inject random random delays between harvesting a comment page and the actual submission of the comment. Altogether, these techniques make the prevention of blogspam much harder.

The following is a brief overview of how a spammer may target a site:

  • Harvest web pages
    The spammer searches for a web page containing a comment form. Thanks to Google, this can be done easily. Typically, a search for certain keywords and a series of dates will generate a good starting point. Within each of these "harvested" pages, the comment form is identified.

    Searching for "leave a reply" on Google returns about 25 million sites, "leave a comment" (72 million), "post a comment" (207 million).
  • Select suitable victim sites
    Select websites with high page rank, and a lot of visitor traffic. Ideally, pages will be selected that have been dormant for a while (i.e. stale for a year). In this manner, acts of spamming may go un-noticed by the site operator as they focus their attention on more recent articles. Alternatively, a lazy spammer may simply pay $100 to purchase a "quality" list of up to 10,000 vulnerable sites. The test for suitability may also include a check for rel="nofollow" links, which would discourage spamming.
  • Identify comment form fields
    An educated guess is made about the purposes of each comment form field (e.g. name, link/url, email, comments). In most cases, the NAME parameter of the FORM INPUT tag provides an accurate description, making the spammer's job very easy!

    Given an assumption about the purpose of each field, the spammer now has the difficult task of trying to assign values to each field. One or more of these fields will be used to contain the main link spam, while the others will be set in such a way to look reasonable (i.e. fake names, fake email addresses, etc.). Making a wrong choice about which field(s) to use is probably the easiest way to for a site to detect spam content. A spammer may try several times before getting it right.
  • Identify the POST destination
    The most important part of the crawling / harvesting process is in identifying the web page address that actually receives the form submission. This is nearly always visible in the FORM tag, such as: <FORM method="post" action="/submit.php"> (where "/submit.php" would be the destination page). The key point to realize here is that once the destination (receiving) page has been identified, the spammer can simply send an HTTP POST request directly to this page, completely bypassing your comment form page!

    Making this process even easier for the bad guys is the prevalence of many off-the-shelf blogging programs. In most cases, the blogging software comes configured with default destination page URLs (such as "wp-comments-post.php" for WordPress, or "mt-comments.cgi" for Movable Type). Even worse is the fact that the form fields almost always have a predictable meaning. This commonality makes an attack extremely easy.

    Often times, spammers will simply try going directly for a submission page (e.g. somedomain.com/blog/wp-comments-post.php) without even trying to harvest your comment form page! That said, most of the time they will use a program to harvest your comment form to determine the actual destination URL. Nonetheless, if you are reliant on a standard blogging program, you may be able to reduce spam exposure slightly by renaming your destination page (known as security through obscurity).
  • Optionally wait for a cool-off period.
    While 44.8% of the spams I receive are sent as soon as a page is harvested, 55.2% are actually delayed. In other words, there may be a significant time delay between when the original comment form is harvested and when the spam is sent directly to the destination URL. I have observed delays of up to 75 days between harvest and the submission. Presumably, this offers the person an advantage in hiding their tracks, as often the access log files (showing the original visit) are deleted after a period of time.
  • Use a Proxy / Bot for anonymity and anti-blacklisting
    Since many website operators simply ban IP addresses that are responsible for generating bad submissions, a spammer would quickly lose their ability if they relied on only a single address. This is where proxies, zombies and bots come to their rescue. A proxy allows someone to generate a request through the web via an "anonymous" IP address. There are hundreds of proxy servers available to use for free. By hopping from proxy to proxy, it not only makes IP blacklisting less effective, but it also shields them from investigation should the owner decide to track the persondown.
  • Send POST request to destination
    One or more proxies or spambots are instructed to compose and send HTTP POST requests to the destination page.
  • Adjust the field content
    Optionally, the result of the form submission is captured and analyzed. This information can then be used to modify the set of fields for a re-submission attempt. If too much information is given to the spammer following a bad entry, they may be more inclined to adjust their next attempt accordingly. This one reason I suggest responding to the first obvious attempt of spam with immediate blacklisting.

Typical Approaches to Preventing Blog Spam

There are numerous techniques and tools that a website designer might use to block blog spam, but most of the methods will trade off effectiveness against hassles for either the reader or website operator. The key to great spam filtering is in producing a high degree of effectiveness (no false positives or false negatives) while still making it easy for readers (no registration) and the operator (no daily deletions).

Other methods that are often used, and which will be the subject of an upcoming article here:

  • CAPTCHAs
  • Moderation
  • User Registration
  • Blacklisting, Bad Behavior
  • Turing Tests
  • Distributed methods (e.g. Akismet)
  • Rel=Nofollow links

Focus of this Article

This page describes techniques for a full-custom approach, which is a highly-effective option for those with some programming experience (e.g. PHP). But the concepts also apply to those who want to understand the various plug-ins available to pre-built blogging programs. Custom solutions have the most potential to be effective -- most automated spamming programs are designed to defeat the faults of the most prevalent website modules.

Comment Spam Statistics

The following statistics are based on what my custom detection code observes here on ImpulseAdventure.com. At the time of writing, I get roughly 3000 visits a day. For obvious reasons, the data is only captured at random times so that it will be less useful to the few bad apples out there who may be reading this!

PS> On that note, any spammer reading this should recognize that I use moderation as a backup for my automated methods. Therefore, no spam will ever reach a live page, and hence no benefit will ever be gained.

Frequency
Spams per day46.9
Email Injection
Newlines in single-line field0.0%
Email headers0.0%
Client scripting
Javascript code0.6%
Sequencing
Duplicate1.4%
Delayed Harvest26.5%
IP Mismatch31.4%
Harvested Total55.2%
Fixed Fields
Empty5.8%
Invalid content29.5%
User Constrained
Field bad format51.4%
Field bad content21.4%
Empty5.2%
User Freeform
Spam keywords33.5%
Excessive URLs71.1%
Special detect #165.2%
Special detect #272.7%
Pass / Fail Summary
Pass3.3%
Fail96.5%
Blacklist new51.5%
Blacklist already45.0%
Filter Effectiveness Summary
False negative0.2%
False positive0.2%
Total Effectiveness99.7%

Types of Fields

  • Fixed -

    Field content that is forced by the submit form (i.e. via an <input type="hidden">), which can't be edited by the user. This content may be dynamically generated (e.g. product number, page ID, action type to perform, etc.).

    Since the text or number is not user-editable, any changes to these values (outside of the expected ranges) can be immediately flagged as a forged submission.
  • User Constrained -

    Field content that the user can edit (i.e. via an <input type="text"> or <textarea>), but that has some specific constraints in expected valid content. Examples of user constrained fields may be email addresses, URLs, telephone numbers or even a number.

    Identifying spam content through the constraints expected in these fields is possible, but one must allow for some genuine mistakes or variations.
  • User Freeform -

    Field content that the user can edit, and that doesn't have any strict constraints in content. This type of content is much harder to verify as there are no obvious restrictions on what the user can enter.

    Identifying spam content through these fields can be prone to false positivies if careful analysis has not been done.

Indicators of Spam

High Severity Indicators

The following patterns are indicative of spam with a high degree of certainty. In nearly all cases, these patterns are impossible to enter by someone actually typing information into your form.

  • Hidden / required fields missing - Likely due to using old harvested data. Observed in 5.8% of entries.
  • Hidden / required fields modified out of range (29.5%)
  • Single-line user fields containing newlines - Malicious form hijacking. In PHP:
    preg_match("/(%0A|%0D|\n+|\r+)/i", $text);
  • Multi-line user fields containing elements of MIME email headers - High probability of malicious form hijacking. The PHP source code to do this is:
    preg_match("/(%0A|%0D|\n+|\r+)(content-type:|to:|cc:|bcc:)/i", $text);
  • Significant cool-off time between harvest and spam. Most users don't need more than an hour to compose a message, so the 26.5% of entries with extended delays are flagged as bad. One can measure time by using a time sequence field to encode the form generation time. Compare against POST submission time. I have seen delays of over 50 days in some cases!
  • Mismatch between IP on form generation and posting (seen in 31.4% of entries). In combination with a cool-off time after harvesting, many times the spammer will also use a different proxy to send the actual form POST. No real user is going to have a different IP address between these two events -- NOTE: AOL users might need verification against X-Forwarded-For HTTP header as this ISP uses proxies constantly!
  • Special patterns. After careful analysis, I have created two additional pattern detectors that identify nearly all link spam that I see today. These patterns are based on observing methods that are used to help links appear on a variety of blogs. As these checks are extremely important for my overall filter effectiveness (flagging 65.2% and 72.7% of all entries), I'm not sure that I should publish the details of the mechanism here.

Low Severity Indicators

The following characteristics are typical in spam attacks, but could also be due to genuine user content or mistakes. One must be careful in how these are treated, as an overly tough approach will increase likelihood of false positives.

  • Identify the acceptible patterns for each of your user fields. It is better if you can somehow influence the user's entry to contain an identifying trait that you can use to validate the input. 51.4% of entries contain questionable content of this type.
  • Multiple submissions without reloading the source page. While harvesting attacks often resubmit the same FORM POST(1.4%), it is possible that a user genuinely wanted to go back and resubmit their entry. Multiple submissions can be detected through the use of one-time tokens included in a hidden field.
  • Evidence of keywords or key phrases in the message body. I only suggest using a very minimal list (e.g. certain pharmaceuticals), as this method is not very effective with the constant changes seen in phrases every few days. Even still, it is able to capture 33.5%.
  • Inappropriate use of field content. If a field asks for an phone number and a URL is submitted, there is a good chance that a human did not read your page! 21.4% of entries are marked in this way.
  • Link spamming often involves the submission of a huge number of links (seen in 71.1%). I have encountered as many as 170 URLs in a single entry! Clearly, the more links an entry has, the higher the probability of it advertising something worthless. The graphs below should give you an idea of the distribution of link counts that I tend to see versus what spams contain.

    Links per Spam Entry Links per Legitimate Entry
    NOTE: Entries with 0 links flagged as spam were triggered by content in non-primary comment field.

Other Suggestions

The goal is should always be to identify and filter spam with virtually no false positives (you don't ever want to throw away valid content!).

  • Field names are often used by spamming software to make educated guesses about the suitability of a particular field for spam injection. Consider scrambling the field names -- in this way it is very likely that one of the more constrained / restrictive fields (e.g. email or name) may be used for link spamming, which can easily be detected by the filter.
  • One strike, you're out! Don't give the spammer an opportunity to retune their attack. Most spammers don't use that many IP addresses concurrently, so IP blocking (for periods of time) can actually be an effective remedy (takes care of 45.0%).
  • Don't give information about detected violations unless you can block effectively. Don't challenge the spammer! Maybe you shouldn't write an article about it, for that matter, either!
  • Be mindful about the run-time processing required to filter entries as you need to consider the possibility of a Denial of Service attack. Perform your most effective lightweight checks first, and only if they pass should you continue on to use more CPU-intensive checks (such as database lookups).
  • If the content you expect in your submissions does not overlap with topics that many spammers target (certain prescriptions, adult content, etc.), then you may want to use a limited form of keyword blacklisting.
  • If you are hit very frequently by a particular host, you may want to consider automatically inserting an IP Deny into the .htaccess. One particular spammer hit me 177 times in a matter of 2 days, and they have continued frequently since. By adding in an IP Deny to the .htaccess you can safe your host the processing otherwise required. For example, if the following line were added to your root .htaccess, any hits from this address will be blocked:

    deny from 69.46.24.44
  • When looking for patterns, toss out white-space, certain tags, fillers and make case insensitive whenever possible. You want the checks to be as robust as possible, despite some very creative formatting by the spammers.

Don't challenge the Spammer!

One should be very careful about the level of feedback provided on the submission page. If the result is negative, the sender may either give up or choose to refine their approach. I have definitely seen evidence of refinement, so keep this in mind when you are trying to give a helpful error message!

That said, there are times when negative feedback is appropriate -- if the user posted something that may have accidentally triggered a low-severity threshold, you may choose to tell them to retry. On the other hand, cases that are highly likely to be invalid should get tossed without remorse.

Cut off the source: Prevent the Harvesting

Consider limiting the visibility of your comment forms to bots that have been flagged as harvesters in the past. As a good amount of spam originates from a single harvest, this method can actually curb a good deal of it. In fact, enabling this mechanism instantly cut more than 70% of all spam. Note that many times the IP address that is used for harvesting is never used for the actual spamming -- therefore, you'll want to identify the original harvester in order to accomplish this, not the originator of the spam.

Auditing your Filter

Tracking False Positives

for lower-certainty spam (e.g. with keyword blacklisting), it may be desirable to capture some of the flagged entries into a database so that you can analyze how effective your mechanism has been. If you detect good content getting tossed, you may then be in a position to re-approve it and/or adjust your detection mechanism. If it is a high-certainty spam, you might want to capture a small portion into a DB or simply discard it altogether.

In my case, my only false positive so far has been due to America Online's use of a proxy system that constantly changes IP addresses between page loads. I'm currently modifying my system to handle this type of scenario.

False Negatives

Analyze the content of spam that gets through so that you can adjust your detection mechanism further. Over time you should have a pretty solid idea about the tactics used by most auto-generated entries.

Upcoming Topics

  • Harvest to Spam Delay
  • Multiple attempts: multiple submits per harvest, multiple proxies per spam
  • Manual vs automated entries
  • Dangers of Spam
  • SQL Injection
  • Form Hijacking
  • Other tricks to combat

 


Reader's Comments:

Please leave your comments or suggestions below!
2013-09-23Pandy
 That sounds similar to Sblam antispam filter: http://sblam.com/en.html

http://translate.google.com/translate?sl=auto&tl=en&js=n&prev=_t&hl=en&ie=UTF-8&u=http%3A%2F%2Fsblam.com%2Ftechniki.html

It uses naive bayes to learn keywords in text and URLs, distributed blacklists like Project Honeypot (that's actually clever - hidden URLs that only bots see).

It also looks at time post was sent, assuming people don't post in the middle of the night (although timezones...)

and checks for link syntax, e.g. BBCode in forms that don't allow it.
 Thanks Pandy -- looks like an interesting alternative.
2009-08-16Poul
 Hi, I know the post is a little old but I still find it very interesting and relevant. Read through all of it, and kinda makes me want to try to develop a spam filter for WordPress myself. However, there is no need, because I currently use SpamTask and it works like a charm. No mistakes so far, and I had to manually remove 4-5 spam comments a month when I had Akismet installed. So well.. Blog spam might be ultimately defeated with SpamTask (at least at the moment, since new spam methods will be found in time). :)
Great post!
2009-07-05nirvana
 I want to know that how a blogger can avoid the ban for his/her IP from WP
 It may depend on what triggered the banning in the first place, but if the result was simply that a specific IP address was banned (not a very good protection measure), then one could easily circumvent this by masking the IP address (using anonymizers or proxies).
2008-01-02Bruce Wright
 Hello,

I am a front-end developer and one of the problems that have reared its ugly head is the comment spam. In your paragraph on Spam Patterns you stated "...I'm not sure that I should publish the details of the mechanism here." Is there a way to get that mechanism as we would like to cut down, if not stop, the spam that is hitting our customers "Contact Us" page.

This is a good article,
Bruce Wright
CSS/HTML Web Developer
Dominion Enterprises
 Hi there Bruce --

Now that I have left my anti-spam setup unmodified for quite a few months, I have had a reasonable opportunity to compare my custom anti-spam solution against the Akismet system. My system has a lower false-positive rate but they are very similar in performance. Given the relatively high success rate in filtering comments via Akismet, I would strongly suggest starting with that system (all you need to do is get a free API key from their website). Modifying your contact form to interface with Akismet is easy if you have access to PHP -- if you need any help implementing this, let me know and I'll guide you through it. As for my custom methods, I'll send you some info via email.
 
2007-11-23 
 FYI, in opera, when "advanced->network->enable referrer logging" option is unchecked, no pictures are displayed on your site. i first thought it was a bug on your server, but then i noticed that pictures appear on explorer/firefox.
 Thanks... you're right. I frequently got hit with image hotlinking (people posting my high-res photos/articles on web forums, blogs, etc.), which has meant that I was using Apache's mod_rewrite to check that the referrer is local. While this solved much of the bandwidth issue, it unfortunately has the downside that some firewalls and referrer-hiding browser configurations will not display images correctly. If there were a less restrictive alternative, I would certainly be open to it. In the meantime, I'll disable the workaround and see if I can leave it off. Thx.

 


Leave a comment or suggestion for this page:

(Never Shown - Optional)
 

Visits!