How to Prevent Comment Spam & In depth Analysis
Receiving thousands of spam entry submissions per month on a website is enough of an impetus to challenge most webmasters into developing the ultimate spam filter. After carefully analyzing the patterns of thousands of comment spams, I have now built a very effective filter. This page shares some of my detailed comment spam statistics and suggestions for prevention.
What is comment spam / blog spam?
You've undoubtedly come across a website or blog that is littered with hundreds of garbage comments and links. Invariably,you'll see this on pages that have been online for over a year without much activity. While some webmasters choose to engage in a daily manual fight (i.e. delete!), many others simply give up as it is just too much of a hassle to do every day. In some cases, the site owner may not even be aware of the scourge taking place!
As this site contains hundreds of content pages online, each allowing readers to add their own comments, it too is a juicy target for spam attacks. The graphs below show just how much spam is sent to ImpulseAdventure every day via the comment forms. In fact, 96.5% of all entries are spam! Thankfully, you don't have to see any of it, as I have automated many unique tricks in filtering the fluff out before it gets to me. Deleting 50+ comments a day by hand is simply not practical!
|Spams Per Day||Spams Per Hour|
How Comment Spamming Works
While I have never had access to any blog spamming software, I do have a pretty good idea about how they must operate after analyzing thousands of spam entries. The programs typically range in cost from $100 to $300, offering many very advanced and "clever" features. The methods are continually being revised to keep ahead of developments in the counter-measures deployed by blog software and anti-spam plugin developers.
The majority of these programs take advantage of tricks that such as randomized proxy servers, user agents and referer strings. A good percentage also inject random random delays between harvesting a comment page and the actual submission of the comment. Altogether, these techniques make the prevention of blogspam much harder.
The following is a brief overview of how a spammer may target a site:
- Harvest web pages
The spammer searches for a web page containing a comment form. Thanks to Google, this can be done easily. Typically, a search for certain keywords and a series of dates will generate a good starting point. Within each of these "harvested" pages, the comment form is identified.
Searching for "leave a reply" on Google returns about 25 million sites, "leave a comment" (72 million), "post a comment" (207 million).
- Select suitable victim sites
Select websites with high page rank, and a lot of visitor traffic. Ideally, pages will be selected that have been dormant for a while (i.e. stale for a year). In this manner, acts of spamming may go un-noticed by the site operator as they focus their attention on more recent articles. Alternatively, a lazy spammer may simply pay $100 to purchase a "quality" list of up to 10,000 vulnerable sites. The test for suitability may also include a check for rel="nofollow" links, which would discourage spamming.
- Identify comment form fields
An educated guess is made about the purposes of each comment form field (e.g. name, link/url, email, comments). In most cases, the NAME parameter of the FORM INPUT tag provides an accurate description, making the spammer's job very easy!
Given an assumption about the purpose of each field, the spammer now has the difficult task of trying to assign values to each field. One or more of these fields will be used to contain the main link spam, while the others will be set in such a way to look reasonable (i.e. fake names, fake email addresses, etc.). Making a wrong choice about which field(s) to use is probably the easiest way to for a site to detect spam content. A spammer may try several times before getting it right.
- Identify the POST destination
The most important part of the crawling / harvesting process is in identifying the web page address that actually receives the form submission. This is nearly always visible in the FORM tag, such as: <FORM method="post" action="/submit.php"> (where "/submit.php" would be the destination page). The key point to realize here is that once the destination (receiving) page has been identified, the spammer can simply send an HTTP POST request directly to this page, completely bypassing your comment form page!
Making this process even easier for the bad guys is the prevalence of many off-the-shelf blogging programs. In most cases, the blogging software comes configured with default destination page URLs (such as "wp-comments-post.php" for WordPress, or "mt-comments.cgi" for Movable Type). Even worse is the fact that the form fields almost always have a predictable meaning. This commonality makes an attack extremely easy.
Often times, spammers will simply try going directly for a submission page (e.g. somedomain.com/blog/wp-comments-post.php) without even trying to harvest your comment form page! That said, most of the time they will use a program to harvest your comment form to determine the actual destination URL. Nonetheless, if you are reliant on a standard blogging program, you may be able to reduce spam exposure slightly by renaming your destination page (known as security through obscurity).
- Optionally wait for a cool-off period.
While 44.8% of the spams I receive are sent as soon as a page is harvested, 55.2% are actually delayed. In other words, there may be a significant time delay between when the original comment form is harvested and when the spam is sent directly to the destination URL. I have observed delays of up to 75 days between harvest and the submission. Presumably, this offers the person an advantage in hiding their tracks, as often the access log files (showing the original visit) are deleted after a period of time.
- Use a Proxy / Bot for anonymity and anti-blacklisting
Since many website operators simply ban IP addresses that are responsible for generating bad submissions, a spammer would quickly lose their ability if they relied on only a single address. This is where proxies, zombies and bots come to their rescue. A proxy allows someone to generate a request through the web via an "anonymous" IP address. There are hundreds of proxy servers available to use for free. By hopping from proxy to proxy, it not only makes IP blacklisting less effective, but it also shields them from investigation should the owner decide to track the persondown.
- Send POST request to destination
One or more proxies or spambots are instructed to compose and send HTTP POST requests to the destination page.
- Adjust the field content
Optionally, the result of the form submission is captured and analyzed. This information can then be used to modify the set of fields for a re-submission attempt. If too much information is given to the spammer following a bad entry, they may be more inclined to adjust their next attempt accordingly. This one reason I suggest responding to the first obvious attempt of spam with immediate blacklisting.
Typical Approaches to Preventing Blog Spam
There are numerous techniques and tools that a website designer might use to block blog spam, but most of the methods will trade off effectiveness against hassles for either the reader or website operator. The key to great spam filtering is in producing a high degree of effectiveness (no false positives or false negatives) while still making it easy for readers (no registration) and the operator (no daily deletions).
Other methods that are often used, and which will be the subject of an upcoming article here:
- User Registration
- Blacklisting, Bad Behavior
- Turing Tests
- Distributed methods (e.g. Akismet)
- Rel=Nofollow links
Focus of this Article
This page describes techniques for a full-custom approach, which is a highly-effective option for those with some programming experience (e.g. PHP). But the concepts also apply to those who want to understand the various plug-ins available to pre-built blogging programs. Custom solutions have the most potential to be effective -- most automated spamming programs are designed to defeat the faults of the most prevalent website modules.
Comment Spam Statistics
The following statistics are based on what my custom detection code observes here on ImpulseAdventure.com. At the time of writing, I get roughly 3000 visits a day. For obvious reasons, the data is only captured at random times so that it will be less useful to the few bad apples out there who may be reading this!
PS> On that note, any spammer reading this should recognize that I use moderation as a backup for my automated methods. Therefore, no spam will ever reach a live page, and hence no benefit will ever be gained.
Types of Fields
Indicators of Spam
High Severity Indicators
The following patterns are indicative of spam with a high degree of certainty. In nearly all cases, these patterns are impossible to enter by someone actually typing information into your form.
- Hidden / required fields missing - Likely due to using old harvested data. Observed in 5.8% of entries.
- Hidden / required fields modified out of range (29.5%)
- Single-line user fields containing newlines - Malicious form hijacking. In PHP:
- Multi-line user fields containing elements of MIME email headers - High probability of malicious form hijacking. The PHP source code to do this is:
- Significant cool-off time between harvest and spam. Most users don't need more than an hour to compose a message, so the 26.5% of entries with extended delays are flagged as bad. One can measure time by using a time sequence field to encode the form generation time. Compare against POST submission time. I have seen delays of over 50 days in some cases!
- Mismatch between IP on form generation and posting (seen in 31.4% of entries). In combination with a cool-off time after harvesting, many times the spammer will also use a different proxy to send the actual form POST. No real user is going to have a different IP address between these two events -- NOTE: AOL users might need verification against X-Forwarded-For HTTP header as this ISP uses proxies constantly!
- Special patterns. After careful analysis, I have created two additional pattern detectors that identify nearly all link spam that I see today. These patterns are based on observing methods that are used to help links appear on a variety of blogs. As these checks are extremely important for my overall filter effectiveness (flagging 65.2% and 72.7% of all entries), I'm not sure that I should publish the details of the mechanism here.
Low Severity Indicators
The following characteristics are typical in spam attacks, but could also be due to genuine user content or mistakes. One must be careful in how these are treated, as an overly tough approach will increase likelihood of false positives.
- Identify the acceptible patterns for each of your user fields. It is better if you can somehow influence the user's entry to contain an identifying trait that you can use to validate the input. 51.4% of entries contain questionable content of this type.
- Multiple submissions without reloading the source page. While harvesting attacks often resubmit the same FORM POST(1.4%), it is possible that a user genuinely wanted to go back and resubmit their entry. Multiple submissions can be detected through the use of one-time tokens included in a hidden field.
- Evidence of keywords or key phrases in the message body. I only suggest using a very minimal list (e.g. certain pharmaceuticals), as this method is not very effective with the constant changes seen in phrases every few days. Even still, it is able to capture 33.5%.
- Inappropriate use of field content. If a field asks for an phone number and a URL is submitted, there is a good chance that a human did not read your page! 21.4% of entries are marked in this way.
- Link spamming often involves the submission of a huge number of links (seen in 71.1%). I have encountered as many as 170 URLs in a single entry! Clearly, the more links an entry has, the higher the probability of it advertising something worthless. The graphs below should give you an idea of the distribution of link counts that I tend to see versus what spams contain.
Links per Spam Entry Links per Legitimate Entry
The goal is should always be to identify and filter spam with virtually no false positives (you don't ever want to throw away valid content!).
- Field names are often used by spamming software to make educated guesses about the suitability of a particular field for spam injection. Consider scrambling the field names -- in this way it is very likely that one of the more constrained / restrictive fields (e.g. email or name) may be used for link spamming, which can easily be detected by the filter.
- One strike, you're out! Don't give the spammer an opportunity to retune their attack. Most spammers don't use that many IP addresses concurrently, so IP blocking (for periods of time) can actually be an effective remedy (takes care of 45.0%).
- Don't give information about detected violations unless you can block effectively. Don't challenge the spammer! Maybe you shouldn't write an article about it, for that matter, either!
- Be mindful about the run-time processing required to filter entries as you need to consider the possibility of a Denial of Service attack. Perform your most effective lightweight checks first, and only if they pass should you continue on to use more CPU-intensive checks (such as database lookups).
- If the content you expect in your submissions does not overlap with topics that many spammers target (certain prescriptions, adult content, etc.), then you may want to use a limited form of keyword blacklisting.
- If you are hit very frequently by a particular host, you may want to consider automatically inserting an IP Deny into the .htaccess. One particular spammer hit me 177 times
in a matter of 2 days, and they have continued frequently since. By adding in an IP Deny to the
.htaccess you can safe your host the processing otherwise required. For example, if the following line were added to your root .htaccess, any hits from this address will be blocked:
deny from 126.96.36.199
- When looking for patterns, toss out white-space, certain tags, fillers and make case insensitive whenever possible. You want the checks to be as robust as possible, despite some very creative formatting by the spammers.
Don't challenge the Spammer!
One should be very careful about the level of feedback provided on the submission page. If the result is negative, the sender may either give up or choose to refine their approach. I have definitely seen evidence of refinement, so keep this in mind when you are trying to give a helpful error message!
That said, there are times when negative feedback is appropriate -- if the user posted something that may have accidentally triggered a low-severity threshold, you may choose to tell them to retry. On the other hand, cases that are highly likely to be invalid should get tossed without remorse.
Cut off the source: Prevent the Harvesting
Consider limiting the visibility of your comment forms to bots that have been flagged as harvesters in the past. As a good amount of spam originates from a single harvest, this method can actually curb a good deal of it. In fact, enabling this mechanism instantly cut more than 70% of all spam. Note that many times the IP address that is used for harvesting is never used for the actual spamming -- therefore, you'll want to identify the original harvester in order to accomplish this, not the originator of the spam.
Auditing your Filter
Tracking False Positives
for lower-certainty spam (e.g. with keyword blacklisting), it may be desirable to capture some of the flagged entries into a database so that you can analyze how effective your mechanism has been. If you detect good content getting tossed, you may then be in a position to re-approve it and/or adjust your detection mechanism. If it is a high-certainty spam, you might want to capture a small portion into a DB or simply discard it altogether.
In my case, my only false positive so far has been due to America Online's use of a proxy system that constantly changes IP addresses between page loads. I'm currently modifying my system to handle this type of scenario.
Analyze the content of spam that gets through so that you can adjust your detection mechanism further. Over time you should have a pretty solid idea about the tactics used by most auto-generated entries.
- Harvest to Spam Delay
- Multiple attempts: multiple submits per harvest, multiple proxies per spam
- Manual vs automated entries
- Dangers of Spam
- SQL Injection
- Form Hijacking
- Other tricks to combat