Sharing Statistics (and why it’s a really really dumb idea)

I checked my server logs lately, and found that I was getting a TON of hits to /blog?disp=stats and /index.php?disp=stats. I’m guessing most b2evo users probably are in the same boat with that, including former b2evolution users like myself.

Like all spam, referer-header spam only exists for one reason:

Spammers can make money doing it.

They spam your site to show up in the “My Top Referers” listing, which many bloggers make public. It’s a fun thing to share, but unfortunately, it’s also a honeypot of free links and search-engine ranking if they ping the hell out of your site.

Spammers are smart. They find blog applications that have this “hole” (i.e., free links for them if they spam you) and then they search for all the blogs using these systems, and send tons of fake hits with themselves as the referer header. I think that they find vulnerable sites by searching google for phrases like “Powered by b2evolution” and “MT Forged”. Then, they look for the stat-display page to both send their fake referer and see whether or not to keep your site on the list.

If you share your referers with the world, then You Will Get Referer Spam. I guarantee it. Just hiding them isn’t enough — you need to lock them up and explain in very simple machine-interpretable terms that they are gone, and will never be back.

If you are with me in thinking that it’s not worth the mild amusement of sharing your referers with your viewers, and want to take your log files back, I’ve cut out about 90% of my referal spam with this .htaccess rule:

RewriteCond %{QUERY_STRING} disp=stats [NC]
RewriteRule ^.*$ - [G,L]

That says: “If they’re looking for anything containing the phrase disp=stats, then return a 410 (Gone) error response, and stop processing the request.”

Now, there are a few mods out there to close out the public stats portion of your page, for example, but putting something like this in your stub: if( $_GET['disp'] == ’stats’ ) $disp = ‘posts’;

However, the 410 method is even better. Why? Because, spammers aren’t sitting around analyzing their logs making sure that every “disp=stats” request got to an actual “My Stats” page. The bot just checks whether or not it gets a 200 response header when it hits your stats page. 200 means “OK”, so it doesn’t realize that your stats page isn’t there any more. (On the plus side, at least you’re not giving them free links!)

I log all the errors on my site in my errors folder, and in the first day of adding this rule, my log file blew up. In fact, they were coming so fast that I had to stop the logging and reconfigure how I was recording these requests so that I wouldn’t run out of server space! (Now it keeps a “scorecard” of these stat hunters, with the number of hits from each REMOTE_ADDR and HTTP_REFERER, instead of a separate line for each request.)

On the second day, there was about 1/2 as many requests. I still got another 50-100 from some of them before they dropped off, so I think that some of them keep looking until they get a certain number of failures in a row. Then, it decides that you’re no fun, and moves on.

Why is [G] (410) better than [F] (403)?

Server admins often put up a 410 response when a customer’s site is no longer active (for example, if they moved to a new host, abandoned the domain name, etc.) However, a 403 is usually a result of some sort of check. If “phentermine.cm3.com” was rejected, how about “p-h-e-n-t-e-r-m-i-n-e.cm3.com”? or “ph-en-te-rm-in-e.onthenet.as”? This technique is annoying, for sure, and it’s a simple brute force attack that works really well against the most popular anti-spam measure: a (possibly centralized) blacklist.

Since I moved from a 403 response to a 410 on disp=stats requests, my referer spam hits have gone down about 90%. Of course, it’s probably only a matter of time before they mutate and see through what I’m doing (in fact, I wouldn’t be surprised if some spammers keep tabs on sites that contain antispam posts like this one!) But, for now, it’s a welcome rest for my server.

Don’t get me wrong, blacklists are good, and I do use a blacklist to filter out some referrers, but if you think that a blacklist is going to keep you safe forever, you’re not thinking like a spammer. A good blacklist is a strong friend, a bouncer at the door. Locking up your stats and giving a 410 response to anyone who looks for them is more like hiding out in a bomb shelter. You don’t even hear them knocking.

Another Tip

If you want to stop comment spam dead in its tracks, this method can help. It’s not foolproof, but it helps.

  1. Use WordPress, and require logins for comments. Also moderate any users without a previously approved comment. Yeah, this is a little annoying since your users have to wait for you to approve their comments, but it’s really rock-solid at fighting comment spam.

    If you run a busy discussion site with lots of user comments, you might want to not moderate all users without a previously approved comment, since this could get quite time-consuming. Requiring logins for comment posting is an absolute must, however.

  2. Get Referrer Karma and Spam Karma. Install both of them in your /wp-content/plugins/ directory, and follow the instructions.

  3. Add these rules to your .htaccess file:

    RewriteCond %{REQUEST_METHOD} ^POST [NC]
    RewriteRule ^(cgi-bin/MT/|htsrv/comment_post|htsrv/trackback).* /wp-content/plugins/ref-karma/referrer-karma.php?rk_redirect_to=&rk_ban_this_ip=1 [QSA,L]

    This says “anyone who tries to send a POST request to the Movable Type or b2evolution comment/trackback system gets their IP banned.” Very nice.

    #replace "my.site.com" with your actual website, but make sure to keep the slash before any periods.
    RewriteCond %{HTTP_REFERER} !my\.site\.com.* [NC]
    RewriteCond %{REQUEST_METHOD} ^POST [NC]
    RewriteRule (wp-trackback|wp-comments-post) /wp-content/plugins/ref-karma/referrer-karma.php?rk_redirect_to=/&rk_ban_this_ip=1 [QSA,L]

    This says “If anyone sends a POST request to any WP comment or trackback posting scripts, ban their IP, but just in case it’s a false positive, redirect them to the front page if they click a link to un-ban themselves.”

This is how I’ve eliminated spam on this site. It’s a lovely thing.

Now let’s see how the spammers figure out a way through these methods.

10 Responses to “Sharing Statistics (and why it’s a really really dumb idea)”

  1. On September 26th, 2005 at 16:25:54, jwedgeco Said:

    Hi Isaac,

    I’m pestered by referer spammers as well. One thing that I have noticed is that the worst spammers don’t even read http status codes. I can tell this by my apache logs. Some spammers drop the TCP connection as soon as the request is sent. They don’t even wait for the response. I’ve resorted to using iptables to block IP addresses. I’ve written an article for Linux Journal and I’m waiting on it to be approved. Look for it in print or on the web site.

    Jason

  2. On September 26th, 2005 at 16:54:56, Isaac Said:

    I can tell this by my apache logs. Some spammers drop the TCP connection as soon as the request is sent.

    I wonder if there’s a way to automagically read the logs and build the list of IPs to ban based on that.

    I’ll be keeping an eye out for your article.

  3. On September 26th, 2005 at 17:53:55, jwedgeco Said:

    I wonder if there’s a way to automagically read the logs and build the list of IPs to ban based on that.

    I do that now based on referers and IP addresses.

  4. On September 27th, 2005 at 09:33:26, Isaac Said:

    I think the real problem is sharing one’s referers in the first place. Once a blog system starts doing this, it’s a free-for-all for spammers.

    If they do not even wait for the HTTP response code, then it would be odd that returning a 410 would have any effect. And yet, it’s impossible to deny that this has had a huge effect on the spam that I’m receiving.

    I wonder if perhaps this is due to the fact that spammers look for targets based on a google search for “disp=stats”. If that’s the case, then maybe the real effect is that I told Google, Yahoo, et al., to remove my stats page from their listings by flagging them as 410 Gone.

    If all bloggers stopped publishing their referrers, and took active steps to prevent comment spam, then I am convinced that this problem would go away. People who depend on spam for google rankings would have to find another way (for example, through spamblogs) that doesn’t involve spamming ME with mountains of false referer headers.

  5. On October 3rd, 2005 at 18:13:20, kwa Said:

    After removing the “disp=stats” pages and links from my blog, I’ve seen referrer spam seriously reduced in the two following weeks. However, I’ve seen referrer spammers hitting other pages who never published statistics once the “disp=stats” pages disappeared from all my site’s blogs…

    I’ve just updated my .htaccess file with your “page gone” solution, since some spammers appear to hit it anyway. And filtering using a simple check with “disp=stats” is much quicker than checking every banned keywork/URL/domain. Maybe it’s not _so_ bad to leave the stats online? ;-)

    BTW, nice post! Thanks!

  6. On October 27th, 2005 at 11:12:16, Websiteentwicklung - Unser Blog und der Kampf gegen den Spam - oder: B2Evolution, Versuche Statistiken zu manipulieren, unnötigen Traffic und was man dagegen tun kann Said:

    [...] Die integrierte Antispam-Liste von B2Evolution hat die Funktion einer Blaglist: Taucht eine Anfrage von einem dort aufgeführten Server auf, wird die Anfrage nicht gezählt und taucht auch nicht in den Statistiken auf. Das ist grundsätzlich hilfreich - reicht aber nicht, denn die Anfrage des Spamer erzeugt ja trotzdem Systemlast und Traffic. Bei den Spamanfragen wird hauptsächlich immer die Statistikseite des Blogs aufgerufen. Dazu reicht ein Blick in die Logfiles des Webservers - weil in den B2Evolution-Statistiken tauchen sie ja durch die Blacklist-Funktion nicht mehr auf. Das ist der erste Ansatzpunkt: Man sollte die Anzeige der Statistikfunktion - jenem disp=stats - verhindern. Isaac Schlueter hat einen interessanten englischen Beitrag zu diesem Thema geschrieben: Man teilt dem Anfragenden einfach mit, daß die Statistikseite für immer gelöscht wurde. Da das schon durch den Webserver passiert, erzeugt es kaum Traffic und Systemlast. Zur Aktivierung trägt man einfach in die .htaccess Folgendes ein:

    RewriteCond %{QUERY_STRING} disp=stats [NC]
    RewriteRule ^.*$ - [G,L]

    Bei uns sieht das dann so aus. Ein Blick in die Logfiles zeigt, daß dadurch schon jede Menge Traffic - der Geld kosten kann - verhindert wird. [...]

  7. On February 12th, 2006 at 11:45:41, guchuj05 Said:

    I tried to download my htaccess file to edit it, but it said ‘for the system only.’ How do I go about adding:

    RewriteCond %{QUERY_STRING} disp=stats [NC]
    RewriteRule ^.*$ - [G,L]

    to the file?
    Thanks.

  8. On February 12th, 2006 at 13:40:26, guchuj05 Said:

    Another question: Is there anything that needs to go before or after this bit of code?

    RewriteCond %{QUERY_STRING} disp=stats [NC]
    RewriteRule ^.*$ - [G,L]

    Sorry, but I don’t know anything about this sort of thing.
    Thank you,
    JG

  9. On February 13th, 2006 at 09:55:16, Isaac Said:

    JG,

    Please see this page and read all that it has to say.

    In particular, pay attention to the sections on RTFM and STFW.

    I am happy to share information, and even help to enlighten where I can. I try not to be rude, even when annoyed. But I’m not teaching an “Intro to Web Mechanics” course.

    You’re going to have to hunt this beast on your own, Grasshopper.

  10. On April 6th, 2006 at 04:33:56, segeln » Blog Archive » Unser Blog und der Kampf gegen den Spam - oder: B2Evolution, Versuche Statistiken zu manipulieren, unnötigen Traffic und was man dagegen tun kann Said:

    [...] Die integrierte Antispam-Liste von B2Evolution hat die Funktion einer Blacklist: Kommt eine Anfrage von einem dort aufgeführten Server, wird die Anfrage nicht gezählt und erscheint auch nicht in den Statistiken. Das ist grundsätzlich hilfreich - reicht aber nicht, denn die Anfrage des Spamer erzeugt ja trotzdem Systemlast und Traffic. Bei den Spamanfragen wird hauptsächlich immer die Statistikseite des Blogs aufgerufen. Dazu reicht ein Blick in die Logfiles des Webservers - in den B2Evolution-Statistiken tauchen sie ja durch die Blacklist-Funktion nicht mehr auf. Das ist der erste Ansatzpunkt: Man sollte die Anzeige der Statistikfunktion - jenem disp=stats - verhindern. Isaac Schlueter hat einen interessanten englischen Beitrag zu diesem Thema geschrieben: Man teilt dem Anfragenden einfach mit, daß die Statistikseite für immer gelöscht wurde. Da das schon durch den Webserver passiert, erzeugt es kaum Traffic und Systemlast. Zur Aktivierung trägt man einfach in die .htaccess folgendes ein: RewriteCond %{QUERY_STRING} disp=stats [NC] RewriteRule ^.*$ - [G,L] Ein Blick in die Logfiles zeigt, daß dadurch schon jede Menge Traffic - der Geld kosten kann - verhindert wird. [...]