Here at SAS, we've come a long way with how we deal with blog spam on blogs.sas.com.
Last year at this time, I was sifting through dozens of spam messages per day in order to salvage the one or two genuine comments that originate from real readers. I was just a human trying to keep up with machine-generated spam, a time-consuming -- and somewhat frustrating -- activity.
I created SAS-based reports that showed the impact not just on me, but on the hundreds of other SAS blog contributors. Thankfully in May, our internal IT support configured an industry-standard spam-filtering mechanism called Akismet. The spam flow ceased immediately, as if somebody turned a spigot.
Recently, the spigot has been turned on slightly, allowing a few spam messages to leak through. The spammers are like an ever-evolving virus, constantly adapting their techniques to punch through our defenses. I wondered: with the leaks I'm seeing, how effective is Akismet for us today?
It turns out that Akismet is very effective. Despite the few messages that I see leak through daily, Akismet is catching over 95% of the spam messages that target our blogs. Over the last 16 days I saw 40 spam notifications hit my inbox. Without Akismet, I would have seen 741 messages in the same period. Yikes.
I can see some of this within my WordPress blog administration screen, since the Akismet plugin embeds a sort of dashboard there. This is useful for me, but we have over 30 blogs hosted at blogs.sas.com. As far as I know, we don't have a view into how Akismet is performing for our entire set of blog properties.
Enter SAS. By using SAS to connect to the WordPress database (something I already do for other reports), I can scrape and aggregate the Akismet metrics for a more global report on spam activity. Here's an example of a chart that I created:
In this chart, the height of the blue bar indicates the total number of incoming spam on each day, across all blogs.sas.com blogs. The smaller green bar indicates the number of "author-identified spam" messages -- those that the filter did not catch and that the blog moderator had to mark as spam. The number at the top of each bar indicates a "percent missed" for the day. And finally, the text at the bottom of the chart provides a summary: total spam caught, total missed, and a percentage. Since we receive so much spam, the IT folks configured Akismet to purge our spam data every 3 weeks or so. That's why the chart shows only a 16-day study period. (See an example of the full report here.)
Interested in the SGPLOT code behind this chart? It looks something like this (data prep steps omitted, of course):
proc sgplot data=reckoning; format comment_date dtdate5. fail_rate percentn6.2; label spam_caught='Spam caught' spam_allowed='NOT caught'; vbar comment_date / response=spam_caught dataskin=pressed datalabel=fail_rate datalabelfitpolicy=none datalabelattrs=(color=black size=10pt); vbar comment_date / response=spam_allowed barwidth=.6 dataskin=pressed ; yaxis label="Akismet ruling" grid; xaxis valueattrs=(size=9) label="Spam comments (&total_caught. caught, &total_allowed. allowed - &overallfail. missed overall)"; run;
Assuming that it takes a blog moderator just a few seconds to evaluate and dispose of each spam message, Akismet has saved us a collective several hours over the weeks we see here. However, as we can see from the metrics accumulated over time, this really adds up:
I don't think that Akismet costs us very much, and -- in my view -- the time that we save is definitely worth the expense. Spammers will always be with us and are just part of the cost of publishing content. But dealing with it manually? Ain't nobody got time for that.