This article is actually fastidious: How spammers generate random comments for blogs

Last week Chris Hemedinger posted an article about spam that is sent to SAS blogs and discussed how anti-spam software helps to block spam. No algorithm can be 100% accurate at distinguishing spam from valid comments because of the inherent trade-off between specificity and sensitivity in any statistical test. Therefore, some spam comments slip through the anti-spam filter, and I get the pleasure of reading the comments and deciding whether to allow them to appear on my blog, or whether to manually mark them as spam.

Why spammers submit comments to blogs

When I first started getting spam comments, I wondered what the spammers were trying to achieve. A typical spam comment seems fairly innocuous. Here are some actual spam comments that I have received:

  • Thanks for a marvelous posting! I certainly enjoyed reading it, you can be a great author.
  • Thank you for the good writeup. It in fact was a amusement account it.
  • If all bloggers made good content as you did, the internet will be much more useful than ever before.
  • Wow, this article is actually fastidious.

Yes, some of the grammar and word choices are strange, but these comments are not much different from some legitimate comments that I have received from legitimate readers whose native language is not English. What makes me sure that these are spam comments?

Along with each of these comments, the commenter included a URL link to some web site. The URL in a typical spam comment links to a web site that advertises cheap "name brand" merchandise, Russian brides, or get-rich-quick schemes. The spammers get paid for each link that they can successfully embed somewhere on the web, such as on my blog. As you might know, internet search engines use the number of "incoming links" as a measure of how important a web site is, and therefore how high it should appear in the search results. The goal of the blog spammer is to embed many links in many blog articles so that internet search engines rank their sponsoring web site highly when someone searches for something like "cheap viagra."

The link is not always embedded in the comment itself. When you comment on a blog, you have the option to include your name and to link to your personal web site. In a legitimate comment, the links points to the commenter's blog or business; spammers link to their sponsoring URL.

How spammers create random comments

As you can see from the sample comments, spammers try to construct complimentary but fairly generic message that they can submit regardless of an article's content.

Suppose that a spammer decides to construct the following generic message: Your blog is truly wonderful. He could write a program that submits this comment to a million blogs. However, he would not be very successful because anti-spam software can block this simple attack by applying simple logic: IF the comment is 'Your blog is truly wonderful' AND the URL field is filled in, THEN classify the comment as spam.

To attempt to defeat anti-spam software, spammers randomly generate comments by using synonyms for the nouns, verbs, and adjectives that appear in the comment. For example, synonyms for "blog" include "post" and "article." A synonym for "wonderful" is "marvelous," and so forth. Thus the spammer could modify his spam program to generate random messages according to the following grammatical template: Your NOUN is ADJECTIVE SUBJECT_COMPLEMENT. The result is like those Mad Libs® stories you and your sibling used to create on long car rides. You can create a huge number of possible sentences, but most of them sound silly.

It is easy to write a SAS DATA step program that generates comments that fit the grammatical template. The following program uses the RAND function to randomly generate elements of a character array and uses the CATX function to concatenate the elements into a sentence:

data SpamComments(keep=msg);
array noun{4} $12 _temporary_                /* nouns */
   ("blog","post","article","commentary");
array adj{7}  $12 _temporary_                /* adjectives */
   (" ","simply","actually","truly","sincerely","honestly","really");
array sc{6}   $12 _temporary_                /* subject complements */
   ("wonderful","marvelous","fastidious","judicious","superb","fantastic");
sp = " ";                                    /* delimiter between words */
call streaminit(12345);
length msg $ 120;
do i = 1 to 20;                              /* generate 20 messages   */
   noun_i = ceil(dim(noun)*rand("uniform")); /* random index into noun array */
   adj_i  = ceil(dim(adj)*rand("uniform"));
   sc_i   = ceil(dim(sc)*rand("uniform"));
   msg = catx(sp, "Your", noun[noun_i], "is", adj[adj_i], sc[sc_i]);
   output;
end;
run;
 
proc print; run;
t_spamcomments

As shown by the output on the left, these randomly generated comments can be amusing. It is often obvious when spammers use a thesaurus to automatically generate dozens of synonyms for each word in their message template. Words have subtle connotations; they cannot be mixed and matched like a verbal closet of Garanimals®. I call comments like this "grammaticals."

Do you have any experience dealing with spammers? Share your experience by leaving a comment. I'm sure it will be fastidious. If all readers make astonishing comments as you do, the web will be much more useful than ever before.

tags: Just for Fun, Statistical Programming, Strings

3 Comments

  1. Posted May 7, 2014 at 7:56 am | Permalink

    Your post is actually marvelous... Intrigued to see how well Akismet performs ;-)

    An alternative to using arrays could be a hashtable look up to a dataset with columns for nouns, adjectives and subject complements. A large table = a lot of interesting spam comments.

  2. Melvin Alexander
    Posted May 9, 2014 at 2:23 pm | Permalink

    Interesting post. I wonder if you could use text mining techniques such as removing stopwords (words that don't help find in finding meaningful text in posts), stemming (removing characters in terms leaving only its root word), etc to filter out the spam before they are posted.

    • Posted May 9, 2014 at 2:56 pm | Permalink

      Yes, there is a huge amount of research on statistical classification schemes, and classifying spam is often a canonical example. The UC Irvine Machine Learning Repository has many spam data sets, if you'd like to try your hand at this classification problem. The spam data collected by George Forman (HP Labs) is analyzed heavily in the classic text _The Elements of Statistical Learning_ by Hastie, Tibshirani, and Friedman.

One Trackback

  1. […] week, as part of an article on how spammers generate comments for blogs, I showed how to generate random messages by using the CATX function in the DATA step. In that […]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <p> <pre lang="" line="" escaped=""> <q cite=""> <strike> <strong>