This article is actually fastidious: How spammers generate random comments for blogs

5

Last week Chris Hemedinger posted an article about spam that is sent to SAS blogs and discussed how anti-spam software helps to block spam. No algorithm can be 100% accurate at distinguishing spam from valid comments because of the inherent trade-off between specificity and sensitivity in any statistical test. Therefore, some spam comments slip through the anti-spam filter, and I get the pleasure of reading the comments and deciding whether to allow them to appear on my blog, or whether to manually mark them as spam.

Why spammers submit comments to blogs

When I first started getting spam comments, I wondered what the spammers were trying to achieve. A typical spam comment seems fairly innocuous. Here are some actual spam comments that I have received:

  • Thanks for a marvelous posting! I certainly enjoyed reading it, you can be a great author.
  • Thank you for the good writeup. It in fact was a amusement account it.
  • If all bloggers made good content as you did, the internet will be much more useful than ever before.
  • Wow, this article is actually fastidious.

Yes, some of the grammar and word choices are strange, but these comments are not much different from some legitimate comments that I have received from legitimate readers whose native language is not English. What makes me sure that these are spam comments?

Along with each of these comments, the commenter included a URL link to some web site. The URL in a typical spam comment links to a web site that advertises cheap "name brand" merchandise, Russian brides, or get-rich-quick schemes. The spammers get paid for each link that they can successfully embed somewhere on the web, such as on my blog. As you might know, internet search engines use the number of "incoming links" as a measure of how important a web site is, and therefore how high it should appear in the search results. The goal of the blog spammer is to embed many links in many blog articles so that internet search engines rank their sponsoring web site highly when someone searches for something like "cheap viagra."

The link is not always embedded in the comment itself. When you comment on a blog, you have the option to include your name and to link to your personal web site. In a legitimate comment, the links points to the commenter's blog or business; spammers link to their sponsoring URL.

How spammers create random comments

As you can see from the sample comments, spammers try to construct complimentary but fairly generic message that they can submit regardless of an article's content.

Suppose that a spammer decides to construct the following generic message: Your blog is truly wonderful. He could write a program that submits this comment to a million blogs. However, he would not be very successful because anti-spam software can block this simple attack by applying simple logic: IF the comment is 'Your blog is truly wonderful' AND the URL field is filled in, THEN classify the comment as spam.

To attempt to defeat anti-spam software, spammers randomly generate comments by using synonyms for the nouns, verbs, and adjectives that appear in the comment. For example, synonyms for "blog" include "post" and "article." A synonym for "wonderful" is "marvelous," and so forth. Thus the spammer could modify his spam program to generate random messages according to the following grammatical template: Your NOUN is ADJECTIVE SUBJECT_COMPLEMENT. The result is like those Mad Libs® stories you and your sibling used to create on long car rides. You can create a huge number of possible sentences, but most of them sound silly.

It is easy to write a SAS DATA step program that generates comments that fit the grammatical template. The following program uses the RAND function to randomly generate elements of a character array and uses the CATX function to concatenate the elements into a sentence:

data SpamComments(keep=msg);
array noun{4} $12 _temporary_                /* nouns */
   ("blog","post","article","commentary");
array adj{7}  $12 _temporary_                /* adjectives */
   (" ","simply","actually","truly","sincerely","honestly","really");
array sc{6}   $12 _temporary_                /* subject complements */
   ("wonderful","marvelous","fastidious","judicious","superb","fantastic");
sp = " ";                                    /* delimiter between words */
call streaminit(12345);
length msg $ 120;
do i = 1 to 20;                              /* generate 20 messages   */
   noun_i = ceil(dim(noun)*rand("uniform")); /* random index into noun array */
   adj_i  = ceil(dim(adj)*rand("uniform"));
   sc_i   = ceil(dim(sc)*rand("uniform"));
   msg = catx(sp, "Your", noun[noun_i], "is", adj[adj_i], sc[sc_i]);
   output;
end;
run;
 
proc print; run;
t_spamcomments

As shown by the output on the left, these randomly generated comments can be amusing. It is often obvious when spammers use a thesaurus to automatically generate dozens of synonyms for each word in their message template. Words have subtle connotations; they cannot be mixed and matched like a verbal closet of Garanimals®. I call comments like this "grammaticals."

Do you have any experience dealing with spammers? Share your experience by leaving a comment. I'm sure it will be fastidious. If all readers make astonishing comments as you do, the web will be much more useful than ever before.

Share

About Author

Rick Wicklin

Distinguished Researcher in Computational Statistics

Rick Wicklin, PhD, is a distinguished researcher in computational statistics at SAS and is a principal developer of PROC IML and SAS/IML Studio. His areas of expertise include computational statistics, simulation, statistical graphics, and modern methods in statistical data analysis. Rick is author of the books Statistical Programming with SAS/IML Software and Simulating Data with SAS.

5 Comments

  1. Michelle Homes

    Your post is actually marvelous... Intrigued to see how well Akismet performs ;-)

    An alternative to using arrays could be a hashtable look up to a dataset with columns for nouns, adjectives and subject complements. A large table = a lot of interesting spam comments.

  2. Melvin Alexander on

    Interesting post. I wonder if you could use text mining techniques such as removing stopwords (words that don't help find in finding meaningful text in posts), stemming (removing characters in terms leaving only its root word), etc to filter out the spam before they are posted.

    • Rick Wicklin

      Yes, there is a huge amount of research on statistical classification schemes, and classifying spam is often a canonical example. The UC Irvine Machine Learning Repository has many spam data sets, if you'd like to try your hand at this classification problem. The spam data collected by George Forman (HP Labs) is analyzed heavily in the classic text _The Elements of Statistical Learning_ by Hastie, Tibshirani, and Friedman.

  3. Pingback: Tips for concatenating strings in SAS/IML - The DO Loop

Leave A Reply

Back to Top