One of the first things many of us do in the "monitoring and listening" phase of social media engagement is to sign up for Google Alerts. After you sign up for those alerts, one of the things to watch out for are scraper blogs. These sites copy and paste (or scrape) random content from all over the Web and repost it on their site without crediting the original authors or linking back to the original publication.
Blog scraping is especially insidious when it's your content being scraped, but it's also annoying if you're trying to "listen" online for legitimate content and you can't tell the real thing from the fake. You might want to comment on real blogs but you don't want to comment on a scraper site. You also don't want to include scraper sites in your count of blog mentions, if you're keeping count.
But how can you tell a scraper blog from a legitimate blog? Or - even more confusing - how can you tell a scraper blog from a legitimate aggregator that has permission to pull together content from different blogs and gives those bloggers credit? (Like SAS-X, for example.)
The main tells for me are:
- The blog has an odd name that isn’t related at all to the software industry.
- The other posts on the blog are not related to our industry, and they cover a very broad range of topics.
- There are no comments.
- The site uses a generic design theme.
- There are a lot of Google traffic ads.
- The posts do not link back to the original source.
- The site does not clearly identify itself as an aggregator.
If I’m suspicious but not sure, I grab a chunk of text from the first paragraph and do a Google search. If the search results lead you to the exact same article on a legitimate blog or publication, then you know the other site was likely the scraper.