June 23, 2008

Spam Insidy.

There’s a particularly insidious kind of comment spam nowadays, one that cannot be defeated by automated measures such as captcha. I’ve been noticing more and more of it on QuestionCopyright.org, and I’m sure other web site admins are experiencing the same wave.

I don’t know if this phenomenon has a name yet; perhaps you can tell me.

Basically, it’s comment spam written by people — real human beings, not robots — who are paid to surf the web. They scan each article as quickly as they can and then leave a “drive-by” comment. The comment is usually on-topic, more or less, but of extremely low quality, and contains a commercial link back to whoever’s paying that person to surf. This is now a business model: there are intermediaries who hook up people willing to leave links for money with companies looking to boost their search engine rankings. The intermediary charges a flat rate per comment (US $0.20 seems to be the going rate), keeping a percentage and paying the rest to the surfer. The customers buy in bulk, of course, and the surfers are paid in bulk; the intermediary’s business model is based on economies of scale and smoothing things out.

Such comments now comprise the vast majority of new comments on QuestionCopyright.org. That is, we still get genuine comments at the same rate we did before — in fact, that rate may even be slowly increasing — but it’s dwarfed by the number of paid-link spams we’re getting now. It’s a total deluge. To a first approximation, all our new comments are spams, and then if you look closely there are a few hams (good comments) scattered randomly in the flood.

No Turing Test can possibly solve this, because actual humans are involved. Making your captcha puzzles harder won’t help: all it does is drive up the price-per-comment a bit for the buyers, while also making it harder for legitimate commenters to leave their remarks. The only way to detect such comments is to have human editors reading and making judgements.

This has profound implications for the user interfaces by which editors filter out spam. But before we get to that, let me show you how insidious the problem is. Here are some examples, all taken recently from the same article on QuestionCopyright.org. (Note that this is just the tip of the iceberg — the same thing is happening on all the articles on the site.)

This article is very
Submitted by Anonymous on Thu, 2008-06-19 12:53.

This article is very useful.I read it carefully and I agree with the main idea of the author.The opportunities that the internet offers to make our life and work simpler should be taken advantage of. It is absolutely necessary in the field of copyright.

Internet marketing

That one was a pretty easy call. Even the link text (“Internet marketing”) practically screams spam.

But how about this next one?

Extremely Informative
Submitted by Sedona on Mon, 2008-06-09 11:32

Thank you for this extremely informative article. I agree I don’t feel its about creativity but the publishing entities sustaining a mood of “go cautiously” and keep a big legal war chest.

Thank you,
Sedona

A little harder to tell, that time. The comment doesn’t exactly say anything, but it’s not immediately clear that it’s nonsense — you have to read it somewhat carefully to figure that out. Sedona is actually a registered user of the site (note that her name is highlighted in the header line), and the link text at the bottom is just the name “Sedona” too. But the link points to www.sedona-spiritual-vacations.com.

It turned out that this comment was also pretty clearly spam: “Sedona” left similar comments elsewhere on the site, always with the same commercial link at the bottom. None of her comments said anything much, let alone responded in a meaningful way to the content of the article.

But it gets worse. Some link spam comments actually say something. The paid surfer reads the article, apparently enjoys it and has some kind of non-trivial thought about it, and leaves a halfway decent (or sometimes even better than that) comment — but still with a paid link. Like this:

Patent and CopyRight
Submitted by Anonymous on Thu, 2008-05-22 15:41

Apart from the middle man and distributors its probably Lawyers who benefit the most from these **laws**.

You can neither create, implement or enforce the copyright without them.

One only has to look at the case of RIM ( Blackberry ) in Canada who was forced to pay 600 Million to what was in essence a group of 30+ lawyers who pro bono backed a patent that was actually overthrown in court ( but not before RIM was told to pay ).

This is not an isolated example where a claim jumper has been given a ridiculous patent by the patent office ( who frequently revokes them after they are challenged )

My uncle, who is somewhat of an economist, likes to say that lawyers are one of the very few professional groups who do not contribute to the gross national product of a country.

I am not against lawyers, they are a useful bunch. But like many government employees ( which they are not ) when allowed, too many of them actively attempt to overvalue their services within the scope or measurement of a country’s forward economic progress.

Signed, A Poker Lover

Not a great comment, I admit, but not completely pointless either. If there had been no commercial link, nothing else about it would have raised my suspicions. It was followed up to some days later by this one:

Authors & Artists
Submitted by Anonymous on Sun, 2008-06-08 14:14

I agree very much with the previous poster. How many lawsuits are brought up about copyright a year. The millions of dollars which are thrown at law companies around the world to up hold a ‘companies’ intellectual rights is pathetic to say the least.

The only person who truely has a right to claim stack is the writer, producer artist. I qould pay my way to anyone who does work for me. If they provide a service like my electric or water company I pay them. But what do these middle men companies do? They look out for themselve and only themselve. It is time the power was taken away from the big corperations and given back to the people who really deserve to be paid. Those who created it in the first place.

Regards,

David of PC Sport Live

Wow, the link spammers are following up to each other’s posts! Actually, it’s possible that the “David” of the second post is the same person who wrote the first post, even though he (or she?) portrays himself as being a different person. I’ve noticed they do that a lot. You can often tell, from a combination of the writing style and the link destination, that supposedly distinct commenters are really the same person.

The next day, someone followed up to “David”‘s comment:

Copyright laws
Submitted by Anonymous on Mon, 2008-06-09 16:34

Hello Everyone,

Today, I note that RedHat Founder Bob Young also weighed in on the copyright issue :

A new open source software group has added its voice to the opposition against the Conservative government’s ( Canada )impending copyright reform bill. Lulu CEO Bob Young likens the legislation to banning screwdrivers because they could be used by burglars.

……

Young said the proposed bill will cater too heavily to the content industry and not to the engineers and software developers that are going to be most severely impacted by the new laws. The proposed anti-circumvention legislation, he said, is similar to making the use and ownership of screw-drivers and pliers illegal because they can be used to commit crimes such as burglary.

Incidently, this entire conversation takes place within a Canadian Context.

Young further says,

“The copyright philosophy behind the U.S. DMCA is that it’s illegal to do what software engineers do every day of the week and what they’ll have to continue to do in order to build better technology for all companies,” Bob Young, spokesperson for the Canadian Software Innovation Alliance (CSIA) and a former founder and CEO at Red Hat Inc., said. “The biggest concern is we’re going to have law substitute for good technology. We’re crafting these laws without having anyone from the technology industry engaged in the process.”

The complete article is here itworldCanada.com

An interested internet marketing guy

Hmmm. It’s clumsily written, and consists mostly of quotes from someone else, but there’s real content there: that quote about the screwdriver is terrific. I have to admit that the comment actually contributes something to the site. I think it probably lies somewhere between typical paid link spam and a real comment: it might be from a person who is actually associated in some permanent way with the business being linked to, and who just makes a habit of always signing his posts with a link back to his business. Or it might be the usual kind of paid link spam. I frankly can’t tell.

I could go on and on; the above is a tiny fraction of what we’ve been getting on the site. There are obvious spams, semi-spams, maybe-not-spams, clearly-not-spams, and every gradation in between. I sometimes have to exercise real judgement when doing comment moderation; it’s not always clear what’s spam and what’s not.

In fact, it is no longer possible to divide comments into “spam” and “not spam” in an unambiguous, binary way. A given comment can now fall into both categories. Paid-link spammers are humans, and may have genuine reactions to the articles they read, even though most of the time they’re reading primarily to get just enough of a sense of the topic to be able to write a drive-by comment. Editors will just have to deal with comments on a case-by-case basis. It may be possible to apply some automatable heuristics, but they will always be imperfect, because the problem of categorization has become arbitrarily complex.

This phenomenon has implications for both site editors and software designers. For the former:

Site editors need to get it through their heads that they’re editors. That is, they’re responsible for quality of the site, and that includes the comments. Whether a comment has spam-like commercial links in it or not is not the question. The real question is, Does that comment contribute anything useful to the site? It’s true that there is a strong correlation between commercial links and poor quality, but it’s the poor quality that’s the problem. If a comment is good but has commercial links, you don’t have to throw out the baby with the bathwater — just replace the links with some text like “[commercial link deleted]”. (It’s important to leave some visible sign that the comment has been edited, otherwise the commenter is effectively being misquoted by your site.)
Site editors need to educate their readers about the situation. Most readers don’t run web sites with open comment forums, and most also have no idea that this whole paid-link comment spam problem even exists (partly because the editors have been protecting them from it). When you accidentally delete one of their comments, or they see comments disappearing, they’ll wonder why; you should have an explanation at the ready, and should refer to it often. (I’m about to update QuestionCopyright.org’s editorial policy to reflect this, and then will link more prominently to that policy from various places on the site.)

The implications for software designers (particularly of content-management systems such as Drupal, which is what we’re running on QuestionCopyright.org) are equally important:

Stop thinking about spam-filtering as the problem of filtering a few spams out of the stream of hams. It’s the other way around: the spams are the stream, and the problem is to pick out the rare hams. Please design interfaces accordingly!

If it takes two clicks plus a request/response loop with the server just to see the full body of a new comment, and then another click-plus-loop to mark it as spam or ham, and then another click to confirm, then site editors will waste the majority of their time clicking and waiting. If I have to visit comments by visiting the articles to which they are attached, then the interface is mis-tuned: the operative unit should the comment, not the article. Since most new comments are obviously spam, I don’t need to see the original article to mark them as such. They are the common case, and the interface should be optimized toward them.

The ideal interface would present the editor with a single page showing all as-yet-unmoderated comments, with their full texts (or arrange them in groups of 20, or whatever, if that would make the page too long). They would all be presumed spam, and the editor’s job would simply be to glide down them marking the hams. Each comment would have a link allowing the editor to see the original article, in case that context is necessary (though it rarely is). Each comment should have a flag next to it indicating whether or not there are any links in the comment at all — if a comment has no links, it is much less likely to be paid spam, and therefore much more likely to be high quality. This flag would enable the editor to set her expectations, which is a great help when faced with hundreds of comments.
Don’t make the parent->child threading relationship between comments fatal. That is, if comment B is a reply to comment A, but A is later classified as spam and deleted, B should not be deleted along with it. Users often click “reply” just as a way of making a new comment; the fact that the comment they’re replying to is spam has no bearing on the quality of the new comment. B may not be spam, even if A is. (The version of Drupal we’re currently running at QuestionCopyright.org gets this wrong, unfortunately, but newer versions may have fixed it.)
Have a setting that allows editors to simply prohibit links. It’s a policy decision that each site must make on its own (some links are useful, as we saw in one of the examples above), but the option should be there.

Any other ideas, folks? It’s a whole new world out there…

20 comments

nandita says:

June 23, 2008 at 7:29 pm

I don’t have anything particularly interesting to add although I did enjoy reading your posts. I’ve been experimenting with comments on my blogs: for example, no comments on one of my Blogspot blogs (although I just switched unmoderated comments on), moderated comments on 4dvertigo.wordpress and on LawMatters.in (another of my blogs).

I apologize for the self-‘linking’. The point though is that, for me, the decision to leave comments moderated or unmoderated depends largely on what the content of the blog is. 4dvertigo, for example, seems to attract a number of pointless and extremely nasty comments. The others are relatively bland so they don’t need to be moderated because of comments being abusive.

However, I do moderate LawMatters because I run it with someone else and I think it’s important to limit any possible liability issues. Paid links don’t bother me too much. As long as a comment isn’t overtly nasty and makes some minimal amount of sense, I’ll publish it though probably after removing the link.

I’ve had problems with drupal although that could have quite a bit to do with the fact that I’m pretty technologically challenged.
Paul Huff says:

June 23, 2008 at 11:36 pm

Here are a couple of thoughts:

1. Bayesian filter the pages on the other side of the links in a comment. That is, your web server, on rereceiving a comment, goes and checks the links in it. If it compares favorably with the content of the article or blog in question, the comment isn’t flagged as a spam. If it doesn’t pass muster with a basic (naive) bayesian filter, it’s marked as spam. This is a little more intensive and obviously opens up sort of reverse DDOS possibilities, since if you got enough spams your blog host would kill itself trying to filter everything out.

2. Something akin to spamhaus for comment links. If you mark a few things as spam, they’re saved to a distributed resource (p2p even?) or webserivce system which has a list of comment links previously marked as spam. Complex rules are then used to make sure that false positives for comment spam can be white listed, if necessary (i.e. 20 people have marked this link as having spammish qualities, but 75 have said that it’s really quite hammish, so it’s treated as ham by your blog).

3. Domain-specific turing tests. On a tech blog, require a debugging problem to be solved in place of a captcha. This will restrict the domain of those who can post to people who are knowledgeable in the domain at hand. Similar domain specific questions could be asked on themed-blogs. This kind of works against the economies of scale employed by the comment spammers by requiring them to find people who are knowledgeable about specific fields to be able to post on specific blogs, meaning that they can’t hire people en masse any more, to just sort of spray comments in an arbitrary way all over.
Crosbie Fitch says:

June 26, 2008 at 5:27 am

On borderline cases, I’d suggest redaction.

This is where the comment is converted to javascript with hyperlinks converted to textual URLs.

I’d use it for off-topic comments or on-topic spam to which one wished to give the benefit of the doubt, but not page space. Inquisitive readers with too much time on their hands can still read such comments by clicking on them (An weeny link icon with alttext ‘comment redacted’).
Karl Fogel says:

June 26, 2008 at 9:57 pm

Nandita,

Hey, no need to apologize for the self-linking — I like your blogs. It sounds like you’re maybe having a different comment experience though: the fact that you get certain types of comments at certain blogs indicates that the readers are at least reacting mainly to the content of your posts. Their reactions may in some cases be, uh, reactionary (sorry, couldn’t resist), but the point is that there is usually a correlation between the type of blog or post and the type of comments it gets. But the defining characteristic of spam is its unrelatedness to its host topic (“host” in the host/parasite sense). And that’s exactly the problem with paid-link spam: any relatedness to the host topic is inserted by the spammer purely for the purpose of getting the comment moderated through. Thus, the moderator is essentially in the position of trying to detect sincerity remotely.

I get different real comments at different blogs (here versus QuestionCopyright.org, for example). But the spam all looks pretty much alike, though there are different amounts of it at different places.
Karl Fogel says:

June 26, 2008 at 10:00 pm

Paul,

Various sites are actually offering (2) as a service right now; I should probably start using it, but I don’t (yet) know how to hook it in to Drupal.

(1) and (3) are really creative; at least, I’d never heard (nor thought) of them before. You wanna do a startup? 🙂
Karl Fogel says:

June 26, 2008 at 10:03 pm

Crosbie,

Ah, so “redaction” in the sense of miniaturization, with user option to expand? That’s a great idea, although I guess it has the problem of fracturing the thread. That is: it’s no longer clear what comments are part of the conversation and what comments aren’t; each user may be experiencing a slightly different conversation, and thus things some people say might not make sense to other people, because the usual shared context wouldn’t really be there.

But in practice, I’m not sure how much of a problem that would be. Everyone would have the “known good” comments as a shared environment, and that’s probably enough to keep things focused.
Crosbie Fitch says:

June 27, 2008 at 3:41 am

Redacted comments are not intended to be part of the conversation (don’t need to be replied to or permalinked), they are there to let readers monitor the moderation/editorial policy. If you redact comments then readers can have a better idea of what passes and fails for comments worthy of ‘full’ publication. You can thus redact comments such as “This is a very good point. I’ll definitely take it to heart in future.” without necessarily upsetting non-spammers who write such trite comments.

Moreover, you can have a link within the javascript pop-up of a redacted comment that refers readers and the commenter to the site’s editorial policy, e.g. “Comments, no matter how on topic they are, that include off-topic links to poker sites, etc. will be redacted”. Obviously, pure spam gets deleted, not merely redacted.

Eventually, most readers will start to ignore redaction icons, when they become confident that the moderation policy doesn’t redact anything worthy of reading. But that small fraction of readers who do click to see redacted comments out of curiosity will probably let you know when you are being overly strict – not that that’s likely.

Another thing is that at least redaction let’s an off-topic commenter get at the text of their comment if they didn’t take the precaution of saving it before submission.

This editorial policy also lets you reward good commenters by not putting ‘nofollow’ in the comment hyperlinks to their homepages.
Pingback: Labnotes
Karl Fogel says:

July 7, 2008 at 10:23 am

I left a comment at…

http://blog.labnotes.org/2008/07/06/smart-spam-and-new-comment-policy/#comment-140779

…responding to Labnotes’s automated comment (which is not a spam, by the way, despite surface appearances — see the above link for why).

-Karl
Assaf says:

July 8, 2008 at 3:44 am

Ironically, I have failed the turing test 🙂 That trackback does look like spam, but I can assure you no male-enhancing products are hiding behind that link.
David says:

July 9, 2008 at 11:28 am

I would like you to note that I did actually make a unique reply on that page. As you can see I have not added the link to my site here since you’ll probably remove it anyway. My comment was not spammy, it was a legitimate post. I also run a blog and give posters the right to link back to their site. I do not see the problem.
Karl Fogel says:

July 9, 2008 at 2:03 pm

David,

Glad to hear you were real. I’ll keep an eye out for your reply when moderating the site.

Since my entire article above is precisely about the difficulty of distinguishing between (or even defining) “spammy” and “legitimate” posts in this new era of paid-link spam, I’m not sure how better to explain the situation.

Note that I’m not accusing you of being a paid-link spammer; it seems clear you’re not. I’m merely pointing out that it is now becoming very hard for site moderators to tell the difference quickly enough to moderate efficiently.
Steve Alexander says:

July 18, 2008 at 7:48 am

Karl, in the “internet marketing” example, the code I see in my browser is a regular looking HTML anchor tag. I’m surprised it isn’t marked “nofollow”, as that would surely make it less useful for spammers seeking to increase their google pagerank. http://en.wikipedia.org/wiki/Nofollow
jessica says:

September 13, 2008 at 4:29 pm

Yeah, it was just a few months ago that I had a conversation with someone whose friend is employed as a link spammer. Apparently one of her fortes is linking to tech products. She’ll go on tech blogs and forums and say something like, “Hi, guys! I’m kind of new here, but I’m looking for a new t.v., and I wanted the opinion of someone who really knows what he’s talking about! I’m looking for something light-weight, not too expensive, and something that won’t break. My friend has a [spammy link] and says I should get one too, but I don’t know. What do you guys think? Thanks! ;)”

When possible, she accompanies her post with an attractive userpic.
Karl Fogel says:

September 13, 2008 at 6:17 pm

Wow. How bizarre. I mean, people who would never walk into a quiet library and start shouting loudly and distractingly somehow don’t see anything wrong with being a paid-link spammer. (At least, I’m assuming your friend’s friend matches the preceding description.)

Anonymity really is the death of courtesy.
Karl Fogel says:

September 13, 2008 at 6:21 pm

Steve,

Thanks for the tip. I’m using a content management system (Drupal), so I’d need to persuade its code for displaying comments to add that attribute. I’ll take a look at Drupal and see what that would involve.
Karl Fogel says:

October 9, 2008 at 11:39 am

More notes to myself (since this post and its comments have become my scratch pad for an ideal moderation system):

Response time, response time, response time. After you mark a message as ham or spam, you must be able to move on to the next item with no further delay. If there is any waiting for the action just performed, the waiting must happen in the background. Two examples of where this rule is not followed: the Drupal “delete this post” action, and my own mailreader (Gnus) when I’m approving Subversion mailing list posts.

Also, no confirmation step. Asking for confirmation is pessimal: you pay the price in advance for a mistake you might or might not be about to make. And of course, you just habituate to the confirmation step anyway; thus confirmation is only useful when the confirmation-requiring action is liable to be accidentally invoked in place of some other non-confirmation-requiring action. That’s not the case in, say, the Drupal interface.

The better alternative to confirmation is rollback — the ability to realize you’ve just made a mistake and undo it. So that means posts marked as spam go to a holding bin. A post can be automatically flushed out of that bin after 24 hours or a week or whatever; the point is that it just needs to be available for retrieval in the few seconds after one marks it as spam, because that’s the window in which one realizes one miscategorized something.

Note how this is tied in with the response time / no confirmation step rules: if the interface allows you to fly along marking things one way or the other, then your brain starts pipelining, and you will usually realize you’ve mis-marked something a few messages later. Example: deleting spams (not moderating, just deleting) in a Gnus mail group using the “e” key. Post-marking realization of miscategorization is by far the common case there, in my experience. Thus, marking must be undo-able within a reasonable window.
PPather says:

November 11, 2008 at 10:28 am

Well thanks for the information guys it is really useful. I am going to edit my blog now and start blocking the spammers 🙂
Karl Fogel says:

August 10, 2009 at 6:50 pm

More notes for myself:

Since spams tend to group in a moderation queue, the interface must support dragging.

Also, since one usually identifies all the spams in a given selection list, the implication is that everything else is hams. Therefore, there should be a “marked items are spam, unmarked are ham” command and/or better yet the reverse: “marked items are ham, everything else is spam”. Either way, don’t force the user to scan the same item twice: if something has crossed my screen already, and I’ve identified it as one or the other, then I never want to see it again.
Pingback: Yes, please « My Cold Blog Online