Post #50

Neutralising spam

16th November 2003, lunch time | Comments (15)

The following post hinges entirely on the premise that Google places no page rank value on un-linked, plain text URLs like this: http://webstandards.org/ as opposed to linked URLs like this http://webstandards.org/. If I’m wrong, please tell me.

Last night I read Mark Pilgrim’s thoughts on the futility of trying to combat comment spam on blogs. I know practically nothing about past fights against spammers, so it was good to get an insight into just how much oomph these shifty-eyed swine have behind them.

Having implemented a blacklist myself last week I guess I’m one of the people Mark labels as thinking they’re special and unique. I suppose it must be frustrating to see the same battles being fought time and again with no lessons being learnt, but then not all of us are aware of these past fights. I didn’t know even a quarter of what Mark wrote, so I don’t feel bad about giving it a shot this time round.

So, since Mark has knocked some holes in the blacklist method, what else is there left for us to do? The Moveable Type guys have summed up their thoughts on the matter, but don’t really get any further than praising Jay Allen’s work with the Comment Spam Clearinghouse.

Personally, (and leaving the blacklists aside for a moment) I see three main pathways blog owners can follow if they want to allow commenting but still fight spam on their sites:

  1. Manually review all comments made to your site before they go live

    This is a time consuming solution and, unless you’re poised at your computer 24hrs a day, may break-up the flow of conversation in your comments. However, it does seem to be the only way to ensure spam never finds its way on to your site.

  2. Try to cut spam off at the submission point

    This is where our blacklists fit in. The other option is to try to recognise spam by its format and content, but that isn’t going to work since spammers post perfectly good English sentences, as well as gobbledygook.

  3. Try to remove the reason spammers spam — the links

    If links to other sites weren’t allowed in comments then there’d be no point in posting this kind of spam.

It’s this last idea I’d like to address, though from a slightly different angle.

Removing links from comments… kind of

The main way in which cash-based business try to lower their chances of being robbed is to remove that cash from the premises — they send it to a bank and then publicise the fact. Why spend money on expensive alarms when removing the incentive is a much simpler way of avoiding burglary?

In the same vein, it might be reasonable to think, why would a spammer post a link on your web site if he knew that link would never appear to do its job?

But, I hear you say, a blog that won’t allow outgoing links in comments is going to be a terribly unpopular — links are the lifeblood of the Web. So how is this of any use? Well, I’ll tell you…

The idea

My idea goes like this: let all comments be posted to your site straight away (thus avoiding the time-lag that manual reviews take), but let the links within them (excepting internal links to your own site) be rendered as plain text.

Then, when you receive your admin notification email, you (as a human administrator) can decide if the comment is spam or not.

If it is spam then remove the comment.

If it’s not spam then click your OK button and activate all the URLs within the post; turning them from plain text in to clickable links.

Since Google places no page rank value on un-linked, plain text URLs, the spammer’s comment will do him no good in the time between him posting it and you removing it.

And if users really want to visit a link before you, as admin, make it active, then they can always cut-and-paste it into their browsers.

Up sides to this idea

  1. It relies on humans to detect spam, not complicated and ultimately flawed regular-expressions or the like;
  2. It promotes the flow of conversation in comments, with no time delay between posts appearing;
  3. If you’re away from your computer for a while the whole system doesn’t break down, leaving users wondering where their comments have gone;
  4. You could turn such a system off at any point and just allow all posts to have clickable URLs;
  5. It doesn’t rely on any centralised, attackable resource;
  6. It doesn’t rely on storing data in any way.

Down sides to this idea

  1. There’s still the chance you could end up with a site full of spam which everyone can see before you delete it (but then, we already run that risk);
  2. Users might get frustrated by the plain text URLs if they read a comment before you’ve had time to review and activate its links;
  3. You would have to have some kind of message educating spammers as to why there’s no point posting on your site (though we already do that with the blacklists);
  4. To work fully it does rely on you reading each of the comments made to your site (but then I’m sure most of us do that any way).

Conclusion

I think this idea would be a fair blend of manual review (where no spam is going to get through), and free-posting (where everything gets through). If we can’t stop spam, can we not neutralise it?

If you wanted you could also combine this with the blacklist technique to remove those obvious spam comments right at the start and save yourself a bit of time.

Any thoughts on this?

Jump up to the start of the post


Comments (15)

Jump down to the comment form ↓

  1. Dunstan:

    An additional thought - maybe in the time you take to review a comment, while the links are still rendered as plain text, you could use Javascript to turn them into clickable links.

    That would have the effect of keeping the majority of your users happy (it'd be seamless for those with JS), and would still provide no benefits to the spammers, as I presume Google wouldn't run the JS and recognise the URLs as links.

    Looking forward to comments on this, it'd be interesting to see if it has any merit at all :o)

    Posted 4 hours, 7 minutes after the fact
  2. Matt:

    I don't think moderation is so bad. Just have a system that lets innocent looking comments through and holds others for approval. It could be bayesian, blacklist-based, keyword-based, or a combination of all the above. If a comment says "viagra" in it, I'll want to review it even if it's innocent. Plus the system could be used for people who want to set up blogs for their kids but worry about weirdos leaving comments. This is the direction we're heading with WordPress. This isn't to say your solution isn't a good one, but to be really effective you need a hybrid approach.

    (BTW, thanks for the read more feature, the front page was getting unwieldy.)

    Posted 7 hours, 56 minutes after the fact
    Inspired: ↓ Dunstan
  3. Dunstan:

    (Hey Matt - thanks for the comments as always, it's all very well having these thoughts in isolation, but it's great to get some feedback and find out if they're rubbish or not :o)

    I don't know how you would tell which comments _are_ innocent? I guess if this bayesian filter thing people keep talking about can do it then great, but I don't know how to write something like that.

    But you're right, if you had the prowess to get clever filtering in, and combine it with things you know (like 'viagra' buzz words) then it's be a much better system.

    I was trying to come up with a process that doesn't rely _at all_ on being able to spot spam automatically, but rather by its very nature neutralises spam until a hooman can get in on the act.

    I guess we're all driven by what happens around us - if I was clever enough to start writing filters then I expect I'd be going down that route, and if I had a big connection to the web and was on it all day then I'd maybe go down the monitored route.

    But neither of those things is true, so I came up with this to fit my situation :o)

    But, I think I'm gonna stick with the old blacklist for now - to be honest I don't get much spam, I'm just trying to think ahead since it's such a topical item at the mo.

    Matt: "BTW, thanks for the read more feature, the front page was getting unwieldy."

    I suddenly got really annoyed with it :op

    Thanks again.

    Posted 8 hours, 19 minutes after the fact
    Inspired by: ↑ Matt
    Inspired: ↓ Matt
  4. Matt:

    I've been looking at http://xhtml.net/php/PHPNaiveBayesianFilter.html for PHP bayesian filter code. Haven't had time to really dive into it yet though.

    Posted 8 hours, 35 minutes after the fact
    Inspired by: ↑ Dunstan
  5. Tony Crockford:

    Why turn them into links at all?

    It's a short step for someone reading a post to highlight, copy and paste a plain text url into their browser address bar. In fact Opera has a go to url "right click" option that will go straight to a selected url even if it's in plain text.

    Or am I being too simplistic again?

    Posted 10 hours, 25 minutes after the fact
    Inspired: ↓ Dunstan, ↓ Dennis Pallett
  6. Dunstan:

    Well turning them into links would be the work of a moment - simply altering an admin setting. I use a little regex to change the URLs into links as they come out of the DB, so it would be a case of simply running that (or not) depending on a flag in that record.

    But it's good to know about that thing in Opera, I recall now there's an entension for Moz FB that lets you do the same thing.

    So, plain text URLs wouldn't be such a pain for the short time they were in effect.

    Although I must say, of the 50 comments made here so far, only 4 of them have involved people other than me posting links. So in this instance it's not as if _every_ commentor would be effected by this system

    I think it'd work rather well :o)

    Posted 11 hours, 14 minutes after the fact
    Inspired by: ↑ Tony Crockford
  7. BenM:

    Simon Willison has quite a good system at the moment. He has all his link redirected through a page that is excluded from Search Engine Indexing using the robots.txt file.

    So for example a link to my site http://www.benmeadowcroft.com would show up as http://www.example.org/redirect?url=http://www.benmeadowcroft.com and the redirect page would be excluded from indexing.

    Posted 1 day after the fact
    Inspired: ↓ Dunstan
  8. Dennis Pallett:

    I wouldn't leave the links like plain text, as that annoys me, and I'm sure others as well, A LOT! If I see a plain text link, I probably won't go there, unless it's something I really want to see (important).

    Now, to be honest, your idea seems perfect. Blacklists aren't waterproof (for obvious reasons, see http://diveintomark.org/archives/2003/11/15/more-spam ), but if the spammers can't get any PR from your site, then your site is no use to them, and they leave. Simple as that.

    Posted 1 day, 3 hours after the fact
    Inspired by: ↑ Tony Crockford
    Inspired: ↓ Dunstan
  9. Scrivs:

    I guess I take the unusual approach of closing comments after an entry is two weeks old. I find no one posts after that and if they did post a couple of month later no one is going to notice or care. So if the spammers don't have form to post to, then there is no spam. Blacklists are good and all, but I know for sure that I am safe here and that some spam doesn't slip under my radar.

    Posted 1 day, 3 hours after the fact
    Inspired: ↓ Dunstan
  10. Dunstan:

    Ben - oh yes, I'd forgotten about simon's method, that has exactly the effect I was trying to achieve, of neutralising spam rather than just relying on stopping it at the point of submission. (I think he says the referalls don't work though, but that's not such a bad side effect I guess.)

    Damn, I ought to get a better brain, I could have saved myself some work there :o)

    Dennis - the links would only be plain text for a short while (while I was sleeping for example), not forever.

    Posted 1 day, 5 hours after the fact
    Inspired by: ↑ BenM, ↑ Dennis Pallett
    Inspired: ↓ Dennis Pallett
  11. Dunstan:

    Paul - that's a great idea - or maybe open it up a little bit and say that comments on anything past 2 weeks old must come through you first, before going live. That'd cut down the odds of crap making its way onto your site by a long way, and still let people have their say on the off-chance they want to comment once the excitement has passed - not all posts only relate to the now.

    I guess I'll see how things go on here, but that sounds like a jolly good idea - I guess I haven't been going long enough for it to have occured to me, when you're new you want _lots_ of comments :op

    Posted 1 day, 5 hours after the fact
    Inspired by: ↑ Scrivs
  12. Dennis Pallett:

    Dunstan - I know, I was replying to Tony Crockford's comment, should've made that clear :)

    Posted 1 day, 5 hours after the fact
    Inspired by: ↑ Dunstan
  13. Scrivs:

    Well the spam only starts to really happen when you get linked from numerous places. Seeing how myself and Simon have alread linked to you, you should be getting a visit from some friends very shortly :)

    Posted 1 day, 6 hours after the fact
  14. Tara:

    Well if comment spammers are anything like guestbook spammers it won't matter if the link is only in plain text. I took out all links/HTML etc from a client's guestbook and the spammers didn't care. Probably because they're bots and can't tell the difference. Google may not rank the links, but the spammers will still come and spam your comments. If my experience has any relevance to comments spamming, I'm sorry to say Dunstan, but your plain text link method won't have any effect at all.

    Posted 2 days, 10 hours after the fact
    Inspired: ↓ Dunstan
  15. Dunstan:

    That's OK Tara, any kind of info is great, and could save me wasting hours of my time! :o)

    Posted 2 days, 11 hours after the fact
    Inspired by: ↑ Tara

Jump up to the start of the post


Add your comment

I'm sorry, but comments can no longer be posted to this blog.