<<

NAME

MT::JunkFilter

SYNOPSIS

    use MT::JunkFilter;
    MT::JunkFilter->filter($comment);

Introduction

Movable Type 3.2 introduces a pluggable spam-detection framework that plugin developers can extend without conflicting with each other. The framework uses a "scoring" model: each plugin either assesses an incoming feedback with a numerical score, or it abstains; the scores are combined into a composite score. The composite score is used to decide what action to take on the feedback: to publish it or throw it in the junk folder.

The composite score used by Movable Type is the average (arithmetic mean) of all the attesting plugins' scores (we also allow a plugin to abstain in which case it is not included in the average). This value is called the "composite score." If the composite score is above a threshold (by default, 0) then we count the comment (or trackback) as junk.

Another way to think of it is to imagine that Movable Type weighs each comment using a balance beam that starts out flat. Each plugin, after looking at the item, can place a unit weight anywhere it wants on the balance beam (or it can abstain). If every plugin places its weight on the left side of the balance beam, it will tip to the left, which causes the item to be junked. If every plugin places its weight on the right side, the beam will of course tip that way and the item will be treated as not junk. If some plugins put their weights on the right and some on the left, the outcome will depend on how many weights are on each side and where they are placed. The average is like the center of gravity of all these little weights.

Intuitively, large positive scores will overwhelm small negative scores, or lots of scores on one side will overwhelm a few on the other side.

Scoring Guidelines

It is important to design your plugin carefully to be sure that it returns an appropriate score. Here are some points to remember.

The threshold, by default, is 0, but the user can adjust it. In the balance beam metaphor, this is like adjusting the "tare" weight or the bias on the beam. This gives the user some control if they find that too many legitimate comments are winding up in the junk folder or if too many junk comments are being published. But this makes no excuse for a poorly-calibrated plugin.

As a corollary to this, remember that you needn't always return a numerical value at all. There are many situations where your plugin has no information about whether the item is junk or not--in that case, 0 is not the right value to return, since 0 is in fact a vote. By default, this may make no difference, but when the user has raised the threshold, 0 will be interpreted as a vote for "not junk" or, if the threshold was lowered, 0 would then be interpreted as a vote for "junk."

Think of the balance beam as a perfectly smooth continuum from -10 to +10 -- don't count on any particular value being the cutoff between junk and not-junk. If your plugin's sensors don't know what to make of a comment, it should abstain and let other plugins take care of it.

METHODS

filter($obj)

Score the object, mark as junk or not-junk and log the action.

score($obj)

Apply the defined filters and return the junk score.

task_expire_junk()

Perform junk expiration for each blog.

The API: Sample Code

To get familiar with the API, let's look at some example code.

As we all know, the "e" character is eeevil. So here is a plugin to detect any E's in an incoming feedback and place a high junk score on items that have a lot of those monsters.

    package SpamTest;

    use strict;

    use MT::JunkFilter qw(ABSTAIN);
    use base 'MT::Plugin';

    sub name { "Sample Spam Detector"; }
    sub description { "Counts the number of E's, an indicator of junk."; }

    sub score {
        my ($obj) = @_;
        my @es = $obj->all_text =~ m/(e)/gi;
        my $count = scalar @es;
        my $score = (2 ** $count - 1);
        return ABSTAIN if ($score <= 0);

        return (-$score, "Contained $count 'e' characters");
    }

    MT->add_plugin(__PACKAGE__->new);
    MT->register_junk_filter({name => 'E Junk Filter', code => \&score});

    1;

Some of this is boilerplate for defining a basic Movable Type plugin. Let's cut to the junk.

    MT->register_junk_filter({name => 'E Junk Filter', code => \&score});

This line registers my score routine to be run against incoming comments. The structure of one of these routines is straightforward:

    sub score {
        my ($obj) = @_;

        # ... return ABSTAIN if I can't find any E's ...
        # ... calculate a score based on "e" count ...

        return (-$score, "Contained $count 'e' characters");
    }

The only argument to this routine is the comment (or TrackBack) to be filtered.

The return value should be a list ($score, [$log_line1, $log_line2]). The latter value (the array of log messages) can be omitted. $score should be either a real number in the range [-10, +10] or the special value ABSTAIN imported from the MT::JunkFilter package.

Let's break down the example routine a bit further. We use a Perl regular expression to count the E's and then we apply a mathematical function so that the score increases dramatically with each additional E. One E is suspicious, but four or five E's just can't be good.

It is good practice to provide a log message that includes the score you're returning. The messages will be displayed in the admin interface along with the comment, so that the weblog owner can track how a score got to be what it is. Movable Type will add a log line with the final score and the action taken.

If there are no E's in the comment, the plugin has no rightful judgment about the feedback. It might be clean, or there might be some other, unrelated signs that should flag it as junk. That's none of this plugin's business, so it returns ABSTAIN.

Note that a return value of 0 is truly a judgment on the comment, and not an abstention. What kind of judgment is it? Well, it's a judgment more spammy than -1, but less spammy than 1, for example. Since the weblog owner can adjust the threshold, a 0 result can actually trip the comment into junked, or rescue a good comment from the junk folder. As a plugin developer, you don't know how the user is going to adjust his or her threshold, so you have to see the 0 value as just somewhere in the middle of the spectrum.

By contrast, ABSTAIN indicates that your plugin has no way of judging whether the item is junk or not. It won't affect the composite score one way or another. A whitelist plugin (below) never wants to tip a comment over into the junk folder -- it doesn't know what would make a comment junk, so it would be reckless to do so. And, except for the small list of whitelisted names, it doesn't know how to recognize a not-junk comment, either. So it abstains unless it sees something that is meaningful on the axis it measures.

    sub score {
        my ($obj) = @_;
        my @whitelist_terms = ('George\s+Lucas', 'Boutros\s+Boutros\s+Ghali',
                               'Neil\s+Armstrong', 'Salif\s+Keita');
        my $whitelist_expr = join "|", @whitelist_terms;
    
        if ($obj->all_text() =~ /$whitelist_expr/i) {
            return (1, "Whitelisted by spam-whitelister.pl");
        } else {
            return ABSTAIN;
        }
    }

You're invited to use the "SpamLookup" junk filters that were supplied with Movable Type as a basis for developing your own Junk Filter plugins. The SpamLookup code included with Movable Type is licensed under the Artistic License and may be modified and/or redistributed under the same terms as Perl itself.

LICENSE

The license that applies is the one you agreed to when downloading Movable Type.

AUTHOR & COPYRIGHT

Except where otherwise noted, MT is Copyright 2001-2007 Six Apart. All rights reserved.

<<