Content Editing Tool For Your Web Site posted: 4/14/08

The Curse Word Filter is our first step towards automating the editor's job. It might save you time in its' current form so we put it here for your convenience. Just add the script below to the Perl code that processes your blog or text input and put it to work as you gradually add to your negative word list.

This is a concept piece, a work in progress. Your collaboration and development ideas would be greatly appreciated. I would greatly appreciate the opportunity to add this to the comment section of a working web site.

Description posted: 4/14/08
    The curse word filter consists of two parts; the textarea element and the negative word file. The content of the textarea is compared with the negative word file and if a match occurs the offending text is replaced with asterisks. With more creative programming (using if statements) you may be able to sort out different types of content and create different responses for different types of abrasive input. Maybe even sort out words like 'abrasive'.

use CGI qw/:all/;
$query = new CGI;
    Using the CGI module reduces the amount of code needed for assigning variables.
use Text::Wrap;
$Text::Wrap::columns = 62;
$huge = 'overflow';
    The Text module allows paragraph formatting that will be used here to reassemble the comment variable after sorting out the curse words.
    The content of textarea is assigned to the $comment variable. The variable is then passed to the Perl script, typically via the form element action attribute or with the ajax httpRequest function.
    Next the Perl script opens up the negative word file, 'negatives.txt'. This is a comma delimited list with no spaces between entries. The contents of the file are assigned to the $nWord variable.
    Once the negative word file is read the contents of the $comment variable and the contents of $nWord are separated into the arrays @commentsWord and @negativeWordList. The lengths of both the arrays are assigned to variables, $lengthComments and $lengthNegativeWordList.

    The script loops through the comment array checking for matches from the negative word array. If there are any matches the script swaps the offensive word with asterisks. An alternative caution message can be added easily by assigning it to a variable, here labeled $marker and then displayed if a value has been changed.
    Fancier scripts might throw in a few 'if' operators to sort entries further. You might want to provide different levels of comment warnings or give users a second chance to choose their words.
    The negative word file can also be a negative phrase list. There are no absolutes and the combinations are endless. The creative antagonist is going to get around your filter, but at least you made them stop and think. Plus with continued improvements to your list you may be able to work towards a high rate of negative word and or phrase extraction that will work for you application. Generally what I do is display a message to the user if word is flagged, like 'Sorry. Content Flagged. Editor will review for inclusion'.
# **********************************************
# ********* BEGIN NEGATIVE WORD EDIT ***********

use CGI qw/:all/;
$query = new CGI;
#  include text wrap for negative word list
use Text::Wrap;
$Text::Wrap::columns = 62;
$huge = 'overflow';

$comment = $query->param('COMMENT');

open(NEGATIVE, "<negatives.txt") || die &fileNotOpen;
$nWords = &@60NEGATIVE>
close (NEGATIVE);

@commentsWord = split(/\b/, $comment);
$lengthComments = @commentsWord;
@negativeWordList = split('\,', $nWords);
$lengthNegativeWordList = @negativeWordList;

for($count = 0; $count < $lengthComments; $count++)
$currentCommentWord = $commentsWord[$count];
  for($count2 = 0; $count2 < $lengthNegativeWordList; $count2++)
    if($currentCommentWord eq $negativeWordList[$count2])
	$marker = 1;
	$cussWordRemovedMarker = " Negative word(s) removed.  
Contact editor for inclusion. "; $currentCommentWord = " **** "; } } $newCommentsA .= $currentCommentWord; } $newComment = wrap("", "", $newCommentsA); # ********* END NEGATIVE WORD EDIT ******** # *****************************************

Try it.  
    Try the words 'dimwit' or 'dweeb' which we've included in the curse word list for this demo. Notice dimwit and dweeb work but dimwitted and dweebalo don't. The script needs a more precise match allowing for leading and trailing spaces around the flagged word of phrase. The script will, however, sort out 'dweeb.' with trailing punctuation such as a period. It's a work in progress. Your feed back will be greatly appreciated (with or without the bad words deleted).

(Requires JavaScript enabled to submit)


Here's what I would do. First I would copy and paste the code above into the Perl code I'm using to process blog or text entries. You may have already imported the CGI or Text modules and you'll want to delete the calls to those modules here so the call is not duplicated.

Next create your negative word text file, negatives.txt. Here it's in the same directory as the Perl code. The word list is a comma separated list with no spaces for example; dweeb,dweebette,dweeber,dunce . . . etc. etc. There's no comma after the last entry. You might also want to add phrases; donkey ass, cow pusher, dimwitted dweebalo, etc. Careful. You're practicing a new brand or censorship here that garners some responsibility.

I would call the script using either the form action attribute or with the javascript httpRequest function. The second is more popular and allows increased design flexibility. You might want to take a look at the JavaScript file we use on this page if you're using ajax. Ajax is difficult to debug because if the script doesn't run you don't have access to standard error messages available with the Perl code being called.

That's pretty much it except there are a couple of other details you might want to think about that are a matter of style and practice. I think it's a good idea, for example, to put some sort of character filter in so people can't submit script or code to your site, characters like the < and > characters for example or the parenthesis characters, '(' and ')' that are used in function calls. More labor intensive code, with a budget, would scramble these characters and unscramble them for display. It would also be interesting to add a few 'if' operators in the Perl code to flag certain key words or phrases and splash different messages for different types of offensive words.

The curse word filter is a work in progress. There are a few companies around that let you link to their word lists but I think I would prefer the flexibility of designing my own negative word list, suitable for individual web sites. A basic list would have a lot of obvious words in it which I set up for download here. However it might be kind of iffy to email this around as there are government agencies, I hear, monitoring our emails. I don't want to be at the top of that list. Good luck. Call me. Just don't call me anything bad.