HelpSpot Help Desk Software | HelpSpot Blog | HelpSpot Support

How does the Bayesian filter work exactly?


#1

If found the following statement.

1.4. SPAM - HelpSpot’s integrated Bayesian filtering technology automatically detects SPAM that enters the system via the email account integration feature. The SPAM filter is smart and learns from previous SPAM received by the system.

I did seach for that in Google it because I saw one of the agents getting rid of a number of pseudo-test messages by marking them as spam (probable he was looking for some type of batch change status or batch close), but in general the wording of the messages marked as spam was OK, so now I’m concerned that by doing that we have put spam-wight to words that are legitimate.

We don’t receive much real spam yet because most proxy e-mail addresses are new, and now I am concerned that some legitimate messages may move to the SPAM folder as false positives because of the wrong assigment of the agent when he wanted to get rid of some mails.

If you could share some insights of how the spam engine works and which caveats must be taken into account when misused would be appreciated.

Another thing that I would like to mention is that we have written a quick and dirty Windows service in Delphi to do the mail polling, which is less intrusive than having to program a task that pops-up every minute if the administrator is logged. If you can do Delphi and are interested, I could send you the Delphi code for the Windows service to you. Now everything is hardcoded though (the URL and the polling interval).


#2

Yes, you should tell the agents to not mark things as spam which are not spam. The system does analysis on the words in the message along with the email account and some of the email headers. Future emails with those words or from those email accounts could end up being marked as spam, especially in the beginning when you don’t have many messages for the system to learn from. The emails would then be filtered into the spam queue and you’d have to take them out.

You can easily rectify this by emptying the HS_Bayesian_Corpus table and the HS_Bayesian_MsgCounts table. Don’t delete the tables (you’ll lose the indexes if you do), just empty them.

By default the system doesn’t allow batch closing but you can enable this option in Admin->Settings (I use it all the time myself). We don’t default it to on though since some organizations don’t want to risk it. It’s obviously a little more dangerous because you could accidentally check the wrong box.

Sure I’d love to see the script if you don’t mind sharing. I can put it up in with the VBS script we have up there now. I don’t know anything about Delphi, does it need to be compiled or anything?


#3

Oh I should also note, that a message actually isn’t learned as spam until you delete the message from the spam queue. So if you haven’t actully deleted any then you can simply move the messages out of the spam queue and everything will be fine.


#4

OK, the SPAM messages were actually deleted so these tables were populated with wrong input data. Are you only feeding these tables with negative words (not positive)?

BTW, I found another i18n-related issue when cleaning the database; there seems to be a problem with teh recognition of words. For example, the Spanish word “notificación” appeared as “notificaci” in the HS_Bayesian_Corpus table.

Delphi is a compiled language. Actually, a service is not a plug and play thing as a script and some instructions are required on how to install it.

If you did Delphi, just wanted to share it with you so tyou could add it to the product, but as it is now, it would require some step-by-step instructions to install besides the ability to change the source code and compile it.

If we have the time to make it more user friendly and documented so that it is really useful, I’ll send it to you.


#5

Sorry, I should have been more clear. It does populate with positive words as well. When you close a request the words, headers, etc are trained into the system at that point.

Also under Admin->Tools you’ll see a screen which you can use while in trial mode to reset all the requests. This will clear out all the tables including the spam ones. This may be useful if you decide to purchase HelpSpot and want to start fresh but not have to re-enter your categories, mailboxes, etc.

Thanks for the note, I’ll look at that. We’re reviewing all the character set issues and trying to do a comprehensive update in a future release.

No problem. I like the idea of that little program, perhaps we’ll look into writing one and sending it along with HelpSpot in a future release so that there’s no need for the scheduled tasks.


#6

Thanks, one additional question regarding spam: maybe my greatest concern with spam protection are the false positives.

I was wondering if the current version does have any kind of automatic whitelisting, meaning that if for example we actually respond once to a given e-mail address, that address is always accepted as a legitimate user and never qualified as spam in future requests.


#7

Currently there’s no white listing, however email is part of the consideration along with some of the other email headers so responding to an email will considerably help it with passing the spam filter in future requests.

Also the filter never deletes anything until you tell it to. False positives will be placed in the spam folder, but that’s it and can be easily moved out.