by Viper
20. June 2009 20:11
Download SpamTrainer Binaries
Download SpamTrainer Source
As more and more people are tweeting, spam is growing with it as well. Every time I search for some topic, almost half of the messages seem to fall in one of the following categories:
- Somebody is trying to sell something
- Somebody is posting links to get you affiliate web sites to make some money
- Job agencies are posting jobs
- .... and more
This week I decided to use Bayesian spam filter, that is used in most email servers to filter spam, on twitter messages. While searching around I found Bayesian Spam Filter for C#. That gave a good starting point. Without making any changes or training with any additional corpus, I was able to get very good filtering results. I observed close to 90% spam detection. I studied the messages that fell through the cracks and also studies false positives. Based on the observations I figured that issue is very limited context of 140 characters in twitter. A lot of good and spam twitter messages look pretty much the same. So the key to improving spam filtering results was to train the filter with twitter messages and not use just rely on corpus taken from emails or things like that. So I decided to build an application that I could use to generate corpus that is classified as spam and good twitter messages.
How does it work
-
Start the application.
- Enter a search term and click on "More Data" button.
- Application will do initial classification of messages. All spam messages are displayed in Orange or light blue color.
- Double on any message to change its classification.

- Once you are satisfied with the results, click on "Accept" button and results are saved in appropriate good and spam files.
- You can load the new corpus results by clicking on "Reload Corpus".
Spam Filter Service
I have created a service that you can use to classify your text if you do not want to build one of your own. Following link provides
more details about the service.
Spam Filter Service
|
|
|