Log in

No account? Create an account

Previous Entry | Next Entry

A couple of weeks ago I wrote a quick and dirty Python script that would scrape the local Zehrs flier and last night I tossed a GUI around it and hooked it up to a Bayesian classifier to have it filter between things I'm interested in and those that I'm not.

Unfortunately, it seems that the classifier is too unstable. Marking interest in a few things will drag over to the 'interested' side many other things, with no apparent relation. Telling it that I'm not actually interested in adult diapers will cause it to decide that I'm not interested in the items that I originally indicated interest in.

Can anybody who's more familiar with Bayesian classifiers explain why telling it I'm not interested in VEET IN-SHOWER HAIR REMOVER makes it think I'm less interested in MAPLE LEAF BACON, even though the two have no words in common?

I'm using a pair of classifiers, one for 'good' and the other for 'bad'. If one scores high and the other low, it gets marked interested or interested. Otherwise it's undecided.

Edit: Problem solved. Reason given in comments. Now it's working like a dream.


( 4 comments — Leave a comment )
Mar. 26th, 2007 02:22 pm (UTC)
What's your data for the classifier? The words in the item as a unigram (with an independence assumption)?
Mar. 26th, 2007 04:04 pm (UTC)
I was using the spambayes package. After poking around a bit, it seems that they (somewhat reasonably) don't tokenize the wordstream themselves, but require you to provide the tokenized version. As a result, it was classifying based on letters and not words, giving expectedly unpredictable results.
Mar. 26th, 2007 04:50 pm (UTC)
lolol :)
Mar. 27th, 2007 12:35 am (UTC)
There's a Da Vinci Code-esque plot in here somewhere.
( 4 comments — Leave a comment )

Latest Month

December 2008

Page Summary

Powered by LiveJournal.com
Designed by Ideacodes