It's been a while since I worked on this but it is still on my mind a lot. I've been mulling over ways to improve categorization without the semi-supervised tweaking I've had to do.
Just to recap, currently this is what I am doing:
- I have a bunch of 'category exemplars', which are sets of key terms associated with a category. These are the things which currently require some manual work;
- for each article, I extract the plain text, normalize capitalization, remove stop words, then use tf-idf to extract the set of most significant terms (I'm not yet doing stemming although I'll probably start);
- I then use a distance metric from the exemplars to assign category scores to the articles. Provided the score exceeds a threshold the article will be considered to be in the category.