Text Classification in WEKA

Text classification can be done using WEKA, an open source machine learning tool. However a lot of people seem to be running into problems when you have training and test sets of data. If you use the stringToWordVector filter and get the dreaded ‘training and test set are not compatible’ error, read on. This error occurs because the word dictionary will change, since word occurrences will differ in training and test set. The generated output will be two incompatible files. The solution to this is batch filtering the train and test sets of data together using stringToWordVector filter. Then the test set will contain the same attributes, thus eliminating the ‘training and test set are not compatible’ error.
NOTE: If the test data set is missing the class attribute, you have to put it there (For arff files: @attribute classname, csv files, append at the beginning: classname). Then for each instance put a “?” as the value for that attribute. (“?” signifies missing values in WEKA).

If you have .csv files here’s how to do it

For each .csv file (the train and test data sets)
1. Open .csv file in WEKA Explorer
2. Convert all attributes except the class attribute(s) to String type.
Ex: If the attributes are nominal click Filter choose filters->unsupervised->attribute->NominalToString. Click the text area of the chosen filter (Generic Object Editor) and in the ‘attributeIndexes’ field type 1-3 if all your non-class attributes exist in attributes 1,2,3. Click Apply to apply the changes.
3. (Optional) Convert your class attribute(s) to nominal type. (Otherwise most classifiers will be disabled)
Ex: If the class attribute is numeric, then click Filter choose filters->unsupervised->attribute->NumericToNominal. Click the text area of the chosen filter (Generic Object Editor) and type in the ‘attributeIndexes’ field the class attribute index (ex: 4). Click Apply to apply the changes.
4. Click save in the WEKA Explorer and the new data set as a .arff format file.

Now you should have two data sets (train and test) in the .arff format. Then we will batch filter them using the stringToWordVector using the following commands in the command line.
NOTE 1: It’s better to copy .arff files to the weka.jar location
NOTE 2: You should have the path to the weka.jar file in the CLASSPATH system variable. More details here.

1. Open the command prompt (start->run-> type ‘cmd’)
2. Browse to the weka.jar location (‘cd ‘ etc )
3. Type the following command

java weka.filters.unsupervised.attribute.StringToWordVector -b -i input_training_set.arff -o output_training_set.arff -c last -r input_test_set.arff -s output_test_set.arff -R 1,2,3 -O -C -T -I -N 0 -M 1

The options denote the following
-b: batch mode
This is useful to filter two datasets at once. The first dataset is used to initialize the filter and the second one is then filtered according to this setup. i.e., your test set will contain the same    attributes then.
-i: training input file
-o: training output file
-r: test input file
-w: test output file
-R 1,2,3: process the first,second,third attributes which is the string attribute, this is by default
-C: output word count rather than boolean word presence
-T: transform term frequency into log(1+tf)
-I: transform word frequency into tf*log(total# of docs/# of docs contain this word) It is actually the tf*idf weight without normalization
-N: 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don’t normalize).(Detail explanation from Wekalist)
-L: Convert all tokens to lowercase before adding to the dictionary
-A: Only form tokens from contiguous alphabetic sequences (Turn this off when work with phrase!!! After Weka 3-5-6, this option is no more available and is replaced by weka.core.tokenizers.AlphabeticTokenizer)
-S: Ignore words that are in the stoplist. (we don’t use this one since we’ve use our own stop list already)
-M: minimal term frequency, here is 1.

The most important options are -i, -o, -r, -s, -c, -R, -C.

Now you should have two new .arff files. Load them into WEKA and classify the training set (ex: Choose Naive Bayes). In Test Options, choose Supplied Test Set. Click ‘Set’ and choose the newly generated test data set. Click More Options and tick ‘Output Predictions’. Right-click the result in the ‘Result list’ and choose ‘Re-evaluate model on current test set’.

Now you should have the classified data in the results.


Tokenizing, Stopping and Stemming using Apache Lucene

If you want to tokenize, stop and stem a line of string, you can do it using Apache Lucene. Here a sample code which takes a line of input and returns a tokenized, stopped and stemmed output.


is a hash set with a list of stop words. I used the list from here.

import org.apache.lucene.analysis.PorterStemFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.util.Version;

private static String tokenizeStopStem(String input) {

        TokenStream tokenStream = new StandardTokenizer(
                Version.LUCENE_36, new StringReader(input));
        tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, stop_word_set);
        tokenStream = new PorterStemFilter(tokenStream);

        StringBuilder sb = new StringBuilder();
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        CharTermAttribute charTermAttr = tokenStream.getAttribute(CharTermAttribute.class);
            while (tokenStream.incrementToken()) {
                if (sb.length() > 0) {
                    sb.append(" ");
        catch (IOException e){
        return sb.toString();