Text Classification in WEKA

Text classification can be done using WEKA, an open source machine learning tool. However a lot of people seem to be running into problems when you have training and test sets of data. If you use the stringToWordVector filter and get the dreaded ‘training and test set are not compatible’ error, read on. This error occurs because the word dictionary will change, since word occurrences will differ in training and test set. The generated output will be two incompatible files. The solution to this is batch filtering the train and test sets of data together using stringToWordVector filter. Then the test set will contain the same attributes, thus eliminating the ‘training and test set are not compatible’ error.
NOTE: If the test data set is missing the class attribute, you have to put it there (For arff files: @attribute classname, csv files, append at the beginning: classname). Then for each instance put a “?” as the value for that attribute. (“?” signifies missing values in WEKA).

If you have .csv files here’s how to do it

For each .csv file (the train and test data sets)
1. Open .csv file in WEKA Explorer
2. Convert all attributes except the class attribute(s) to String type.
Ex: If the attributes are nominal click Filter choose filters->unsupervised->attribute->NominalToString. Click the text area of the chosen filter (Generic Object Editor) and in the ‘attributeIndexes’ field type 1-3 if all your non-class attributes exist in attributes 1,2,3. Click Apply to apply the changes.
3. (Optional) Convert your class attribute(s) to nominal type. (Otherwise most classifiers will be disabled)
Ex: If the class attribute is numeric, then click Filter choose filters->unsupervised->attribute->NumericToNominal. Click the text area of the chosen filter (Generic Object Editor) and type in the ‘attributeIndexes’ field the class attribute index (ex: 4). Click Apply to apply the changes.
4. Click save in the WEKA Explorer and the new data set as a .arff format file.

Now you should have two data sets (train and test) in the .arff format. Then we will batch filter them using the stringToWordVector using the following commands in the command line.
NOTE 1: It’s better to copy .arff files to the weka.jar location
NOTE 2: You should have the path to the weka.jar file in the CLASSPATH system variable. More details here.

1. Open the command prompt (start->run-> type ‘cmd’)
2. Browse to the weka.jar location (‘cd ‘ etc )
3. Type the following command

java weka.filters.unsupervised.attribute.StringToWordVector -b -i input_training_set.arff -o output_training_set.arff -c last -r input_test_set.arff -s output_test_set.arff -R 1,2,3 -O -C -T -I -N 0 -M 1

The options denote the following
-b: batch mode
This is useful to filter two datasets at once. The first dataset is used to initialize the filter and the second one is then filtered according to this setup. i.e., your test set will contain the same    attributes then.
-i: training input file
-o: training output file
-r: test input file
-w: test output file
-R 1,2,3: process the first,second,third attributes which is the string attribute, this is by default
-C: output word count rather than boolean word presence
-T: transform term frequency into log(1+tf)
-I: transform word frequency into tf*log(total# of docs/# of docs contain this word) It is actually the tf*idf weight without normalization
-N: 0=not normalize/1=normalize all data/2=normalize test data only to average length of training documents (default 0=don’t normalize).(Detail explanation from Wekalist)
-L: Convert all tokens to lowercase before adding to the dictionary
-A: Only form tokens from contiguous alphabetic sequences (Turn this off when work with phrase!!! After Weka 3-5-6, this option is no more available and is replaced by weka.core.tokenizers.AlphabeticTokenizer)
-S: Ignore words that are in the stoplist. (we don’t use this one since we’ve use our own stop list already)
-M: minimal term frequency, here is 1.

The most important options are -i, -o, -r, -s, -c, -R, -C.

Now you should have two new .arff files. Load them into WEKA and classify the training set (ex: Choose Naive Bayes). In Test Options, choose Supplied Test Set. Click ‘Set’ and choose the newly generated test data set. Click More Options and tick ‘Output Predictions’. Right-click the result in the ‘Result list’ and choose ‘Re-evaluate model on current test set’.

Now you should have the classified data in the results.



6 thoughts on “Text Classification in WEKA

  1. Laritza says:

    I hope you can help. I have two datasets training and testing, Training set has 6 class labels, testing has no class label. I am running StringToWordVector from the command line as you have on your blog post. I am getting an error:
    Input file formats differ
    Attribute differ at position 2
    Different number of labels 1 != 6

    I replaced:
    @attribute text string
    @attribute @@class@@ {N}
    @attribute text string
    @attribute @@class@@ {?}

    and N for ? on each text line but it still does not work.
    Any ideas on how I can get it to run?

    I used TextDirectory loader for both sets inside WEKA to create the arff files from directories.

    • Sagara says:

      Hi, I’m a bit rusty on WEKA at the moment, but off the top of my head, I think you need to have the same attributes in your test data as well. The value for the attribute in each test data item should be “?”. You mention “testing has no class label”, so I think this is where the problem comes up.
      Sorry if this seems stupid, but I haven’t used WEKA in a very long time.

  2. Dipawesh Pawar says:

    hi ,

    I am fairely new to weka.I had used batch filtering some time back but since I am using semi supervised classification wherein test set is very large as comapred to train set, I was loosing info in test set after batch filtering.

    So i shifted to filtered classifier.Now I want to know does this classifier creates dictionary? If yes how can I obtain such dictionary??

  3. Anonymous says:

    java weka.filters.unsupervised.attribute.StringToWordVector -b -i train.arff -o output_train.arff -c last -r test.arff -s output_test.arff -R ,1,2 -O -C -T -I -N 0 -M 1
    I used the above batch filtering line of code. in the program file/weka3-7/ directory i put training and testing file. and run this code,but it replies unable to load weka.filters.unsupervised.attribute.StringToWordVector

    any help will be appreciated

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s