Tokenizing, Stopping and Stemming using Apache Lucene

If you want to tokenize, stop and stem a line of string, you can do it using Apache Lucene. Here a sample code which takes a line of input and returns a tokenized, stopped and stemmed output.


is a hash set with a list of stop words. I used the list from here.

import org.apache.lucene.analysis.PorterStemFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.util.Version;

private static String tokenizeStopStem(String input) {

        TokenStream tokenStream = new StandardTokenizer(
                Version.LUCENE_36, new StringReader(input));
        tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, stop_word_set);
        tokenStream = new PorterStemFilter(tokenStream);

        StringBuilder sb = new StringBuilder();
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        CharTermAttribute charTermAttr = tokenStream.getAttribute(CharTermAttribute.class);
            while (tokenStream.incrementToken()) {
                if (sb.length() > 0) {
                    sb.append(" ");
        catch (IOException e){
        return sb.toString();

6 thoughts on “Tokenizing, Stopping and Stemming using Apache Lucene

  1. irurumuhammad says:

    Hello Sir..

    It’s a good example of source code you have there, thank you..

    but I still have same error with it, when I try run it, I get stop at CharTermAttribute part, I get an IllegalStateException there

    I’m using Lucene 4.6 now, can you tell me how to fix it? Thanks before

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s