# Tagging text with Stanford POS Tagger in Java Applications

I was looking for a way to extract “Nouns” from a set of strings in Java and I found, using Google, the amazing stanford NLP (Natural Language Processing) Group POS.

The library provided lets you “tag” the words in your string. That is, for each word, the “tagger” gets whether it’s a noun, a verb ..etc. and then assigns the result to the word. For example:

“This is a sample sentence”

will be output as


This/DT is/VBZ a/DT sample/NN sentence/NN



To do this, the tagger has to load a “trained” file that contains the necessary information for the tagger to tag the string. This “trained” file is called a model and has the extension “.tagger”. There are several trained models provided by Stanford NLP group for different languages.

In this post I will show you how to use such library in your Java application using Eclipse IDE.

1. Create a new project.
2. Create a new folder called “taggers”.
4. Extract the zip file and Open the extracted folder.
5. You will find a folder called models, open it and copy the model you want to the “taggers” folder we created earlier + its corresponding (with the same name) “.props” file.
6. Now we need to import the library to our project so that Eclipse does not complain when we use it in our code. So, right click your project > Build Path > Configure Build Path.
In the new window, Open the libraries tab (from the top) and click the Add External Jars button.
Locate the “stanford-postagger.jar” file that is found in the extracted folder.

7. Now enough with the configuration and let’s start coding. In your project create a new Class and in its main method write:
// Initialize the tagger

MaxentTagger tagger = new MaxentTagger(

"taggers/left3words-distsim-wsj-0-18.tagger");



The MaxentTagger constructor takes the path to the model (trained file) as a parameter:

“NAME_OF_FOLDER/NAME_OF_MODEL.tagger”.

Once you write the code, Eclipse will tell you to import the MaxentTagger and inform you that it throws some exceptions. Use eclipse to add all that to the code.

Finally, we tag the string we want:


// The sample string

String sample = "This is a sample text";

// The tagged string

String tagged = tagger.tagString(sample);

// Output the result

System.out.println(tagged);

This will output the same result that’s mentioned at the begining of the post.

Here’s my entire class

import java.io.IOException;

import edu.stanford.nlp.tagger.maxent.MaxentTagger;

public class TagText {
public static void main(String[] args) throws IOException,
ClassNotFoundException {

// Initialize the tagger
MaxentTagger tagger = new MaxentTagger(
"taggers/left3words-distsim-wsj-0-18.tagger");

// The sample string
String sample = "This is a sample text";

// The tagged string
String tagged = tagger.tagString(sample);

// Output the result
System.out.println(tagged);
}
}


Finally, We need to know what these “abbreviations” mean. For example in this output:


This/DT is/VBZ a/DT sample/NN sentence/NN



What does “NN” or “DT” mean? The tagger uses the Penn Treebank tag set for English language as stated on the library’s homepage. For a list of the abbreviations click here. See the included README-Models.txt in the models directory for more information about the tagsets for the other languages.

For memory problems (quoting Akash’s comment below):

It turns out that the problem is that eclipse allocates on 256MB of memory by default. RightClick on the Project->Run as->Run Configurations->Go to the arguments tab-> under VM arguments type -Xmx2048m This will set the allocated memory to 2GB and all the tagger files should run now.

Updated:

Liked it? share it.

## 96 thoughts on “Tagging text with Stanford POS Tagger in Java Applications”

1. Rahul Kumar

Hi,
When using your GUI example in eclipse, it complains about “MaxentTagger cannot be resolved to a type”. What should I do for it?

2. jothi

How to extract noun phase from the text file.

In your example, How to extract Noun, Determiner phase separately ie.,
NN = sample, sentence
DT =This, a

1. Paul A

I don’t know if this has been answered already but: Check the sample TaggerDemo2.java included in the zip. Also, if you want to understand the tagger more closely, they also included the source. Just see the jar with “src” appended to its filename.

As far as I’ve read, you might need to use the class TaggedWord to access the tags separately.
a snippet from the TaggerDemo2.java:
 List sent = Sentence.toWordList("The", "slimy", "slug", "crawled", "over", "the", "long", ",", "green", "grass", "."); List taggedSent = tagger.tagSentence(sent); for (TaggedWord tw : taggedSent) { if (tw.tag().startsWith("JJ")) { pw.println(tw.word()); } } 

1. Jerasak

Hi LSL,Thanks for sharing a denrefift system. Now we can take in some good new points and upgrade the worksheet.As mentioned by Reginna about Balance Check and How sure are you?, I’d like to share my pov that there is no point achieving our goals at the expense of our relationships, health, family or other people. We don’t live alone in this World, and I think it is a wisdom we work on every day to know how to satisfy our own needs and other people’s.When we understand our priorities and responsibilities, we can learn to dance with these variables and stakeholders, and rise up to the top of our goals, truly feeling proud of ourselves.I guess, ultimately, everyone is seeking for some form of balance in life. In fact, going after our goals could be one way we are seeking to balance our lives.So the question is: what priorities are we willing to adjust to achieve the goal. And why?

3. Amin

Hi

I need a sample of data for training. When I m using conll file it says it needs “/” between words.
Please send me a sample to know what should I prepare!

Thanks,

4. YY

Hi, how do i incorporate the .props file when using eclipse? it seems like the tagger can run perfectly without the .props file.

5. Sujata Mehta

Exception in thread “main” java.lang.RuntimeException: ‘model’ parameter must be specified
at edu.stanford.nlp.tagger.maxent.TaggerConfig.(TaggerConfig.java:175)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.main(MaxentTagger.java:1494)

6. parnika

Exception in thread “main” java.lang.RuntimeException: ‘model’ parameter must be specified
at edu.stanford.nlp.tagger.maxent.TaggerConfig.(TaggerConfig.java:175)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.main(MaxentTagger.java:1494)

1. Charli

A couple poitns. First, as the website poitns out: It is permissible to use previously constructed lexicons, word clusters or other resources provided that they are made available for other participants. So you can use clusters, but in the spirit of open competition, we ask that these resources be made available.Second, I agree that taking a domain, running system X on the domain, doing an error analysis and then adding features, changing the model or annotating some more data is a very good way to adapt systems. I don’t think anyone is scared’ of this approach. In fact, outside academia, this is the standard way of doing business, not the exception. However, this is not as easy as it sounds. First, you need the resources (human resources that is) to do this for every domain on the web or domain you might be interested in. Second, the annotations you wish to collect must be easily created by you or via a system like mechanical turk. It is one thing to annotate some short twitter posts with 12-15 part of speech tags and a whole other thing to annotate consumer reviews with syntactic structure. I have tried both. They are not comparable. Even the former cannot be done reliably by turkers, which means you will need grad students, staff research scientists or costly third party vendors to do this every time you want to study a new domain.So the spirit of the competition was to see, from a modeling/algorithm perspective, what are the best methods for training robust syntactic analyzers on the data currently available. By limiting the resources we are trying to make this as much an apples-to-apples comparison as we can. Even this is impossible. Some parsers require lexicons that might have been tuned for specific domains, etc.Understanding this is still valuable in the analyze, annotate and iterate approach. Don’t you want to start off with the best baseline to reduce the amount of human labor required?

8. Umang Sardesai

But this above link tells u how to install in Eclipse, here are the steps to make it work in Netbeans.
1. Create a new project.
2. Right click on ur project. Select new package. Name it as taggers.
3. Download the zip file provided by Stanford group (The link is given above in the steps).
4. Copy “left3words-wsj-0-18.tagger” tagger and props file from models folder of the downloaded zip file and paste it in taggers folder.
5. Now under project name, expand libraries and add new JAR/folder.
6. Select stanford-postagger.jar from zip files. This step configures the path for the JAR file.
7. Rest; implement the sample program given above. Just one correction. Instead of MaxentTagger tagger = new MaxentTagger(“taggers/left3words-distsim-wsj-0-18.tagger”); write MaxentTagger tagger = new MaxentTagger(“taggers/left3words-wsj-0-18.tagger”); The “distsim” has to be removed.

HOPE THIS HELPS FOR NETBEANS USERS !!

9. Jyoti

Hello,
I am in dire need of open source code for implementation of natural language processing in java.
Can anyone give me the link for this.

1. pierce-dong

actually stanford use java for their nlp research. so you may find many useful source there.

10. Tam

Please tell me how can i change verbs into base verb with Stanford POS Tagger. (for example: heard–> hear, studied–>study)

11. Reena

Hi,

I have the below error when i implement POS tagger in dynamic web project.
Error: No such trained tagger config file found.
java.io.FileNotFoundException: models\left3words-wsj-0-18.tagger (The system cannot find the path specified)

I have models folder inside the project folder and the corresponding tagger file and property file.
Kindly help

12. priya

I’m new to java and stanford pos taggar. Im getting the following error when I run the my code. But I mentioned the model file in my code. anyone help me to solve the problem.

Exception in thread “main” java.lang.RuntimeException: ‘model’ parameter must be specified
at edu.stanford.nlp.tagger.maxent.TaggerConfig.setProperties(TaggerConfig.java:196)
at edu.stanford.nlp.tagger.maxent.TaggerConfig.(TaggerConfig.java:155)
at edu.stanford.nlp.tagger.maxent.TaggerConfig.(TaggerConfig.java:128)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.main(MaxentTagger.java:1837)

Thanks

13. priya

import java.io.*;
import java.util.Scanner;

public class checkwords {
// **** you will use a different String literal here!! ****
private static final String FILE_PATH = “inputp.txt”;
private static final String FILE_PATH1 = “output.txt”;
public static void main(String[] args) throws FileNotFoundException, IOException {

File file = new File(FILE_PATH);
File file1 = new File(FILE_PATH1);

Scanner scan = new Scanner(file);
Scanner scan1 = new Scanner(file1);
try{
// scan through file to make sure that it holds the text
// we think it does, and that scanner works.
while (scan.hasNext()) {
String line = scan.next();
System.out.println(line);

while(scan1.hasNext()){
String str = scan1.next();
System.out.println(str);

if (str.equalsIgnoreCase(line)){
FileWriter fw = new FileWriter(“outputnew.txt”,true);
BufferedWriter bw = new BufferedWriter(fw);
bw.write(str);
bw.newLine();

}
}
}

}catch (IOException iox) {
//do stuff with exception
iox.printStackTrace();
}
}

I wanted to read a word from “input file” and check the word is present in “output file”. If the word present, write the string into “outputnew file”. In this code snippit BufferedWriter is writing into new file. I could not figure out the problem .

14. Dilesh Sahoo

Hi , I want to use the the model file in mapreduce for tagging the comments .
So for instatiating the tagger
MaxentTagger tagger = new MaxentTagger(“tagger/english-bidirectional-distsim.tagger”);

How can I use this , Please let me know .