I was looking for a way to extract “Nouns” from a set of strings in Java and I found, using Google, the amazing stanford NLP (Natural Language Processing) Group POS.
The library provided lets you “tag” the words in your string. That is, for each word, the “tagger” gets whether it’s a noun, a verb ..etc. and then assigns the result to the word. For example:
“This is a sample sentence”
will be output as
This/DT is/VBZ a/DT sample/NN sentence/NN
To do this, the tagger has to load a “trained” file that contains the necessary information for the tagger to tag the string. This “trained” file is called a model and has the extension “.tagger”. There are several trained models provided by Stanford NLP group for different languages.
In this post I will show you how to use such library in your Java application using Eclipse IDE.
- Create a new project.
- Create a new folder called “taggers”.
- Download the zip file provided by stanford group.
- Extract the zip file and Open the extracted folder.
- You will find a folder called models, open it and copy the model you want to the “taggers” folder we created earlier + its corresponding (with the same name) “.props” file.
- Now we need to import the library to our project so that Eclipse does not complain when we use it in our code. So, right click your project > Build Path > Configure Build Path.

In the new window, Open the libraries tab (from the top) and click the Add External Jars button.
Locate the “stanford-postagger.jar” file that is found in the extracted folder. - Now enough with the configuration and let’s start coding. In your project create a new Class and in its main method write:
// Initialize the tagger MaxentTagger tagger = new MaxentTagger( "taggers/left3words-distsim-wsj-0-18.tagger");
The MaxentTagger constructor takes the path to the model (trained file) as a parameter:
“NAME_OF_FOLDER/NAME_OF_MODEL.tagger”.
Once you write the code, Eclipse will tell you to import the MaxentTagger and inform you that it throws some exceptions. Use eclipse to add all that to the code.
Finally, we tag the string we want:
// The sample string String sample = "This is a sample text"; // The tagged string String tagged = tagger.tagString(sample); // Output the result System.out.println(tagged);
This will output the same result that’s mentioned at the begining of the post.
Here’s my entire class
import java.io.IOException; import edu.stanford.nlp.tagger.maxent.MaxentTagger; public class TagText { public static void main(String[] args) throws IOException, ClassNotFoundException { // Initialize the tagger MaxentTagger tagger = new MaxentTagger( "taggers/left3words-distsim-wsj-0-18.tagger"); // The sample string String sample = "This is a sample text"; // The tagged string String tagged = tagger.tagString(sample); // Output the result System.out.println(tagged); } }
Finally, We need to know what these “abbreviations” mean. For example in this output:
This/DT is/VBZ a/DT sample/NN sentence/NN
What does “NN” or “DT” mean? The tagger uses the Penn Treebank tag set for English language as stated on the library’s homepage. For a list of the abbreviations click here. See the included README-Models.txt in the models directory for more information about the tagsets for the other languages.
That was perfect for my application and that’s what I know. =D
Updated:
Click here to download a sample project (for usage with Eclipse). It contains a tagger and a GUI example.
References
http://nlp.stanford.edu/software/tagger.shtml
http://www.englishclub.com/grammar/parts-of-speech_1.htm

hey thanx a lot
You’re welcome
Can you guide me how to initialize tagger as when i run the java application, it is unable to find open the file when i pass the folder name
to the .tagger file.
i am passing “models/left3words-distsim-wsj-0-18.tagger” as the file is under models folder.
should this folder be in the same working space as of my project?
FYI, i am working in Windows environment
yeah the folder should be inside your project’s folder. To do this from Eclipse, Right click the project’s name and choose to create a new folder.
Thank you for sharing the knowledge, Galal. Really helped a lot.
I can not unzip the file when I download it over Windows…
Use 7-zip wheneven unpacking .tar files, i had problems using winrar.
thanks a lot for this post….
how can crate model file ?I want train this software
I didn’t try training a model =D u can check their mailing list or something.
It worked perfectly
… I would like to know how to extract verbs and nouns alone from the tagged text ???
Yes, I have the same question, is it possible?
This should be done with normal Java String manipulation. We know the tags, it won’t be a problem to extract verbs
Thanks! Very useful and concise!
thanks for the tutorial. Do you have another tutorial for using lemmatizer in stanford core NLP?
Actually i did not use it but i guess it’s pretty much the same. Download the library, add it to the Eclipse external libraries and start using it.
Hey thank you very much. This neat little tutorial saved me allot of time!!
hey.. m getting this following error.. pls help asap.
“Loading default properties from trained tagger taggers/left3words-distsim-wsj-0-18.tagger
Error: No such trained tagger config file found.
java.io.IOException: Unable to resolve “taggers/left3words-distsim-wsj-0-18.tagger” as either class path, filename or URL
at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:331)
at edu.stanford.nlp.tagger.maxent.TaggerConfig.getTaggerDataInputStream(TaggerConfig.java:724)
at edu.stanford.nlp.tagger.maxent.TaggerConfig.(TaggerConfig.java:186)
at edu.stanford.nlp.tagger.maxent.TaggerConfig.(TaggerConfig.java:131)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:240)
at TagText.main(TagText.java:11)
java.io.IOException: Unable to resolve “taggers/left3words-distsim-wsj-0-18.tagger” as either class path, filename or URL”
Are u sure u created the taggers folder in the right position and copied the tagger file to it?
thanx aton ..
hey i got an error while using this code.
th error is: “usage: Relation treebank numberRanges”
Can you please guide me what should I do ?
Thanks for your clear explanation.
It helped me.
Whenever I am running this program, it just gives this output.:
dep: []
pred: [Root (S|SINV <# VP=target )]
aux: [Root (VP < VP < /^(?:TO|MD|VB.*|AUXG?|POS)$/=target ), Root (SQ|SINV < (/^(?:VB|MD|AUX)/=target $++ /^(?:VP|ADJP)/ )), Root (CONJP < TO=target < VB ), Root (SINV < (VP=target < (/^(?:VB|AUX|POS)/ < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai)$/ )$-- (VP < VBG )))]
auxpass: [Root (VP < (/^(?:VB|AUX|POS)/=target < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai|seem|seems|seemed|seeming|appear|appears|appeared|become|becomes|became|becoming|get|got|getting|gets|gotten|remains|remained|remain)$/ )< (VP|ADJP [< VBN|VBD |< (VP|ADJP < VBN|VBD )< CC ])), Root (SQ|SINV < (/^(?:VB|AUX|POS)/=target < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai)$/ $++ (VP < /^VB[DN]$/ ))), Root (SINV < (VP=target < (/^(?:VB|AUX|POS)/ < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai)$/ )$– (VP < /^VB[DN]$/ ))), Root (SINV < (VP=target < (VP < (/^(?:VB|AUX|POS)/ < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai)$/ ))$– (VP < /^VB[DN]$/ )))]
cop: [Root (VP < (/^(?:VB|AUX)/=target < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai|seem|seems|seemed|seeming|appear|appears|appeared|stay|stays|stayed|remain|remains|remained|resemble|resembles|resembled|resembling|become|becomes|became|becoming)$/ [$++ (/^(?:ADJP|NP$|WHNP$)/ !< VBN|VBD ) |$++ (S <: (ADJP < JJ ))])), Root (SQ|SINV < (/^(?:VB|AUX)/=target < /^(?i:am|is|are|be|being|'s|'re|'m|was|were|been|s|ai|seem|seems|seemed|seeming|appear|appears|appeared|stay|stays|stayed|remain|remains|remained|resemble|resembles|resembled|resembling|become|becomes|became|becoming)$/ [$++ (ADJP !< VBN|VBD ) |$++ (NP $++ NP ) |$++ (S <: (ADJP < JJ ))]))]
conj: [Root (VP|S|SBAR|SBARQ|SINV|SQ < (CC|CONJP $-- !/^(?:``|-LRB-|PRN|PP|ADVP|RB)/ $+ !/^(?:PRN|``|''|-[LR]RB-|,|:|\.)$/=target )), Root (VP|S|SBAR|SBARQ|SINV|SQ < (CC|CONJP $– !/^(?:“|-LRB-|PRN|PP|ADVP|RB)/ $+ (ADVP $+ !/^(?:PRN|“|''|-[LR]RB-|,|:|\.)$/=target ))), Root (VP|S|SBAR|SBARQ|SINV|SQ < (CC|CONJP $– !/^(?:“|-LRB-|PRN|PP|ADVP|RB)/ )< (/^(?:PRN|“|''|-[LR]RB-|,|:|\.)$/ $+ /^S$|^(?:A|N|V|PP|PRP|J|W|R)/=target )), Root (/^(?:ADJP|JJP|PP|QP|(?:WH)?NP(?:-TMP|-ADV)?|ADVP|UCP|NX|NML)$/ < (CC|CONJP $– !/^(?:“|-LRB-|PRN)$/ $+ !/^(?:PRN|“|''|-[LR]RB-|,|:|\.)$/=target )), Root (/^(?:ADJP|PP|(?:WH)?NP(?:-TMP|-ADV)?|ADVP|UCP|NX|NML)$/ < (CC|CONJP $– !/^(?:“|-LRB-|PRN)$/ $+ (ADVP $+ !/^(?:PRN|“|''|-[LR]RB-|,|:|\.)$/=target ))), Root (/^(?:ADJP|PP|(?:WH)?NP(?:-TMP|-ADV)?|ADVP|UCP|NX|NML)$/ < (CC|CONJP $– !/^(?:“|-LRB-|PRN)$/ )< (/^(?:PRN|“|''|-[LR]RB-|,|:|\.)$/ $+ /^S$|^(?:A|N|V|PP|PRP|J|W|R)/=target )), Root (NX|NML < (CC|CONJP $- __ )< (/^,$/ $- /^(?:A|N|V|PP|PRP|J|W|R|S)/=target )), Root (/^(?:VP|S|SBAR|SBARQ|ADJP|PP|QP|(?:WH)?NP(?:-TMP|-ADV)?|ADVP|UCP|NX|NML)$/ < (CC $++ (CC|CONJP $+ !/^(?:PRN|“|''|-[LR]RB-|,|:|\.)$/=target )))]
cc: [Root (/^(?:S|VP|(?:WH)?NP(?:-TMP|-ADV)?|QP|ADJP|PP|ADVP|UCP|NX|SBAR|SBARQ|SINV|SQ|JJP|NML|CONJP)/ [< (CC=target !< /^(?i:either|neither|both)$/ ) |< (CONJP=target !< (RB < /^(?i:not)$/ $+ (RB|JJ < /^(?i:only|just|merely)$/ )))])]
punct: [Root (__ < /^(?:\.|:|,|''|``|-LRB-|-RRB-)$/=target )]
arg: []
subj: [] ….
more like this!!
Can you please possibly say where is it going wrong?
By the way, FYI, I use a Macbook…MAC OS X 10.4.11!
Actually I didn’t get that error while working on my project so I do not know the answer. Try contacting them on the mailing list.
If we have to make our own tagger file then what we should do??Can u guide??
No I haven’t tried it. Sorry
Thanks! It is working when I am using it this way but I am getting an error when I am trying to use this part of the code in a different applet program where the input text to be tagged is taken from the text box of the applet!! The error shown is as follows:
Error: No such trained tagger config file found.
And also IO exception is shown.
Kindly help me if you can.
Here is an archive. Import it using Eclipse. It contains a frame with an input box and a button. When you click the button the text in the inputbox is tagged and re-displayed. Repeat step 6 in this post if necessary.
http://bachelor.galalaly.me/gui-example.zip
Hope it helps.
Hey,
Thanks for the tutorial, really helped me get started. Just wondering, does the POS Tagger help with identifying phrases (not just words alone)?
Pingback: Tagging text with Stanford POS Tagger in Java Applications | phamsieunhien
Thank you very much for this nice presentation
Hi Galal,
Thanks for your nice presentation.
Could you please let me know how can I use POS tagger using NetBeans? It will be very helpful for me. Waiting for your reply.
Thanks again
Shuvo
I will try to post a tutorial soon isA.
How to get the output in TSV format? Iam trying everything..but not able to get in TSV format..:(
I have the same interest as Shuvo on how to use POS with NetBeans.
Thank you so much for this tutorial
Short and effective
We have this project to convert english sentences to first order logic form. Your code to tag text in POS helped us. Can you please help us with the java code to convert these tagged sentences into FOL(First order logic form). A little guidance would matter a lot.
We are working on our final year project on the concept of OPINION MINING…so we are in need of POS tagger..but we dnt know how to use..your instructions are simple and understandable..but still we got some errors on implementing your code
here they are
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Loading default properties from trained tagger taggers/left3words-distsim-wsj-0-18.tagger
Error: No such trained tagger config file found.
java.io.FileNotFoundException: taggers\left3words-distsim-wsj-0-18.tagger (The system cannot find the file specified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(Unknown Source)
at java.io.FileInputStream.(Unknown Source)
at edu.stanford.nlp.tagger.maxent.TaggerConfig.getTaggerDataInputStream(TaggerConfig.java:736)
at edu.stanford.nlp.tagger.maxent.TaggerConfig.(TaggerConfig.java:184)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:240)
at TagText.main(TagText.java:9)
Exception in thread “main” java.io.FileNotFoundException: taggers\left3words-distsim-wsj-0-18.tagger (The system cannot find the file specified)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(Unknown Source)
at java.io.FileInputStream.(Unknown Source)
at edu.stanford.nlp.tagger.maxent.TaggerConfig.getTaggerDataInputStream(TaggerConfig.java:736)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:667)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:280)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:240)
at TagText.main(TagText.java:9)
I got the same err, but i dnt know how to solve it.
pls help me asap!!
Can you please provide me with your folder structure?
Hey I got the output..
Thanks man
Hey can we extract back the data/corpus from the model file?? Thanks
Thanks a lot!
i would try with an arabic example the model left3words-wsj-0-18.tagger can not resolved the problem of arabic i try with an arabic models but same errors was generated
Loading default properties from trained tagger sources/arabic-fast.tagger
Reading POS tagger model from sources/arabic-fast.tagger … Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3209)
at java.lang.String.(String.java:215)
at java.io.DataInputStream.readUTF(DataInputStream.java:644)
at java.io.DataInputStream.readUTF(DataInputStream.java:547)
at edu.stanford.nlp.tagger.maxent.FeatureKey.read(FeatureKey.java:79)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:758)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:702)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:286)
at edu.stanford.nlp.tagger.maxent.MaxentTagger.(MaxentTagger.java:244)
at taging.Main.main(Main.java:45)
Java Result: 1
BUILD SUCCESSFUL (total time: 8 seconds)
Are you sure that you’re using an Arabic model? As far as I remember the left3words-wsj-0-18.tagger is an English one.
Dear friends,
I have a question about using the POS Tagger. I’m doing my PhD research and I need to extract N+N combinations from the texts. I’ve got a general idea about how to use a program but what I need to know is how many nouns you have in the dictinary of a program? I need my project data to be as accurate as possible.. Shall it be able to find 99% of them in the text?
Hey thank you very much for the tutorial. It got my project started very fast.
Keep up the good work!
thanks a lot!!!
Thanks Galal! You really helped me a lot!
Pingback: Part-of-Speech Tagging(POS分词)技术有用链接 | 白痴奋斗之家