Python – NLTK – Chunking

With NLTK we can split text by sentence and part of speech. However, we may want to split by subject which is generally a person place or thing. Once you know the named entity then you can find the words that modify or affect that named entity (which is usually a noun). You might have many named entities or nouns. In one sentence you might be talking about two things and you might have an opinion about each of them.

Most people will chunk into noun phrases. There will be a noun and modifiers describing that noun. The downside of taking this approach is that you can only chunk words that touch each other.

We will use a combination of part of speech tagging and regular expressions.

Code snippet for chunking:

chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""

chunkParser = nltk.RegexpParser(chunkGram)

chunked = chunkParser.parse(tagged) #tagged is the part of speech

print(chunked.draw)

Notes regarding code snippet:

The output is not very user friendly when you just print chunked. Therefore use matplotlib function draw() to create a plot of the chunked words – ie: chunked.draw)

Most people put the r infront of the three quotes to denote a regular expression

You put the chunk you want to find in the parenthesis

We are looking for any RB (adverb) through a regular expression

Any character is the period except for new line in regular expression

Question mark is zero or one in regular expression

The asterisk is zero or more

NNP is a proper noun

NN is just a noun