With NLTK we can split text by sentence and part of speech. However, we may want to split by subject which is generally a person place or thing. Once you know the named entity then you can find the words that modify or affect that named entity (which is usually a noun). You might have many named entities or nouns. In one sentence you might be talking about two things and you might have an opinion about each of them.
Most people will chunk into noun phrases. There will be a noun and modifiers describing that noun. The downside of taking this approach is that you can only chunk words that touch each other.
We will use a combination of part of speech tagging and regular expressions.
Code snippet for chunking:
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}""" chunkParser = nltk.RegexpParser(chunkGram) chunked = chunkParser.parse(tagged) #tagged is the part of speech print(chunked.draw)
Notes regarding code snippet:
The output is not very user friendly when you just print chunked. Therefore use matplotlib function draw() to create a plot of the chunked words – ie: chunked.draw)
Most people put the r infront of the three quotes to denote a regular expression
You put the chunk you want to find in the parenthesis
We are looking for any RB (adverb) through a regular expression
Any character is the period except for new line in regular expression
Question mark is zero or one in regular expression
The asterisk is zero or more
NNP is a proper noun
NN is just a noun