Natural Language Processing (NLP,自然语言处理)

Simply and in short, natural language processing (NLP) is about developing applications and services that are able to understand human languages.

  • Search engines like Google, Yahoo, etc.
  • Social websites feeds like Facebook news feed. The news feed algorithm understands your interests using natural language processing and shows you related Ads and posts more likely than other posts.
  • Speech engines like Apple Siri.
  • Spam filters like Google sp

NLP tools

  • Natural Language Toolkit

    NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

    Natural Language Processing with Python provides a practical introduction to programming for language processing. Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more.

Concepts in text processing (文本处理基本概念)

Corpora (语料库)

Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples. Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.


A token is the technical name for a sequence of characters, that we want to treat as a group. The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together. In Python we can obtain the vocabulary items with the command: set().

Stopwords (停词)

Stopwords are common words that generally do not contribute to the meaning of a sentence, at least for the purposes of information retrieval and natural language processing. These are words such as the and a. Most search engines will filter out stopwords from search queries and documents in order to save space in their index.

Stemming (词根检索)

Stemming is a technique to remove affixes from a word, ending up with the stem. For example, the stem of cooking is cook , and a good stemming algorithm knows that the ing suffix can be removed. Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.

Frequency Counts (频数统计)

Frequency Counts the number of hits. Frequency counts require finding all the occurences of a particular feature in the corpus. So it is implicit in concordancing. Software is used for this purpose. Frequency counts can be explained statistically.

Word Segmenter (分词)

Word segmentation is the problem of dividing a string of written language into its component words.

In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter). (Some examples where the space character alone may not be sufficient include contractions like can't for can not.)

However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited.

Part-Of-Speech Tagger (词性标注工具)

In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Named Entity Recognizer(命名实体识别工具)

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages.

Tokenizing text into sentences (断句)

The sent_tokenize() function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module. This instance has already been trained and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.

In [2]:
para = "Python is a widely used general-purpose, high-level programming language. \
        Its design philosophy emphasizes code readability, and its syntax allows programmers \
        to express concepts in fewer lines of code than would be possible in languages such as \
        C++ or Java. The language provides constructs intended to enable clear programs \
        on both a small and large scale."
from nltk.tokenize import sent_tokenize
['Python is a widely used general-purpose, high-level programming language.',
 'Its design philosophy emphasizes code readability, and its syntax allows programmers         to express concepts in fewer lines of code than would be possible in languages such as         C++ or Java.',
 'The language provides constructs intended to enable clear programs         on both a small and large scale.']

Tokenizing sentences into words (分词)

In [3]:
from nltk.tokenize import word_tokenize
word_tokenize('Hello World.')

Tokenizing sentences using regular expressions (使用正则表达式分词)

In [4]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
tokenizer.tokenize("Can't is a contraction.")
["Can't", 'is', 'a', 'contraction']

Filtering stopwords in a tokenized sentence (过滤停词)

In [5]:
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
{'any', 'few', 'will', 'these', 're', 'that', "doesn't", 'aren', 'herself', 'why', 'those', "haven't", "you've", 'further', 'wasn', 'hadn', 'hers', 'each', 'yours', 'off', 'its', 'into', 'over', 'me', 'mustn', 'weren', "that'll", 'wouldn', 'it', 'doesn', 'myself', 'such', 'needn', 'whom', 'very', "weren't", "you'll", 'after', 'than', "you're", 'couldn', "don't", 'hasn', 'up', 'been', 'he', 'who', 'being', 'and', 'where', 'she', 'only', 'because', 'y', 'do', 's', 'haven', 'or', 'above', 'yourself', 'under', 'don', "hasn't", 'have', 'd', 'but', 'not', "won't", "wouldn't", 'from', 'itself', 'has', "you'd", 'had', 've', 'how', "aren't", 'him', 'with', 'themselves', 'am', 'here', "wasn't", 'himself', 'won', 'which', "she's", 'an', 'so', 'should', 'them', 'is', 'was', 'o', 'some', "shouldn't", 'all', 'other', 'same', 'at', 'having', 't', 'll', 'then', 'you', 'yourselves', 'mightn', 'own', "didn't", 'before', "couldn't", 'are', "mustn't", 'be', 'm', 'her', 'there', 'my', 'didn', 'as', 'when', 'were', "it's", 'both', 'their', 'about', 'most', 'this', 'more', 'can', 'isn', 'your', 'down', 'below', 'in', 'too', 'on', 'of', "mightn't", 'shouldn', 'if', 'the', 'against', "hadn't", 'doing', "isn't", 'until', 'out', 'ourselves', 'our', 'ma', 'now', 'shan', 'what', 'while', 'for', 'through', 'to', 'i', 'his', 'just', "should've", 'a', 'ain', 'nor', 'during', "needn't", 'they', 'theirs', 'between', 'again', 'ours', 'no', 'we', 'by', 'once', 'does', "shan't", 'did'}
In [6]:
words = ["Can't", 'is', 'a', 'contraction']
[word for word in words if word not in english_stops]
["Can't", 'contraction']

Get Synonyms from WordNet (同义词)

WordNet is a database which is built for natural language processing. It includes groups of synonyms and a brief definition.

You can get these definitions and examples for a given word like this:

In [8]:
from nltk.corpus import wordnet
syn = wordnet.synsets("pain")
a symptom of some physical hurt or disorder
['the patient developed severe pain and distension']

You can use WordNet to get synonymous words like this:

In [9]:
from nltk.corpus import wordnet
synonyms = []
for syn in wordnet.synsets('Computer'):
    for lemma in syn.lemmas():
['computer', 'computing_machine', 'computing_device', 'data_processor', 'electronic_computer', 'information_processing_system', 'calculator', 'reckoner', 'figurer', 'estimator', 'computer']

Get Antonyms from WordNet (反义词)

You can get the antonyms words the same way, all you have to do is to check the lemmas before adding them to the array if it’s an antonym or not.

In [10]:
from nltk.corpus import wordnet
antonyms = []
for syn in wordnet.synsets("small"):
    for l in syn.lemmas():
        if l.antonyms():
['large', 'big', 'big']

Stemming (词干提取)

One of the most common stemming algorithms is the Porter stemming algorithm by Martin Porter. It is designed to remove and replace well-known suffixes of English words

In [7]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

Frequency Counts

In [9]:
from import *
fdist1 = FreqDist(text1)
<FreqDist with 19317 samples and 260819 outcomes>
[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982),
 ("'", 2684),
 ('-', 2552),
 ('his', 2459),
 ('it', 2209),
 ('I', 2124),
 ('s', 1739),
 ('is', 1695),
 ('he', 1661),
 ('with', 1659),
 ('was', 1632),
 ('as', 1620),
 ('"', 1478),
 ('all', 1462),
 ('for', 1414),
 ('this', 1280),
 ('!', 1269),
 ('at', 1231),
 ('by', 1137),
 ('but', 1113),
 ('not', 1103),
 ('--', 1070),
 ('him', 1058),
 ('from', 1052),
 ('be', 1030),
 ('on', 1005),
 ('so', 918),
 ('whale', 906),
 ('one', 889),
 ('you', 841),
 ('had', 767),
 ('have', 760),
 ('there', 715),
 ('But', 705),
 ('or', 697),
 ('were', 680),
 ('now', 646),
 ('which', 640),
 ('?', 637),
 ('me', 627),
 ('like', 624)]

POS tagger

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word (don't forget to import nltk):

In [5]:
import nltk
from nltk.tokenize import word_tokenize
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]
In [ ]: