Big Data Essentials¶

L13: Text Mining with Spark¶

Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site

Concepts in text processing¶

Corpora¶

Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based.
A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples.
Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.

Tokens¶

A token is the technical name for a sequence of characters, that we want to treat as a group.
The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together. In Python we can obtain the vocabulary items with the command: set().

Count vectorizer¶

Denote a term by $t$ , a document by $d$ , and the corpus by $D$ . Term frequency $TF(t,d)$ is the number of times that term t appears in document $d$ .

In [53]:

# CountVectorizer can be used to get term frequency vectors
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(inputCol="words", outputCol="rawFeatures")
model = cv.fit(wordsData)
result = model.transform(wordsData)
result.show(truncate=False)

+-----+-----------------------------------+------------------------------------------+----------------------------------------------------+
|label|sentence                           |words                                     |rawFeatures                                         |
+-----+-----------------------------------+------------------------------------------+----------------------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |(16,[0,3,7,10,13],[1.0,1.0,1.0,1.0,1.0])            |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(16,[0,1,2,8,9,11,14],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |(16,[4,5,6,12,15],[1.0,1.0,1.0,1.0,1.0])            |
+-----+-----------------------------------+------------------------------------------+----------------------------------------------------+

IDF¶

If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g., “a”, “the”, and “of”. If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document.
IDF (Inverse document frequency) is a numerical measure of how much information a term provides: $IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1},$ where $|D|$ is the total number of documents in the corpus.
Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus.

$n$ -gram¶

An $n$ -gram is a sequence of $n$ tokens (typically words) for some integer $n$ . The NGram class can be used to transform input features into $n$ -grams.
NGram takes as input a sequence of strings (e.g. the output of a Tokenizer).
The parameter n is used to determine the number of terms in each $n$ -gram.
The output will consist of a sequence of $n$ -grams where each $n$ -gram is represented by a space-delimited string of $n$ consecutive words. If the input sequence contains fewer than n strings, no output is produced.

In [16]:

# Shows the result
transformed = model.transform(dataset)
transformed.show(truncate=False)

+-----+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|features                                                       |topicDistribution                                                                                                                                                                                                    |
+-----+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0.0  |(11,[0,1,2,4,5,6,7,10],[1.0,2.0,6.0,2.0,3.0,1.0,1.0,3.0])      |[0.004681172149622019,0.00468117819091479,0.004681135694970299,0.005739935862359076,0.004681139647508738,0.004681163035836322,0.004681195802725201,0.0046811988040349585,0.00492812552906486,0.9565637552829637]     |
|1.0  |(11,[0,1,3,4,7,10],[1.0,3.0,1.0,3.0,2.0,1.0])                  |[0.007811273626304089,0.007811346701466485,0.007811242960027281,0.00957734766276134,0.007811265509513229,0.007811206689049855,0.00781128109751437,0.0078112984040495026,0.9276462050215541,0.008097532327759652]     |
|2.0  |(11,[0,1,2,5,6,8,9],[1.0,4.0,1.0,4.0,9.0,1.0,2.0])             |[0.004069740485454562,0.004069658893835722,0.0040697050383498926,0.9630081173528389,0.004069702412079345,0.00406974198724303,0.004069717726907355,0.004069776983942488,0.0042843192677678825,0.004219519851581043]   |
|3.0  |(11,[0,1,3,6,8,9,10],[2.0,1.0,3.0,5.0,2.0,3.0,9.0])            |[0.0035992571907178302,0.003599276707503996,0.0035992682713546016,0.9672853316462869,0.0035992683635499363,0.003599272546483701,0.003599266495766857,0.003599262254110511,0.003788685598710634,0.0037311109255152987]|
|4.0  |(11,[0,1,2,3,4,6,9,10],[3.0,1.0,1.0,9.0,3.0,2.0,1.0,3.0])      |[0.0038998378836786023,0.0038998553942561263,0.003899803769702433,0.9645523683214465,0.003899822505440322,0.0038997916363409093,0.003899891029212763,0.003899855475584653,0.004105845066093223,0.004042928918244333] |
|5.0  |(11,[0,1,3,4,5,6,7,8,9],[4.0,2.0,3.0,4.0,5.0,1.0,1.0,1.0,4.0]) |[0.0035995247227519982,0.0035994886642426605,0.003599466814932717,0.4419674910361024,0.0035995149928232024,0.0035994746107467393,0.003599550978920746,0.003599554111656215,0.529104393329512,0.0037315407383113684]  |
|6.0  |(11,[0,1,3,6,8,9,10],[2.0,1.0,3.0,5.0,2.0,2.0,9.0])            |[0.00374343260874588,0.003743454291543376,0.0037434442312325156,0.9659748718375448,0.003743441853212792,0.003743444727613381,0.003743440140793345,0.003743439961638659,0.003940457995780985,0.003880572351894354]    |
|7.0  |(11,[0,1,2,3,4,5,6,9,10],[1.0,1.0,1.0,9.0,2.0,1.0,2.0,1.0,3.0])|[0.004254926616580389,0.004254909371345522,0.00425488380254243,0.9613250069912941,0.004254915456616531,0.0042548916475503955,0.004254962578481665,0.00425495739004296,0.0044794590948799414,0.0044110870506661164]   |
|8.0  |(11,[0,1,3,4,5,6,7],[4.0,4.0,3.0,4.0,2.0,1.0,3.0])             |[0.00425496839016557,0.004254953915974418,0.0042549351461166705,0.005217635249656695,0.004254945196253455,0.004254917606045874,0.004254942673277737,0.004254996332101531,0.9605867274027159,0.004410978087692148]    |
|9.0  |(11,[0,1,2,4,6,8,9,10],[2.0,8.0,2.0,3.0,2.0,2.0,7.0,2.0])      |[0.0032266185995705058,0.0032266291741003574,0.0032266392566645758,0.48930724451664287,0.0032266232502273532,0.003226642148167788,0.0032266468774465418,0.00322661358315771,0.48476130361577885,0.003345038978243395]|
|10.0 |(11,[0,1,2,3,5,6,9,10],[1.0,1.0,1.0,9.0,2.0,2.0,3.0,3.0])      |[0.004069548150438102,0.004069509675152526,0.004069543568092783,0.9630104183150475,0.004069558541395517,0.00406955896965846,0.0040695467882306555,0.004069589573650498,0.004283816903229528,0.00421890951510439]     |
|11.0 |(11,[0,1,4,5,6,7,9],[4.0,1.0,4.0,5.0,1.0,3.0,1.0])             |[0.004681370318789852,0.004681298632475264,0.004681258109640178,0.005739283441512521,0.00468130536850138,0.004681292713463788,0.004681366952427264,0.004681391399335244,0.9566382652716945,0.004853167792160142]     |
+-----+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Big Data Essentials¶

L13: Text Mining with Spark¶

Concepts in text processing¶

Corpora¶

Tokens¶

Stopwords¶

Stemming¶

Frequency Counts¶

Word Segmenter¶

Part-Of-Speech Tagger¶

Named Entity Recognizer¶

Word embeddings¶

Word embeddings¶

King¶

¶

Analogy¶

Text Feature Extractors¶

Count vectorizer¶

IDF¶

HashingTF¶

Word2Vec¶

Remove stop words¶

$n$ -gram¶

Topic modelling with LDA¶

Lab 8¶