Big Data Essentials

L13: Text Mining with Spark





Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site

Concepts in text processing

Corpora

  • Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based.

  • A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples.

  • Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.

Tokens

  • A token is the technical name for a sequence of characters, that we want to treat as a group.

  • The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together. In Python we can obtain the vocabulary items with the command: set().

Stopwords

  • Stopwords are common words that generally do not contribute to the meaning of a sentence, at least for the purposes of information retrieval and natural language processing.

  • These are words such as the and a. Most search engines will filter out stopwords from search queries and documents in order to save space in their index.

Stemming

  • Stemming is a technique to remove affixes from a word, ending up with the stem. For example, the stem of cooking is cook , and a good stemming algorithm knows that the ing suffix can be removed.

  • Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.

Frequency Counts

  • Frequency Counts the number of hits.
  • Frequency counts require finding all the occurences of a particular feature in the corpus.
  • So it is implicit in concordancing. Software is used for this purpose. Frequency counts can be explained statistically.

Word Segmenter

  • Word segmentation is the problem of dividing a string of written language into its component words.

  • In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter). (Some examples where the space character alone may not be sufficient include contractions like can't for can not.)

  • However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited.

Part-Of-Speech Tagger

  • In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph.

  • A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Named Entity Recognizer

  • Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages.

Word embeddings

  1. Word frequency based
  2. Prediction based

Word embeddings

This is a word embedding for the word “king” (GloVe vector trained on Wikipedia):

[ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , -0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 , -0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961 , -0.13495 , -0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 , 1.9927 , -0.04234 , -0.64319 , 0.71125 , 0.49159 , 0.16754 , 0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 , -1.0137 , -0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 , -0.51042 ]

King

Analogy

Text Feature Extractors

In [2]:
import findspark
findspark.init('/usr/lib/spark-current')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Python Spark with TM").getOrCreate()
In [49]:
from pyspark.ml.feature import Tokenizer

sentenceData = spark.createDataFrame([
    (0.0, "Hi I heard about Spark"),
    (0.0, "I wish Java could use case classes"),
    (1.0, "Logistic regression models are neat")
], ["label", "sentence"])
In [50]:
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenizer
Out[50]:
Tokenizer_80e25bab6922
In [51]:
wordsData = tokenizer.transform(sentenceData)
wordsData.show()
+-----+--------------------+--------------------+
|label|            sentence|               words|
+-----+--------------------+--------------------+
|  0.0|Hi I heard about ...|[hi, i, heard, ab...|
|  0.0|I wish Java could...|[i, wish, java, c...|
|  1.0|Logistic regressi...|[logistic, regres...|
+-----+--------------------+--------------------+

Count vectorizer

Denote a term by $t$, a document by $d$, and the corpus by $D$. Term frequency $TF(t,d)$ is the number of times that term t appears in document $d$.

In [53]:
# CountVectorizer can be used to get term frequency vectors
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(inputCol="words", outputCol="rawFeatures")
model = cv.fit(wordsData)
result = model.transform(wordsData)
result.show(truncate=False)
+-----+-----------------------------------+------------------------------------------+----------------------------------------------------+
|label|sentence                           |words                                     |rawFeatures                                         |
+-----+-----------------------------------+------------------------------------------+----------------------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |(16,[0,3,7,10,13],[1.0,1.0,1.0,1.0,1.0])            |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(16,[0,1,2,8,9,11,14],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |(16,[4,5,6,12,15],[1.0,1.0,1.0,1.0,1.0])            |
+-----+-----------------------------------+------------------------------------------+----------------------------------------------------+

IDF

  • If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g., “a”, “the”, and “of”. If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document.
  • IDF (Inverse document frequency) is a numerical measure of how much information a term provides: $$IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1},$$ where $|D|$ is the total number of documents in the corpus.
  • Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus.
In [55]:
# We use IDF to rescale the feature vectors
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(result)
rescaledData = idfModel.transform(result)


rescaledData.select("label", "features").show(truncate=False)
+-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|features                                                                                                                                                      |
+-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0.0  |(16,[0,3,7,10,13],[0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])                                          |
|0.0  |(16,[0,1,2,8,9,11,14],[0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])|
|1.0  |(16,[4,5,6,12,15],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])                                           |
+-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------+

HashingTF

In [52]:
# Alternatively, we can use hashingTF to extract features
from pyspark.ml.feature import HashingTF, IDF
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
featurizedData.show(featurizedData.count(), truncate=False)    
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
|label|sentence                           |words                                     |rawFeatures                              |
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+
|0.0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |(20,[0,5,9,17],[1.0,1.0,1.0,2.0])        |
|0.0  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(20,[2,7,9,13,15],[1.0,1.0,3.0,1.0,1.0]) |
|1.0  |Logistic regression models are neat|[logistic, regression, models, are, neat] |(20,[4,6,13,15,18],[1.0,1.0,1.0,1.0,1.0])|
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+

Word2Vec

  • Word2Vec is an Estimator which takes sequences of words representing documents and trains a Word2VecModel.

  • The model maps each word to a unique fixed-size vector.

  • The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used as features for prediction, document similarity calculations, etc. Please refer to the MLlib user guide on Word2Vec for more details.

In [8]:
from pyspark.ml.feature import Word2Vec

# Input data: Each row is a bag of words from a sentence or document.
documentDF = spark.createDataFrame([
    ("Hi I heard about Spark".split(" "), ),
    ("I wish Java could use case classes".split(" "), ),
    ("Logistic regression models are neat".split(" "), )
], ["text"])
In [9]:
# Learn a mapping from words to Vectors.
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF)

result = model.transform(documentDF)
for row in result.collect():
    text, vector = row
    print("Text: [%s] => \nVector: %s\n" % (", ".join(text), str(vector)))
Text: [Hi, I, heard, about, Spark] => 
Vector: [0.04043976259417832,0.012058253586292268,-0.04951667487621308]

Text: [I, wish, Java, could, use, case, classes] => 
Vector: [-0.020121478608676364,-0.048725567758083344,-0.03504794144204684]

Text: [Logistic, regression, models, are, neat] => 
Vector: [-0.049713856726884845,0.06829165453091264,0.03470015302300453]

Remove stop words

In [10]:
from pyspark.ml.feature import StopWordsRemover

sentenceData = spark.createDataFrame([
    (0, ["I", "saw", "the", "red", "balloon"]),
    (1, ["Mary", "had", "a", "little", "lamb"])
], ["id", "raw"])

remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
remover.transform(sentenceData).show(truncate=False)
+---+----------------------------+--------------------+
|id |raw                         |filtered            |
+---+----------------------------+--------------------+
|0  |[I, saw, the, red, balloon] |[saw, red, balloon] |
|1  |[Mary, had, a, little, lamb]|[Mary, little, lamb]|
+---+----------------------------+--------------------+

$n$-gram

  • An $n$-gram is a sequence of $n$ tokens (typically words) for some integer $n$. The NGram class can be used to transform input features into $n$-grams.

  • NGram takes as input a sequence of strings (e.g. the output of a Tokenizer).

  • The parameter n is used to determine the number of terms in each $n$-gram.
  • The output will consist of a sequence of $n$-grams where each $n$-gram is represented by a space-delimited string of $n$ consecutive words. If the input sequence contains fewer than n strings, no output is produced.
In [48]:
from pyspark.ml.feature import NGram

wordDataFrame = spark.createDataFrame([
    (0, ["Hi", "I", "heard", "about", "Spark"]),
    (1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
    (2, ["Logistic", "regression", "models", "are", "neat"]),
    (3, ["I", "like", "regression", "models"]),
], ["id", "words"])

ngram = NGram(n=2, inputCol="words", outputCol="ngrams")

ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select("ngrams").show(truncate=False)
+------------------------------------------------------------------+
|ngrams                                                            |
+------------------------------------------------------------------+
|[Hi I, I heard, heard about, about Spark]                         |
|[I wish, wish Java, Java could, could use, use case, case classes]|
|[Logistic regression, regression models, models are, are neat]    |
|[I like, like regression, regression models]                      |
+------------------------------------------------------------------+

Topic modelling with LDA

  • LDA is an unsupervised method that models documents and topics based on Dirichlet distribution, wherein each document is considered to be a distribution over various topics and each topic is modeled as a distribution over words.

  • Therefore, given a collection of documents, LDA outputs a set of topics, with each topic being associated with a set of words.

  • To model the distributions, LDA also requires the number of topics (often denoted by k) as an input. For instance, the following are the topics extracted from a random set of tweets from Canadian users where k = 3:

    • Topic 1: great, day, happy, weekend, tonight, positive experiences
    • Topic 2: food, wine, beer, lunch, delicious, dining
    • Topic 3: home, real estate, house, tips, mortgage, real estate
In [12]:
from pyspark.ml.clustering import LDA

# Loads data.
dataset = spark.read.format("libsvm").load("/opt/apps/ecm/service/spark/2.4.5-hadoop3.1-1.0.2/package/spark-2.4.5-hadoop3.1-1.0.2/data/mllib/sample_lda_libsvm_data.txt")
dataset.head(10)
Out[12]:
[Row(label=0.0, features=SparseVector(11, {0: 1.0, 1: 2.0, 2: 6.0, 4: 2.0, 5: 3.0, 6: 1.0, 7: 1.0, 10: 3.0})),
 Row(label=1.0, features=SparseVector(11, {0: 1.0, 1: 3.0, 3: 1.0, 4: 3.0, 7: 2.0, 10: 1.0})),
 Row(label=2.0, features=SparseVector(11, {0: 1.0, 1: 4.0, 2: 1.0, 5: 4.0, 6: 9.0, 8: 1.0, 9: 2.0})),
 Row(label=3.0, features=SparseVector(11, {0: 2.0, 1: 1.0, 3: 3.0, 6: 5.0, 8: 2.0, 9: 3.0, 10: 9.0})),
 Row(label=4.0, features=SparseVector(11, {0: 3.0, 1: 1.0, 2: 1.0, 3: 9.0, 4: 3.0, 6: 2.0, 9: 1.0, 10: 3.0})),
 Row(label=5.0, features=SparseVector(11, {0: 4.0, 1: 2.0, 3: 3.0, 4: 4.0, 5: 5.0, 6: 1.0, 7: 1.0, 8: 1.0, 9: 4.0})),
 Row(label=6.0, features=SparseVector(11, {0: 2.0, 1: 1.0, 3: 3.0, 6: 5.0, 8: 2.0, 9: 2.0, 10: 9.0})),
 Row(label=7.0, features=SparseVector(11, {0: 1.0, 1: 1.0, 2: 1.0, 3: 9.0, 4: 2.0, 5: 1.0, 6: 2.0, 9: 1.0, 10: 3.0})),
 Row(label=8.0, features=SparseVector(11, {0: 4.0, 1: 4.0, 3: 3.0, 4: 4.0, 5: 2.0, 6: 1.0, 7: 3.0})),
 Row(label=9.0, features=SparseVector(11, {0: 2.0, 1: 8.0, 2: 2.0, 4: 3.0, 6: 2.0, 8: 2.0, 9: 7.0, 10: 2.0}))]
In [13]:
# Trains a LDA model.
lda = LDA(k=10, maxIter=10)
model = lda.fit(dataset)
In [14]:
ll = model.logLikelihood(dataset)
lp = model.logPerplexity(dataset)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))
The lower bound on the log likelihood of the entire corpus: -797.6827874983526
The upper bound on perplexity: 3.06801072114751
In [15]:
# Describe topics.
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)
The topics described by their top-weighted terms:
+-----+-----------+---------------------------------------------------------------+
|topic|termIndices|termWeights                                                    |
+-----+-----------+---------------------------------------------------------------+
|0    |[7, 5, 6]  |[0.1021365328082399, 0.09694223822423166, 0.09466798944772135] |
|1    |[0, 10, 8] |[0.10464150943142521, 0.10331051116614971, 0.09665549177356285]|
|2    |[1, 0, 3]  |[0.10212840880470624, 0.10055201895131596, 0.10045007154595613]|
|3    |[10, 3, 6] |[0.2493186996338478, 0.19913457316982602, 0.14749249425469355] |
|4    |[3, 9, 8]  |[0.10905545960705024, 0.10073376600446866, 0.09528055695492436]|
|5    |[6, 9, 2]  |[0.1039600702523239, 0.10344827221025912, 0.09789995178753787] |
|6    |[4, 9, 5]  |[0.10717177600684129, 0.10045733284545605, 0.09794485955196937]|
|7    |[5, 3, 8]  |[0.10744369332435673, 0.104111549656985, 0.09159886235132612]  |
|8    |[1, 4, 7]  |[0.16343370091191292, 0.14735055650519843, 0.11133274242674586]|
|9    |[2, 5, 1]  |[0.18428935702320365, 0.10686639650367562, 0.09588486337546435]|
+-----+-----------+---------------------------------------------------------------+

In [16]:
# Shows the result
transformed = model.transform(dataset)
transformed.show(truncate=False)
+-----+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|features                                                       |topicDistribution                                                                                                                                                                                                    |
+-----+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0.0  |(11,[0,1,2,4,5,6,7,10],[1.0,2.0,6.0,2.0,3.0,1.0,1.0,3.0])      |[0.004681172149622019,0.00468117819091479,0.004681135694970299,0.005739935862359076,0.004681139647508738,0.004681163035836322,0.004681195802725201,0.0046811988040349585,0.00492812552906486,0.9565637552829637]     |
|1.0  |(11,[0,1,3,4,7,10],[1.0,3.0,1.0,3.0,2.0,1.0])                  |[0.007811273626304089,0.007811346701466485,0.007811242960027281,0.00957734766276134,0.007811265509513229,0.007811206689049855,0.00781128109751437,0.0078112984040495026,0.9276462050215541,0.008097532327759652]     |
|2.0  |(11,[0,1,2,5,6,8,9],[1.0,4.0,1.0,4.0,9.0,1.0,2.0])             |[0.004069740485454562,0.004069658893835722,0.0040697050383498926,0.9630081173528389,0.004069702412079345,0.00406974198724303,0.004069717726907355,0.004069776983942488,0.0042843192677678825,0.004219519851581043]   |
|3.0  |(11,[0,1,3,6,8,9,10],[2.0,1.0,3.0,5.0,2.0,3.0,9.0])            |[0.0035992571907178302,0.003599276707503996,0.0035992682713546016,0.9672853316462869,0.0035992683635499363,0.003599272546483701,0.003599266495766857,0.003599262254110511,0.003788685598710634,0.0037311109255152987]|
|4.0  |(11,[0,1,2,3,4,6,9,10],[3.0,1.0,1.0,9.0,3.0,2.0,1.0,3.0])      |[0.0038998378836786023,0.0038998553942561263,0.003899803769702433,0.9645523683214465,0.003899822505440322,0.0038997916363409093,0.003899891029212763,0.003899855475584653,0.004105845066093223,0.004042928918244333] |
|5.0  |(11,[0,1,3,4,5,6,7,8,9],[4.0,2.0,3.0,4.0,5.0,1.0,1.0,1.0,4.0]) |[0.0035995247227519982,0.0035994886642426605,0.003599466814932717,0.4419674910361024,0.0035995149928232024,0.0035994746107467393,0.003599550978920746,0.003599554111656215,0.529104393329512,0.0037315407383113684]  |
|6.0  |(11,[0,1,3,6,8,9,10],[2.0,1.0,3.0,5.0,2.0,2.0,9.0])            |[0.00374343260874588,0.003743454291543376,0.0037434442312325156,0.9659748718375448,0.003743441853212792,0.003743444727613381,0.003743440140793345,0.003743439961638659,0.003940457995780985,0.003880572351894354]    |
|7.0  |(11,[0,1,2,3,4,5,6,9,10],[1.0,1.0,1.0,9.0,2.0,1.0,2.0,1.0,3.0])|[0.004254926616580389,0.004254909371345522,0.00425488380254243,0.9613250069912941,0.004254915456616531,0.0042548916475503955,0.004254962578481665,0.00425495739004296,0.0044794590948799414,0.0044110870506661164]   |
|8.0  |(11,[0,1,3,4,5,6,7],[4.0,4.0,3.0,4.0,2.0,1.0,3.0])             |[0.00425496839016557,0.004254953915974418,0.0042549351461166705,0.005217635249656695,0.004254945196253455,0.004254917606045874,0.004254942673277737,0.004254996332101531,0.9605867274027159,0.004410978087692148]    |
|9.0  |(11,[0,1,2,4,6,8,9,10],[2.0,8.0,2.0,3.0,2.0,2.0,7.0,2.0])      |[0.0032266185995705058,0.0032266291741003574,0.0032266392566645758,0.48930724451664287,0.0032266232502273532,0.003226642148167788,0.0032266468774465418,0.00322661358315771,0.48476130361577885,0.003345038978243395]|
|10.0 |(11,[0,1,2,3,5,6,9,10],[1.0,1.0,1.0,9.0,2.0,2.0,3.0,3.0])      |[0.004069548150438102,0.004069509675152526,0.004069543568092783,0.9630104183150475,0.004069558541395517,0.00406955896965846,0.0040695467882306555,0.004069589573650498,0.004283816903229528,0.00421890951510439]     |
|11.0 |(11,[0,1,4,5,6,7,9],[4.0,1.0,4.0,5.0,1.0,3.0,1.0])             |[0.004681370318789852,0.004681298632475264,0.004681258109640178,0.005739283441512521,0.00468130536850138,0.004681292713463788,0.004681366952427264,0.004681391399335244,0.9566382652716945,0.004853167792160142]     |
+-----+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Lab 8

  • Find anEnglish corpras and load to Spark
  • Text processing with the document
  • Topic modeling with your data
  • Visulize your result with Python