Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site
Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based.
A corpus provides grammarians, lexicographers, and other interested parties with better discriptions of a language. Computer-procesable corpora allow linguists to adopt the principle of total accountability, retrieving all the occurrences of a particular word or structure for inspection or randomly selcted samples.
Corpus analysis provide lexical information, morphosyntactic information, semantic information and pragmatic information.
A token is the technical name for a sequence of characters, that we want to treat as a group.
The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together. In Python we can obtain the vocabulary items with the command: set()
.
Stopwords are common words that generally do not contribute to the meaning of a sentence, at least for the purposes of information retrieval and natural language processing.
These are words such as the and a. Most search engines will filter out stopwords from search queries and documents in order to save space in their index.
Stemming is a technique to remove affixes from a word, ending up with the stem. For example, the stem of cooking is cook , and a good stemming algorithm knows that the ing suffix can be removed.
Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.
Word segmentation is the problem of dividing a string of written language into its component words.
In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter). (Some examples where the space character alone may not be sufficient include contractions like can't for can not.)
However the equivalent to this character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited.
In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph.
A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
This is a word embedding for the word “king” (GloVe vector trained on Wikipedia):
[ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , -0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 , -0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961 , -0.13495 , -0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 , 1.9927 , -0.04234 , -0.64319 , 0.71125 , 0.49159 , 0.16754 , 0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 , -1.0137 , -0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 , -0.51042 ]
import findspark
findspark.init('/usr/lib/spark-current')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Python Spark with TM").getOrCreate()
from pyspark.ml.feature import Tokenizer
sentenceData = spark.createDataFrame([
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenizer
Tokenizer_80e25bab6922
wordsData = tokenizer.transform(sentenceData)
wordsData.show()
+-----+--------------------+--------------------+ |label| sentence| words| +-----+--------------------+--------------------+ | 0.0|Hi I heard about ...|[hi, i, heard, ab...| | 0.0|I wish Java could...|[i, wish, java, c...| | 1.0|Logistic regressi...|[logistic, regres...| +-----+--------------------+--------------------+
Denote a term by $t$, a document by $d$, and the corpus by $D$. Term frequency $TF(t,d)$ is the number of times that term t appears in document $d$.
# CountVectorizer can be used to get term frequency vectors
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(inputCol="words", outputCol="rawFeatures")
model = cv.fit(wordsData)
result = model.transform(wordsData)
result.show(truncate=False)
+-----+-----------------------------------+------------------------------------------+----------------------------------------------------+ |label|sentence |words |rawFeatures | +-----+-----------------------------------+------------------------------------------+----------------------------------------------------+ |0.0 |Hi I heard about Spark |[hi, i, heard, about, spark] |(16,[0,3,7,10,13],[1.0,1.0,1.0,1.0,1.0]) | |0.0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(16,[0,1,2,8,9,11,14],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])| |1.0 |Logistic regression models are neat|[logistic, regression, models, are, neat] |(16,[4,5,6,12,15],[1.0,1.0,1.0,1.0,1.0]) | +-----+-----------------------------------+------------------------------------------+----------------------------------------------------+
# We use IDF to rescale the feature vectors
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(result)
rescaledData = idfModel.transform(result)
rescaledData.select("label", "features").show(truncate=False)
+-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ |label|features | +-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ |0.0 |(16,[0,3,7,10,13],[0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453]) | |0.0 |(16,[0,1,2,8,9,11,14],[0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453])| |1.0 |(16,[4,5,6,12,15],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453]) | +-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
# Alternatively, we can use hashingTF to extract features
from pyspark.ml.feature import HashingTF, IDF
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
featurizedData.show(featurizedData.count(), truncate=False)
+-----+-----------------------------------+------------------------------------------+-----------------------------------------+ |label|sentence |words |rawFeatures | +-----+-----------------------------------+------------------------------------------+-----------------------------------------+ |0.0 |Hi I heard about Spark |[hi, i, heard, about, spark] |(20,[0,5,9,17],[1.0,1.0,1.0,2.0]) | |0.0 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|(20,[2,7,9,13,15],[1.0,1.0,3.0,1.0,1.0]) | |1.0 |Logistic regression models are neat|[logistic, regression, models, are, neat] |(20,[4,6,13,15,18],[1.0,1.0,1.0,1.0,1.0])| +-----+-----------------------------------+------------------------------------------+-----------------------------------------+
Word2Vec is an Estimator which takes sequences of words representing documents and trains a Word2VecModel.
The model maps each word to a unique fixed-size vector.
The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used as features for prediction, document similarity calculations, etc. Please refer to the MLlib user guide on Word2Vec for more details.
from pyspark.ml.feature import Word2Vec
# Input data: Each row is a bag of words from a sentence or document.
documentDF = spark.createDataFrame([
("Hi I heard about Spark".split(" "), ),
("I wish Java could use case classes".split(" "), ),
("Logistic regression models are neat".split(" "), )
], ["text"])
# Learn a mapping from words to Vectors.
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF)
result = model.transform(documentDF)
for row in result.collect():
text, vector = row
print("Text: [%s] => \nVector: %s\n" % (", ".join(text), str(vector)))
Text: [Hi, I, heard, about, Spark] => Vector: [0.04043976259417832,0.012058253586292268,-0.04951667487621308] Text: [I, wish, Java, could, use, case, classes] => Vector: [-0.020121478608676364,-0.048725567758083344,-0.03504794144204684] Text: [Logistic, regression, models, are, neat] => Vector: [-0.049713856726884845,0.06829165453091264,0.03470015302300453]
from pyspark.ml.feature import StopWordsRemover
sentenceData = spark.createDataFrame([
(0, ["I", "saw", "the", "red", "balloon"]),
(1, ["Mary", "had", "a", "little", "lamb"])
], ["id", "raw"])
remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
remover.transform(sentenceData).show(truncate=False)
+---+----------------------------+--------------------+ |id |raw |filtered | +---+----------------------------+--------------------+ |0 |[I, saw, the, red, balloon] |[saw, red, balloon] | |1 |[Mary, had, a, little, lamb]|[Mary, little, lamb]| +---+----------------------------+--------------------+
An $n$-gram is a sequence of $n$ tokens (typically words) for some integer $n$. The NGram class can be used to transform input features into $n$-grams.
NGram takes as input a sequence of strings (e.g. the output of a Tokenizer).
n
is used to determine the number of terms in each $n$-gram. from pyspark.ml.feature import NGram
wordDataFrame = spark.createDataFrame([
(0, ["Hi", "I", "heard", "about", "Spark"]),
(1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
(2, ["Logistic", "regression", "models", "are", "neat"]),
(3, ["I", "like", "regression", "models"]),
], ["id", "words"])
ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select("ngrams").show(truncate=False)
+------------------------------------------------------------------+ |ngrams | +------------------------------------------------------------------+ |[Hi I, I heard, heard about, about Spark] | |[I wish, wish Java, Java could, could use, use case, case classes]| |[Logistic regression, regression models, models are, are neat] | |[I like, like regression, regression models] | +------------------------------------------------------------------+
LDA is an unsupervised method that models documents and topics based on Dirichlet distribution, wherein each document is considered to be a distribution over various topics and each topic is modeled as a distribution over words.
Therefore, given a collection of documents, LDA outputs a set of topics, with each topic being associated with a set of words.
To model the distributions, LDA also requires the number of topics (often denoted by k) as an input. For instance, the following are the topics extracted from a random set of tweets from Canadian users where k = 3:
from pyspark.ml.clustering import LDA
# Loads data.
dataset = spark.read.format("libsvm").load("/opt/apps/ecm/service/spark/2.4.5-hadoop3.1-1.0.2/package/spark-2.4.5-hadoop3.1-1.0.2/data/mllib/sample_lda_libsvm_data.txt")
dataset.head(10)
[Row(label=0.0, features=SparseVector(11, {0: 1.0, 1: 2.0, 2: 6.0, 4: 2.0, 5: 3.0, 6: 1.0, 7: 1.0, 10: 3.0})), Row(label=1.0, features=SparseVector(11, {0: 1.0, 1: 3.0, 3: 1.0, 4: 3.0, 7: 2.0, 10: 1.0})), Row(label=2.0, features=SparseVector(11, {0: 1.0, 1: 4.0, 2: 1.0, 5: 4.0, 6: 9.0, 8: 1.0, 9: 2.0})), Row(label=3.0, features=SparseVector(11, {0: 2.0, 1: 1.0, 3: 3.0, 6: 5.0, 8: 2.0, 9: 3.0, 10: 9.0})), Row(label=4.0, features=SparseVector(11, {0: 3.0, 1: 1.0, 2: 1.0, 3: 9.0, 4: 3.0, 6: 2.0, 9: 1.0, 10: 3.0})), Row(label=5.0, features=SparseVector(11, {0: 4.0, 1: 2.0, 3: 3.0, 4: 4.0, 5: 5.0, 6: 1.0, 7: 1.0, 8: 1.0, 9: 4.0})), Row(label=6.0, features=SparseVector(11, {0: 2.0, 1: 1.0, 3: 3.0, 6: 5.0, 8: 2.0, 9: 2.0, 10: 9.0})), Row(label=7.0, features=SparseVector(11, {0: 1.0, 1: 1.0, 2: 1.0, 3: 9.0, 4: 2.0, 5: 1.0, 6: 2.0, 9: 1.0, 10: 3.0})), Row(label=8.0, features=SparseVector(11, {0: 4.0, 1: 4.0, 3: 3.0, 4: 4.0, 5: 2.0, 6: 1.0, 7: 3.0})), Row(label=9.0, features=SparseVector(11, {0: 2.0, 1: 8.0, 2: 2.0, 4: 3.0, 6: 2.0, 8: 2.0, 9: 7.0, 10: 2.0}))]
# Trains a LDA model.
lda = LDA(k=10, maxIter=10)
model = lda.fit(dataset)
ll = model.logLikelihood(dataset)
lp = model.logPerplexity(dataset)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))
The lower bound on the log likelihood of the entire corpus: -797.6827874983526 The upper bound on perplexity: 3.06801072114751
# Describe topics.
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)
The topics described by their top-weighted terms: +-----+-----------+---------------------------------------------------------------+ |topic|termIndices|termWeights | +-----+-----------+---------------------------------------------------------------+ |0 |[7, 5, 6] |[0.1021365328082399, 0.09694223822423166, 0.09466798944772135] | |1 |[0, 10, 8] |[0.10464150943142521, 0.10331051116614971, 0.09665549177356285]| |2 |[1, 0, 3] |[0.10212840880470624, 0.10055201895131596, 0.10045007154595613]| |3 |[10, 3, 6] |[0.2493186996338478, 0.19913457316982602, 0.14749249425469355] | |4 |[3, 9, 8] |[0.10905545960705024, 0.10073376600446866, 0.09528055695492436]| |5 |[6, 9, 2] |[0.1039600702523239, 0.10344827221025912, 0.09789995178753787] | |6 |[4, 9, 5] |[0.10717177600684129, 0.10045733284545605, 0.09794485955196937]| |7 |[5, 3, 8] |[0.10744369332435673, 0.104111549656985, 0.09159886235132612] | |8 |[1, 4, 7] |[0.16343370091191292, 0.14735055650519843, 0.11133274242674586]| |9 |[2, 5, 1] |[0.18428935702320365, 0.10686639650367562, 0.09588486337546435]| +-----+-----------+---------------------------------------------------------------+
# Shows the result
transformed = model.transform(dataset)
transformed.show(truncate=False)
+-----+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |label|features |topicDistribution | +-----+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |0.0 |(11,[0,1,2,4,5,6,7,10],[1.0,2.0,6.0,2.0,3.0,1.0,1.0,3.0]) |[0.004681172149622019,0.00468117819091479,0.004681135694970299,0.005739935862359076,0.004681139647508738,0.004681163035836322,0.004681195802725201,0.0046811988040349585,0.00492812552906486,0.9565637552829637] | |1.0 |(11,[0,1,3,4,7,10],[1.0,3.0,1.0,3.0,2.0,1.0]) |[0.007811273626304089,0.007811346701466485,0.007811242960027281,0.00957734766276134,0.007811265509513229,0.007811206689049855,0.00781128109751437,0.0078112984040495026,0.9276462050215541,0.008097532327759652] | |2.0 |(11,[0,1,2,5,6,8,9],[1.0,4.0,1.0,4.0,9.0,1.0,2.0]) |[0.004069740485454562,0.004069658893835722,0.0040697050383498926,0.9630081173528389,0.004069702412079345,0.00406974198724303,0.004069717726907355,0.004069776983942488,0.0042843192677678825,0.004219519851581043] | |3.0 |(11,[0,1,3,6,8,9,10],[2.0,1.0,3.0,5.0,2.0,3.0,9.0]) |[0.0035992571907178302,0.003599276707503996,0.0035992682713546016,0.9672853316462869,0.0035992683635499363,0.003599272546483701,0.003599266495766857,0.003599262254110511,0.003788685598710634,0.0037311109255152987]| |4.0 |(11,[0,1,2,3,4,6,9,10],[3.0,1.0,1.0,9.0,3.0,2.0,1.0,3.0]) |[0.0038998378836786023,0.0038998553942561263,0.003899803769702433,0.9645523683214465,0.003899822505440322,0.0038997916363409093,0.003899891029212763,0.003899855475584653,0.004105845066093223,0.004042928918244333] | |5.0 |(11,[0,1,3,4,5,6,7,8,9],[4.0,2.0,3.0,4.0,5.0,1.0,1.0,1.0,4.0]) |[0.0035995247227519982,0.0035994886642426605,0.003599466814932717,0.4419674910361024,0.0035995149928232024,0.0035994746107467393,0.003599550978920746,0.003599554111656215,0.529104393329512,0.0037315407383113684] | |6.0 |(11,[0,1,3,6,8,9,10],[2.0,1.0,3.0,5.0,2.0,2.0,9.0]) |[0.00374343260874588,0.003743454291543376,0.0037434442312325156,0.9659748718375448,0.003743441853212792,0.003743444727613381,0.003743440140793345,0.003743439961638659,0.003940457995780985,0.003880572351894354] | |7.0 |(11,[0,1,2,3,4,5,6,9,10],[1.0,1.0,1.0,9.0,2.0,1.0,2.0,1.0,3.0])|[0.004254926616580389,0.004254909371345522,0.00425488380254243,0.9613250069912941,0.004254915456616531,0.0042548916475503955,0.004254962578481665,0.00425495739004296,0.0044794590948799414,0.0044110870506661164] | |8.0 |(11,[0,1,3,4,5,6,7],[4.0,4.0,3.0,4.0,2.0,1.0,3.0]) |[0.00425496839016557,0.004254953915974418,0.0042549351461166705,0.005217635249656695,0.004254945196253455,0.004254917606045874,0.004254942673277737,0.004254996332101531,0.9605867274027159,0.004410978087692148] | |9.0 |(11,[0,1,2,4,6,8,9,10],[2.0,8.0,2.0,3.0,2.0,2.0,7.0,2.0]) |[0.0032266185995705058,0.0032266291741003574,0.0032266392566645758,0.48930724451664287,0.0032266232502273532,0.003226642148167788,0.0032266468774465418,0.00322661358315771,0.48476130361577885,0.003345038978243395]| |10.0 |(11,[0,1,2,3,5,6,9,10],[1.0,1.0,1.0,9.0,2.0,2.0,3.0,3.0]) |[0.004069548150438102,0.004069509675152526,0.004069543568092783,0.9630104183150475,0.004069558541395517,0.00406955896965846,0.0040695467882306555,0.004069589573650498,0.004283816903229528,0.00421890951510439] | |11.0 |(11,[0,1,4,5,6,7,9],[4.0,1.0,4.0,5.0,1.0,3.0,1.0]) |[0.004681370318789852,0.004681298632475264,0.004681258109640178,0.005739283441512521,0.00468130536850138,0.004681292713463788,0.004681366952427264,0.004681391399335244,0.9566382652716945,0.004853167792160142] | +-----+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+