School of Economics and Management
Beihang University
http://yanfei.site

## Raw human written text $$\Rightarrow$$ Structured information

• The biggest difference between text mining and general data analysis is that it deals with text data, instead of numeric values.
• Sometimes text mining is called 'Natural Language Processing (NLP)', especially in computer science.
• Most text mining methods are based on word frequency in real world.

## Concepts in text mining

• Corpus
• a collection of documents (e.g., a collection of different job description documents)
• Word segment
• segment each text into words
• stopwords: common words that generally do not contribute to the meaning of a sentence, at least for the purposes of information retrieval and natural language processing. These are words such as the and a. Most search engines will filter out stopwords from search queries and documents in order to save space in their index.
• DocumentTermMatrix
• Each row is a document, while each column shows word frequencies of the corresponding word.
• This is the very basic data structure for text mining.
• TermDocumentMatrix
• Text clustering
• Group similar documents together according to their similarities.
• Topic models
• Find topics which the corpus is talking about.

## Latent Semantic Analysis (LSA)

• Extract relationships between the documents and terms assuming that terms that are close in meaning will appear in similar (i.e., correlated) pieces of text.
• LSA leverages a singular value decomposition (SVD) factorization of a term-document matrix to extract these relationships. $A = U\Sigma V^T.$
• $$U$$ contains the eigenvectors of the term correlations, $$AA^T$$.
• $$V$$ contains the eigenvectors of the document correlations, $$A^TA$$.

## LSA to the Rescue!

• LSA often remediates the curse of dimensionality problem in text analytics:
• The matrix factorization has the effect of combining columns, potentially enriching signal in the data.
• By selecting a fraction of the most important singular values, LSA can dramatically reduce dimensionality.
• SVD is effective and is a staple of text analytics pipelines!