Data Processing and Statistical Analysis

School of Economics and Management
Beihang University
http://yanfei.site

Objectives

Text processing in R
Web scraping in R
Text mining in R

Packages required in this lecture

install.packages(c("stringr", "rvest", "knitr", "jiebaR", "wordcloud2", "tm", 
"slam" , "proxy", "topicmodels", "RColorBrewer"))

Text Processing

Read and write text data into R

## read text data into R
## WMTnews.txt can be found on my Github.
wmt.news <- readLines('WMTnews.txt')
## You can also read from the file I put online. It takes roughly 2 mins to read.
## wmt.news <- readLines("https://yanfei.site/docs/dpsa/WMTnews.txt",
##                       encoding = 'UTF-8') 
length(wmt.news)

## [1] 449

## print the first news article without quotes
noquote(wmt.news[1])

## [1] 沃尔玛在中国强推综合工时制，引发多地门店员工罢工由于不满沃尔玛近期在中国推行的“综合工时制”改革，从7月1日开始，沃尔玛多地门店的基层员工发起罢工。据江西南昌当地媒体的报道，近日南昌沃尔玛八一广场店成了“闹市”，这里的沃尔玛员工正在集体罢工，工装背后贴着的A4纸上写着“沃尔玛员工站起来，抵制综合工时制度，反对欺骗，坚决维权。”据悉，之所以会发生这样的事，是因为沃尔玛要实行新的薪酬制度。员工称，他们本来是与沃尔玛签订了长期劳动合同，现在沃尔玛要求更改合同，本来的月薪制更改成小时制，并强制让员工签字。在大家看来，用工合同的更改，意味着他们的保障得到了根本性的改变。不仅如此，这些员工称，新的合同也变相提高了基本工资，从而规避城市基本工资上调政策，变相降低总体工资。除南昌外，成都、重庆、深圳、哈尔滨的个别商场员工也组织了罢工，以抗议这一次的“综合工时制”改革。由沃尔玛员工自发组织的中国员工联谊会介绍，此前沃尔玛在中国一直以来采用的是标准工时制，全职工每天工作8小时，每周工作5天，每周40小时；但沃尔玛自今年5月开始在中国各地分店推行综合计算工时工作制。新规则下，全职工每天工作4-11小时，每周工作3-6天，每周20-66小时，每月平均标准工时174小时、加班工时不超过36小时。沃尔玛员工认为，新规则可能导致工作时间安排不稳定，而且单方面实施新工时制度在程序上是违法的。知情人士告诉澎湃新闻，沃尔玛在美国实行的就是综合工时制，在中国推行这一改革也是为了与总部统一标准。在中国，综合工时制符合法律要求。《劳动法》共规定三种工时计算标准，即标准工时制、综合工时制和不定时工时制。标准工时制和综合工时制的区别在于，标准工时制以“天”为计算单位，而综合工时制以“周、月、年”为计算单位。也就是说，标准工时制是按照社会上最常见的8小时工作制，每周不超过40小时的标准来计算员工的工作量。而综合工时制是在每周不超过40小时的工作总量下，灵活分配每天的工作时间，工作长度。只是，从标准工时制改为综合工时制，还需要得到中国各地相关劳动主管部门的批准，而各地的要求也不尽相同。有的地方主管部门直接批准即可执行；有的地方主管部门则要求获得绝大部分员工的同意才可推行新政。因此，沃尔玛一些城市门店的基层员工需要签字同意这一改革。在这一过程中，部分员工对新政有各种各样的担忧，罢工事件由此爆发，沃尔玛在中国的“综合工时制”改革遭遇强大的挑战。据江西当地电视台7月3日报道，南昌市总工会的领导已经要求沃尔玛华中地区负责人向沃尔玛中国总部反映，恢复标准工时制。截止澎湃新闻发稿之时，尚未获得沃尔玛中国总部针对这一事件的回应。业内分析，沃尔玛在中国推行“综合工时制”改革的目的还是为了降低人力成本。沃尔玛这两年来一直在“做减法”，减去他们认为不利于管控、不利于标准化、不利于规模化、不利于降低成本的任何环节、商品、配置等。这种做法的好处在于，沃尔玛进一步加强中央管控，门店更加“听话”，并且可以节省成本，在利润上有直接体现。但这种做法也存在门店的本地化、个性化日益下降，商品竞争力逐渐下滑的弊端。（来源：澎湃新闻）进入【新浪财经股吧】讨论

## write text data into R
cat(wmt.news, file = "WMTnews.txt", sep = "\n")

You may use other functions like read.table(), write.table() etc.

Length of each article

nchar()
stringr::str_length()

## number of characters in each news article
nchar(wmt.news)

##   [1] 1308 1005 1066 2886  440  270 2313  452 3099  683 3119 3140  397 2781
##  [15]  419  460 2839 2519 2934  572 1181  156 1723 3301 2245 2401 2872 2849
##  [29] 1226 2048 2324 3439 3182 1055 1698 1881  673 2877 1719  254 2342  770
##  [43] 1325  771 1923  497 3145  139 2096  323  251 1652  495  301 1227  419
##  [57] 1720 3219  736 3245 2602  722 1345  748  524 1537  924   62   54  632
##  [71] 3143 3120  966 1790  962  940 2636 1497  964  250  218  338 1677  346
##  [85]  366 1679  841 2011  866 1314  393  664 1708 1320 1977 1592  285  833
##  [99]  731  336 1882 3241 2270 1251 1455  215 2264  345  344  955  782  267
## [113]  411    0  683  158  285 1155  168  382 1759 5446  292  651 3927  578
## [127]  607  144  142  540  867 1136 1874  654  539  141   42 1033  229  140
## [141] 1246  892 1442  863 2293  289 2583  142  564  773  694  121  122 1958
## [155] 1251 1284  928 2195  493 1334 2537 1724 1229  728 1552  548 3479  763
## [169] 1907    0  789  997  626  797  855  560  882 1166   48  552   52  729
## [183]  248  701  624  632  751  356  542  346 1054 2617  237 3197  582  827
## [197] 1366  491 1016  538  956 3938  963 1188    0 2352 1176 1247 2533 1249
## [211] 2039 1426  463  456 1042  128  551 1589  296  468 3950  198  855  870
## [225]  254  660  207  599  362  430  442  285  843  735 5897  308  149  300
## [239]  598  592  399  835 2299  866  840  211  379  397 1415  456  945  439
## [253] 1485  995  242 2253  239  399  879 1630  690  825  740  201  300   71
## [267]   52 1246  653  998  554 1623 1134 1138 1066  360  302  709  828  159
## [281]  598  170  420  432 3448  513  292 3305  136 1883 2184  794  534  782
## [295] 1919 1527 1562 1638  811    0  931  576 1168 1218 2130  798  291  465
## [309]  720 4068  563 1806   90   25 1014   72 1496  468    0 1571 1769 2775
## [323]  972 3515 1898  181 1263  376   92 1903  138    0  410 2128  465  575
## [337] 1740  583  856  879 1214  398 1084 1114 2146  341 2159  480 1952  469
## [351] 1127  113   95 2922  132  872 5488  205  137  145  706  469  470 1095
## [365]  231 1093 1068  950  650  205  887  380   25 1297 1642  821 3251  830
## [379] 2028    0  522  146  347  560  598  662  400  218 1965 1627 1551  235
## [393]  182  777  534 1048 7061  414  167 2986 1214  425 1107 1138 1229 1186
## [407] 1216    0  727 1758  773 2073  588  864  299  415    0 3256  741 3691
## [421]  403 1979 2050    0    0  661 1458 1231 1445  814  799 2477 1920 1493
## [435]  166  554  114  711 1456 2669    0 1183 1123  209 1527  115  297 1425
## [449]  576

## library(stringr); str_length(wmt.news)

Concatenate strings

Concatenate two strings

paste()
stringr::str_c()

## concatenate characters 
paste('2015', '06-04', sep = '-')

## [1] "2015-06-04"

paste('2015', c('06-04', '06-05'), sep = '-')

## [1] "2015-06-04" "2015-06-05"

paste('2015', c('06-04', '06-05'), sep = '-', collapse = ' ')

## [1] "2015-06-04 2015-06-05"

## str_c() in stringr
library(stringr)
str_c('2015', '06-04', '00:00', sep = '-')

## [1] "2015-06-04-00:00"

## frequently used in web scraping
paste('http://search.sina.com.cn/?q=%BE%A9%B6%AB&c=news&from=index&page=', 
      '1', sep = '')

## [1] "http://search.sina.com.cn/?q=%BE%A9%B6%AB&c=news&from=index&page=1"

Combine text and variable values

sprintf() is a superior choice over paste.

## combine text and variable values
kw <- '大数据'
start <- 15
url <- sprintf('https://book.douban.com/subject_search?search_text=%s&cat=1001&start=%d', kw, start)
URLencode(url)

## [1] "https://book.douban.com/subject_search?search_text=%E5%A4%A7%E6%95%B0%E6%8D%AE&cat=1001&start=15"

Split strings

strsplit()
stringr::str_split()

## split characters
dates <- c('2015-06-04', '2015-06-05')
strsplit(dates, "-")

## [[1]]
## [1] "2015" "06"   "04"  
## 
## [[2]]
## [1] "2015" "06"   "05"

strsplit('2015-06-04', '-')

## [[1]]
## [1] "2015" "06"   "04"

## another way
library(stringr)
str_split(dates, '-')

## [[1]]
## [1] "2015" "06"   "04"  
## 
## [[2]]
## [1] "2015" "06"   "05"

str_split('2015-06-04', '-')

## [[1]]
## [1] "2015" "06"   "04"

Pattern Matching

## search for matches
mySentences <- c('沃尔玛还与微信跨界合作，顾客可通过沃尔玛微信服务号的付款功能在实体门店秒付买单。', 
                 '沃尔玛移动支付应用已经部署在其全美4,600家超市中。')
grep('沃尔玛', mySentences)

## [1] 1 2

grepl('沃尔玛', mySentences)

## [1] TRUE TRUE

library(stringr); str_detect(mySentences, '沃尔玛')

## [1] TRUE TRUE

regexpr('沃尔玛', mySentences)

## [1] 1 1
## attr(,"match.length")
## [1] 3 3

gregexpr('沃尔玛', mySentences)

## [[1]]
## [1]  1 18
## attr(,"match.length")
## [1] 3 3
## 
## [[2]]
## [1] 1
## attr(,"match.length")
## [1] 3

Pattern Replacement

## replace white spaces 
messySentences <- c('沃尔玛还与微信   跨界合作，顾客可通过沃尔玛微信服务号的付款功能在实体门店
                    秒付买单。', 
                    '沃尔玛移动支付应 用已经部  署在其全美4,600家超市中。')

## patten replacement
## sub(pattern, replacement, x, ...)
sub(' ', '', messySentences)

## [1] "沃尔玛还与微信  跨界合作，顾客可通过沃尔玛微信服务号的付款功能在实体门店\n                    秒付买单。"
## [2] "沃尔玛移动支付应用已经部  署在其全美4,600家超市中。"

## gsub(pattern, replacement, x, ...)
gsub(' ', '', messySentences)

## [1] "沃尔玛还与微信跨界合作，顾客可通过沃尔玛微信服务号的付款功能在实体门店\n秒付买单。"
## [2] "沃尔玛移动支付应用已经部署在其全美4,600家超市中。"

Extract substrings in a character vector

## extract substrings: substr(x, start, stop)
x <- c('月薪：5000元', '月薪：8000元')
substr(x,4,7)

## [1] "5000" "8000"

Your turn

Load text from the https://yanfei.site/docs/dpsa/BABAnews.txt and print it on screen. Text file contains some of the news of Alibaba.
How many paragraphs are there in the article?
Trim leading whitespaces of each paragraph (try ??trim).
How many characters are there in the article?
Collapse paragraphs into one and display it on the screen (un-list it).
Does the text contain word ‘技术架构’?
Split the article into sentences (by periods).
Replace ‘双11’ with ‘双十一’.

Reference and further reading

Please see Text processing on Wiki for more details, examples, R packages and R functions used for text processing in R.

Web Scraping

We need data

Data is messy!

But we want a tidy format of data!
- rows == observations
- columns == attributes

Movie	Score	Length (mins)	Language
爱乐之城	8.4	128	English
看不见的客人	8.7	106	Spanish
…	…	…	…

Get familiar with the structure of a html (tags)

When we do web scraping, we deal with html tags to find the path of the information we want to extract.
A simple html source code: tree structure of html tags. HTML tags normally come in pairs.

<!DOCTYPE html>
<html>
  <title> My title
  </title>
  <body>
    <h1> My first heading </h1>
      <p> My first paragraph </p>
  </body>
</html>

<!DOCTYPE html> : HTML documents must start with a type declaration
HTML document is contained between <html> and </html>
The visible part of the HTML document is between <body> and </body>
HTML headings are defined with the <h1> to <h6> tags
HTML paragraphs are defined with the <p> tag

Work with other useful tags

HTML links are defined with the <a> tag

<a href="http://www.test.com">This is a link for test.com</a>

HTML tables are defined with <table>, row as <tr> and rows are divided into data as <td>

<table style="width:100%">
  <tr>
    <td> 中文名称 </td>
    <td> 英文名称 </td>
    <td> 简称     </td>
  </tr>
  <tr>
    <td> 北京航空航天大学 </td>
    <td> Beihang University </td>
    <td> 北航 </td>
  </tr>   
</table>

HTML list starts with <ul> (unordered) and <ol> (ordered). Each item of list starts with <li>

<ol>
  <li> 科技获奖 </li>
  <li> 服务国家战略 </li>
  <li> 标志性成果 </li>
</ol>

You can try https://html-online.com/editor/ to learn more about html.

XPath

Path used to select nodes and info in html. Here are some examples of XPath expressions and their meanings. XPath is one of the most crucial step to do web scraping.

Example 1

<!DOCTYPE html>
<html>
  <title> My title
  </title>
  <body>
    <h1> My first heading </h1>
      <p> My first paragraph </p>
  </body>
</html>

/html/title: selects the <title> element of an HTML document
//p: selects all the <p> elements

Example 2

<html> 
  <head>
    <base href='http://example.com/' />
    <title>Example website</title> 
  </head>
  <body>
    <div id='images', class='img'>
      <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg'/></a> 
      <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg'/></a> 
      <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg'/></a> 
      <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg'/></a> 
      <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg'/></a>
    </div> 
    <div>
      <a href='img.html'> text <img src='img.jpg'/></a>
    </div>
  </body>
</html>

//div[@id="images"]: selects all the <div> elements which contain an attribute id="images". Note its difference with //div
- //div[@class="img"]
- //body/div[1]
- Xpath is not unique.
- attributes like ‘class’ and ‘id’ are mostly used.
//div[@id="images"]/a/: selects all the <a> elements inside the aforementioned element.

Your turn

<td class="zwmc" style="width: 250px;">
  <div style="width: 224px;*width: 218px; _width:200px; float: left">
    <a style="font-weight: bold">金融分析师</a>
  </div>
</td>

Q: Please extract the <a> element from the source above.
A1: //td[@class="zwmc"]/div/a
A2: //td[@class="zwmc"]//a

Where to go?

Scrape news data from http://search.sina.com.cn related to ‘京东’.

Inspect a web page (easily found in Chrome).
Find the xpath for the elements you want to extract. You can also find xpath from viewing the whole page source.

Your turn

Can you find xpath for news titles, abstracts and links?

Scrapping a webpage using rvest package

Parse the entire website: read_html().
Find and extract the pieces of the website you need using XPath: html_nodes(). It pull out the entire node.
The following are done after using html_nodes() to extract content we need.
- html_table(): extract all data inside a html table.
- html_text(): extract all text within the node.
- html_attr(): extract contents of a single attribute.
- html_attrs(): extract all attributes.
Cleanup

Go back to previous examples

library(rvest)
web <- read_html('<!DOCTYPE html>
<html>
  <title> My title
  </title>
  <body>
    <h1> My first heading </h1>
      <p> My first paragraph </p>
  </body>
</html>')
title_node <- html_nodes(web, xpath = '//title')
title_node

## {xml_nodeset (1)}
## [1] <title> My title\n  </title>

html_text(title_node)

## [1] " My title\n  "

str_trim(html_text(title_node))

## [1] "My title"

Let’s do it!

Now we want to scrape data from a html table

url <- "http://www.bls.gov/web/empsit/cesbmart.htm"
web <- read_html(url)
table1 <-
  html_nodes(web, xpath = '//*[@id="Table1"]')
employdata <- html_table(table1, fill = TRUE)
library(knitr)
kable(head(employdata[[1]]), format = "html")

2018	Levels	Levels	Levels	Over-the-month Changes	Over-the-month Changes	Over-the-month Changes
2018	As Previously Published	As Revised	Difference	As Previously Published	As Revised	Difference
January	147,801	147,767	-34	176	171	-5
February	148,125	148,097	-28	324	330	6
March	148,280	148,279	-1	155	182	27
April	148,455	148,475	20	175	196	21
May	148,723	148,745	22	268	270	2

Let’s do it!

Now let’s scrape some news data

library(rvest)
url <- 'http://search.sina.com.cn/?q=%BE%A9%B6%AB&c=news&from=index&page=1'
web <- read_html(url, encoding = 'gbk')
news_title <- html_nodes(web, xpath = '//div[@class="r-info r-info2"]/h2[1]/a')
length(news_title)

## [1] 8

titles <- html_text(news_title)
titles[1:5]

## [1] "盘前提示 | 沪指在2800点上方蓄势 A股进一步下跌的空间不大"
## [2] "京东供应链集结到仓、商务仓、经济仓三大标准化服务"       
## [3] "京东(JD.US)1Q19季报点评：规模经济下的“新“京东"          
## [4] "多城出台推动夜间经济发展举措夜排档的“转正”之路"         
## [5] "多城出台推动夜间经济发展举措夜排档的“转正”之路"

link <- html_attr(news_title, 'href')
link[1:5]

## [1] "https://finance.sina.com.cn/roll/2019-05-15/doc-ihvhiqax8801723.shtml"            
## [2] "http://mp.sina.cn/article/2019-05-15/detail-i-ihvhiqax8801276.d.html"             
## [3] "https://finance.sina.com.cn/stock/relnews/us/2019-05-15/doc-ihvhiews1985710.shtml"
## [4] "https://k.sina.com.cn/article_2286908003_884f726302001a82v.html?from=food"        
## [5] "https://finance.sina.com.cn/roll/2019-05-15/doc-ihvhiqax8795795.shtml"

Pipeable!

news_title <- html_nodes(web, xpath = '//div[@class="r-info r-info2"]/h2[1]/a')
titles <- html_text(news_title)

\(\Downarrow\)

titles <- web %>%
  html_nodes(xpath = '//div[@class="r-info r-info2"]/h2[1]/a') %>%
  html_text()

Let’s extract as much information as we can

library(rvest)
url <- 'http://search.sina.com.cn/?q=%BE%A9%B6%AB&c=news&from=index&page=1'
web <- read_html(url, encoding = "gbk")
news_title <- web %>%
    html_nodes(xpath = '//div[@class="r-info r-info2"]/h2[1]/a') %>%
    html_text(trim = TRUE)
news_time <- web %>%
    html_nodes(xpath = '//div[@class="r-info r-info2"]/h2[1]/span') %>%
    html_text(trim = TRUE)
news_abstract <- web %>%
    html_nodes(xpath = '//div[@class="r-info r-info2"]/p[1]') %>%
    html_text(trim = TRUE)
news_link <- web %>%
    html_nodes(xpath = '//div[@class="r-info r-info2"]/h2[1]/a') %>%
    html_attr('href')
news_details <- data.frame(news_title, news_time, news_abstract, news_link)
library(knitr)
kable(head(news_details), format = "html")

news_title	news_time	news_abstract	news_link
盘前提示 \| 沪指在2800点上方蓄势 A股进一步下跌的空间不大	绝对值 2019-05-15 09:00:28	盘前提示 \| 沪指在2800点上方蓄势 A股进一步下跌的空间不大一、大势研判昨日两市集体低开开盘之后中小创发力创业板指急速翻红全天指数以震荡为主盘中指数多次翻红可见有资金活跃进场	https://finance.sina.com.cn/roll/2019-05-15/doc-ihvhiqax8801723.shtml
京东供应链集结到仓、商务仓、经济仓三大标准化服务	中国物流与采购网 2019-05-15 08:56:12	5月14日消息近日京东供应链标准化系列产品全面升级发布到仓服务、商务仓、经济仓三大产品及数十项增值服务为618活动做准备　　贯穿商品出工厂仓到消费者的B2C正逆向全业务场景	http://mp.sina.cn/article/2019-05-15/detail-i-ihvhiqax8801276.d.html
京东(JD.US)1Q19季报点评：规模经济下的“新“京东	格隆汇 2019-05-15 08:40:30	Non-GAAP 归属股东净利润创上市以来最高记录本季度京东总体净收入同比增长 20.9%至 RMB1,210.8 亿超市场预期 0.8% 其中自营电商业务同比增长 18.7%至 RMB1,086.5 亿；服务及其他业务同比增长 44.0%至 RMB124.3 亿	https://finance.sina.com.cn/stock/relnews/us/2019-05-15/doc-ihvhiews1985710.shtml
多城出台推动夜间经济发展举措夜排档的“转正”之路	人民网 2019-05-15 08:35:00	家住北京东三环劲松街道的陈先生记得过去每逢夏季原本宽阔的人行道上总是摆满了塑料桌椅行人只能溜边走 “大半夜还能听到大吼大叫根本休息不好” 如今	https://k.sina.com.cn/article_2286908003_884f726302001a82v.html?from=food
多城出台推动夜间经济发展举措夜排档的“转正”之路	人民网 2019-05-15 08:35:00	家住北京东三环劲松街道的陈先生记得过去每逢夏季原本宽阔的人行道上总是摆满了塑料桌椅行人只能溜边走 “大半夜还能听到大吼大叫根本休息不好” 如今	https://finance.sina.com.cn/roll/2019-05-15/doc-ihvhiqax8795795.shtml
福建今年一季度新开工一批好项目大项目，在你家乡吗？	福建尤溪电视台 2019-05-15 08:28:48	京东（仙游）数字经济产业园、中科智谷电子信息产业园、宸鸿科技 SNW 导电膜等电子信息项目古雷奇美化工 ABS 及	https://k.sina.com.cn/article_3966988353_ec73704102000i6u7.html?from=news&subch=onews

Questions

How about you want to scrape news data for multiple companies?
Think about how to turn pages?

Your turn

Please get from the web any data you are interested in.

Examples: news data, travelling data, books, movies, reviews, etc.

About assignments

Subject of your email: “DPSA2019-姓名-学号”
You’d better describe what you did in the body of the email and let me know the questions or problems you have.
Email attachments:
1. R script named as “DPSA2019-姓名-学号.R”
2. Your data named as “DPSA2019-姓名-学号.csv” or with other extensions.

Some notes

When you scrape a website too frequently, the server may reject your request. One possible solution is to stop for several seconds irregularly.
Not every website is scrappable! Some websites go with really high technoloy to protect their data from being extracted. For example, they use javascript, or really complex captcha codes.
Python has more functionality for web scraping. It is more flexible to deal with the problems mentioned above. If you are interested in that, please refer to this book. Basics of web scraping with Python are similar.

Text Mining

What is text mining?

Raw human written text \(\Rightarrow\) Structured information

The biggest difference between text mining and general data analysis is that it deals with text data, instead of numeric values.
Sometimes text mining is called ‘Natural Language Processing (NLP)’, especially in computer science.
Most text mining methods are based on word frequency in real world.

Topics in text mining

Corpus
- a collection of documents (e.g., a collection of different news documents)
Word segment
- segment each text into words
- stopwords: common words that generally do not contribute to the meaning of a sentence, at least for the purposes of information retrieval and natural language processing. These are words such as the and a. Most search engines will filter out stopwords from search queries and documents in order to save space in their index.
DocumentTermMatrix
- Each row is a document, while each column shows word frequencies of the corresponding word.
- This is the very basic data structure for text mining.
Text clustering
- Group similar documents together according to their similarities.
Topic models
- Find topics which the corpus is talking about.

Case study: news corpus

## read text data into R
news_abstract <- as.character(news_details$news_abstract)

Word segment

## load the word segment package
library(jiebaR)
## build a segment engine
engine1 <- worker(stop_word = 'stopwords.txt')
## add news words into the engine 
new_user_word(engine1, c("电子信息"))

## [1] TRUE

## for each news, perform word segmentation
Words <-  c()
for(i in 1:length(news_abstract)){
  Words <- c(Words,c(segment(news_abstract[i], engine1)))
}

## we need to consider other stopwords in this specific case
myStopwords <- c('提示', '京东', '公司')
Words <- Words[-which(Words %in% myStopwords)]

## remove all the numbers
Words <- gsub("[0-9]+?",'', Words)

## only keep terms
Words <- Words[nchar(Words) > 1]
head(Words)

## [1] "盘前"   "上方"   "蓄势"   "A股"    "进一步" "下跌"

Word frequency and wordcloud

## word frequencies
wordsNum <- table(unlist(Words))  
wordsNum <- sort(wordsNum, decreasing = TRUE)   
words.top150 <- head(wordsNum,150) 
library(RColorBrewer); colors <- brewer.pal(8,"Dark2")  
Sys.setlocale("LC_CTYPE")

## [1] "en_US.UTF-8"

library(wordcloud2)
wordcloud2(words.top150, color = "random-dark", shape = 'circle', backgroundColor = 'white')

Clustering

## load the text mining packages
library(tm)
library(slam)
wmt.news <- readLines("WMTnews.txt") 
## wmt.news <- readLines("https://yanfei.site/docs/dpsa/WMTnews.txt", encoding = 'UTF-8')

## build word segment engine
mixseg <- worker(stop_word = "stopwords.txt")
## mixseg <- worker(stop_word = "https://yanfei.site/docs/dpsa/stopwords.txt")
mixseg$bylines <- TRUE

## word segmentation for each of the 449 articles
word_list <- mixseg[wmt.news]

f <- function(x){
  x <- gsub("[0-9]+?",'', x)
  x[x == '号店'] <- '1号店'
  x <- paste(x[nchar(x) > 1], collapse = ' ')
  return(x)
}
## cleanup
d.vec <- lapply(word_list,f)
corpus <- Corpus(VectorSource(d.vec))

## remove stopwords
myStopwords <- c('新浪', '沃尔玛', '年',  '月', '日','公司', '中国', '有限公司')
stopwords <- readLines('stopwords.txt')
mycorpus <- tm_map(corpus,removeWords,c(stopwords, myStopwords))

## Warning in tm_map.SimpleCorpus(corpus, removeWords, c(stopwords,
## myStopwords)): transformation drops documents

## creat DocumentTermMatrix
control <- list(removePunctuation=T, 
             wordLengths = c(2, Inf), 
             stopwords = c(stopwords, myStopwords))
d.dtm <- DocumentTermMatrix(mycorpus, control)
d.dtm <- d.dtm[row_sums(d.dtm)!=0, ]

## remove sparse ones
d.dtm.sub <- removeSparseTerms(d.dtm, sparse = 0.99)

## text clustering
library(proxy)
d.dist <- proxy::dist(as.matrix(d.dtm.sub), method='cosine')
fit <- hclust(d.dist, method="ward.D")
memb <- cutree(fit, k = 2)
plot(fit)

Clustering

findFreqTerms(d.dtm.sub[memb==1, ], 300)

##  [1] "商品"   "美国"   "门店"   "市场"   "服务"   "记者"   "亿美元"
##  [8] "会员"   "增长"   "消费者" "超市"   "企业"   "全球"   "销售"  
## [15] "零售"   "食品"   "业务"   "山姆"   "电商"

findFreqTerms(d.dtm.sub[memb==2, ], 300)

## [1] "合作"  "业务"  "京东"  "电商"  "1号店"

Topic models

library(topicmodels)
ctm <- topicmodels::CTM(d.dtm.sub, k = 2)
terms(ctm, 2, 0.01)

## $`Topic 1`
## [1] "京东"  "1号店"
## 
## $`Topic 2`
## [1] "门店"

Objectives

Packages required in this lecture

Text Processing

Read and write text data into R

Length of each article

Concatenate strings

Concatenate two strings

Combine text and variable values

Split strings

Pattern Matching

Pattern Replacement

Extract substrings in a character vector

Your turn

Reference and further reading

Web Scraping

We need data

Get familiar with the structure of a html (tags)

Work with other useful tags

XPath

Example 1

Example 2

Your turn

Where to go?

Your turn

Scrapping a webpage using rvest package

Go back to previous examples

Let’s do it!

Now we want to scrape data from a html table

Let’s do it!

Now let’s scrape some news data

Pipeable!

Let’s extract as much information as we can

Questions

Your turn

Please get from the web any data you are interested in.

About assignments

Some notes

Text Mining

What is text mining?

Raw human written text \(\Rightarrow\) Structured information

Topics in text mining

Case study: news corpus

Word segment

Word frequency and wordcloud

Clustering

Clustering

Topic models

References

Further readings