School of Economics and Management
Beihang University
http://yanfei.site

Primary R Functions

The primary R functions for dealing with text data are

  • grep(), grepl(): These functions search for matches of a regular pattern in a character vector. grep() returns the indices into the character vector that contain a match or the specific strings that happen to have the match. grepl() returns a TRUE/FALSE vector indicating which elements of the character vector contain a match.

  • regexpr(), gregexpr(): Search a character vector for pattern matches and return the indices of the string where the match begins and the length of the match.

  • sub(), gsub(): Search a character vector for pattern matches and replace that match with another string.

  • substr(): Extract substrings in a character vector.

  • regexec(): This function searches a character vector for a pattern, much like regexpr(), but it will additionally return the locations of any parenthesized sub-expressions. Probably easier to explain through demonstration.

Text data

We will use a running example using data from homicides in Baltimore City. You can get the file from https://yanfei.site/docs/sc/data/homicides.txt. Original data is from https://homicides.news.baltimoresun.com.

homicides <- readLines("../data/homicides.txt")
## Total number of events recorded
length(homicides)
## [1] 1571
homicides[1]
## [1] "39.311024, -76.674227, iconHomicideShooting, 'p2', '<dl><dt>Leon Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore, MD 21216</dd><dd>black male, 17 years old</dd><dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd></dl>'"
homicides[1000]
## [1] "39.33626300000, -76.55553990000, icon_homicide_shooting, 'p1200', '<dl><dt><a href=\"http://essentials.baltimoresun.com/micro_sun/homicides/victim/1200/davon-diggs\">Davon Diggs</a></dt><dd class=\"address\">4100 Parkwood Ave<br />Baltimore, MD 21206</dd><dd>Race: Black<br />Gender: male<br />Age: 21 years old</dd><dd>Found on November  5, 2011</dd><dd>Victim died at Johns Hopkins Bayview Medical Center </dd><dd>Cause: Shooting</dd><dd class=\"popup-note\"><p>Originally reported in 5000 Belair Road; later determined to be rear alley of 4100 block Parkwood</p></dd></dl>'"

We have the latitude and longitude of where the victim was found; then there's the street address; the age, race, and gender of the victim; the date on which the victim was found; in which hospital the victim ultimately died; the cause of death.

grep()

Suppose we wanted to identify the records for all the victims of shootings (as opposed to other causes)? How could we do that?

Here I use grep() to match the literal iconHomicideShooting into the character vector of homicides.

g <- grep("iconHomicideShooting", homicides)
length(g)
## [1] 228

Using this approach I get 228 shooting deaths. However, I notice that for some of the entries, the indicator for the homicide "flag" is noted as icon_homicide_shooting. It's not uncommon over time for web site maintainers to change the names of files or update files. What happens if we now grep() on both icon names using the | operator?

g <- grep("iconHomicideShooting|icon_homicide_shooting", homicides)
length(g)
## [1] 1263

grep()

Another possible way to do this is to grep() on the cause of death field, which seems to have the format Cause: shooting. We can grep() on this literally and get

g <- grep("Cause: shooting", homicides)
length(g)
## [1] 228

Notice that we seem to be undercounting again. This is because for some of the entries, the word "shooting" uses a captial "S" while other entries use a lower case "s". We can handle this variation by using a character class in our regular expression.

g <- grep("Cause: [Ss]hooting", homicides)
length(g)
## [1] 1263

grepl()

The function grepl() works much like grep() except that it differs in its return value. grepl() returns a logical vector indicating which element of a character vector contains the match. For example, suppose we want to know which states in the United States begin with word "New".

g <- grepl("^New", state.name)
g
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [46] FALSE FALSE FALSE FALSE FALSE
state.name[g]
## [1] "New Hampshire" "New Jersey"    "New Mexico"   
## [4] "New York"

Here, we can see that grepl() returns a logical vector that can be used to subset the original state.name vector.

regexpr()

  • Both the grep() and the grepl() functions have some limitations - they don't tell you exactly where the match occurs or what the match is for a more complicated regular expression.

  • The regexpr() function gives you
    • index into each string where the match begins
    • length of the match for that string.
  • regexpr() only gives you the first match of the string (reading left to right). gregexpr() will give you all of the matches in a given string if there are is more than one match.

regexpr() Example

In our Baltimore City homicides dataset, we might be interested in finding the date on which each victim was found. Taking a look at the dataset

homicides[1]
## [1] "39.311024, -76.674227, iconHomicideShooting, 'p2', '<dl><dt>Leon Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore, MD 21216</dd><dd>black male, 17 years old</dd><dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd></dl>'"

it seems that we might be able to just grep on the word "Found". However, the word "found" may be found elsewhere in the entry, such as in this entry, where the word "found" appears in the narrative text at the end.

homicides[954]
## [1] "39.30677400000, -76.59891100000, icon_homicide_shooting, 'p816', '<dl><dt><a href=\"http://essentials.baltimoresun.com/micro_sun/homicides/victim/816/kenly-wheeler\">Kenly Wheeler</a></dt><dd class=\"address\">1400 N Caroline St<br />Baltimore, MD 21213</dd><dd>Race: Black<br />Gender: male<br />Age: 29 years old</dd><dd>Found on March  3, 2010</dd><dd>Victim died at Scene</dd><dd>Cause: Shooting</dd><dd class=\"popup-note\"><p>Wheeler\\'s body was&nbsp;found on the grounds of Dr. Bernard Harris Sr.&nbsp;Elementary School</p></dd></dl>'"

But we can see that the date is typically preceded by "Found on" and is surrounded by <dd></dd> tags, so let's use the pattern <dd>[F|f]ound(.*)</dd> and see what it brings up.

regexpr("<dd>[F|f]ound(.*)</dd>", homicides[1:10])
##  [1] 177 178 188 189 178 182 178 187 182 183
## attr(,"match.length")
##  [1] 93 86 89 90 89 84 85 84 88 84
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

We can use the substr() function to extract the first match in the first string.

substr(homicides[1], 177, 177 + 93 - 1)
## [1] "<dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd>"

Picked up too much information? We need to use the ? metacharacter to make the regular expression "lazy" so that it stops at the first </dd> tag.

regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:10])
##  [1] 177 178 188 189 178 182 178 187 182 183
## attr(,"match.length")
##  [1] 33 33 33 33 33 33 33 33 33 33
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

Now when we look at the substrings indicated by the regexpr() output, we get

substr(homicides[1], 177, 177 + 33 - 1)
## [1] "<dd>Found on January 1, 2007</dd>"

Instead of using substr(), regmatches() is more handy.

r <- regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:5])
regmatches(homicides[1:5], r)
## [1] "<dd>Found on January 1, 2007</dd>"
## [2] "<dd>Found on January 2, 2007</dd>"
## [3] "<dd>Found on January 2, 2007</dd>"
## [4] "<dd>Found on January 3, 2007</dd>"
## [5] "<dd>Found on January 5, 2007</dd>"

sub() and gsub()

Sometimes we need to clean things up or modify strings by matching a pattern and replacing it with something else. For example, how can we extract the date from this string?

x <- substr(homicides[1], 177, 177 + 33 - 1)
x
## [1] "<dd>Found on January 1, 2007</dd>"

We want to strip out the stuff surrounding the "January 1, 2007" portion. We can do that by matching on the text that comes before and after it using the | operator and then replacing it with the empty string.

sub("<dd>[F|f]ound on |</dd>", "", x)
## [1] "January 1, 2007</dd>"

Notice that the sub() function found the first match (at the beginning of the string) and replaced it and then stopped. However, there was another match at the end of the string that we also wanted to replace. To get both matches, we need the gsub() function.

gsub("<dd>[F|f]ound on |</dd>", "", x)
## [1] "January 1, 2007"

The sub() and gsub() functions can take vector arguments so we don't have to process each string one by one.

r <- regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:5])
m <- regmatches(homicides[1:5], r)
m
## [1] "<dd>Found on January 1, 2007</dd>"
## [2] "<dd>Found on January 2, 2007</dd>"
## [3] "<dd>Found on January 2, 2007</dd>"
## [4] "<dd>Found on January 3, 2007</dd>"
## [5] "<dd>Found on January 5, 2007</dd>"
d <- gsub("<dd>[F|f]ound on |</dd>", "", m)
## Nice and clean
d
## [1] "January 1, 2007" "January 2, 2007" "January 2, 2007"
## [4] "January 3, 2007" "January 5, 2007"

Finally, it may be useful to convert these strings to the Date class so that we can do some date-related computations.

as.Date(d, "%B %d, %Y")
## [1] "2007-01-01" "2007-01-02" "2007-01-02" "2007-01-03"
## [5] "2007-01-05"

regexec()

The regexec() function works like regexpr() except it gives you the indices for parenthesized sub-expressions. For example, take a look at the following expression.

regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1])
## [[1]]
## [1] 177 190
## attr(,"match.length")
## [1] 33 15
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

Here's the overall expression match.

substr(homicides[1], 177, 177 + 33 - 1)
## [1] "<dd>Found on January 1, 2007</dd>"

And here's the parenthesized sub-expression.

substr(homicides[1], 190, 190 + 15 - 1)
## [1] "January 1, 2007"

All this can be done much more easily with the regmatches() function.

r <- regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1:2])
regmatches(homicides[1:2], r)
## [[1]]
## [1] "<dd>Found on January 1, 2007</dd>"
## [2] "January 1, 2007"                  
## 
## [[2]]
## [1] "<dd>Found on January 2, 2007</dd>"
## [2] "January 2, 2007"

regexec()

As an example, we can make a plot of monthly homicide counts. First we need a regular expression to capture the dates.

r <- regexec("<dd>[F|f]ound on (.*?)</dd>", homicides)
m <- regmatches(homicides, r)

Then we can loop through the list returned by regmatches() and extract the second element of each (the parenthesized sub-expression).

dates <- sapply(m, function(x) x[2])

Finally, we can convert the date strings into the Date class and make a histogram of the counts.

invisible(dates <- as.Date(dates, "%B %d, %Y"))
hist(dates, "month", freq = TRUE, main = "Monthly Homicides in Baltimore")

We can see from the picture that homicides do not occur uniformly throughout the year and appear to have some seasonality to them.

Summary

The primary R functions for dealing with regular expressions are

  • grep(), grepl(): Search for matches of a regular expression/pattern in a character vector

  • regexpr(), gregexpr(): Search a character vector for regular expression matches and return the indices where the match begins; useful in conjunction withregmatches()`

  • sub(), gsub(): Search a character vector for regular expression matches and replace that match with another string

  • regexec(): Gives you indices of parethensized sub-expressions.

Lab Session 3

In this lab, you will play around text processing using R.

  1. Load text from the https://yanfei.site/docs/dpsa/BABAnews.txt and print it on screen. Text file contains some of the news of Alibaba.

  2. How many paragraphs are there in the article?

  3. Trim leading whitespaces of each paragraph (try ??trim).

  4. How many characters are there in the article?

  5. Collapse paragraphs into one and display it on the screen (un-list it).

  6. Does the text contain word '技术架构'?

  7. Split the article into sentences (by periods).

  8. Replace '双11' with '双十一'.

References

Chapter 19 of the book "R programming for data science".