In our Baltimore City homicides dataset, we might be interested in finding the date on which each victim was found. Taking a look at the dataset
homicides[1]
## [1] "39.311024, -76.674227, iconHomicideShooting, 'p2', '<dl><dt>Leon Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />Baltimore, MD 21216</dd><dd>black male, 17 years old</dd><dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd></dl>'"
it seems that we might be able to just grep
on the word "Found". However, the word "found" may be found elsewhere in the entry, such as in this entry, where the word "found" appears in the narrative text at the end.
homicides[954]
## [1] "39.30677400000, -76.59891100000, icon_homicide_shooting, 'p816', '<dl><dt><a href=\"http://essentials.baltimoresun.com/micro_sun/homicides/victim/816/kenly-wheeler\">Kenly Wheeler</a></dt><dd class=\"address\">1400 N Caroline St<br />Baltimore, MD 21213</dd><dd>Race: Black<br />Gender: male<br />Age: 29 years old</dd><dd>Found on March 3, 2010</dd><dd>Victim died at Scene</dd><dd>Cause: Shooting</dd><dd class=\"popup-note\"><p>Wheeler\\'s body was found on the grounds of Dr. Bernard Harris Sr. Elementary School</p></dd></dl>'"
But we can see that the date is typically preceded by "Found on" and is surrounded by <dd></dd>
tags, so let's use the pattern <dd>[F|f]ound(.*)</dd>
and see what it brings up.
regexpr("<dd>[F|f]ound(.*)</dd>", homicides[1:10])
## [1] 177 178 188 189 178 182 178 187 182 183
## attr(,"match.length")
## [1] 93 86 89 90 89 84 85 84 88 84
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
We can use the substr()
function to extract the first match in the first string.
substr(homicides[1], 177, 177 + 93 - 1)
## [1] "<dd>Found on January 1, 2007</dd><dd>Victim died at Shock Trauma</dd><dd>Cause: shooting</dd>"
Picked up too much information? We need to use the ?
metacharacter to make the regular expression "lazy" so that it stops at the first </dd>
tag.
regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:10])
## [1] 177 178 188 189 178 182 178 187 182 183
## attr(,"match.length")
## [1] 33 33 33 33 33 33 33 33 33 33
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
Now when we look at the substrings indicated by the regexpr()
output, we get
substr(homicides[1], 177, 177 + 33 - 1)
## [1] "<dd>Found on January 1, 2007</dd>"
Instead of using substr()
, regmatches()
is more handy.
r <- regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:5])
regmatches(homicides[1:5], r)
## [1] "<dd>Found on January 1, 2007</dd>"
## [2] "<dd>Found on January 2, 2007</dd>"
## [3] "<dd>Found on January 2, 2007</dd>"
## [4] "<dd>Found on January 3, 2007</dd>"
## [5] "<dd>Found on January 5, 2007</dd>"