So, let’s use our extractNonemptyWords function on our abt vector.
Here’s a loop way to do it:
allWords <- NULL # start with empty vector
for (i in 1:70) {
thisLine <- extractNonemptyWords(abt[i])
allWords <- c(allWords,thisLine)
}Note that the result, i.e. the final value of allWords, will be one
long vector, consisting of all the words in the file.
As usual, it is a must to inspect the result, say the first 25 elements:
> head(allWords,25)
[1] "What" "is" "R?"
[4] "Introduction" "to" "R"
[7] "R" "is" "a"
[10] "language" "and" "environment"
[13] "for" "statistical" "computing"
[16] "and" "graphics." "It"
[19] "is" "a" "GNU"
[22] "project" "which" "is"
[25] "similar" Good, all the words seem to be there, and the "" are NOT there, just as
desired. But how to get the word counts? Why, it’s our old friend,
tapply!
> q <- tapply(allWords,allWords,length)
> head(q,25)
; …) “environment” (easily) (formerly
1 1 1 1 1
(including (linear * © a
1 1 5 1 13
about accretion activity. add additional
3 1 1 1 1
Advanced algorithmic allows an analysis
1 1 1 5 2
analysis, and are around arrays,
2 27 4 1 1 Actually, this really the same pattern we saw before, with the
length function as our third argument. It may look a little odd
that the first two arguments are identical, but it makes sense:
We split up the allWords vector into piles, according to the
second argument, which happens to be the same vector.
We apply the length function to each pile, giving us the count
in each pile, exactly what we needed.
Tip: In coding, certain patterns do arise often, one did here. In fact, there are coding books with “design patterns” in their titles. Take note when you see the same pattern a lot.
We’re not fully done yet. For instance, we have a punctuation problem,
where periods, commas and so on are considered parts of words, such as
the period in allWords[17] seen above, ‘graphics.’ We also probably
should change capital letters to lower
For major usage, we should consider using one of the advanced R packages
in text processing. For instance, the tm package has a
removePunctuation function. But let’s see how we can do this with
the basics.
We’ll use R’s gsub function. It’s call form, as we’ll use it, is
gsub(string_to_change,replacement,input_vector,fixed=TRUE)E.g.
> a <- c('abc','de.')
> gsub('.','',a,fixed=TRUE) # replace '.' by empty stringW
[1] "abc" "de" (The fixed argument is complex, and pops up in all the R string manipulation packages. This again is something you should use for now, and look into when you become more skilled at R.)
So, to remove all periods in allWords, we can do:
> awNoPers <- gsub('.','',allWords,fixed=TRUE)
> awNoPers[17] # check that it worked
[1] "graphics"