Lesson 30: Simple Text Processing, II

So, let’s use our extractNonemptyWords function on our abt vector. Here’s a loop way to do it:

allWords <- NULL  # start with empty vector
for (i in 1:70) {
   thisLine <- extractNonemptyWords(abt[i])
   allWords <- c(allWords,thisLine)
}

Note that the result, i.e. the final value of allWords, will be one long vector, consisting of all the words in the file.

As usual, it is a must to inspect the result, say the first 25 elements:

> head(allWords,25)
 [1] "What"         "is"           "R?"          
 [4] "Introduction" "to"           "R"           
 [7] "R"            "is"           "a"           
[10] "language"     "and"          "environment" 
[13] "for"          "statistical"  "computing"   
[16] "and"          "graphics."    "It"          
[19] "is"           "a"            "GNU"         
[22] "project"      "which"        "is"          
[25] "similar"     

Good, all the words seem to be there, and the "" are NOT there, just as desired. But how to get the word counts? Why, it’s our old friend, tapply!

> q <- tapply(allWords,allWords,length)
> head(q,25)
            ;            …) “environment(easily)     (formerly 
            1             1             1             1             1 
   (including       (linear             *             ©             a 
            1             1             5             1            13 
        about     accretion     activity.           add    additional 
            3             1             1             1             1 
     Advanced   algorithmic        allows            an      analysis 
            1             1             1             5             2 
    analysis,           and           are        around       arrays, 
            2            27             4             1             1 

Actually, this really the same pattern we saw before, with the length function as our third argument. It may look a little odd that the first two arguments are identical, but it makes sense:

  1. We split up the allWords vector into piles, according to the second argument, which happens to be the same vector.

  2. We apply the length function to each pile, giving us the count in each pile, exactly what we needed.

Tip: In coding, certain patterns do arise often, one did here. In fact, there are coding books with “design patterns” in their titles. Take note when you see the same pattern a lot.

We’re not fully done yet. For instance, we have a punctuation problem, where periods, commas and so on are considered parts of words, such as the period in allWords[17] seen above, ‘graphics.’ We also probably should change capital letters to lower

For major usage, we should consider using one of the advanced R packages in text processing. For instance, the tm package has a removePunctuation function. But let’s see how we can do this with the basics.

We’ll use R’s gsub function. It’s call form, as we’ll use it, is

gsub(string_to_change,replacement,input_vector,fixed=TRUE)

E.g.

> a <- c('abc','de.')
> gsub('.','',a,fixed=TRUE)  # replace '.' by empty stringW
[1] "abc" "de" 

(The fixed argument is complex, and pops up in all the R string manipulation packages. This again is something you should use for now, and look into when you become more skilled at R.)

So, to remove all periods in allWords, we can do:

> awNoPers <- gsub('.','',allWords,fixed=TRUE) 
> awNoPers[17]  # check that it worked
[1] "graphics"