These days, text processing is big in the Data Science field, e.g. in
Natural Language Processing applications. In this lesson, we’ll do a
simple yet practical example, in order to illustrate some key functions
in base-R. (R has many packages for advanced text work, such as tm.)
Our example will cover reading in a file of text, and compiling a word count, i.e. calculating the number of times each word appears. This kind of task is at the core of many text classification algorithms.
The file is here. It’s basically the About section of the R Project home page. Here are the first few lines:
What is R?
Introduction to R
R is a language and environment for statistical computing and graphics.Now, how can we read the file? For instance, read.table won’t work,
as it expects the same number of nonblank fields on each line. As you
can see above, our file has a variable number of such fields per line.
Instead, we read the lines of the file via a function named, not
surprisingly, readLines:
> abt <- readLines('https://raw.githubusercontent.com/matloff/fasteR/master/data/aboutR.txt')So, what exactly is in abt now? Let’s turn to our usual inspection
tools, str and head.
> str(abt)
chr [1:70] "" "What is R?" "" "Introduction to R" "" ...So, abt is a vector of 70 elements, of type character. Each element
of this vector is one line from the file:
> head(abt)
[1] ""
[2] "What is R?"
[3] ""
[4] "Introduction to R"
[5] ""
[6] " R is a language and environment for statistical computing and graphics."The first line in the file was empty, so abt[1] is "", and so on.
Now, to count the number of words in the file, we’ll need a way to count
the number in each line, which we will then sum. R’s strsplit
function will serve us well here, e.g. for line 4:
> y <- strsplit(abt[4],' ')
> y
[[1]]
[1] "Introduction" "to" "R" (That second argument, ' ', means we want the blank character to be our
splitting delimiter.
Good, it split the line into the three words on that line,
"Introduction", "to", and "R".
But be careful! What is that [[1]] doing there? Remember, the double
bracket notation is for R lists. So, strsplit has split abt[4]
a list with one element, and that element is in turn the three-element
character vector c("Introduction","to","R"). So for instance,
> y[[1]][2]
[1] "to"
> y[[1]][3]
[1] "R"Why the R list form? Well, strsplit can be applied to the entire
character vector abt, yielding a list of 70 elements; the i-th such
element will contain the split form of the i-th line in the file, e.g.
line 6:
> w <- strsplit(abt,' ')
> w[[6]] # example
[1] "" "" "" "R" "is"
[6] "a" "language" "and" "environment" "for"
[11] "statistical" "computing" "and" "graphics." Yep, that’s the split form of line 6.
The material beginning with the # sign is what is called a comment in
the programming world. It does not get executed by R, but it is a memo
to us, the programmer, a note to help us remember what we did.
Comments are extremely important. When we read our code six months from now, we will have forgotten most of it, so comments help us reorient. The same holds if someone else reads our code. Comments — meaningful comments — are key to good coding. More on this in a future lesson.
But we also see another snag. The above output tells us that R took
line 6, which has 11 words, and split into 14 words — the first 3 of
which are empty words "". This is because the first three characters in
line 6 are blanks. When there is more than one consecutive blank,
the strsplit function treats the excess blanks as
“words.” (This comes as quite a surprise to Python programmers.)
So, how to fix it? Answering that question will give us a chance to learn more about R in general.
For that particular line, we could do, say, is remove those empty “words” as follows:
> z <- w[[6]]
> z <- z[z != ""]
> z
[1] "R" "is" "a" "language" "and"
[6] "environment" "for" "statistical" "computing" "and"
[11] "graphics." R’s != means “not equal to.” By the way, ! means “not” in R, e.g.
> 3 < 8
[1] TRUE
> !(3 < 8)
[1] FALSESo what we did to z above followed our usual pattern:
The expression z != "" yields a bunch of TRUEs and FALSEs.
The operation z[bunch of TRUEs and FALSEs] extracts those
elements of z at which there are TRUEs, which are exactly the
ones we want to keep here.
Tip: When you write some code that looks like it will be generally useful, make a function out of it, and save it for future use. The above code to delete the empty “words” sounds like something worth keeping. So, let’s write it in function form:
extractNonemptyWords <- function(s)
{
z <- strsplit(s,' ')[[1]]
z[z != ""]
}Recall that in R functions, the last computed value is automatically
returned. The expression z[z != ""] evaluates to the set of
nonempty words, and it is returned.
Tip:
As mentioned, we should probably save that function for future use.
We could save it using the save function as discussed earlier, but it
would be better to save our “home grown” functions in one or more
packages, maybe even submitting them to CRAN. More on this in a later
lesson.
We’ll continue with this example in the next lesson, but first, time for a Your Turn session.
Your Turn: That
[[1]]expression in the body ofextractNonemptyWordswas crucial! Try the code without it, and see if you can explain the result, which is not what we desire. Tip: This illustrates a common error for beginners and veterans alike. The error message probably won’t be helpful! So keep this frequent error in mind, both when you’re writing code and viewing cryptic error messages.
We can then call our extractNonemptyWords function on each line of
the file, say in a loop. We’ll do this in the next section.
Your Turn: Write a function with call form
delNAs(x), that returnsxwith NAs deleted.