Note: On Unix-family systems such as Linux, the Windows term folder is said to be a directory. You will frequently see this in Mac discussions as well. (The Mac OS is a Unix-family system.) We will typically use the term directory here, as that is what R uses.
In assmebling a dataset for my regtools
package, I needed to collect
the records of several of my course offerings. I started in a directory
that had one subdirectory for each offering. In turn, there was a file
named Results
. As an intermediate step, wanted to find all such
files, placing the text for each one in an R list gFiles
. Only some
specific columns of each file will be retained. (The discussion here is
a slightly adapted version.)
The chief R functions I used were:
list.dirs()
:** Returns a character vector with the names of all the
directories (i.e. subdirectories) within the current directory.
dir()
:** Returns a character vector with the names of all files
within the current directory.
%in%
:** Determines whether a specified object is an element in a
specified vector.
setwd()
:** Changes to the specified directory.
Here is the code:
getData <- function() {
currDir <- getwd() # leave a trail of bread crumbs
dirs <- list.dirs(recursive=FALSE)
numCourseOfferings <- 0
# create empty R list, into which we'll store our course records
resultsFiles <- list()
for (d in dirs) {
setwd(d) # descend into d directory
# check if there is a Results file there
fls <- dir()
if (!('Results' %in% fls)) { # not there, skip this dir
setwd(currDir)
next
}
# ah, there is such a file; increment our count
numCourseOfferings <- numCourseOfferings + 1
# open it
resultsLines <- readLines('Results')
# delete the comment lines; look at 1st character in each line
resultsLines <- delComments(resultsLines)
resultsFiles[[numCourseOfferings]] <- extractCols(resultsLines)
setwd(currDir)
}
resultsFiles # return all the grades records
}
Before we go into the details, note the following:
The code is written in a top-down manner. Much of the work of
getData()
is offloaded to other functions (code not shown),
delComments()
and extractCols()
.
There are lots of comments!
Now, consider the line
dirs <- list.dirs(recursive=FALSE)
As mentioned, list.dirs()
will determine all the subdirectories
within the current directory. But what about subdirectories of
subdirectories, and subdirectories of subdirectories of subdirectories,
and so on? Setting recursive
to FALSE
means we want only
first-level subdirectories.
So, the line
for (d in dirs) {
will then have us process each (first-level) directory, one by one.
When we enter one of those subdirectories, the line
fls <- dir()
will determine all the files there, storing the result as a character
vector fls
.
Then, as the comment notes, the lines
if (!('Results' %in% fls)) { # not there, skip this dir
setwd(currDir)
next
}
will, in the event that there is no Results
file in this
subdirectory, skip this subdirectory. The R keyword next
says, “Go
to the next iteration of this loop,” which here means to process the
next subdirectory. Note that to prepare for that, we need to move back
to the original directory:
setwd(currDir)
On the other hand, if this subdirectory does contain a file named
Results
, the remaining code increments our count of such files,
reads in the found file, and assigns its contents as a new element of
our resultsFiles
list.