Authors: Helen Lockstone, Matthieu Miossec, Ben Wright
Bioinformatics Core, Wellcome Centre for Human Genetics
This course is hosted at: https://www.well.ox.ac.uk/bioinformatics/training/MSc_GM_2022/CM4.1_Programming_in_R/CM4.1_Programming_in_R.html.
This course has workshop content intended to be worked through using RStudio. The course tutor will lead work through that content during the session, but you are encouraged to work at your own pace and read ahead if you feel comfortable doing so.
The course is intended to start at the very beginning, for people with no previous programming experience, using R or otherwise. If you're finding it at too basic a level, feel free to skip ahead or use this time for something more useful to you! Some suggestions for more R material are at the end of this document.
R and RStudio are available for a variety of platforms, including Windows, macOS and Linux. R is open source software. R Studio is open source software, supplemented by non-free add-ons.
These should already be installed and ready.
This workshop will introduce you to the R environment, with the aim of becoming familiar with how commands are written and structured and how to perform basic tasks associated with data analysis using R.
R is, at its heart, a command line language like bash. Text commands are entered and text output is printed to the console. Similarly, you can write R scripts to be executed when you need them. R supports stepping through the commands you have previously entered with the up and down arrow keys, and will attempt to complete function and file names whne you press the tab key, just like bash.
However, unlike bash, R also keeps data in its own internal store rather than only keeping data in memory while the command is running. R can also open new windows to display plots it has created, or show help files.
When you close R, it will ask you whether you want to save the workspace. If you say yes, then R creates a file called '.Rdata' (which is a hidden file on most operating systems by default). This file contains all of the data you had stored in your R session. This makes is easy to pick up where you left off, but be aware that this file can become very large if you load a lot of data into R.
RStudio is an integrated development environment (IDE) that wraps around R and makes a lot of common tasks much more convenient. Launch RStudio to open a new R session.
By default, RStudio displays 4 different panels. If there's only one panel on the left, then one of them is currently hidden – you can make it visible by opening a new R script file from the 'File' menu.
images/RStudio_screenshot.png
The bottom left panel is R itself, in the tab 'Console'. If you were running R without RStudio, this is exactly what you would see.
The >
prompt is telling you that R is ready to
receive new commands. You can type R code here and press return to have
R run that code. R will print any output from that command and then show
the >
prompt again to say it is ready for new input.
Note that something may have happened 'behind the scenes' – such
creating or altering a data object. It is important to know what each
command has done and it was as intended.
Two useful things to know about this panel are that you can use the up and down arrows to scroll through previous commands – very useful to fix an error without retyping the entire line! – and you can use the tab key to auto-complete commands, variable names and file names.
The other tabs you don't need to use now.
Try typing a simple command into the R console (remember to press Return):
6 * 7
You'll see that it will give you the answer. The [1]
is
a little cryptic for now, but that will be explained later. See how it
shows the >
prompt again afterwards.
The top left panel is a simple text editor (like Notepad), intended to be used to write R scripts. The editor can have multiple files open at a time.
This text editor is integrated into RStudio and has several convenient features. If you click on the green 'Run' button, it will copy the current line of code into the R window and run it. If you highlight a portion of code, the same button will copy and run that selection. There is also a 'Source' button, what will copy the entire file into the R panel and run it.
Try this by typing this line into the editor:
6 * 9
and clicking 'Run'.
The top right panel contains information about the current R session. The 'Environment' tab lists all the data objects currently stored in R. Each entry is an 'object', in R's terminology. The 'History' tab shows all the commands that the main R panel has executed so far. This is the same list you can browse with the arrow keys in the R panel. The other tabs here you don't need to worry about.
The bottom right panel is used to show a variety of things in a tabbed interface. 'Files' allows you to browse your computer's directories. 'Plots' displays any plots you have made using R in the current seesion. 'Help' is where R will display its help files.
R is a very interactive environment; commands can be entered into the console one by one, and each is interpreted and executed by R in real time. R is a high-level programming language, meaning that it is fairly human-readable.
If a command is not valid in the way it is constructed (its syntax), R will print an error message to the screen. These can sometimes be hard to interpret but particularly common culprits are simple typing mistakes or quotes and brackets, whether they are in the wrong place, missing, or not in pairs (e.g. missing a closing bracket).
You can type the R commands directly, which is useful at first to get used to the syntax (structure) of commands. However, you can use RStudio's editor to copy commands into the console if you prefer. This way, it's easier to go back and run a command again, or change it slightly first.
If you see an error message, first double check that the command is entered exactly as shown. If you can't find the problem, please ask for help. The advantage of copying the commands into the text editor panel of RStudio means that you can save the file afterwards to have a record of what you have done and can easily re-run again in future.
To do anything useful in R, we need to use objects to hold data and perform various operations on them. The terms object, variable and data structure can all refer generally to 'objects' created in R.
Although variable is a widely used programming term and would be the preferred term in certain situations, we use 'object' as a general term throughout to refer to any of R's data structures. These include vectors, factors, matrices, data frames and lists. Don't worry about the differences between those things for now.
The editor also understands some other R-related files. This includes a '.rmd' file, which is a special kind of R script where you can put pieces of R code in a text document.
Type the following into the console (bottom left) in RStudio:
download.file("https://www.well.ox.ac.uk/bioinformatics/training/R_Intro_2022/R_Introduction_to_Basic_Features.rmd","./R_Introduction_to_Basic_Fetures.rmd")
Now go to the 'File' menu and select 'Open', then navigate to and select that file (it's likely to be in 'Downloads'). You'll see this entire document appear in the top left panel. The snippets of R code each have their own little run button (the green arrow on the right) that lets you run that section (called a 'chunk') with one click.
Now you have this course open you can follow it from inside RStudio. Feel free to work through the material at your own pace. As well as example code snippets to run, there are questions to think about and try your own R code out on. You're not expected to answer these questions before moving on to the next section.
The following few commands give some sense of how R stores
information in objects and how simple manipulations of data can be
performed. The #
symbol indicates a comment. We'll
sometimes use them here to show what R's output is without needing to
run the command.
The first example assigns the value 1 to a new object we create and
name x
.
x <- 1
The assignment operator is <-
and running this
command creates a new object named x
in R's memory. If you
look in the 'Environment' panel in the top right, you'll see that an
entry for x
has appeared. To inspect the contents of the
new object in the console, we simply type its name:
## Inspect the contents of the new object
x
## [1] 1
The #
symbol is a comment. R will ignore everything
after that symbol. It's being used here to show the expected output of
the command.
Note that the contents of an object will be overwritten – without any warning message – if later assigned a different value:
x <- 4
# x now contains the value 4
x
## [1] 4
We can also perform operations directly on the object (note the object itself does not change):
x * 2
## [1] 8
# x still contains the value 4
x
## [1] 4
## unless we were to re-assign the output back to the same object:
x <- x * 2
# x now contains the value 8
x
## [1] 8
Question How would you get R to calculate 3x - 2?
R is case sensitive so x
and X
are
different:
X
## An error
Trying to run this command will produce an error message:
Error: object "X" not found
. Object names are chosen by the
programmer – informative names are helpful for several reasons. You can
use capitalization, _
or .
to separate parts
of an object name but they cannot contain spaces, nor start with a
number. To avoid confusion or potential issues, it is also best not to
give them the same name as an R function, which have their own defined
names. An object named raw_data
, raw.data
,
rawData
(or even d.raw
for minimal typing!) is
fine, but trying to assign a value to a variable named
raw data
will give an error because R cannot parse it
correctly.
Elsewhere R ignores whitespace so the following commands are equivalent:
x<-4+3
x <- 4 + 3
Now we will assign a set of numbers to our variable x, which R will store as a vector. A vector is a series of different data values.
x <- c(1, 2, 3, 4, 5)
x
## [1] 1 2 3 4 5
The [1]
part of the output is R's way of displaying
vectors. The number in square brackets is the index of the first entry
in the line. If the output is long enough that it spans more than one
line, this makes it easier to work out the index of entries on that
line. Note that in R indices start at 1
(instead of
0
as in most other programming languages).
Question How might you create the vector 2, 4, 6, 8, 10 from the object x?
If we forget the closing bracket before pressing enter, then R can
detect that the command is incomplete. It gives us a different prompt:
+
. This indicates that R is waiting for you to complete the
command.
x <- c(1, 2, 3, 4, 5
## + appears as the prompt
)
If we can't simply continue our command, use Esc or control-C to
cancel the input so far and return to the >
prompt and
start again.
Rstudio and some other text editors highlight different parts of the syntax in different colours and automatically close brackets and quotation marks to help eliminate typing mistakes.
Objects can be created in many different ways and hold different kinds of information. Unlike other programming languages, there is no need to initialise a variable or object in R (i.e. define it before first use) – it can simply be created and used directly. R also automatically decides which of its data structures and types are most appropriate for the data given, rather than being explicitly specified by the programmer.
We'll work through some examples and look at ways to access or manipulate the data contained within an object. Be aware that the type (class) of an object and data it contains (numeric, character etc) can affect how it is treated by R.
There are many shortcuts in R to avoid tedious or error-prone steps. When we created our small example vector containing the numbers 1 to 5, we issued this command:
x <- c(1, 2, 3, 4, 5) # the c function tells R to concatenate these 5 numbers
## equivalently we could use this command
x <- 1:5 # handy if we needed a much longer vector such as 1 to 100, or 1 to 100000
The :
symbol is special syntax that tells R to make a
sequence of whole numbers from one value to another. We can also put
together non-consecutive strings of numbers or a mixture
x2 <- c(1, 3, 5, 7, 9)
x3 <- c(1:5, 7, 9, 10:15)
If we need to create a sequence of numbers, the function
seq
is very useful.
seq1 <- seq(from=1, to=99, by=2)
seq2 <- seq(from=0, to=1, by=0.01)
You can look at all of these objects in the 'Environment' panel to see what you get when you use these commands.
Question How might you create a sequence of numbers from 100 to 1? It can be done with the
seq
function and using:
syntax; can you find both ways?
Functions are defined blocks of code that perform a particular task.
R has many hundreds of in-built functions for common tasks, and they can
be used by 'calling' the function by its name. Above, we used the
function seq
; the function name is followed by any
arguments in parentheses – the arguments determine exactly what the
function does and enable it to be useful for many situations. By
changing the arguments, we changed the sequence of numbers that was
produced. Some arguments are required and others are optional or may
have default settings defined in the function. Documentation about the
usage of any R function is given in the corresponding help page:
help(seq)
help
is, itself, a function that opens up the help
file for a given names function. Note that the help file is opened in
the lower right panel of RStudio.
Vectors are one-dimensional objects; in the case of the object
x
we created earlier, it has length 5. There is an in-built
R function called length that we can use to check how long any given
vector object is:
length(x)
## [1] 5
## If we change what is assigned to x, the length of the vector is automatically adjusted:
x <- 1:10
length(x)
## [1] 10
Vectors are R's primary object type and many computations in R are highly efficient because they operate on the whole vector at once, rather than element by element.
Vectors can contain numeric or character data (or both). We can
create a new vector, y
, containing the letters 'a', 'b',
and 'c'.
y <- c("a", "b", "c")
Note how x
and y
are displayed differently
in the 'Environment' panel. We can also see how R has automatically
treated them differently by checking the class
of the
objects directly:
class(x)
## [1] "integer"
class(y)
## [1] "character"
R has as family of functions as.integer
,
as.character
, and so on to convert from one class to
another.
Question Can you think of value you can give an object so that it's class is neither 'integer' nor 'character'?
Now try running the following command:
y <- c(a, b, c)
## An error
What do you think R has tried to do and why does it result in an error message?
Square brackets ([
and ]
) are used to
access specific elements or subsets of a vector, factor, matrix or data
frame. Let's create a new vector as an example:
x <- c(1:5, 10:14)
## now extract 3rd element
x[3]
## or extract alternate elements
x[c(1, 3, 5, 7, 9)]
## or extract subset of elements
x[3:6]
The second and third examples above show how R can be powerful by
working with vectors. Instead of having to use square brackets to
extract each element of the vector, and then using c
to
combine them into a new vector, we use a vector inside the
square brackets and R interprets that as an instruction to return a
vector of those elements of x
.
Question How would you return the elements of
x
in reverse order? How would you extract the elements ofx
where the index is a square number?
Question Can you predict what R will do in each of these cases:
x[11]
,x[-3]
?
R will decide the most appropriate way to store the data it is provided with, including when you mix different classes in a single vector. To give more examples of how data is interpreted by R, run the following and note the results.
x2 <- c(1:5, 6.5)
class(x2)
## [1] "numeric"
x3 <- c(1:5, 6.5, "a", "b", "c")
class(x3)
## [1] "character"
This gives some idea of R's internal rules. Because the way data is being handled by R is important for both performing computations correctly and the source of many error messages, it is useful to be familiar with the common data types. Some functions, such as computing a mean for example, require numeric data objects to operate on. What happens when you try to find the mean of x2 and x3?
mean(x2)
mean(x3)
There are hundreds, probably thousands, of in-built functions in R. Some you will use very often and others rarely or never. There are always several ways to do the same thing in R, using closely-related functions.
Examples of the functions we have used so far include
length
, mean
, and class
. In the
case of length
, the argument supplied is the name of the
object we wish to find the length of. We can check the length of another
vector object simply by changing the argument.
The length
function is only applicable to vectors (or
factors) and does not work on other data types such as matrices or data
frames, which are 2-dimensional. Often, data is stored in a table format
(e.g. in Excel), and commonly handled as a 'data frame' in R. We will
introduce these ideas in the next section.
There are many thousands of additional functions available to R
through something called 'packages'. Each package is a self-contained
parcel of functions created for a specific application. For example,
'Bioconductor' is a repository of many packages related to handling data
in bioinformatics. Because it would be confusing to always have all
these functions available (and because some packages may end up using
the same function names) you need to specifically tell R to make a
package available. This is done using the install.packages
and library
functions:
## Install the package for your R environment
install.packages("ggplot2")
## Make the package available to use
library(ggplot2)
The first function might prompt you to choose from a list of 'mirrors', which are just different places on the internet where the packages are stored. It doesn't much matter which you choose, although choosing one close to where you currently are is recommended.
Note that when installing the package, you need to treat the package
name as character data, inside double-quotes. Once the package has been
installed, R is aware the package exists and can interpret the name
without quotes when you use library
. Installing packages is
something you only need to do once per computer (in theory, at least).
You must library
in each new R session before trying to use
any functions from the package; as this is what tells R to look in that
package for function names (again, to try and handle duplicate names in
different packages gracefully). If you save your R session and come back
to it later, you will need to use library
again to make
those functions available – which packages you have used
library
on is not saved with the data.
An important concept before we get started is the working directory. Your R session is active in one particular directory on your computer – by default your 'home' directory, as defined by the operating system.
When you save an R session, the .RData
file is saved
there. When you start a new R session in a particular directory, any
.RData
file is re-opened in that session. If is also
possible to load and save these session files from within RStudio.
It is useful to set the working directory to the place where you have the files you want to work with. It means you have to do less typing to load the save files and you can keep the R version of the data alongside the original files.
We need to set a working directory for the rest of the workshop. From
RStudio's Session menu, select 'Set working directory', and then 'Choose
directory...'. Create a new directory to work in and select that. You
can change the working directory using the R function
setwd
, but it's more convenient to use RStudio's
interface.
You will also need to download two data files by entering the following commands:
download.file("https://www.well.ox.ac.uk/bioinformatics/training/R_Intro_2022/inflammation_data.csv", "./inflammation_data.csv")
download.file("https://www.well.ox.ac.uk/bioinformatics/training/R_Intro_2022/sample.csv", "./sample.csv")
Sometimes you are testing things out and creating lots of objects –
that's fine but it's always good to start a new session when running or
checking your final code to be sure previous objects do not affect it in
any way. Sessions can also be cleaned up by deleting objects with the
rm
command:
rm(x3)
This is an excellent time to take a short break. Stretch your legs, get a drink.
Earlier we introduced the R software environment, some key features of the R programming language and how to start using it. We will now do some practical exercises working with example data to perform typical tasks. First, some important acknowledgements:
The tutorial that follows is in part adapted from the Software Carpentry Foundation Programming with R, specifically the Analysing Patient Data tutorial. The Software Carpentry material is available for re-use under a Creative Commons License and we are grateful to the original authors.
We set up our R session for this practical before the break. You can check the current working directory with the following command:
getwd()
The output (directory path) should match the location you are working
in today, and this folder should also contain the files named
inflammation_data.csv
and sample.csv
. If you
get an error while typing the command into the terminal, first check the
command matches that in the tutorial exactly and that you haven't
accidentally missed an earlier command out. Pay particular attention to
lower/upper case letters, underscores, dashes or dots in function or
object names, and that brackets and quotes are correctly paired. If you
see an error message when running chunks in the R markdown document, it
is probably that you accidentally changed that code. Close the file
without saving then reload it. If you can't spot the problem or have a
question at any point, please don't hesitate to ask!
.R
suffix lets RStudio know that the file is meant to be an R script.This is the usually the easiest way to work; if you need to close your session and return to it another time, it is easy to run the code again from start to finish. All of the code stored in a script can be executed in R from start to finish with the command:
source("script_name.R")
## This file doesn't exist so will cause an error
You may recognise the function name source
from the
button in the editor panel that does the same thing.
The first thing we need to do is load or read in the data from our
files so it is accessible in the current R session. There are a few
possible ways to do this but we will use the function
read.csv
because our files are saved in 'comma-separated
values' or csv format. To find out details of how to use this function
you can search RStudio's Help (bottom right panel) or type:
help(read.csv)
The help page shows us the arguments for this function and their
default values where applicable. For example, we see
sep=","
, which means the fields in a row will be separated
on commas, and header=TRUE
, which means it is expecting the
first row of the file to contain names for each column. The first file
we will work with is the one named 'inflammation_data.csv'. If we
inspect this file in Excel or a text editor (by opening it directly from
its location on your computer) we see there are no column names, just a
large set of numerical values. Therefore, we need to explicitly include
the header argument in our command, changing it to FALSE to over-ride
the default behaviour of the function. This is an example of how
arguments modify a function's precise behaviour, rather than requiring
two separate functions to exist for files with/without header rows.
inf.data <- read.csv("inflammation_data.csv", header=FALSE)
Note: if you see an error message similar to 'No such file or directory' when trying to read in a file, it is likely that either (i) the file is not located in the current working directory, or (ii) there is one (or more) typos in the filename.
It's useful to break this command down a little more.
We have given a name for a new object, inf.data
, in
which to store the contents of the file 'inflammation_data.csv'. Our
object name is descriptive without being too long, shortening
'inflammation' to 'inf' for our convenience. We use the in-built R
function read.csv
, and provide two arguments:
header=FALSE
, indicating to R that our file does not
have a header rowThere are many other arguments to the read.csv
function
to further refine its behaviour but these are either optional or the
default settings are fine for most situations. By running this command,
R creates the object inf.data
and information about it
appears in the 'Environment' panel of RStudio.
Notice that, unlike the vectors we've created before, that it doesn't
show all the data stored in the object. It is detailed as
60 obs.
(observations) of 40 variables
and if
you hover the mouse pointer over the name, inf.data
, it
indicates that the object is a data.frame
, which is how R
stores tabular data by default.
If you double-click on the object name or click on the spreadsheet
icon to the righthand side, the contents of the object are loaded in the
top-left panel in a new tab. The columns have been automatically named
by R as V1
through to V40
, as column names
have to start with a letter. The rows are simply numbered – the only
restriction to row names is that they must be unique.
What do you think will happen if you run the command above without
including the header argument? Try it by saving the contents into a new
object called test
and compare this to
inf.data
:
test <- read.csv("inflammation_data.csv")
Another object named test
now appears. View the contents
of that object and compare it to inf.data
.
This is a good example of how easily something unwanted can happen in R and the importance of checking your objects contain what you intend them to. Any mistakes can simply be corrected by re-running the command e.g. with the appropriate header argument, and overwriting any previous version of the object.
Our original file contained rows and columns of data, and R has suitable 2-dimensional data structures to store such data: matrices and dataframes. These can both be thought of as tables of data, analogous to an Excel spreadsheet. Matrices require all columns to be of the same type, while data frames can have columns of different data types. Given that experimental data is often a mixture of numeric values (e.g. measurements) and associated descriptive information, data frames are a very commonly used data structure in R.
NB while it is possible to create a matrix object with mixed data types, R will use its internal hierarchy of data types to choose one that is applicable to all of the data – often this means numerical columns get converted to character strings, and certain functions may not perform as expected if this is not noticed, or they may produce an error message.
In this case, a data frame object has been created:
class(inf.data)
We can check how each column of data has been treated by R, e.g. for the first column:
class(inf.data[, 1])
In fact, as all the columns contain data of the same type, R
could equally well store this data as a matrix object. A data frame has
been created because the functions read.csv
and
read.table
are specifically designed to deal with mixed
column classes and produce data frames by default. Another function
scan
can be used to read in matrices, especially large
ones.
It is worth noting that some functions operate on matrix objects, and so converting between classes is sometimes needed. 'matrix' is a class of data that R recognises, separate to 'integer' or 'numeric'.
inf.data <- as.matrix(inf.data)
class(inf.data)
We will continue with the matrix form of this object for now, and
load a mixed dataset later. Most operations on either kind of
2-dimensional object are the same. For example, we can find the
dimensions of a matrix or a data frame with the dim
function.
dim(inf.data)
The output of dim
is printed to the screen and shows the
number of rows the object contains, followed by the number of columns
(the convention is always rows, then columns but this can be hard to
remember at first as there is no indication of which way round it is in
the syntax).
If unsure, the functions nrow
and ncol
will
return the number of rows or columns respectively:
nrow(inf.data)
ncol(inf.data)
This particular file suffers from the lack of any labels to annotate what data is recorded in the rows and columns. The Software Carpentry tutorial provides the following information:
“We are studying inflammation in patients who have been given a new treatment for arthritis. Each row holds the observations for just one patient. Each column holds the inflammation measured in a day, so we have a set of values in successive days."
Our object has 60 rows and 40 columns, so we infer from the information above that there are 60 patients, and 40 days.
Again we see a way for mistakes to easily creep into data analysis – here we have to rely on information given to us second-hand to know what is what. What if that information were wrong? Are there any checks we can make ourselves to be sure patients are in rows? We are not told how many patients were included so simply checking the number of rows won't help. And what if there were 50 patients and measurements taken over 50 days?
Even with careful scrutiny it would be hard to know how the data are presented (patients in rows or columns) from the data alone. We could perhaps make some plots to help us, or we might spot the zero values in the first column. Scrolling down the object display in the top-left panel or displaying the first column in the console confirms they are all zeroes, and the values in each row tend to rise across the first few columns. We may be reassured by this that the patients are indeed in the rows, since we might expect inflammation to rise over time, and an individual recording 0 on every single day might be unlikely (though not impossible).
It would be prudent to add our own row and column names to reduce the chance of making a mistake later when dealing with this data:
rownames(inf.data) <- paste("Patient", 1:60, sep="_")
colnames(inf.data) <- paste("Day", 1:40, sep="_")
This introduces the very useful and versatile function
paste
. Note that adding row and column names does not
change the size of the data object, but we can see them displayed by
reloading the object. The row and column names are not part of the data
itself. They are similar to the alphabetical columns and numbered rows
in an Excel spreadsheet.
Earlier we looked at accessing elements of a one-dimensional vector object. For matrices and dataframes, a similar approach with square brackets is used:
object_name[rows, cols]
By specifying the rows and columns of interest, an object can be subset in a variety of ways to inspect or extract different parts of it.
inf.data[1, 1] # this pulls out the data value in the first row of the first column
inf.data[30, 20] # any single entry can be extracted by specifying the row and column
Think about how might you select the data in the first 5 rows for the first 5 columns.
## If we need to select non-contiguous portions of the object, we'll need the help of c():
inf.data[c(1, 3, 5), c(10, 20)]
## If you want to display all columns for selected row(s), leave blank space after the comma:
inf.data[5, ] # All columns for row 5
## Or blank space before the comma to select all rows for given column(s):
inf.data[, 1:5] # all rows, columns 1 through 5
Question Can you infer what will happen with
inf.data[,]
? Can you produce the matrix with the patients in reverse order?
When you select a single column from a data frame, R will 'helpfully' turn it into a vector:
test <- inf.data[, 2] # Select the second column
class(test)
This can cause errors if you were expecting test
to
still be a data frame in a later command.
We added column names to our object earlier. Rows and columns can
also be accessed by name (in double-quotes) inside square brackets.
These methods work for any two-dimensional class of data in R, but data
frames specifically have another: $
is a shorthand way of
referring to a named column. Both alternatives in the snippet below
return the same vector.
inf.data[, "Day_3"]
tmp <- as.data.frame(inf.data)
tmp$Day_3
Question Suppose you want to determine the maximum inflammation for patient 5 across days three to seven. To do this you would extract the relevant subset from the data frame and calculate the maximum value. Which of the following lines of R code gives the correct answer? 1.
max(inf.data[5, ])
2.max(inf.data[3:7, 5])
3.max(inf.data[5, 3:7])
4.max(inf.data[5, 3, 7])
We can perform many simple analyses of the data by applying functions such as max, min, mean, or summary to our data object. We might want to determine the maximum value per patient or the average value per day. The following examples illustrate how this can be done extremely efficiently in R, starting with an approach that is the opposite (and definitely not recommended!).
Suppose we want to find the maximum inflammation score for each patient across the 40 days of measurements. Let's start by calculating it for patient 1.
Extracting the data for patient 1 (i.e. the first row) is the first obvious step, and perhaps we decide it makes sense to store the values for this patient in a new object:
inf.patient1 <- inf.data[1, ]
We can then calculate the maximum value for Patient 1:
max(inf.patient1)
Although this seems reasonable enough, there are several issues:
If we did continue with this approach, there would be 60 new objects (all with very similar names), a high probability of having made a typing mistake somewhere (perhaps overwriting one patient's data with another), and a large set of results that we'd have to manually write down or transfer to an Excel spreadsheet – all of which is very messy and prone to error.
We can easily dispense with the intermediate step of creating a new object:
# max inflammation for patient 1
max(inf.data[1, ])
# or equivalently
max(inf.data["Patient_1", ])
These commands are the same as extracting the data for patient 1 as
we did earlier, but instead of printing to the screen or storing in a
new object, the command is used directly as an argument to the function
max
by enclosing in the parentheses.
Commands can be nested in this way to achieve multiple steps in a single line of code; too many commands in one line though can make it harder to work out what the code is doing, as well as increase the chance of the code not doing as intended – the location of brackets becomes vital.
We'd really like a way to this for all 60 patients without
duplicating the code 60 times. Loops are one option (not discussed here)
but the apply
function is the most efficient approach:
apply
allows us to repeat a function on all of the rows
(MARGIN = 1
) or all of the columns
(MARGIN = 2
) of a data frame simultaneously:
max_inf_patient <- apply(inf.data, MARGIN=1, max)
Similarly, we could compute the average inflammation per day with a single line of code:
avg_inf_day <- apply(inf.data, MARGIN=2, mean)
Comparing these two commands will help understand the
apply
function, which is not intuitive but highly
efficient. The arguments to apply
are:
MARGIN
, indicating whether to apply over rows (1) or
columns (2)We wanted to find the maximum inflammation score for each patient, so
we looked across the rows and used the max
function. To
modify the command to find the average inflammation per day, we switched
the MARGIN argument to 2 for columns, and gave the final argument as
mean
.
While the MARGIN argument is explicitly assigned above, R is
equally happy to infer from the shortened command
apply(inf.data, 1, max)
that the 1 should be assigned to
the second defined argument of apply
. You can also write
your own bespoke functions as required and use apply to run them over an
object.
We have also solved the final issue with our initial approach by storing the results in suitably-named objects for further work.
Question What functions other than
mean
andmax
can you use with theapply
function? Looking at the help files for the functions might suggest some.
length(max_inf_patient)
head(max_inf_patient)
head
is a function that shows the first six elements of
a vector or the first 6 rows of a data frame.
Another useful function is summary
. This returns the
minimum value, first quartile, median, mean, third quartile and the
maximum value, all very useful information to make an initial inspection
of your data.
summary(inf.data[, 1:4]) # for each of the first 4 days
Question What happens when you use
summary
withapply
?
We'll next read in data from another file to illustrate a few more features of data frames and how to work with them in R. In this case the file does contain a header row and the default arguments for read.csv are appropriate for this file so we only need provide the filename:
data2 <- read.csv("sample.csv")
head(data2)
This displays the first 6 rows, and we can see immediately that we have a range of different types of data in each column. Let's see how R has treated it (you can paste the following 4 lines as one block).
for(i in 1:ncol(data2)) {
print(class(data2[, i]))
}
Here, we've used a for
loop to iterate over each column
in the object data2
, and print to screen the class of each
column. The output tells us that columns 1:3 are treated as characters,
and most of the rest are treated as integers. Column 5 is 'numeric', the
basic data type in R for non-integer numbers.
Although columns 1 – 3 are currently characters, some functions in R that perform statistical analysis prefer a data type called 'factors'. Factors are useful when you have category data, where the character data can only come from a well defined set of possibilities. Factors can be useful when you want to make sure incorrect values don't creep in accidentally, because R will report an error if factor data tries to be changed to an illegal value.
We can specifically convert the Group column to a factor:
data2$Group <- as.factor(data2$Group)
class(data2$Group)
The as.factor
function returns a factor version of a
vector. However, as this doesn't change the original data2
object by itself, we need to assign that vector of factors back to the
column, replacing what was there before. Indices (and by extension, the
$
syntax) in R can be used when assigning values to
objects, not just when extracting them. You can see how using the
$
syntax makes this line easier to read.
Question If you used square brackets instead of the
$
notation for the above, how would you do it?
A very useful summary function is table
:
table(data2$Group)
table(data2$Gender)
This alerts us to the fact that data in the Gender column has not been entered consistently, which you might have already spotted from viewing the object in RStudio.
To fix this, we can make sure F
and M
are
used throughout; this involves determining which rows contain a
lowercase f
for example, and substituting with
F
. Similarly for the lowercase case m
. The
following code does each a different way.
data2$Gender <- gsub("f", "F", data2$Gender)
data2$Gender[data2$Gender=="m"] <- "M"
## checking we have modified the data as intended
table(data2$Gender)
The gsub
function replaces the first argument
"f"
with the second argument "F"
in the third
argument data2$Gender
wherever it appears, leaving the
other entries unchanged. As before, we need to assign this new vector
back to the original object.
The second line shows one of R's particularly useful features. We're
telling R to replace values in data2$Gender
, but only where
those values equal "m"
.
This is one example of how there can be multiple ways of doing a particular task in R. When deciding which method to use, in general, it is better to stick with the method you understand the best as that's the method where you are least likely to make a mistake. Some mistakes will cause R to generate an error, but the more dangerous mistakes are ones that don't result in an error but result in incorrect data.
Question Look at the help file for the function
ifelse
. Can you work out how you would to the above correction using that instead of either of those methods?
Visualising data is a vital part of statistical analysis, and R's
plotting capabilities are a key reason for its popularity. Here, we
introduce ways to make a few simple plots. Let's go back to our first
table of data and take a look at the average inflammation over time.
Recall that we already calculated these values above and saved them in
an object named avg_inf_day
. Plotting the values is done
with the function plot
:
plot(avg_inf_day)
## Default labels and settings are used but we can refine our plot with some additional arguments:
plot(avg_inf_day, main="Inflammation Scores Over Time", xlab="Day", ylab="Average_inflammation_score") # adding title and axis labels
plot(avg_inf_day, main="Inflammation Scores Over Time", xlab="Day", ylab="Average_inflammation_score", pch=4, col="red") # changing the plotting symbols and colour
## Similarly, we could plot the data per patient:
plot(max_inf_patient)
## Here, we might decide to use a boxplot instead:
boxplot(max_inf_patient, main="Maximum Inflammation Scores", ylab="Max_inf_score")
legend("topright", legend="n=60 patients", cex=0.8) # adding a legend
In RStudio, the 'Plots' panel in the bottom right contains a history of all the plots you've made using the console. If you run this code from the rmd file, then the plots will be displayed inside that file. In vanilla R, a new plot will replace the existing one unless you create a new window for it.
When we are happy with our plots, they can be saved to a file.
pdf("Inflammation_plots.pdf", onefile=T)
plot(avg_inf_day, main="Inflammation Scores Over Time", xlab="Day", ylab="Average_inflammation_score", pch=4, col="red")
boxplot(max_inf_patient, main="Maximum Inflammation Scores", ylab="Max_inf_score")
legend("topright", legend="n=60 patients", cex=0.8)
dev.off()
This will be saved to the current working directory by default so if
we check the folder, a new file named 'Inflammation_plots.pdf' should
have been created. The onefile=T
argument instructs R to
append additional plots to the same file and the dev.off()
command at the end closes the file connection. You can also export plots
directly to a pdf file from the RStudio 'Plot' panel.
Question Create some other plots based on this data using the examples as templates to work from, and save them as a new pdf file. Can you find a way of saving individual plots as .png files?
Bioconductor is a separate third-party repository of R code, specifically geared towards bioinformatics. Many of the packages in Bioconductor are also available via R's usual package installation method. However, Bioconductor has its own preferred installation mechanism which gets around some of R's more annoying limitations with version incompatibilities. In particular, the packages in Bioconductor are updated more swiftly following the release of a new version of R. To get started with Bioconductor, use the following command:
install.packages("BiocManager")
This installs the package used for managing Bioconductor tools. You can then install Bioconductor packages like this:
BiocManager::install("limma")
Loading Bioconductor packages this way is typically a longer process but more reliable. (If the console asks you if you want to update existing packages, press 'n' and 'return'.)
As you read the R help pages, you may not find them terribly helpful.
The R help
function is most useful for refreshing your
memory about specific functions you have used before. Unfortunately, it
is not very useful for learning the how to use the functions in the
first place.
There are several online forums used by R programmers, novice and expert, to get help and advice from their peers. Searching for your R problem will often give results from one of these forums.
One page 'quick reference' documents:
The Software Carpentry Foundation website contains many tutorials for learning a variety of programming languages, including R. There is also a series of domain-specific Data Carpentry courses, which focus on computational skills needed to handle and analyse data – tutorials are currently available for Ecology, Genomics, Geospatial Data and Social Sciences, with others in development.
In particular, the tutorials at the following links give further details and examples on some of the ideas already introduced or extend to other topics once you feel comfortable interacting with the R environment.
Please let us know if you have any questions and good luck on your programming journey!
We are happy to help anytime: bioinformatics@well.ox.ac.uk