Intro-R

Your browser lacks required capabilities. Please upgrade it or switch to another to continue.

Loading…

Welcome Welcome to this interactive tutorial. This tutorial acts as a short introduction to R and RStudio. In today's workshop you will learn some basics around using RStudio. To start us off, what is your name? <<textbox "$name" "">> Now, let's get stated. [[Proceed|Intro1]] ---- Table of Contents The below Table of Contents is intended to help you navigate back to a point in the tutorial if you need to stop half-way through. You can click the 'back to start' '<<' link at the bottom of each page to return to the this page. Preliminary Steps [[Preliminaries: R and RStudio|Intro1]] [[Quick Overview|Intro2]] [[R is a calculator|Intro3]] [[R is a scientific calculator|Intro5]] [[R allows you to bind numbers into datasets|Intro7]] [[The standard way to hold data in R is a 'dataframe'|Intro9]] [[Selecting columns out of a dataframe|Intro11]] [[Basic statistical tests and figures|Intro12]] [[Finish|Intro14]]

<img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R_logo.png" alt="past" width="10%" height="auto"/> <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R_studio_logo.png" alt="past" width="10%" height="auto"/> Starting Off Hi, $name. Before we get started, it's worth explaining what R and RStudio are and why we are using them. R is a language for statistical computing. It was based on a language called 'S'. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. The first version was released in 1995. A stable version was made available in 2000. Why use R? We have a lot of good reasons for preferring R to other statistical programs like SPSS or GraphPad Prism. * R is free * R is extraordinarily stable and can handle huge datasets. I've personally opened the entire US census data in R (all of it, for all censuses in one multi-gigabyte text file). No other program I tried was even remotely capable of this. * R has a huge amount of support from generous-minded people. There is a veritable army of stats monkeys crawling all over it fixing bugs. * Code can be saved and rerun easily. If someone questions statistics in a ten year old paper, you can easily just go back and check your old R code. If you used a point-and-click program, there is no way you'll be able to easily reconstruct what you did ten years ago. But R has its drawbacks * On the other hand R has a steep learning curve. Because it is a programming language, you need to learn the syntax of the language. * The error messages are often esoteric and incomprehensible, especially if you are new to R. This can leave you feeling very lost when something goes wrong. * Unfortunately, if you are already familiar with languages like Python or Ruby, this won't help you much. R feels like exactly what you would expect to get if two stats experts with no in-depth knowledge of programming languages decided to make up a computing language. That said, R isn't too bad once you get used to it. It can seem scary at first, but before long you'll find that it starts to make (at least) some sort of sense. RStudio In this tutorial we will be using RStudio. R is a fully independent program and doesn't require RStudio, but it is hard to use on its own. RStudio is a 'wrap-around' for R. It adds a whole lot of functionality that improves our experience of using R. Luckily for us, RStudio is also free for personal use. If you haven't already, you need to <a href="https://www.r-project.org/" target="_blank">download and install R first</a>, and then <a href="https://www.rstudio.com/products/rstudio/download/" target="_blank">download and install RStudio</a> before proceeding. If you prefer, you can use the <a href="https://rstudio.cloud/" target="_blank">RStudio Cloud Service</a> instead if you wish (although be aware this may be moving to a paid-for-use model). The main reason why I prefer to use a downloaded version of RStudio is that I can work offline (i.e. in a remote location, or on fieldwork) and the downloaded app is (generally) a bit less buggy than the online version. Let's start with a quick overview of RStudio. [[Next|Intro2]] ---- [[<|start]] [[<<|start]]

Quick Overview RStudio is divided into four window spaces. * The top-left is where code is written (and saved) * The bottom-left is where code is executed (if you are using a standard R script) * The top-right shows objects, datasets and other things in your working space. This is useful, because if you import something (like a dataset) and can't remember what you named it, you can check here. * The bottom right is where the help menu and graphs appear. R Script versus R Notebook You can choose to use either an R Script or an R Notebook to write and save code throughout this tutorial. I prefer R Scripts because I find them more stable and less prone to crash. However, R Notebooks make collaborating on code in a team easier. It's just a matter of preference. * To create an R Script select File, New File, R Script. * To create an R Notebook select File, New File, R Notebook. Here's what you should see if you create a Notebook. <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R02.png" alt="past" width="100%" height="auto"/> Here's what you should see if you create a Script. <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R01.png" alt="past" width="100%" height="auto"/> Step 1: Create a Script or Notebook As per your preference, create either a Script or Notebook. Step 2: Write a header You can use hashtags in R to add notes or headings help keep you code clear. Anything after one (or more) hashtags won't be read as 'code' by R. It's just a note for you. Type the following into your Script or Notebook. ### R INTRO ### # This is a basic intro to R Note that this block of text is shown in green here. I will be colour coding text to make it a bit easier for you to follow. Notes will alwasy be in green. Step 3: Save your file So far, your file doesn't have in code in it, but that's fine. Let's save a copy. You can select File, Save. Alternatively, you can click on the little blue floppy disc save icon just above your Script or Notebook. You can call this file anything you want. It could be $name R file or I_hate_R or anything you like. However, just be sure that you append the name with the correct extension. * If you are using a script add .R to the end of the file like this: this_is_my_R_file.R * If you are using a notebook add .Rmd to the end of the file like this: this_is_my_R_file.Rmd Excellent. Once you have done that, you have a file to save script to. We can start working through some of the basics of R scripting. [[Next|Intro3]] ---- [[<|Intro1]] [[<<|start]]

R is a calculator It's easy to get carried away with all the fancy things R can do, and forget that at its heart, R is a straightforward scientific calculator. Type the following into your script or notebook and run it. 3 + 8 You can run script by placing your cursor on it, and then clicking the 'Run' button. Alternatively, you can highlight the code and use command-return (Mac) or Control-Enter (Windows). Also, if you are using a Notebook, remember to create new chuncks from time to time to keep your code seperated into sensible portions. If you are using a Script, adding headings and notes using hashtags as you go is sensible. Once you have run the above 'code' you should see the answer (11) appear either in the bottom-left window (Script) or under your chunk (Notebook). The mathematical operations are all what you would expect. Try these. 11 / 2 20 - 4 13 * 7 (11 / 2)*5 You can also allocate numbers (or other things) to placeholders. The way we do this is with an attribution arrow. You'll notice now that we are using blue and black colour coding too. Blue will be used for anything that you can change or alter yourself, such as a number or the name of a variable or a dataset. Black will be used for basic syntactical things, like commas, multipliers, brackets or equal signs. Now try this: x <- 3 y <- 1.2 ((6 / x)*5)^y What was the answer? Write this to one decimal place. <<textbox "$q1" "">> [[Check your answer|Intro4]] ---- [[<|Intro2]] [[<<|start]] <<if $name is "cheat">> 15.8 <</if>>

Your answer was $q1 <<if $q1 is "15.8">>Correct! Great work, $name! [[proceed|Intro5]] <<else>>Hmm, that doesn't look right. Maybe [[try again|Intro3]] <</if>> ---- [[<|Intro3]] [[<<|start]]

R is a scientific calculator You can use R to apply scientific functions to numbers, such as square roots or logs. Try these: sqrt(31) log(1.9) Note that I have coloured these commands in orange. Both sqrt and log are 'functions' in R. Functions sit outside a bracket and apply to everything inside the bracket. Most of the statistical tools in R that you will use are functions of some sort or another. We can 'nest' functions inside brackets if we wanted to. For example, an arcsine square root transformation is a standard transformation for proportions. We could write it like this: asin(sqrt(0.8)) However, it is really easy to get confused with this sort of 'nested' coding. You can start to lose track of brackets. It is better to use the attribution arrows to drop a number (or anything else) into an object, and then work with the object. This is a better way to write the exact same operation as above. x <- sqrt(0.8) asin(x) Here's a good example of why this can be confusing. The following code takes the inverse of 5, logs this, then squares the answer and then rounds this to two decimal places. It is confusing to read, even with colour coding. round(((log(1/5))^2),2) This second bit of code does the same thing, but uses attribution arrows to keep things simpler. It takes longer to write out, but it much easier to follow step-by-step. Have a go at entering both of these segments of code into your Script or Notebook and run them. Check that they give the same answer. x <- 1/5 y <- log(x) z <- y^2 round(z,2) What was the square root of 1085? Write the answer rounded to two decimal places. <<textbox "$q2" "">> [[Check your answer|Intro6]] ---- [[<|Intro4]] [[<<|start]] <<if $name is "cheat">> 32.94 <</if>>

Your answer was $q2 <<if $q2 is "32.94">>Correct! Great work, $name! [[proceed|Intro7]] <<else>>That doesn't look quite right. Remember to round correctly. Best to [[try again|Intro5]] <</if>> ---- [[<|Intro5]] [[<<|start]]

R allows you to bind numbers into datasets Although you will typically be importing data into R, you can just enter numbers and bind them into a dataset using the 'combine' function. Let's imagine we had species richness counts for macroinvertebrates at eight locations in a stream (i.e. a count of the number of species caught at each site). We might want to bind these into a single dataset. Try this: invert.species <- c(4, 5, 0, 0, 3, 3, 2, 1) invert.species A note on the 'namespace' Programming languages like R have a 'namespace'. This is a list of all the names of things like functions. sqrt and log are part of the namespace. Unlike most languages R doesn't have a very extensive protected namespace. Most languages won't let you just write over an important function. R will mostly let you do whatever you want. For example, if you try to write over the ANOVA function, aov, most languages would stop you, or at least give you a warning. R just shrugs and says 'I guess you know what you're doing'. This means that it is possible to cause all sorts of problems by writing over something you didn't want to. Some rules to help: * Generally speaking single lettes like a, b, c, x, y, z are safe to use * Most functions don't include underscores or full stops. Appending something on the end like this lizards_aov or lizards.aov will usually help you avoid any problems. * Just remember that you can always just quit and restart the R session if something goes drastically wrong. Save your script before you quit out, then start R up again. It will return to the default settings. Applying functions to a dataset What do you think happens if you apply a square root to our species richness counts? Try it and see. sqrt(invert.species) Basic summary statistics Here is a list of some useful and basic summary functions. You'll notice that Standard Error (which we use a lot in Biology) does not have its own basic function. You have to calculate it as the standard deviation divided by the square root of the number of observations. The stats people who run R don't much like Standard Error. The reason for this is that a SE is only a 68.2% confidence interval of the mean. That is, we only have a 68.2% confidence that the true mean lies somewhere within the SE. That's not very high. We like SE in Biology because biological data is often messy and using 95% or 99% confidence intervals can make our graphs look depressingly uncertain with huge error bars. mean(x) # Mean median(x) # Median max(x) # Highest number min(x) # Lowest number length(x) # Number of observations var(x) # Variance sd(x) # Standard Deviation sd(x)/sqrt(length(x)) # Standard Error summary(x) # Mean, ranges and quartiles (when applied to a set of numbers) str(x) # The 'structure' of an object. Very useful for complex datasets. is.numeric(x) # Asks, is something numeric? (true or false) is.factor(x) # Asks, is something categorical? (true or false) In R categorical data (names, words etc) are called 'factors'. boxplot(x) # Generates a quick boxplot of a single set of numbers. Have a go at the above using the invert.species and then answer the following. What was the mean species count per site? Write the answer rounded to one decimal place. <<textbox "$q3" "">> What was the variance of the species counts? Write the answer rounded to one decimal place. <<textbox "$q4" "">> What was the standard error of the species counts? Write the answer rounded to one decimal place. <<textbox "$q5" "">> [[Check your answer|Intro8]] ---- [[<|Intro6]] [[<<|start]] <<if $name is "cheat">> 2.3, 3.4, 0.6 <</if>>

Your answer was $q3 <<if $q3 is "2.3">>Correct! Great work, $name! <<else>>Hmm, that doesn't look right. <</if>> Your answer was $q4 <<if $q4 is "3.4">>Correct! Great work, $name! <<else>>Hmm, that doesn't look right. <</if>> Your answer was $q5 <<if $q5 is "0.6">>Correct! Great work, $name! <<else>>Hmm, that doesn't look right. <</if>> <<if $q3 is "2.3" and $q4 is "3.4" and $q5 is "0.6">>All correct! Great work, $name! [[proceed|Intro9]] <<else>>You might need to [[try again|Intro7]]. Remember to round correctly. <</if>> ---- [[<|Intro7]] [[<<|start]]

The standard way to hold data in R is a 'dataframe' We have created a basic dataset, but it is currently just a list of numbers. In R such a list is called a vector. Sometimes we do want to work with vectors, but more typically, we need data to be in the form of a data frame. Here's our set of numbers again. invert.species <- c(4, 5, 0, 0, 3, 3, 2, 1) invert.species Now try this. # Use the 'as.data.frame' function to change our first list into a data frame. invert.survey <- as.data.frame(invert.species) invert.species # Look at the old list invert.survey # Look at the new data frame Note how the layout of the numbers has changed. It has converted to a column, more like how we would enter data into a spreadsheet in a program like Excel or Numbers. Typically, you would: * Do your data entry in Excel (or similar) * Save as a csv (comma seperated file) * Import the csv file into RStudio And if you import it correctly it will arrive as a data frame. That is, it will be arranged in a set of columns with headings for each column. Making a 'data frame' in R We're going to have a go at making a data frame in R. I think there are two good reasons to do this: * You will get a clearer idea of how a data frame is made up and arranged * You will see that making data frames in R is fiddly and annoying. This second point is useful only in that it will reinforce that the better way to do things is undertake your data entry in a dedicated program such as Excel. Let's imagine we have species richness counts from two streams, Scotchman's Creek and Salt Creek. We collected the samples over a number of days and recorded the water temperature and whether it was sunny or cloudy on the day too. The first thing we will do is create a new set of numbers representing species richnesses. There are twenty in total. Ten come from Scotchman's Creek. Ten come from Salt Creek. # Create a larger invert.species dataset. We will have 10 observations from Scotchman's Creek and 10 from Salt Creek invert.species <- c(4, 5, 0, 0, 3, 3, 2, 1, 4, 7, 5, 3, 4, 0, 1, 4, 7, 7, 0, 3) Now create a list of creek names. These have to align exactly with the species counts above. The first number corresponds to the first creek name. Notice also that when you do this, the new list will appear in your working space in the top-right window. # Create a list of stream names. The first ten samples are from Scotchmans. The second ten are from Salt Creek. stream <- c("scotchmans", "scotchmans", "scotchmans", "scotchmans", "scotchmans", "scotchmans", "scotchmans", "scotchmans", "scotchmans", "scotchmans", "salt", "salt", "salt", "salt", "salt", "salt", "salt", "salt", "salt", "salt") Now create a list of water temperatures. # Create a list of temperatures. temp <- c(10.2, 11.3, 8.4, 7.9, 10.5, 9.8, 9.5, 10.1, 12.5, 13.1, 9.8, 9.5, 9.5, 7.1, 7.5, 10.4, 11.1, 11.3, 6.5, 10.2) Now create a list of weather conditions that were recorded on the day of sampling. # Create a list of weather conditions. weather <- c("sunny", "sunny", "cloudy", "cloudy", "sunny", "cloudy", "cloudy", "sunny", "sunny", "sunny", "cloudy", "cloudy", "cloudy", "cloudy", "cloudy", "sunny", "sunny", "sunny", "cloudy", "sunny") At this stage, we have a set of lists (vectors) but no data frame. There are quicker ways to create a data frame using libraries like 'dplyr', but I think it is a bit easier to understand if we just do this one step at a time using standard coding. Here, we take 'invert.species' and us 'as.data.frame' to turn it into a data frame. Note that we are dropping it into a new object that we are calling 'invert.survey'. # Use the 'as.data.frame' function to change our first list into a data frame. invert.survey <- as.data.frame(invert.species) invert.species # Look at the old list invert.survey # Look at the new data frame When you look at invert.survey, you should see two things have changed: * Now the numbers are arranged vertically, like a standard spreadsheet. This arrangement is what we would call 'long' or 'tall' format. * The data frame is called invert.survey but inside the data frame is a column named invert.species. Now we can add the other vectors to this new data frame. We do this by using the attribution arrow to take the vector and drop it into a new column. Here, we are naming the columns using the dollar sign symbol. The dollar sign ($) mans 'inside of' or 'look inside'. So you can read invert.survey$stream as: * Look inside 'invert.survey' for a column called 'stream'. invert.survey$stream <- stream # Add stream name invert.survey Do the same for the other two variables, temperature and weather. invert.survey$temp <- temp # Add temperature invert.survey invert.survey$weather <- weather # Add weather invert.survey Now, you should have a data frame that has four columns. This would have been a lot easier to do using a data entry app like Excel, but this way you get to see us building a data frame up from the ground. Columns inside data frames can be numbers or factors just like vectors can. Or they can be other classes of things entirely, such as 'characters'. We prefer our variables to be either numbers or factors. Use is.factor and is.numeric to check what classes the four data columns are. Here's the first one to start you off. is.numeric(invert.survey$invert.species) is.factor(invert.survey$invert.species) Species richness (invert.species) is a number: <<radiobutton "$q6" "true">> true <<radiobutton "$q6" "false">> false [[Check your answers|Intro10]] ---- [[<|Intro8]] [[<<|start]] <<if $name is "cheat">> true <</if>>

Answers Species count is a number: <<if $q6 is "true">>Correct! Great work, $name. The way we constructed a dataframe was bit-by-bit to try and give you a sense of how they come together. It is actually possible to do it in one step. Try this and see if it works... invert.survey.2 <- data.frame(invert.species, stream, temp, weather) invert.survey invert.survey.2 You should get two dataframes that look very similar. [[Next|Intro11]] <<else>>That doesn't look correct. Maybe [[try again|Intro9]]. <</if>> ---- [[<|Intro8]] [[<<|start]]

Selecting columns out of a dataframe Just as you can build dataframes, you can select data out of a dataframe too. You can do this either by selecting out columns by their position (i.e. by number) or by their name (i.e. the heading). Try these: invert.survey[4] # The fourth element. invert.survey[-4] # All but the fourth. invert.survey[2:4] # Elements two to four. invert.survey[-(3:4)] # All elements except three to four. invert.survey[c(1, 3)] # Elements one and five. invert.survey['weather'] # The element named 'weather' invert.survey[c('stream','weather')] # The elements named 'stream' and 'weather' You can also seperate out columsn of data and drop them into a new object, like so: stream_and_weather_only <- invert.survey[c('stream','weather')] # Take the elements named 'stream' and 'weather' and drop them into a new object called 'stream_and_weather_only' stream_and_weather_only [[Proceed|Intro12]] ---- [[<|Intro10]] [[<<|start]]

Basic statistical tests and figures Most basic (core) statistical tests and figure in R follow the same syntax. Here is the basic structure: function(response ~ predictor, data = your.data.frame) Basic boxplot Here's the code to create a basic boxplot. To make this sutiable for a report you would need to add extra code or use a package like ggplot2, but if you just want to eyeball data, this is a useful option. boxplot(invert.species ~ stream, data = invert.survey) You can change the colour of the boxes if you like: boxplot(invert.species ~ stream, data = invert.survey, col = c("hotpink","gold3")) You can try other colours. The colour names that R uses are the same as the colour names supported by CSS. You can even just have a go at guessing some colours. Does 'navy' exist? What about 'darkorange' or 'forestgreen'? What other names can you find? Note that you can always just search online for 'CSS colour names' or 'R colour names' if you want to see a list. Basic scatterplot There is no 'scatterplot' function in base R. Instead you just use 'plot' and R will default to the most suitable plot. plot(invert.species ~ temp, data = invert.survey) To add a line of best fit, you need to use the abline function. Try running these two lines together. plot(invert.species ~ temp, data = invert.survey) abline(lm(invert.species ~ temp, data = invert.survey)) You can modify scatterplots just as you can modify boxplots. Here are a few things you can change (there is a whole lot more): * lwd = line width * pch = point character (the points) * col = colour plot(invert.species ~ temp, data = invert.survey, pch = 20) abline(lm(invert.species ~ temp, data = invert.survey), lwd = 2, col = "red") t-test A t-test is a statistical test that tests if two means are different. If P < 0.05, then the two means are taken to be significantly different. t.test(invert.species ~ stream, data = invert.survey) Mann-Whitney U test A Mann-Whitney U test (also called a Wilcox test) is a non-parametric version of a t-test. It is (sort of) testing for a difference in medians rather than means. A t-test is only valid if the assumptions are met. In R, the default t-test is a Welch's unequal variance t-test. It has two assumptions: * Observations must be independent * Data must be normally distributed A Mann-Whitney U test still requires independent observations, but the data doesn't need to be normally distributed. Try this: wilcox.test(invert.species ~ stream, data = invert.survey) The first thing to note is that you will get an 'error' like so: Warning message: In wilcox.test.default(x = c(5, 3, 4, 0, 1, 4, 7, 7, 0, 3), y = c(4, : cannot compute exact p-value with ties Because a Mann-Whitney U test works by comparing ranks of numbers, it can't generate an exact P value if there are 'ties' in the data. That's usually fine. It would only be a concern if the P value was very close to 0.05, in which case we couldn't be certain if it was significant or not. Because our result is strongly non-significant (P = 0.673), any imprecision is negligible. Given that non-parametric tests have fewer assumptions than parametric tests, you might be wodnering why we don't use non-parametric tests all the time. Non-parametric tests tend to be a bit more limited in terms of the data they can accept, and they tend to inflate (or increase) 'Type II error', which is your chance of getting a false negative. Because researchers are always looking for significance, there is a tendency to start with paramtric tests (making it more likely that you will obtain a significant result) and only move to a non-parametriuc test if you have no other option. This is actually a little bit dodgey, because we shouldn't be making decisions that increase our chance of obtaining significance. Anyway. Great work so far, $name. Now try answering these questions: What was the P-value for a t-test (t.test) of 'invert.species' as a function of 'weather'? Write the answer rounded to three decimal places. <<textbox "$q9" "">> What was the P-value for a Mann-Whitney U (wilcox.test) of 'invert.species' as a function of 'weather'? Write the answer rounded to three decimal places. <<textbox "$q10" "">> [[Check your answer|Intro13]] ---- [[<|Intro11]] [[<<|start]] <<if $name is "cheat">> 0.006, 0.011 <</if>>

Your answer was $q9 <<if $q9 is "0.006">>Correct! Great work, $name! <<else>>Hmm, that doesn't look right. <</if>> Your answer was $q10 <<if $q10 is "0.011">>Correct! Great work, $name! <<else>>Hmm, that doesn't look right. <</if>> <<if $q9 is "0.006" and $q10 is "0.011">>All correct! Really great work, $name! You're getting the hang of this now! Note how the result for the non-parametric wilcox.test is higher than the result for the parametric t-test. Both are significant, but the result for the non-parametric test is closer to being non-significant. This is the reason why researchers (perhaps a bit sneakily) prefer parametric tests. [[proceed|Intro14]] <<else>>You might need to [[try again|Intro12]]. Remember to round correctly. <</if>> ---- [[<|Intro12]] [[<<|start]]

Finish Great work, $name. You've reached the end of this short introductory tutorial. A lot of the information in the tutorial is presented in a quick reference PDF that you can access via <a href="https://rstudio.com/wp-content/uploads/2016/10/r-cheat-sheet-3.pdf" target="_blank">this link</a>. It's a very useful little PDF. Well worth keeping a copy of. Now, you can either [[go back to the start|start]] or move onto your next interactive lab.