Your browser lacks required capabilities. Please upgrade it or switch to another to continue.
Loading…
<span class="header1">Welcome</span>
Welcome to this interactive tutorial on building and testing linear models.
In today's workshop you will learn how to build a linear model, test assumptions and extract and interpret a result.
Linear models are somewhat complicated to execute, and there are quite a few steps and considerations, but the good news is that once you learn how to build and test linear models, the rules are the same for many types of tests you may need to do in the future.
* Need to run a regression analysis? That's just a type of linear model.
* Need to run an ANOVA? Just another type of linear model.
* What about an ANCOVA? It's a linear model.
* A linear mixed effects model? You guessed it. The rules are pretty much the same, with a couple additional considerations.
<strong>To start us off, what is your name?</strong>
<<textbox "$name" "">>
Now, before we even look at data or statistical programs, we need to work through the basic steps for building a model.
[[Proceed|Steps1]]
----
<span class="header2">Table of Contents</span>
The below <strong>Table of Contents</strong> is intended to help you navigate back to a point in the tutorial if you need to stop half-way through. You can click the 'back to start' <strong>'<<'</strong> link at the bottom of each page to return to the this page.
<span class="header2">Preliminary Steps</span>
[[Preliminary 1: Steps: Building a Linear Model|Steps1]]
[[Preliminary 2: Positive and negative relationships|Steps3]]
[[Preliminary 3: Building a model|Steps5]]
[[Preliminary 4: Order of predictors|Steps8]]
[[Preliminary 5: Question: Order of predictors|Steps9]]
[[Preliminary 6: End of Preliminaries|Steps11]]
<span class="header2">PAST</span>
[[Past 1: Intro|P01]]
[[Past 2: One Way ANOVA|P02]]
[[Past 3: One Way ANOVA: Arranging Data for Past|P03]]
[[Past 4: Importing data into Past|P05]]
[[Past 5: More data wrangling|P06]]
[[Past 6: What now?|P08]]
[[Past 7: Transformations|P09]]
[[Past 8: More transformations|P11]]
[[Past 9: Understanding the output|P12]]
[[Past 10: Recap|P14]]
<span class="header2">RStudio</span>
[[RStudio 1: Intro|R01]]
[[RStudio 2: Dataset|R02]]
[[RStudio 3: Hypothesis, data and equation building, 1|R03]]
[[RStudio 4: Hypothesis, data and equation building, 2|R05]]
[[RStudio 5: Import dataset in RStudio|R08]]
[[RStudio 6: Colour coding|R09]]
[[RStudio 7: Building a model|R10]]
[[RStudio 8: Assumptions|R11]]
[[RStudio 9: Object orientated coding|R12]]
[[RStudio 10: Boxplot your data|R13]]
[[RStudio 11: Checking assumptions|R15]]
[[RStudio 12: Diagnostic plots|R16]]
[[RStudio 13: Transformations|R18]]
[[RStudio 14: Investigating interaction terms|R19]]
[[RStudio 15: Check for interactions|R21]]
[[RStudio 16: Removing interaction terms|R23]]
[[RStudio 17: The Reduced Model|R25]]
[[RStudio 18: Tukey's Test|R26]]
[[RStudio 19: Writing up for a formal report, 1|R28]]
[[RStudio 20: Writing up for a formal report, 2|R29]]
<span class="header1">Steps: Building a Linear Model</span>
Hi, $name. Before we even touch data management or a statistical program, we need to work out what we want our model to do. Ideally, you should be thinking about this well before data collection, although frequently students (and established researchers) only start thinking about the exact shape of the model when data collection is finished.
<span class="header2">(1) Hypothesis to model</span>
Let's start by going back to basics and (re)introducing the idea of the hypothesis and null hypothesis. In current biological practice at least, hypotheses tend not to be written out formally. Rather, there's a preference to write them in plain English.
<span class="header2">Simple Hypotheses</span>
* Body mass (g) will positively associate with haemoglobin (g/dL)
* Growth rate of plants (g/day) will be higher with beneficial mycorrhizal treatments than under control conditions (no mycorrhizal treatment)
* Native bird species richness will positively associate with forest fragment area (ha)
* cen-b, myo and gscnaa will show higher expression levels in actively courting guppies (<em>Poecilia reticulata</em>)
In contrast, the null hypothesis is simply the relationship stated as a negative. That is, the null is the state where there is no relationship or association.
<span class="header2">Simple Null Hypotheses</span>
* Body mass (g) will not associate with haemoglobin (g/dL) (i.e. there is no association between body mass and haemoglobin)
* Growth rate of plants (g/day) will not be different with beneficial mycorrhizal treatments than under control conditions (no mycorrhizal treatment)
* Native bird species richness will not associate with forest fragment area (ha)
* cen-b, myo and gscnaa will not show different expression levels in actively courting guppies (<em>Poecilia reticulata</em>)
Note also that I've removed the <strong>direction of effect</strong> from the null hypotheses. You don't actually require a direction of effect for a hypothesis either... that is, there might simply be a relationship, affect or association, and generally speaking null hypotheses are written without the direction. Whether or not you are expecting a direction can make differences to the interpretation of a final test. You may have run into 'one tailed' and 'two tailed' t-tests, for instance, which have different expectations in this regard. However, for our purposes, the most straightforward and rigorous thing to do is always predict there to be no affect at all at the null hypothesis stage, and just run standard two-tailed tests (which will be the default more or less always, so you don't have to change this when running tests).
<span class="header2">Terminology</span>
Notice also that I've used slightly different terminology for relationships.
* <strong> shows a relationship with... </strong> doesn't imply cause or effect
* <strong> shows an association with... </strong> doesn't imply cause or effect
* <strong> shows a correlation with... </strong> doesn't imply cause or effect (but tends to be used for datasets consisting of two numerical columns of observations)
* <strong> has an effect on... </strong> implies causality.
<span class="header2">Positive and negative</span>
* in positive relationships, as x increases, y also increased. At a very basic level, height and mass in humans will be positively correlated
* in negative relationships, as x increases y decreases. At a very basic level, number of young produced by a species and the investment per young tends to be negatively associated. That is, if you produce hundreds of young (as do many octopus or sea turtles) each individual offspring will tend to be small. If you only produce one young a year (or every few years, such as a whale), then the single young you produce will tend to be large.
Let's check your understanding of hypotheses and null hypotheses. Read the following and decide if they are hypotheses or null hypotheses.
Mass (g) of adult welcome swallows will associate with the presence of a broodpatch.
<<radiobutton "$q1" "hypothesis">> hypothesis
<<radiobutton "$q1" "null hypothesis">> null hypothesis
There will be no transcriptome differences between juvenile and adult zebra fish.
<<radiobutton "$q2" "hypothesis">> hypothesis
<<radiobutton "$q2" "null hypothesis">> null hypothesis
Male Gouldian finches of the red colour morph will not show differences in aggressive behaviour when compared to orange or black male colour morphs.
<<radiobutton "$q3" "hypothesis">> hypothesis
<<radiobutton "$q3" "null hypothesis">> null hypothesis
[[Check your answers|Steps2]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
hypothesis
null hypothesis
null hypothesis </span>
<</if>><span class="header2">Answers</span>
<strong>Mass (g) of adult welcome swallows will associate with the presence of a broodpatch.</strong>
You stated that this was a $q1.
<<if $q1 is "hypothesis">>Correct!
<<else>>Hmm, that doesn't look right. Maybe [[try again|Steps1]]
<</if>>
<strong>There will be no transcriptome differences between juvenile and adult zebra fish.</strong>
You stated that this was a $q2
<<if $q2 is "null hypothesis">>Correct!
<<else>>Hmm, that doesn't look right. Maybe [[try again|Steps1]]
<</if>>
<strong>Male Gouldian finches of the red colour morph will not show differences in aggressive behaviour when compared to orange or black male colour morphs.</strong>
You stated that this was a $q3
<<if $q3 is "null hypothesis">>Correct!
<<else>>Hmm, that doesn't look right. Maybe [[try again|Steps1]]
<</if>>
<<if $q1 eq "hypothesis" and $q2 eq "null hypothesis" and $q3 eq "null hypothesis">>Great work! [[Proceed|Steps3]]
<<else>>
Something doesn't look quite right there. Be sure to check the wording carefully.
<<endif>>
[[<|Steps1]]
[[<<|start]]Now, let's look at positive and negative relationships. Read the following and decide if they are positive or negative relationships.
As metazoan parasite number increases, eosinophil count in the blood also increases.
<<radiobutton "$q4" "positive">> positive
<<radiobutton "$q4" "negative">> negative
As accessory pigment concentration in thylakoids increases, photosynthetic rate decreases.
<<radiobutton "$q5" "positive">> positive
<<radiobutton "$q5" "negative">> negative
As disease load decreases, survivorship increases.
<<radiobutton "$q6" "positive">> positive
<<radiobutton "$q6" "negative">> negative
As red blood cell count decreases, haemoglobin (g/dL) decreases.
<<radiobutton "$q7" "positive">> positive
<<radiobutton "$q7" "negative">> negative
[[Check your answers|Steps4]]
[[<|Steps2]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
positive
negative
negative
positive
</span>
<</if>><span class="header2">Answers</span>
<strong>As metazoan parasite number increases, eosinophil count in the blood also increases.</strong>
You stated that this was a $q4 relationship.
<<if $q4 is "positive">>Correct!
<<else>>That doesn't seem correct. Maybe [[try again|Steps3]]
<</if>>
<strong>As accessory pigment concentration in thylakoids increases, photosynthetic rate decreases.</strong>
You stated that this was a $q5 relationship.
<<if $q5 is "negative">>Correct!
<<else>>That doesn't seem correct. Maybe [[try again|Steps3]]
<</if>>
<strong>As disease load decreases, survivorship increases.</strong>
You stated that this was a $q6 relationship.
<<if $q6 is "negative">>Correct!
<<else>>That doesn't seem correct. Maybe [[try again|Steps3]]
<</if>>
<strong>As red blood cell count decreases, haemoglobin (g/dL) decreases.</strong>
You stated that this was a $q7 relationship.
<<if $q7 is "positive">>Correct!
<<else>>That doesn't seem correct. This is a tricky one. Re-read the sentence carefully and [[try again|Steps3]]
<</if>>
<<if $q4 eq "positive" and $q5 eq "negative" and $q6 eq "negative" and $q7 eq "positive">>Great work, $name! [[Proceed|Steps5]]
<<else>>
Something doesn't look quite right there. Be sure to check the wording carefully.
<<endif>>
[[<|Steps3]]
[[<<|start]]<span class="header1">Building a model</span>
So far we have looked at the first two steps of building a linear model. We'll now be moving onto the third step...
<img src="http://cpjohnstone.com/wp-content/uploads/2018/11/sequence_01-1.png" alt="past" width="60%" height="auto"/>
I've written 'equation' in quotes, because really, this isn't going to be much like a mathematical equation.
Let's nail down some terminology too.
* <strong>Response variable:</strong> In classical statistics this is the <em>dependent variable</em>, but in biology we usually call this a response variable. It is the variable that responds to the predictors. Standard linear models (i.e. regression analyses, ANOVAs, ANCOVAs) can only have one response variable. There is a class of multivariate analysis that will accept multiple response variables (such as a MANOVA), but don't worry about that for now. We're only going to look at these single response variable or 'univariate' models for now.
* In a linear model a response must be a number.
* <strong>Predictor variable(s):</strong> In classical statistics this is the <em>independent variable</em>, but in biology we usually call this a predictor variable. It is the variable that predicts the response. Standard linear models (i.e. regression analyses, ANOVAs, ANCOVAs) can have one, two or many predictor variable(s). The number of predictor variables is limited by your sample size, and the consequent degrees of freedom. You can't have more predictors than you have degrees of freedom. A model that has too many predictors will collapse and refuse to run in a statistical program. We call such a model 'over-parameterised'.
* In a linear model a predictor can be a number, binary data (e.g. y/n), or category (e.g. sex, treatment).
<span class="header2">Why can't we have more predictors than degrees of freedom?</span>
Think of it this way: if you had the same number of predictors (mass, length, treatment, sex etc) as observations (your sample size, n), then you could allocate each predictor (in order of their means) to one observation (in order of their means) (i.e. you could just pair everything up from smallest to largest). The predictors would then explain 100% of the response variable... but, actually, you wouldn't have explained anything. We just allocated variables to observations and any set of random numbers would also explain 100% of the variation. For this reason the number of predictors is restricted to be (at most) n-1.
<span class="header2">What about 'explanatory variables'?</span>
Some researchers prefer the term 'explanatory variable' to 'predictor'. I tend to use 'response' and 'predictor' because they are both single word descriptors, however, it is worth being aware that <a href="https://scholar.google.com.au/scholar?hl=en&as_sdt=0%2C5&q=664+Citations+Regression+and+model-building+in+conservation+biology%2C+biogeography+and+ecology%3A+The+distinction+between+-+and+reconciliation+of+-+%27predictive%27+and+%27explanatory%27+models&btnG=" target="_blank">there are some specific definitions around these two terms</a>, and as long as we are dealing with single model hypothesis testing (as opposed to model selection methods), 'predictor' is the more correct term.
<span class="header2">Writing out a model as an equation</span>
We're going to use a notation that is similar to how R handles linear models. We will write a single response first, then a tilda (~), which you can read as 'as a function of' then the responses. We will also add the intercept and error term, but this is just for completeness...
<strong>
# Response
# as a function of (~)
# Intercept (the value of the response when all predictors are zero)
# Predictor 1
# Predictor 2
# Predictor 3 (etc)
# Error (all datasets contain error, and this term accounts for this 'noise' in the dataset).
</strong>
We will use asterisks (*) between the predictors. Asterisks represent a complex (multiplicative) relationship. They might not make it to the final model, but here, they act as a reminder that we need to check interaction terms. Note that a plus (+) is placed after the intercept and before the error term because this represents a simple (additive) relationship, and intercepts and error terms always have simple relationships with the other predictors.
<span style="font-family: Helvetica;"><strong><span style="color: #0000ff;">RESPONSE</span> ~ INTERCEPT + <span style="color: #0000ff;">PREDICTOR1</span> * <span style="color: #0000ff;">PREDICTOR2</span> * <span style="color: #0000ff;">PREDICTOR3</span> (etc)... + ERROR</strong></span>
The terms in blue need to be defined in your final model. The terms in black are always in the model by default, so you don't (usually) have to define them. Some modelling methods (such as structural equation models) may require you to define error terms, but that is pretty advanced stuff and well beyond anything we need to worry about for now.
Let's try taking some hypotheses and turning them into models...
<strong>Mass (g) of adult welcome swallows will associate with the presence of a broodpatch.</strong>
<<radiobutton "$q8" "a">> (a) MASS (g) ~ BROODPATCH (y/n) + Error
<<radiobutton "$q8" "b">> (b) BROODPATCH (y/n) ~ Intercept + MASS (g) + Error
<<radiobutton "$q8" "c">> (c) BROODPATCH (y/n) ~ MASS (g) + Error
<<radiobutton "$q8" "d">> (d) MASS (g) ~ Intercept + BROODPATCH (y/n) + Error
<strong>There will be transcriptome differences between juvenile and adult zebra fish.</strong>
<<radiobutton "$q9" "a">> (a) TRANSCRIPTOME (measurement) ~ Intercept * AGE (juvenile/adult) * Error
<<radiobutton "$q9" "b">> (b) TRANSCRIPTOME (measurement) ~ Intercept + AGE (juvenile/adult) + Error
<<radiobutton "$q9" "c">> (c) TRANSCRIPTOME (measurement) ~ AGE (juvenile/adult) + Intercept/Error
<<radiobutton "$q9" "d">> (d) TRANSCRIPTOME (measurement) ~ AGE (juvenile/adult) + Error/Intercept
<strong>Male Gouldian finches of the red colour morph will show differences in aggressive behaviour when compared to orange or black male colour morphs.</strong>
<<radiobutton "$q10" "a">> (a) AGGRESSIVE BEHAVIOUR (score) ~ Intercept + COLOUR MORPH (red/black/orange) + Error
<<radiobutton "$q10" "b">> (b) AGGRESSIVE BEHAVIOUR (score) ~ Error + COLOUR MORPH (red/black/orange) + Intercept
<<radiobutton "$q10" "c">> (c) AGGRESSIVE BEHAVIOUR (score) ~ Intercept * COLOUR MORPH (red/black/orange) * Error
<<radiobutton "$q10" "d">> (d) AGGRESSIVE BEHAVIOUR (score) ~ Error * COLOUR MORPH (red/black/orange) * Intercept
[[Check your answers|Steps6]]
[[<|Steps4]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
d
b
a
</span>
<</if>><span class="header2">Answers</span>
<strong>Mass (g) of adult welcome swallows will associate with the presence of a broodpatch.</strong>
<<if $q8 is "d">>Correct!
<<else>>That doesn't seem correct. Maybe [[try again|Steps5]]
<</if>>
<strong>There will be transcriptome differences between juvenile and adult zebra fish.</strong>
<<if $q9 is "b">>Correct!
<<else>>That doesn't seem correct. Maybe [[try again|Steps5]]
<</if>>
<strong>Male Gouldian finches of the red colour morph will show differences in aggressive behaviour when compared to orange or black male colour morphs.</strong>
<<if $q10 is "a">>Correct!
<<else>>That doesn't seem correct. Maybe [[try again|Steps5]]
<</if>>
<<if $q8 eq "d" and $q9 eq "b" and $q10 eq "a">>Great work, $name! Now, let's try a harder one. This hypothesis has multiple predictors. Can you remember how to arrange it into an equation?
<strong> Maize growth rate will be affected by nitrogen treatment (low, medium, high) and sunlight exposure (shade, direct) </strong>
<<radiobutton "$q11" "a">> (a) GROWTH RATE ~ Intercept + LOW TREATMENT * MEDIUM TREATMENT * HIGH TREATMENT * SHADE * DIRECT SUNLIGHT + Error
<<radiobutton "$q11" "b">> (b) GROWTH RATE ~ Intercept + LOW TREATMENT + MEDIUM TREATMENT + HIGH TREATMENT + SHADE + DIRECT SUNLIGHT + Error
<<radiobutton "$q11" "c">> (c) GROWTH RATE ~ Intercept + TREATMENT (low, medium, high) * SUNLIGHT (shade, direct) + Error
<<radiobutton "$q11" "d">> (d) GROWTH RATE ~ Intercept * TREATMENT (low, medium, high) + SUNLIGHT (shade, direct) * Error
[[Check Your Answer|Steps7]]
<<else>>
Something doesn't look quite right there. Be sure to check the + and * symbols carefully, as well as the overall order of variables.
<<endif>>
<<if $name is "cheat">>
<span class="cheat"> c </span>
<</if>>
[[<|Steps5]]
[[<<|start]]<span class="header2">Answers</span>
<strong> Maize growth rate will be affected by nitrogen treatment (low, medium, high) and sunlight exposure (shade, direct) </strong>
<<if $q11 is "c">>Correct! Great work, $name.
[[Proceed|Steps8]]
<<else>>That doesn't seem correct. It is a tricky one, though. [[Have another go|Steps6]]
<</if>>
[[<|Steps6]]
[[<<|start]]<span class="header1">Order of predictors</span>
If a model contains two or more predictors, we have to take a step back and think about the order that the predictors will be entered into the model. This is especially important in R, because R defaults to something called a 'Type I' sum of squares, or also sometimes a 'sequential sum of squares', when calculating P values. If your design is unbalanced (i.e. if there are uneven numbers of samples in treatments) and you use Type I sum of squares, the order of predictors can have important implications for final P values.
Here are some diagrams that attempt to explain the differences between Types I, II and II sum of squares in a reasonably accessible way.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/11/Type_I_sum_of_squares.png" alt="past" width="100%" height="auto"/>
<img src="http://cpjohnstone.com/wp-content/uploads/2018/11/Type_II_sum_of_squares.png" alt="past" width="100%" height="auto"/>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/02/Type_III_sum_of_squares.png" alt="past" width="100%" height="auto"/>
<span class="header2">What is an interaction term?</span>
We will come back to interaction terms in more detail later, but, briefly, an interaction term represents the question: <strong> Do these two (or more) Predictors have a complex relationship in their effect on the Response? </strong> If for example, sunlight is having a positive effect on plant growth rate, but only at high nitrogen treatment levels, then we might expect the interaction term to be significant. It can be thought of as a term that asks the question: does the effect of one predictor depend on the level of the other predictor? (where P < 0.05 is 'yes' and P > 0.05 is 'no').
<span class="header2">How do we order predictors?</span>
So we often need to consider the order of predictors in a model. How do we do this? What rules do we use? At the most basic level, you want to endeavour to do two things:
# Predictors that you have only included because you want to control for their effect (i.e. you are probably not interested in the P value), should be placed earlier in the sequence. This is true of classic 'covariates' that are included to control for a known effect in a model like an ANCOVA.
# Predictors that directly relate to your hypothesis should be placed later in the model. The more important the hypothesis, the later the term should be placed.
Our reasons for using these rules are two-fold: (1) if you want a model to control for another potential variable, it must be included earlier in the model or else it won't actually get to explain any of the variance before other variables. A covariate included early in a model in this way can sometimes 'clean up' noise in the data, and make it easier for a pattern related to the hypothesis to be identified, and (2) if a predictor related to your hypothesis is really important, it should still be significant even after all the other predictors have already had a shot at explaining some of the variance in the model.
[[Proceed|Steps9]]
[[<|Steps7]]
[[<<|start]]<span class="header2">Question: Order of predictors</span>
Let's return to the following hypothesis.
<strong> Maize growth rate will be affected by nitrogen treatment (low, medium, high) and sunlight exposure (shade, direct) </strong>
Let's imagine we are somewhat interested in the effect of sunlight and the interaction of nitrogen and sunlight (TREATMENT:SUNLIGHT), but our key hypothesis relates to nitrogen treatment. We also want to control for mean daily temperature in the greenhouse, because that can have an effect on plant growth rates (but we aren't interested in the P value for Temperature, rather we just want to control for it).
<strong>Things to consider</strong>
# Interaction terms are always placed last in a Type I sum of squares model.
# Temperature is not key to our hypothesis. We only wish to control for Temperature.
# We are somewhat interested in testing the effect of sunlight and the interaction for sunlight and treatment (i.e. these are secondary hypotheses).
# We are most interested in the effect of fertiliser treatment (i.e. this is the primary hypothesis).
Keeping this all in mind, which of the following is the best way to order these predictors?
<<radiobutton "$q12" "a">> (a) GROWTH RATE ~ Intercept + TEMPERATURE + TREATMENT (low, medium, high) + SUNLIGHT (shade/direct) + TREATMENT:SUNLIGHT + Error
<<radiobutton "$q12" "b">> (b) GROWTH RATE ~ Intercept + TREATMENT (low, medium, high) + SUNLIGHT (shade/direct) + TEMPERATURE + TREATMENT:SUNLIGHT (shade/direct) + Error
<<radiobutton "$q12" "c">> (c) GROWTH RATE ~ Intercept + TREATMENT (low, medium, high) + TEMPERATURE + SUNLIGHT (shade/direct) + TREATMENT:SUNLIGHT + Error
<<radiobutton "$q12" "d">> (d) GROWTH RATE ~ Intercept + TEMPERATURE + SUNLIGHT (shade/direct) + TREATMENT (low, medium, high) + TREATMENT:SUNLIGHT + Error
[[Check Your Answer|Steps10]]
<<if $name is "cheat">>
<span class="cheat"> d </span>
<</if>>
[[<|Steps8]]
[[<<|start]]<span class="header2">Answer</span>
<strong> Maize growth rate will be affected by nitrogen treatment (low, medium, high) and sunlight exposure (shade, direct) </strong>
<<if $q12 is "d">>Correct! Great work, $name. The most reasonable model would place temperature first (because all we want to do is control for temperature), and then sunlight (because this is a secondary hypothesis) and then nitrogen treatment (because this is the predictor of main interest). The interaction term is placed last because in a Type I sum of squares interaction terms are always placed last in a Type I sequential sums of squares model.
[[Proceed|Steps11]]
<<else>>That doesn't seem correct. It is a tricky one, though. [[Have another go|Steps9]]
<</if>>
[[<|Steps9]]
[[<<|start]]<span class="header1">Recap</span>
Let's revisit the chart of steps we are gradually building. We have reached Step 4:
<img src="http://cpjohnstone.com/wp-content/uploads/2018/11/sequence_02-1.png" alt="past" width="60%" height="auto"/>
Now, we need to start working with a statistical program. We have to test assumptions for a linear model, but to do that we have to build the model.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/11/sequence_03.png" alt="past" width="60%" height="auto"/>
Now, you can decide... would you prefer to start with [[Past|P01]] or [[RStudio|R01]]?
[[<|Steps10]]
[[<<|start]]<img src="http://cpjohnstone.com/wp-content/uploads/2018/08/past_img.png" alt="past" width="10%" height="auto"/>
<span class="header1">Past</span>
Hi $name. Great. Let's get started with Past.
Past is a free point and click statistical program loosely similar to SPSS. It is an excellent option for most statistical analyses and graph-making, although it is somewhat limited with regard to some analyses. Past has been designed for analysing paleontological data, but, conveniently for us, this means it is easily applied to biological sciences data too.
There is an excellent and thorough pdf guide to using Past available from the Past team. If you plan to use Past, you should <a href="https://folk.uio.no/ohammer/past/past3manual.pdf" target="_blank">download a copy</a>.
Once you have a copy of the manual, head to the <a href="https://folk.uio.no/ohammer/past/" target="_blank">Past website</a>. and download the program to your computer. If you are working on a lab computer, you may already have a copy.
Once you have downloaded the manual and installed the Past program, you can download a data file. This file is in a comma delineated format (csv). It is actually just a text file with commas separating values. Comma delineated files are not proprietary, which means that no one owns this format, and a csv file is very stable (i.e. the requirements to open a csv file won't ever change in the future).
We're going to start with a dataset called <a href="http://cpjohnstone.com/wp-content/uploads/2018/12/agilis-morphometrics.csv.zip" target="_blank">Agilis Morphometrics</a>. This is an actual biology PhD dataset examining morphological and physiological responses of agile antechinus (a small carnivorous marsupial) to habitat fragmentation. Once you have downloaded it, you can move onto the [[next page|P02]].
[[<|Steps11]]
[[<<|start]]<img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R_logo.png" alt="past" width="10%" height="auto"/> <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R_studio_logo.png" alt="past" width="10%" height="auto"/>
Hi $name. Great. Let's get started.
<span class="header1">What is R?</span>
R is a free to use, open source programming language used for statistical analysis and graphing. Along with Python, it is one of the default professional tools used by data scientists. R has a bit of a learning curve, but don't worry, you can do it! And once you learn the basics, the syntax is largely the same or similar across many different tasks. It's not as hard as it seems at first.
<span class="header1">What is RStudio?</span>
RStuido is a wrap-around for the free statistical programming language R. RStudio helps make R into a more user-friendly experience.
You need to <a href="https://www.r-project.org/" target="_blank">download and install R first</a>, and then <a href="https://www.rstudio.com/products/rstudio/download/" target="_blank">download and install RStudio</a>.
Once you have downloaded and installed R and RStudio you can [[proceed|R02]]
[[<<|start]]<span class="header1">One Way ANOVA</span>
Past is a great program for most statistical work, but it is somewhat limited in terms of building linear models, especially where they might include multiple numeric and categorical predictors.
What we are going to do today is just examine how to build a one-way and two-way ANOVA in Past. ANOVAs are a form of linear model that consist entirely of categorical predictors (e.g. site, sex, treatment) and a single numeric response that will usually also be continuous (e.g. mass, length, expression level etc).
Like all tests, ANOVAs have assumptions. A convenient thing with linear models, is that the assumptions are the same for all liner models.
<span style="font-family: Arial, Helvetica, sans-serif; color: #3366ff; font-size: 18pt;"><strong>ASSUMPTIONS OF ANOVAs (and ANCOVAs etc)</strong></span>
<span style="font-family: Arial, Helvetica, sans-serif; color: #3366ff; font-size: 18pt;">(1) Residuals are normally distributed</span>
<span style="font-family: Arial, Helvetica, sans-serif; color: #3366ff; font-size: 18pt;">(2) Residuals must be independent (collected randomly)</span>
<span style="font-family: Arial, Helvetica, sans-serif; color: #3366ff; font-size: 18pt;">(3) Residuals have equal variances</span>
<span style="color: #3366ff; font-size: 18pt;"><strong>NOT ASSUMPTIONS, BUT OTHER PROBLEMS MAY OCCUR IF…</strong></span>
<span style="color: #3366ff; font-size: 18pt;">(1) Predictor variables correlate (r > 0.6 considered problematic) (two variables explain the same thing)</span>
<span style="color: #3366ff; font-size: 18pt;">(2) The model is over-parameterised (too many predictors given your data set size)</span>
<span style="color: #3366ff; font-size: 18pt;">(3) You fail to check for significance of interaction terms (your main effects could be meaningless)</span>
Note that assumptions must be met for the model to be valid. The 'other problems' list is a bit more flexible, but you will likely save yourself a headache in the long run by paying attention to them as well.
Here is a basic flowchart to help you step through the process...
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/assumption-process-basic.png" alt="past" width="70%" height="auto"/>
However, before we start, we need to get our data sorted into a shape that Past will like. Open up the Agilis Morphometrics dataset and then move to the [[next page|P03]]
[[<|P01]]
[[<<|start]]<span class="header1">Dataset</span>
We're going to take a look at how to turn a linear model equation into the code necessary to run the model in RStudio (or R for that matter, but we'll stick mostly with RStudio because it is a bit more user friendly).
We will use the same dataset of agile antechinus morphometrics and physiological data that is used in the Past tutorial. You can download it <a href="http://cpjohnstone.com/wp-content/uploads/2018/12/agilis-morphometrics.csv.zip" target="_blank">here</a>, if you haven't already.
This is an actual biology PhD dataset examining morphological and physiological responses of agile antechinus (a small carnivorous marsupial) to habitat fragmentation. Once you have downloaded it, you can move onto the [[next page|R03]]
[[<|R01]]
[[<<|start]]<span class="header1">One Way ANOVA: Arranging Data for Past</span>
Although you can move columns around in Past (see page 14 of the Past Manual), we run a slightly lower risk of confusion if we simply create a new dataset that only has the information in it that we want to test.
<span class="header2">One Way ANOVA: the equation</span>
We're going to test a very basic One Way ANOVA model in which SEX of agile antechinus predicts MASS of agile antechinus, i.e. we are asking: does the sex of agile antechinus predict mass?
<strong>What is the most sensible null hypothesis for the above model?</strong>
<<radiobutton "$q13" "a">> (a) Mass has no effect on sex in agile antechinus
<<radiobutton "$q13" "b">> (b) Female agile antechinus will weigh more than male agile antechinus
<<radiobutton "$q13" "c">> (c) There is no difference between agile antechinus masses and dusky antechinus masses
<<radiobutton "$q13" "d">> (d) Sex has no effect on mass in agile antechinus
<strong>What is the most sensible equation?</strong>
<<radiobutton "$q14" "a">> (a) MASS (g) ~ Intercept + SEX (m/f) + Error
<<radiobutton "$q14" "b">> (b) SEX (m/f) ~ Intercept + MASS (g) + Error
<<radiobutton "$q14" "c">> (c) MASS (g) ~ Intercept + MALE * FEMALE + Error
<<radiobutton "$q14" "d">> (d) MASS (g) ~ Intercept + MALE + FEMALE + MALE:FEMALE + Error
[[Check your answers|P04]]
[[<|P02]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
d
a
</span>
<</if>><span class="header2">Answers</span>
<strong>What is the most sensible null hypothesis?</strong>
<<if $q13 is "d">>Correct! The most sensible null hypothesis is that sex has no effect on mass in agile antechinus
<<else>>Hmm, that doesn't look right. Maybe [[try again|P03]]
<</if>>
<strong>What is the most sensible equation?</strong>
<<if $q14 is "a">>Correct! The most sensible equation is <strong>MASS (g) ~ Intercept + SEX (m/f) + Error</strong>. We are using Sex as an independent variable to predict the dependant variable, Mass. All linear models has an intercept and error, so they needed to be in the equation too.
<<else>>Hmm, that doesn't look right. Maybe [[try again|P03]]
<</if>>
<<if $q13 eq "d" and $q14 eq "a">>Great work! [[Proceed|P05]]
<<else>>
Something doesn't look quite right there. Be sure to check the wording of each answer carefully.
<<endif>>
[[<|P03]]
[[<<|start]]<span class="header2">Importing data into Past</span>
Now we are going to import data in Past. Rather than import the whole dataset, we are going to make life a little easier by creating a dataset that only contains the two columns of data we want, sex as a category (MF in the dataset) and Mass.
<strong>
# Open the downloaded zipped file if you haven't already
# Open the csv file called agilis-morphometrics.csv
# Select File > New to create a new Excel spreadsheet
# Copy and paste the MF column into the SECOND column of the new data sheet
# Copy and paste the MASS column into the FIRST column of the new datasheet
# Save as as comma separated file called 'agilis-mass.csv' (although you could call this anything you like as long as it is comma delineated (csv).
</strong>
Once you have create the new csv file you can close the file in Excel.
<strong>
# Open Past
# Drag and drop the new 'agilis-morphometrics.csv' file into Past.
# Choose...
* Rows contain > Only data cells
* Columns contain > Names, data
* Separator > Comma
</strong>
The Import text file window should look like this:
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/import-ANOVA.png" alt="past" width="50%" height="auto"/>
Here is a video stepping you through the process if you are unsure how to do any of the steps:
<iframe width="560" height="315" src="https://www.youtube.com/embed/C6vz-EBslxE?controls=0" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Once you have imported a dataset containing only Mass and Sex (MF) into Past, you can [[proceed to the next page|P06]]
[[<|P04]]
[[<<|start]]Before we do anything else, we need to tell Past that the second column, MF, is a categorical variable. In Past a categorical variable is called a <strong>Group</strong>, which is nice and easy to remember.
Click 'Column' attributes and change the Type of data for MF to 'Group'. Here's a video showing you what to do if you are unsure.
<iframe width="560" height="315" src="https://www.youtube.com/embed/Bq8BlzA2MPM" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Now, we are going to step through the process of checking assumptions. One thing to keep in mind is that we've started with such a simple dataset that you could analyse this with a t-test, but, we want to learn about linear models so we will use an ANOVA instead.
Here are the assumptions of a linear model. We're not going to worry about the 'other potential problems' for now, as I'd like us to focus just on the key assumptions.
<span style="font-family: Arial, Helvetica, sans-serif; color: #3366ff; font-size: 18pt;"><strong>ASSUMPTIONS OF ANOVAs (and ANCOVAs etc)</strong></span>
<span style="font-family: Arial, Helvetica, sans-serif; color: #3366ff; font-size: 18pt;">(1) Residuals are normally distributed</span>
<span style="font-family: Arial, Helvetica, sans-serif; color: #3366ff; font-size: 18pt;">(2) Residuals must be independent (collected randomly)</span>
<span style="font-family: Arial, Helvetica, sans-serif; color: #3366ff; font-size: 18pt;">(3) Residuals have equal variances</span>
And here is the process that we will step through.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/assumption-process-basic.png" alt="past" width="70%" height="auto"/>
<span class="header2">1. Boxplot the data</span>
# Highlight both columns
# Select 'Plot' > 'Barchart/Boxplot'
# Change the 'Plot type' to 'Boxplot'
Here are some images to help you produce a boxplot. Try playing with some of the options too:
* What happens if you turn on 'Notches' or 'Outliers'?
* Can you work out how to change the graph to greyscale? (hint: try clicking the 'Graph settings' option at the bottom of the graphing window.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/linear_past_boxplot_001-2.png" alt="past" width="70%" height="auto"/>
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/linear_past_boxplot_02.png" alt="past" width="50%" height="auto"/>
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/linear_boxplot_02.png" alt="past" width="50%" height="auto"/>
Once you have played around with the boxplot settings, see if you can address the following questions:
<span class="header2">Do the groups look approximately normal?</span>
* Are the boxplots approximately symmetrical around the medians? (the thick middle line)
<<radiobutton "$q15" "yes">> yes
<<radiobutton "$q15" "no">> no
<span class="header2">Do the groups look approximately of equal variance?</span>
* Are the boxplots and whiskers (lines) about the same size as each other? i.e. is the female boxplot about the same vertical size as the male boxplot?
<<radiobutton "$q16" "yes">> yes
<<radiobutton "$q16" "no">> no
[[Proceed|P07]]
[[<|P05]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
no
no
</span>
<</if>>
<strong>Are the groups approximately normal?</strong>
You stated: $q15
<<if $q15 is "no">>This is a bit subjective, but that is what I think too. The male boxplot in particular has a very long upper whisker. It just doesn't quite look symmetrical.
<<else>>Although this is a little subjective, I'm not sure that's the best choice. Maybe [[try again|P06]]
<</if>>
<strong>Are the variances approximately equal?</strong>
You stated: $q16
<<if $q16 is "no">>This is a bit subjective, but that is what I think too. The female boxplot looks much smaller than the male boxplot.
<<else>>Although this is a little subjective, I'm not sure that's the best choice. Maybe [[try again|P06]]
<</if>>
<<if $q15 eq "no" and $q16 eq "no">>Great work, $name! [[Proceed|P08]]
<<else>>
I'm not sure those are the best choices... maybe go back and look at the boxplots again.
<<endif>>
[[<|P06]]
[[<<|start]]<span class="header2">What now?</span>
So, we have a suspicion that the groups may not have normal distribution and the variances might not be equal. We could start transforming the response variable right away, but it would be better to check the diagnostic plots first. The diagnostic plots are plots of the residuals of the linear model, and strictly speaking the assumptions for a linear model relate to the residuals, and <strong>not the underlying data</strong>. For this reason, it is better to always check the diagnostic plots. To check the diagnostic plots, we actually have to run the model.
# Highlight both columns
# Select 'Univariate' from the menu
# Select 'ANOVA etc. (several samples)'
# Select 'Several samples tests (ANOVA, Kruskal-Wallis)'
Here are some images following these steps.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/past_one-way-ANOVA.png" alt="past" width="70%" height="auto"/>
You should get this output, but don't bother looking at it right now. Until we know if the assumptions are met, the P values might just be meaningless.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/past_one_way_ANOVA_results.png" alt="past" width="50%" height="auto"/>
Instead, click on the <strong>Residuals</strong> tab. First, we'll look at the qq-plot.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/past_one_way_ANOVA_qq_plot.png" alt="past" width="50%" height="auto"/>
You should see a qq-plot that looks the same as above. The qq-plot is a plot of a set of imaginary residuals that are perfectly normal (on the horizontal axis) and the real actual residuals of the model (on the vertical axis). If you think through the logic of this, then if the actual residuals are perfectly normal, they should align exactly with the imaginary normal residuals and form a dead straight line of dots in the figure. If the dots wander off the line, then the residuals are moving away from normality.
<span class="header2">A couple pointers...</span>
# Biological data is often 'noisy'. Some departure from normality should be expected.
# Linear models are actually reasonably tolerant of small departures from normality. Transformations involve a loss of information, so there is always a judgment call involved: is it worth transforming the data?
In the case of this dataset, the dots are wandering off the line at the top of the distribution. It's a fairly minor departure though, and I'd probably be willing to accept this without transformation. But we need to check for equal variance now too.
Change the 'Plot type' to 'Residuals vs means'. You should obtain this figure:
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/past_one_way_ANOVA_equal_variances.png" alt="past" width="50%" height="auto"/>
Departures from the expectation of equal variance of residuals is a much bigger problem for linear models. The usual advice is to look for a 'wedge' or 'arrowhead' in the data. Because we have just two levels of a predictor, we only have two columns of dots (one for females, one for males), but we can still think about whether the is a 'wedge' in the data. A 'wedge' would indicate that variances are smaller at one end of the distribution than the other.
Here is the same figure, but I've manually drawn an imaginary 'wedge' over the data. It certainly looks like the spread of variances on the left is smaller than the variances on the right. This is a serious problem, and suggests that we do need to transform the data...
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/past_one_way_ANOVA_equal_variances_2.png" alt="past" width="50%" height="auto"/>
[[Proceed|P09]]
[[<|P07]]
[[<<|start]]<span class="header1">Transformations</span>
Transforming data in Past is extremely easy. Just highlight the column you want to transform, and then select the 'Transform' menu and the transformation you want to use.
<span class="header2">Log transformation</span>
A log transformation is a standard transformation, and it is a good one to start with. Remember that we are using an iterative cycle, so that if we are concerned about the assumptions, we need to transform the data and start over...
# Highlight the MASS column
# Select 'Transform' > 'Log'
# Produce a boxplot just as you did before. Do the distributions look approximately normal? Do the distributions look approximately equal?
# Now produce a model to look at the Diagnostic plots
# Use 'Univariate' > 'ANOVA etc. (several samples)' > 'Several sample tests (ANOVA, Kruskal-Wallis)'
# Look at the Residual tab to check the qq-plot and the residuals vs mean plot.
Based on the log transformed data...
<span class="header2">Check normality: are the dots in the qq-plot falling approximately on the straight line?</span>
<<radiobutton "$q17" "yes">> yes
<<radiobutton "$q17" "no">> no
<span class="header2">Check equal variance: are the two columns approximately the same height?</span>
<<radiobutton "$q18" "yes">> yes
<<radiobutton "$q18" "no">> no
[[Proceed|P10]]
[[<|P08]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
yes
no
</span>
<</if>><<if $q17 eq "yes" and $q18 eq "no">>Good assessment. This is always a bit subjective, but I would agree that the qq-plot looks (mostly) normal and is (mostly) falling on the line, but the two groups are still quite unequal in their variances (vertical height)
[[Proceed|P11]]
<<else>>
I'm not sure those are the best choices... maybe [[go back and look at the plots again|P09]].
<<endif>>
[[<|P09]]
[[<<|start]]<span class="header1">Transformations</span>
Our log transformation wasn't very successful. But, if you remember our helpful figure, we just try and try again until we get a transformation that works.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/assumption-process-basic.png" alt="past" width="70%" height="auto"/>
<span class="header2">Box-Cox transformation</span>
A Box-Cox transformation is a quite powerful transformation that is well liked in Biological Sciences. However, because we have <strong>already transformed the MASS data</strong>, we either need to 'undo' the transformation, or just reload the file. I think it's worth practising loading files, so, do the following:
# Close Past
# Drag and drop your agilis-mass.csv into the window
# Remember to select the following options...
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/import-ANOVA.png" alt="past" width="50%" height="auto"/>
Now, have you remembered that you need to change the 'Type' of the MF data to group?
# Click 'Column attributes'
# Change the 'Type' for MF to group
# Unselect 'Column attributes'
Finally, highlight the MASS column and select 'Transform' > 'Box-Cox'. The 'Box-Cox' will suggest a default Lambda, but don't worry about this for now. Just select Transform.
You should get numbers that look like this...
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/linear_past_boxcox.png" alt="past" width="50%" height="auto"/>
And a boxplot like this...
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/linear_past_boxcox_boxplot.png" alt="past" width="50%" height="auto"/>
And a qq-plot and resdiuals vs mean that looks like this...
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/linear_past_boxcox_qq.png" alt="past" width="50%" height="auto"/>
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/linear_past_boxcox_resid.png" alt="past" width="50%" height="auto"/>
The qq-plot now looks about as straight as it can get. There is still a slight difference between the two columns in the Residuals vs means plot, but it is well within the realm of being acceptable. Good, we can finally move onto interpreting results...
[[Proceed|P12]]
[[<|P10]]
[[<<|start]]<span class="header1">Understanding the output</span>
You should see an output window that looks like this...
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/linear_past_boxcox-1.png" alt="past" width="50%" height="auto"/>
And you can copy and paste the output, so that you end up with it as text, like so...
<span style="font-size: 10pt; color: #800080;"><strong><span style="font-family: 'Courier New', Courier;">Test for equal means</span></strong></span>
<span style="font-size: 10pt; color: #800080;"><strong><span style="font-family: 'Courier New', Courier;">Sum of sqrs df Mean square F p (same)</span></strong></span><br /><span style="font-size: 10pt; color: #800080;"><strong><span style="font-family: 'Courier New', Courier;">Between groups: 0.15331 1 0.15331 280.4 2.046E-40</span></strong></span><br /><span style="font-size: 10pt; color: #800080;"><strong><span style="font-family: 'Courier New', Courier;">Within groups: 0.113741 208 0.00054683 Permutation p (n=99999)</span></strong></span><br /><span style="font-size: 10pt; color: #800080;"><strong><span style="font-family: 'Courier New', Courier;">Total: 0.26705 209 1E-05</span></strong></span>
<span style="font-size: 10pt; color: #800080;"><strong><span style="font-family: 'Courier New', Courier;">Components of variance (only for random effects):</span></strong></span><br /><span style="font-size: 10pt; color: #800080;"><strong><span style="font-family: 'Courier New', Courier;">Var(group): 0.00146138 Var(error): 0.00054683 ICC: 0.727703</span></strong></span>
<span style="font-size: 10pt; color: #800080;"><strong><span style="font-family: 'Courier New', Courier;">omega2: 0.5709</span></strong></span>
<span style="font-size: 10pt; color: #800080;"><strong><span style="font-family: 'Courier New', Courier;">Levene´s test for homogeneity of variance, from means p (same): 0.2346</span></strong></span><br /><span style="font-size: 10pt; color: #800080;"><strong><span style="font-family: 'Courier New', Courier;">Levene´s test, from medians p (same): 0.2247</span></strong></span>
<span style="font-size: 10pt; color: #800080;"><strong><span style="font-family: 'Courier New', Courier;">Welch F test in the case of unequal variances: F=285.1, df=208, p=7.418E-41</span></strong></span>
Here is a quick breakdown of the important components.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/linear_past_boxcox_explained.png" alt="past" width="100%" height="auto"/>
Remember that a P value is only reported to three decimal places, and other parts of tests (test statistics, such as F values, or sums of squares) are usually only reported to two decimal places.
The P value of p(same) = 2.046E-40 is actually a value of 0.0000000000000000000000000000000000000002046. One very common mistake we see in student reports (and a mistake you will lose marks for) is reporting this P value as P=2.046E-40 (incorrect) instead of P<0.001 (correct). You will tend to lose substantial marks for this mistake because it implies you haven't understood what the P value is or how to read it.
<span class="header2">Interpretation</span>
If the P value for the 'Between groups' result is <0.05, then we take this to mean that the result is significant. Specially, because an ANOVA is testing for differences, there is a significant difference in the mean between the two groups (i.e. male and female antechinus have different masses). In the case of our result, the P value returned is 2.046E-40.
<strong>2.046E-40 is greater than 0.05</strong>
<<radiobutton "$q19" "true">> true
<<radiobutton "$q19" "false">> false
<strong>The result indicates that the two means are significantly the same</strong>
<<radiobutton "$q20" "true">> true
<<radiobutton "$q20" "false">> false
Here's a trickier one...
<strong>Although the result is significant, the ω-squared value indicates that only 57.1% of variance in the response was explained by the predictor(s). This is a very weak effect size. In biology we expect variance explained to be at least 80% or more to be meaningful.</strong>
<<radiobutton "$q21" "true">> true
<<radiobutton "$q21" "false">> false
Or, how about this...
<strong>Because the residuals vs means diagnostic plots indicated that the residuals were mostly equal among groups, we should be reporting the Welch's F test results instead of the standard Test for equal means.</strong>
<<radiobutton "$q22" "true">> true
<<radiobutton "$q22" "false">> false
[[Check your answers|P13]]
[[<|P11]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
false
false
false
false
</span>
<</if>><strong>2.046E-40 is greater than 0.05</strong>
You stated: $q19
<<if $q19 is "false">>Correct. 2.046E-40 is less than 0.05.
<<else>>That doesn't look right to me. Maybe [[try again|P12]]
<</if>>
<strong>The result indicates that the two means are significantly the same</strong>
You stated: $q20
<<if $q20 is "false">>Correct. The result indicates that the means are significantly different, because an ANOVA is a test for difference in means.
<<else>>That doesn't look right to me. Maybe [[try again|P12]]
<</if>>
<strong>Although the result is significant, the ω-squared value indicates that only 57.1% of variance in the response was explained by the predictor(s). This is a very weak effect size. In biology we expect variance explained to be at least 80% or more to be meaningful.</strong>
You stated: $q21
<<if $q21 is "false">>Correct. 57.1% is actually quite good for a biological result. Biological data is often quite noisy, so it's unusual to obtain percentage of variance explained (R squared etc) values that are above 80%, and anything from 40-80% is considered fairly reasonable.
<<else>>This one is a bit subjective but that's not the choice I would pick. Maybe [[try again|P12]]
<</if>>
<strong>Because the residuals vs means diagnostic plots indicated that the residuals were mostly equal among groups, we should be reporting the Welch's F test results instead of the standard Test for equal means.</strong>
You stated: $q22
<<if $q22 is "false">>Correct. We would need to use the Welch's unequal variance test results if the diagnostic plots had indicated that variances were not equal. This is an alternative option that would allow us to avoid using a transformation. The assumption of normality still needs to be met, but Welch's procedure allows us to get around the assumption of variance (at least in some simple examples... as models become more complicated it gets harder to apply Welch's procedure)
<<else>>That doesn't look right to me. Maybe [[try again|P12]]
<</if>>
<<if $q19 eq "false" and $q20 eq "false" and $q21 eq "false" and $q22 eq "false">>Great work, $name! You've nailed those answers. [[Proceed|P14]]
<<else>>
I'm not sure those are the best choices... maybe go back and look at the questions again.
<<endif>>
[[<|P12]]
[[<<|start]]<span class="header1">Recap</span>
So, where are we at with things? We have reached the interpretation step of our flow diagram.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/sequence_04.png" alt="past" width="60%" height="auto"/>
In principle, the steps we've worked through form the basis of the steps required for more complicated linear models in Past, but here's the catch: the advantage of a point-and-click program, like Past, is that everything is intuitive and relatively easy to use, but the disadvantage is that everything is split up under different menus, and the various 'linear model' options are actually scattered all over the place. Here's a basic guide...
<strong>RESPONSE ~ CATEGORICAL PREDICTOR (2 or more levels)</strong>
* Univariate > ANOVA etc. (Several samples) > Several sample tests (ANOVA, Kruskal-Wallis etc)
<strong>RESPONSE ~ NUMERICAL COVARIATE * CATEGORICAL PREDICTOR (2 or more levels)</strong>
* Univariate > One-way ANCOVA
<strong>RESPONSE ~ CATEGORICAL PREDICTOR (2 or more levels) * CATEGORICAL PREDICTOR (2 or more levels)</strong>
* Univariate > ANOVA etc. (Several samples) > Two-way ANOVA
<strong>RESPONSE ~ NUMERICAL PREDICTOR</strong>
* Model > Linear > Bivariate
<strong>RESPONSE ~ NUMERICAL PREDICTOR * NUMERICAL PREDICTOR</strong>
* Model > Linear > Multiple (1 dependant, n independant)
Once you start including multiple predictors you will need to start checking for interaction terms. And, if you have a categorical predictor with 3 or more levels (and it turns out to be significant), then you'll need a Tukey's Test to compare the levels.
For now though, let's move to R and Rstudio. R handles all these linear models as equations, which means that you can create any variation of linear model that takes your fancy, rather than being restricted to a set of pre-ordained types of models.
[[Proceed to RStudio|R01]], or you can return to the [[start|start]].<span class="header1">Hypothesis, data and equation building</span>
The dataset contains a lot of columns of data, however we are going to focus on just these columns:
* <strong>MF:</strong> The sex of the agile antechinus, male or female.
* <strong>Month:</strong> The month of capture (note this is not all caps)
* <strong>Hb:</strong> Haemoglobin in grams per decilitre
* <strong>EXO:</strong> A rough count of the number of macroscopic arthropod exoparasites (ticks, mites etc) attached permanently to the animal's ears (i.e. mobile exoparasites like fleas weren't counted).
* <strong>MASS:</strong> Mass of the animal in grams
Haemoglobin has been suggested to be an indicator of overall health status or condition in free-living vertebrates, but evidence seems to be a bit lacking. We will investigate whether Haemoglobin is affected by sex, season, parasite load or mass (mass should be a rough indicator of body reserves and fat stores).
<strong>Given the investigation, what do you think is the response (dependant) variable?</strong>
<<radiobutton "$q22" "a">> (a) Mass
<<radiobutton "$q22" "b">> (b) Sex
<<radiobutton "$q22" "c">> (c) Parasite load
<<radiobutton "$q22" "d">> (d) Haemoglobin
<strong>Open up the csv file using Excel and have a look at the data. Answer the following questions:</strong>
<strong>What class of data is 'MF' (sex)?</strong>
<<radiobutton "$q23" "a">> (a) Binary (categorical with two levels)
<<radiobutton "$q23" "b">> (b) Categorical (three or more levels)
<<radiobutton "$q23" "c">> (c) Count
<<radiobutton "$q23" "d">> (d) Continuous
<strong>What class of data is 'Month'?</strong>
<<radiobutton "$q24" "a">> (a) Binary (categorical with two levels)
<<radiobutton "$q24" "b">> (b) Categorical (three or more levels)
<<radiobutton "$q24" "c">> (c) Count
<<radiobutton "$q24" "d">> (d) Continuous
<strong>What class of data is 'Hb' (haemoglobin in grams per decilitre)?</strong>
<<radiobutton "$q25" "a">> (a) Binary (categorical with two levels)
<<radiobutton "$q25" "b">> (b) Categorical (three or more levels)
<<radiobutton "$q25" "c">> (c) Count
<<radiobutton "$q25" "d">> (d) Continuous
<strong>What class of data is 'EXO' (parasite load)?</strong>
<<radiobutton "$q26" "a">> (a) Binary (categorical with two levels)
<<radiobutton "$q26" "b">> (b) Categorical (three or more levels)
<<radiobutton "$q26" "c">> (c) Count
<<radiobutton "$q26" "d">> (d) Continuous
<strong>What class of data is MASS (mass in grams)?</strong>
<<radiobutton "$q27" "a">> (a) Binary (categorical with two levels)
<<radiobutton "$q27" "b">> (b) Categorical (three or more levels)
<<radiobutton "$q27" "c">> (c) Count
<<radiobutton "$q27" "d">> (d) Continuous
[[Proceed|R04]]
[[<|R02]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
d
a
b
d
c
d
</span>
<</if>><span class="header2">Answers</span>
<strong>What is the most sensible response variable?</strong>
<<if $q22 is "d">>Correct! The response (dependant) variable is the variable that is predicted by the other variables. We are measuring sex, month, exoparasite load and mass to check if these things have any effect on haemoglobin.
<<else>>Hmm, that doesn't look right. Maybe [[try again|R03]]
<</if>>
<strong>What data class is 'MF'?</strong>
<<if $q23 is "a">>Correct! Sex is a binary categorical variable. There are only two 'levels', male and female.
<<else>>Hmm, that doesn't look right. Maybe [[try again|R03]]
<</if>>
<strong>What data class is 'Month'?</strong>
<<if $q24 is "b">>Correct! Month is a categorical variable with three or more 'levels'. Each month is a level, that is March is a level, April is a level, May is a level etc.
<<else>>Hmm, that doesn't look right. Maybe [[try again|R03]]
<</if>>
<strong>What data class is 'Hb'?</strong>
<<if $q25 is "d">>Correct! Haemoglobin is a continuous variable. Continuous variables are numbers that could in principle have a decimal place. In this case some of the Hb values do have decimal places, so the variable must be continuous.
<<else>>Hmm, that doesn't look right. Maybe [[try again|R03]]
<</if>>
<strong>What data class is 'EXO'?</strong>
<<if $q26 is "c">>Correct! exoparasite count is a count. Even if this didn't have the word 'count' in the name, that none of the values have decimal places makes it awfully suspicious that this is probably a count rather than a measurement.
<<else>>Hmm, that doesn't look right. Maybe [[try again|R03]]
<</if>>
<strong>What data class is 'MASS'?</strong>
<<if $q27 is "d">>Correct! Mass is a continuous variable. Continuous variables are numbers that could in principle have a decimal place. In this case some of the mass values do have decimal places, so the variable must be continuous.
<<else>>Hmm, that doesn't look right. Maybe [[try again|R03]]
<</if>>
<<if $q22 eq "d" and $q23 eq "a" and $q24 eq "b" and $q25 eq "d" and $q26 eq "c" and $q27 eq "d">>Great work, $name! [[Proceed|R05]]
<<else>>
Something doesn't look quite right there. Be sure to check columns of data in the spreadsheet carefully.
<<endif>>
[[<|R03]]
[[<<|start]]<span class="header1">Hypothesis, data and equation building</span>
One of the advantages of RStudio (and R) is that you can build any manner of linear model, without restriction to certain types presented in menus.
On the other hand, this can be a drawback, because it means you really need to know what you are doing when using RStudio.
On the other hand again, this might be an advantage, for exactly the same reason. If you can't obtain P values <strong>without</strong> a good understanding of what you are doing, that might actually be to your advantage. It can be easy to get P values from a point-and-click program, but if you aren't clear about what you are doing, then the results might be meaningless.
We're going to look at several potential models, building up complexity...
<strong> Hypothesis: Haemoglobin is affected by agile antechinus sex</strong>
* Hb ~ Intercept + MF + Error
<strong> Hypothesis: Haemoglobin is affected by agile antechinus sex and the season</strong>
* Hb ~ Intercept + MF * Month + Error
<strong> Hypothesis: Haemoglobin is affected by agile antechinus sex and the season and parasite load</strong>
* Hb ~ Intercept + MF * Month * EXO + Error
<strong> Hypothesis: Haemoglobin is affected by agile antechinus sex and the season and parasite load and mass of the animal</strong>
* Hb ~ Intercept + MF * Month * EXO * MASS + Error
<span class="header1">Where are we in the sequence?</span>
<img src="http://cpjohnstone.com/wp-content/uploads/2018/11/sequence_02-1.png" alt="sequence" width="60%" height="auto"/>
<span class="header1">Ordering predictors</span>
Because the order of predictors can have an effect on [[the way that variance is explained|aside01]], and therefore P values obtained, the order of predictors is important.
If in doubt, we want to:
# Place predictor (independent) variables that we want to control for (i.e. we maybe aren't very interested in the P value) at the start of a sequence
# Place predictor variables (dependant) variables that we are actually interested in from a hypothesis viewpoint towards the end of a sequence.
<span class="header2">Relevance of predictors</span>
* Because male and female agile antechinus have quite different life histories it would be unsurprising (and not very scientifically interesting) to discover that the sexes have different Hb levels. But, sex is something we should control for in the model.
* Antechinus are seasonal annual breeders that undergo a synchronised breeding rut and subsequent male die-off (males only live to breed once). Month will probably affect Hb, but it might not be very informative about health status. Still, it is something that needs to be controlled for.
* Exoparasite load should correlate with health status. Animals with more parasites should be in poorer condition, and at least in theory higher parasite loads should equal lower Hb levels.
* Mass is a reasonable (if rough) indicator of metabolic reserves in small mammals. Animals with higher mass should be in better condition, and in theory, animals with higher mass should have higher Hb, if Hb is actually a measure of overall health status.
Given these descriptions is the following sequence of predictors appropriate or inappropriate?
* Hb ~ Intercept + MF * Month * EXO * MASS + Error
<<radiobutton "$q28" "appropriate">> appropriate
<<radiobutton "$q28" "inappropriate">> inappropriate
[[Proceed|R06]]
[[<|R04]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
appropriate
</span>
<</if>><span class="header1">Order of predictors</span>
If a model contains two or more predictors, we have to take a step back and think about the order that the predictors will be entered into the model. This is especially important in R, because R defaults to something called a 'Type I' sum of squares, or also sometimes a 'sequential sum of squares', when calculating P values. If you design is unbalanced (i.e. if there are uneven numbers of samples in treatments) and you use Type I sum of squares, the order of predictors can have important implications for final P values.
Here are some diagrams that attempt to explain the differences between Types I, II and II sum of squares in a reasonably accessible way.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/11/Type_I_sum_of_squares.png" alt="past" width="100%" height="auto"/>
<img src="http://cpjohnstone.com/wp-content/uploads/2018/11/Type_II_sum_of_squares.png" alt="past" width="100%" height="auto"/>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/02/Type_III_sum_of_squares.png" alt="past" width="100%" height="auto"/>
<span class="header2">What is an interaction term?</span>
We will come back to interaction terms in more detail later, but, briefly, an interaction term represents the question: <strong> Do these two (or more) Predictors have a complex relationship in their effect on the Response? </strong> If for example, sunlight is having a positive effect on plant growth rate, but only at high nitrogen treatment levels, then we might expect the interaction term to be significant. It can be thought of as a term that asks the question: does the effect of one predictor depend on the level of the other predictor? (where P < 0.05 is 'yes' and P > 0.05 is 'no').
<span class="header2">How do we order predictors?</span>
So we often need to consider the order of predictors in a model. How do we do this? What rules do we use? At the most basic level, you want to endeavour to do two things:
# Predictors that you have only included because you want to control for their effect (i.e. you are probably not interested in the P value), should be placed earlier in the sequence. This is true of classic 'covariates' that are included to control for a known effect in a model like an ANCOVA.
# Predictors that directly relate to your hypothesis should be placed later in the model. The more important the hypothesis, the later the term should be placed.
Our reasons for using these rules are two-fold: (1) if you want a model to control for another potential variable, it must be included earlier in the model or else it won't actually get to explain any of the variance before other variables. A covariate included early in a model in this way can sometimes 'clean up' noise in the data, and make it easier for a pattern related to the hypothesis to be identified, and (2) if a predictor related to your hypothesis is really important, it should still be significant even after all the other predictors have already had a shot at explaining some of the variance in the model.
[[Return to tutorial|R05]]<strong>Is the equation we have written out appropriate?</strong>
<<if $q28 is "appropriate">>Correct! The two variables we want to control for (sex and season) are placed earlier (to the left) in the equation, whereas the variables we want to test (parasite load and mass) are placed later in the sequence (to the right).
<<else>>Hmm, that doesn't look quite like the answer I would pick. Perhaps [[try again|R05]]
<</if>>
<<if $q28 is "appropriate">>
<span class="header1">Expectations of effect</span>
We should also work out the expected directions of effect. This is an earlier step, but we skipped it when thinking about the equations.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/11/sequence_02-1.png" alt="sequence" width="60%" height="auto"/>
<strong>Assume the hypotheses that exoparasite load and mass predict Hb is correct. Also assume that animals with higher Hb will be in better overall health. What is the expected direction of effect...</strong>
<strong>...for exoparasite load and Hb?</strong>
<<radiobutton "$q29" "positive">> exoparasite load and Hb should be expected to show a positive relationship (i.e. as exoparasite load increases, Hb increases)
<<radiobutton "$q29" "negative">> exoparasite load and Hb should be expected to show a negative relationship (i.e. as exoparasite increases, Hb decreases)
<strong>...for mass and Hb?</strong>
<<radiobutton "$q30" "positive">> mass and Hb should be expected to show a positive relationship (i.e. as mass increases, Hb increases)
<<radiobutton "$q30" "negative">> mass and Hb should be expected to show a negative relationship (i.e. as mass increases, Hb decreases)
[[Proceed|R07]]
[[<|R05]]
[[<<|start]]
<</if>>
<<if $name is "cheat">>
<span class="cheat">
negative
positive
</span>
<</if>><span class="header2">Answers</span>
<strong>Direction of effect: exoparasites</strong>
<<if $q29 is "negative">>Correct! Animals with more exoparasites should be expected to have lower Hb (assuming the hypothesis is correct).
<<else>>Not sure that looks right. Maybe [[try again|R06]]
<</if>>
<strong>Direction of effect: mass</strong>
<<if $q30 is "positive">>Correct! Animals with higher mass should be expected to have higher Hb (assuming the hypothesis is correct).
<<else>>Not the option I'd pick. Maybe [[try again|R06]]
<</if>>
<<if $q29 eq "negative" and $q30 eq "positive">>Excellent work, $name! [[Proceed|R08]]
<<else>>
Something doesn't look quite right there. Be sure to think carefully about the ways that parasites or mass would relate to general health.
<<endif>>
[[<|R06]]
[[<<|start]]<span class="header2">Import dataset in RStudio</span>
If you already know how to important a dataset into RStudio, do this now. If you don't know how to do this, or if you need a refresher, go to this page [[here|aside02]] first.
Call the dataset <strong>agilis</strong> when you import it. You should see a window that looks like this:
<span class="header2">Base</span>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/import_base.png" alt="sequence" width="40%" height="auto"/>
Or this:
<span class="header2">Readr</span>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/import_reader.png" alt="sequence" width="60%" height="auto"/>
[[Proceed|R09a]]
[[<|R07]]
[[<<|start]]<span class="header2">Importing dataset</span>
RStudio will import a copy of the file only, so that anything you do in RStudio won't be saved in the original file. This means that you always have a backed-up data file, which is good to have.
Now, see if you can import the dataset. The 'Import Dataset' button should be visible on the top right-hand window.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R02.png" alt="past" width="100%" height="auto"/>
<img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R01.png" alt="past" width="100%" height="auto"/>
You might see 'base' and 'readr' options... either of these are fine. They do the same thing but rely on different packages to do so. Here is the window you see when you use the 'readr' option.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/09/readr_import.png" alt="RStudio" width="100%" height="auto"/>
You want to be sure that:
# You have a <strong>Name</strong> that is short and easy to type
# <strong> First Row as Names</strong> is ticked
# The <strong> Delimiter </strong> is set to <strong>Comma</strong> for a csv file (commas separated values).
Here's a short video showing the saving of a csv file, and importing it into RStudio. You shouldn't have to switch tick boxes on and off (everything should default to the correct settings), but I've done this in the video so that you can see how it changes the way R sees the dataset.
<iframe width="560" height="315" src="https://www.youtube.com/embed/SaJDmlwwm2k?rel=0&showinfo=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
[[return to tutorial|R08]]
Now, create either a new R Notebook or Script file depending on your preference. If you need a refresher on how to do this (and what the differences are), you can have a quick read of this page [[here|aside03]].
We're going to be running code soon. Here are some reminders about the way that code will be colour coded and how code is entered and run...
<span class="header1">Colour coding</span>
Code will be colour coded to make it easier to understand. Orange will be used for functions (commands), blue for anything that you can change or name yourself (the 'moving pieces' of the code, if you like), black for basic syntax requirements (arrows, brackets, commas mostly), green for comments (R doesn't read anything after a hash tag), and purple for libraries. For example:
<span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><strong>library</strong></span><strong>(<span style="color: #800080;">vcd</span>) <span style="color: #008000;"># loads library 'vcd'</span></strong></span>
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">attach</span>(<span style="color: #0000ff;">skink</span>) <span style="color: #008000;"># attaches the skink dataset</span></strong></span>
<span class="header1">Running code</span>
Code will be entered into either your Script or Notebook, and run from there. It doesn't matter which you use, but for your own sake you must use either a Script or Notebook. Entering code directly into the execution window is possible, __but nothing is saved__.
Have a go at entering this code and running it.
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">data</span>(<span style="color: #0000ff;">cars</span>) <span style="color: #008000;"># loads the cars dataset, which comes bundled with R</span></span></strong>
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">plot</span>(<span style="color: #0000ff;">cars</span>) <span style="color: #008000;"># plots the cars dataset</span></span></strong>
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">str</span>(<span style="color: #0000ff;">cars</span>) <span style="color: #008000;"># gives you the 'structure' of the cars dataset. Really useful for spotting errors in a dataset!</span></span></strong>
These are the instructions for running code:
# <strong>R script:</strong> Use the 'Run' button or command-return (mac) or Cntrl-enter (PC) to run a single line of code. The cursor will drop down a line.
# <strong>R script:</strong> To run a chunk, highlight what you want to run and use the run commands as per above.
# <strong>R notebook:</strong> There is no the 'Run' button. Use command-return (mac) or Cntrl-enter (PC) to run a single line of code. The cursor will drop down a line.
# <strong>R script:</strong> To run a chunk, click the little green play arrow associated with the chunk. You need organise code into chunks as you go.
[[Proceed|R10]]
[[<|R08]]
[[<<|start]]Once you have downloaded R and RStudio, open RStudio (there is no need to open R as well. It will open automatically inside R Studio). You should get a set of windows that look like this:
<img src="http://cpjohnstone.com/wp-content/uploads/2018/09/RStudio_1.png" alt="RStudio" width="80%" height="auto"/>
This might all look a bit confusing, but it's easy to navigate once you know what the parts are. Before we go on, one important bit is missing. You need a place to store and save your script. You can use either:
* <strong>R script:</strong> plain text file. Highly stable, but not very flashy to look at. Uses the name_of_file.R appended file type, but actually, it is just a name_of_file.txt with the extension changed.
* <strong>R Notebook:</strong> attemping to mimic a <a href="http://jupyter.org/" target="_blank">Jupyter Notebook</a> layout. Some people like R Notebooks. The main substantial advantage of R Notebooks is that if you are involved in a collaborative project (i.e. between you and a supervisor in an honours), then the Notebook is a live document that is easier to work on together. The only real downside of notebooks is that they are marginally more complicated to navigate than a plain R script, and they are less stable (only because nothing is more stable than a plain text file).
Either is totally fine. It's just a matter of preference.
* However, remember, if you are working in a plain text file you need to use plenty of hashtags (#) to leave notes for yourself
* And if you are working in a notebook you should be using one chunk per task. That is, when you move onto a new 'task' in the R scripting, you should create a new chunk for this.
<span class="header2">Why are we not using the learnr R package?</span>
The learnr R package is great for baby steps, but, I actually think it makes running code a little too easy. It is easy to switch off mentally and just start running stuff on autopilot. It's also only somewhat stable, and new updates in the past have sometimes caused the learnr R package to run aground.
* Using a single javascript file is stable and portable. You do need internet access for all the videos and images to load correctly, but this file should never trip over because of updating problems.
* Copying and pasting (or better yet, actually writing out code) is much closer to what your actual R coding experience will be like. It's best to get a taste of this now.
* Although it appears to be possible to 'colour code' the code in learnr, doing so is by no means straightforward. I think colour coding is hugely helpful for understanding the code, rather than just seeing it as a jumble of letters and symbols. That's a big reason why I've opted for a javascript tutorial instead.
<strong>To open a new R Script select 'File > New File > R Script'</strong>
<img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R-script.png" alt="RStudio" width="50%" height="auto"/>
<strong>To open a new R Notebook select 'File > New File > R Notebook'</strong>
<img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R-notebook.png" alt="RStudio" width="50%" height="auto"/>
Because I'm not a very clever person, I find R Notebooks a bit confusing. This means I mostly work in plain old fashioned R scripts. <strong>__However, this makes no difference to the steps for running tests.__</strong> The only difference is purely to do with how the code is run...
# <strong>R script:</strong> Use the 'Run' button or command-return (mac) or Cntrl-enter (PC) to run a single line of code. The cursor will drop down a line.
# <strong>R script:</strong> To run a chunk, highlight what you want to run and use the run commands as per above.
# <strong>R notebook:</strong> There is no 'Run' button. Use command-return (mac) or Cntrl-enter (PC) to run a single line of code. The cursor will drop down a line.
# <strong>R script:</strong> To run a chunk, click the little green play arrow associated with the chunk. You need organise code into chunks as you go.
Once you have a Script or Notebook, you should have four windows. Here are the important parts of the windows.
<span class="header1">R Script Layout</span>
<img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R01.png" alt="RStudio" width="100%" height="auto"/>
<span class="header1">R Notebook Layout</span>
<img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R02.png" alt="RStudio" width="100%" height="auto"/>
[[return to tutorial|R09]]<span class="header1">Building a model</span>
We are now up to Step 5: Build a model.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/11/sequence_03.png" alt="sequence" width="60%" height="auto"/>
We will start with the most basic of the models I presented earlier...
<strong> Hypothesis: Haemoglobin is affected by agile antechinus sex</strong>
* Hb ~ Intercept + MF + Error
The code used to define an equation in R is exactly the same as the equations we have been writing <strong>except that the intercept and error terms are included by default and do not need to be defined</strong>. This means that the basic equation looks like this:
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;">Hb <span style="color: #000000;">~ </span>MF</span></span></strong>
R recognises this as 'Hb as a function of MF', but, R also needs you to provide a function, otherwise the program doesn't know how to analyses this relationship. There are two functions that allow you to build a linear model.
<span class="header1">aov</span>
This is the ANOVA function. It is recommended when your primary predictors of interest are categorical (i.e. factors with two or more levels).
<span class="header1">lm</span>
This is the linear model function. It is also called a regression analysis. It is recommended when your primary predictors of interest are numerical (either count, ordinal or continuous).
But, importantly, although the maths is different, both methods (typically) arrive at the same results. A linear model (lm) just turns factors into dummy numbers. An ANOVA just treats numbers as covariates.
Try entering these equations and see what happens (you won't get P values, instead you should see a set of values describing the model).
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">aov</span>(<span style="color: #0000ff;">Hb</span>~<span style="color: #0000ff;">MF</span>, data=<span style="color: #0000ff;">agilis</span>)</span></strong><br /><strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">lm</span>(<span style="color: #0000ff;">Hb</span>~<span style="color: #0000ff;">MF</span>, data=<span style="color: #0000ff;">agilis</span>)</span></strong>
You should see results that look like this:
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/models_01.png" alt="sequence" width="40%" height="auto"/>
The results are not interpretable, rather R is just telling you 'here are the basic descriptors of the model you just called'.
Now that you have a model built and running, let's [[test assumptions|R11]]
[[<|R09]]
[[<<|start]]<span class="header1">Assumptions</span>
Here are the assumptions of a linear model.
<span style="font-family: Arial, Helvetica, sans-serif; color: #3366ff; font-size: 18pt;"><strong>ASSUMPTIONS OF ANOVAs (and ANCOVAs etc)</strong></span>
<span style="font-family: Arial, Helvetica, sans-serif; color: #3366ff; font-size: 18pt;">(1) Residuals are normally distributed</span>
<span style="font-family: Arial, Helvetica, sans-serif; color: #3366ff; font-size: 18pt;">(2) Residuals must be independent (collected randomly)</span>
<span style="font-family: Arial, Helvetica, sans-serif; color: #3366ff; font-size: 18pt;">(3) Residuals have equal variances</span>
Other potential problems can occur if the following are not addressed, but we are going to put these aside for the time being to keep things simpler:
<span style="color: #3366ff; font-size: 18pt;"><strong>NOT ASSUMPTIONS BUT OTHER PROBLEMS MAY OCCUR IF…</strong></span>
<span style="color: #3366ff; font-size: 18pt;">(1) Predictor variables correlate (r > 0.6 considered problematic) (two variables explain the same thing)</span>
<span style="color: #3366ff; font-size: 18pt;">(2) The model is over-parameterised (too many predictors given your data set size)</span>
<span style="color: #3366ff; font-size: 18pt;">(3) You fail to check for significance of interaction terms (your main effects could be meaningless)</span>
<span class="header2">The process</span>
Here is the sequence diagram at the assumption testing step...
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/sequence_03.1.png" alt="sequence" width="60%" height="auto"/>
And here is a basic flowchart to help you step through the process...
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/assumption-process-basic.png" alt="transformations" width="70%" height="auto"/>
[[Proceed|R12]]
[[<|R10]]
[[<<|start]]<strong> Hypothesis: Haemoglobin is affected by agile antechinus sex</strong>
* Hb ~ Intercept + MF + Error
<strong> Hypothesis: Haemoglobin is affected by agile antechinus sex and the season</strong>
* Hb ~ Intercept + MF * Month + Error
<strong> Hypothesis: Haemoglobin is affected by agile antechinus sex and the season and parasite load</strong>
* Hb ~ Intercept + MF * Month * EXO + Error
<strong> Hypothesis: Haemoglobin is affected by agile antechinus sex and the season and parasite load and mass of the animal</strong>
* Hb ~ Intercept + MF * Month * EXO * MASS + Error<span class="header2">Did you remember to change the name of the dataset from 'agilis_morphometrics' to 'agilis'?</span>
Look at the top right-hand environment window. Is the dataset called 'agilis' and not 'agilis_morphometrics'?
[[Yes. I feel pretty smug.|R09]]
[[No. I need to import it again...|R08]]<span class="header1">Object orientated coding</span>
We will be using 'object orientated' coding that uses the 'attribute' arrow (<strong>"<-"</strong>) to assign models to objects that can then be tested for assumptions and examined for significance. The same methods can be used for model fitting approaches (such as the use of AICs), but we will focus on classical hypothesis testing here.
<span class="header1">A note on tidyverse piping</span>
Although it turns out that it is possible to construct linear models using code based entirely on tidyverse piping (<strong>"%>%"</strong>), this is extremely unusual. Even researchers who like tidyverse piping very much, tend to restrict its use to data wrangling and management.
[[Onward|R13]]
[[<|R11]]
[[<<|start]]<span class="header1">Boxplot your data</span>
Boxplotting your data isn't really an assumption test, but it can give a sense of what the distribution looks like, and whether the underlying data might be normally distributed. It's always a good first step.
<strong>If data is normal, we expect:</strong>
* Boxplots to be roughly symmetrical above and below the median line
<strong>If data has equal variances among groups we expect:</strong>
* Boxplots should be roughly about the same height as each other.
Try this code:
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">boxplot</span>(<span style="color: #0000ff;">Hb</span>~<span style="color: #0000ff;">MF</span>, data=<span style="color: #0000ff;">agilis</span>)</span></strong>
You should get this boxplot:
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/basic_boxplot.png" alt="past" width="60%" height="auto"/>
If you want to add labels and a different colour to the boxplots, you can try this...
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">boxplot</span>(<span style="color: #0000ff;">Hb</span>~<span style="color: #0000ff;">MF</span>, data=<span style="color: #0000ff;">agilis<span style="color: #000000;">, xlab = "<span style="color: #0000ff;">Sexes</span>", ylab = "<span style="color: #0000ff;">Hb (g/dL)</span>", col = <span style="color: #ff6600;">c</span>("<span style="color: #0000ff;">white</span>","<span style="color: #0000ff;">grey</span>"</span></span>))</span></strong>
You should get this image.
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/boxplot_fancy.png" alt="past" width="60%" height="auto"/>
You can also try playing around with different colours. What happens if you replace "grey" with "hotpink4" or "firebrick2"? Here's a summary of (just some) R colours:
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/colour-names.png" alt="past" width="70%" height="auto"/>
Looking at your boxplots, answer the following:
<strong>Are the variances approximately equal for the two sexes?</strong>
<<radiobutton "$q31" "a">> (a) No, they are quite different
<<radiobutton "$q31" "b">> (b) Yes, although the variance for female Hb is a little larger than for male Hb
<<radiobutton "$q31" "c">> (c) Yes, although the variance for male Hb is a little larger than for female Hb
<<radiobutton "$q31" "d">> (d) Yes, the variances are exactly the same for the two sexes
<strong>Are the boxplots approximately symmetrical above and below the median (thick central) line?</strong>
<<radiobutton "$q32" "a">> (a) Yes, more or less
<<radiobutton "$q32" "b">> (b) No, they are quite skewed
[[Check your answers|R14]]
[[<|R12]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
b
a
</span>
<</if>><span class="header2">Answers</span>
<strong>Are the variances approximately equal for the two sexes?</strong>
<<if $q31 is "b">>Correct! The variance is a little larger for the female Hb, but it isn't a substantial difference.
<<else>>This is a little subjective, but that isn't the answer I would pick. Maybe [[try again|R13]]
<</if>>
<strong>Are the boxplots approximately symmetrical above and below the median (thick central) line?</strong>
<<if $q32 is "a">>Correct! The boxplots look fairly symmetrical above and below the median lines. This suggests they could be normally distributed (although this is not definitive... it's just a rough indicator).
<<else>>This is a little subjective, but that isn't the answer I would pick. Maybe [[try again|R13]]
<</if>>
<<if $q31 eq "b" and $q32 eq "a">>Excellent, $name! [[Proceed|R15]]
<<else>>
Something doesn't look quite right there. This is a bit subjective, but perhaps go back and just look at the boxplots again very carefully.
<<endif>>
[[<|R13]]
[[<<|start]]<span class="header1">Checking assumptions</span>
Although there are various assumption tests (i.e. Shapiro-Wilk's for normality, Bartlett for equal variance among groups), these need to be applied to the residuals of the model (not the original data), which complicates their use when testing linear models.
For this reason it is usually preferable to simply use diagnostic plots in R. These allow you to quickly examine the 'patterns' of the residuals and look for departures from equal variance and normality of residuals.
To do this, we simply use the R function <strong>'plot'</strong> on the model we want to examine. However, because this will generate (up to) four plots depending on the type of model we are looking at, we need to first create space in the plotting window to look at them all together.
Use this code the set the plotting window to a 2 by 2 array (i.e. four plots arranged 2 across by 2 down).
<span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><strong>par</strong></span><strong>(mfrow = </strong><span style="color: #ff6600;"><strong>c</strong></span><strong>(</strong><span style="color: #0000ff;"><strong>2</strong></span><strong>, </strong><span style="color: #0000ff;"><strong>2</strong></span><strong>))</strong></span>
The 'par' stands for parameter, and is used to change graphing parameters. Any change you make at the level of 'par' applies to all graphs that you produce for the rest of the session. You call the help menu to see what else can be changed in 'par'.
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">?</span>par</strong></span>
If you want to restore the defaults for your graphs (i.e. if you've played around with parameters and everything has gotten a bit strange), you can use the 'dev.off' function, but be warned, <strong>this will also clear all of your graphing history too</strong>.
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">dev.off</span>()</strong></span>
Okay. We're going to fit a simple model, and then plot it to see what the residuals look like. Remember that green text is notes, and anything after a hashtag is just for your benefit (i.e. R doesn't read anything after a hashtag on a line). Now, try this...
<span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><strong>par</strong></span><strong>(mfrow = </strong><span style="color: #ff6600;"><strong>c</strong></span><strong>(</strong><span style="color: #0000ff;"><strong>2</strong></span><strong>, </strong><span style="color: #0000ff;"><strong>2</strong></span><strong>))</strong></span>
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;"><span style="color: #008000;"># set plotting window to 2x2 array</span></span></strong></span>
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><span style="color: #0000ff;">model.aov </span><span style="color: #000000;"><- </span>aov</span>(<span style="color: #0000ff;">Hb</span>~<span style="color: #0000ff;">MF</span>, data=<span style="color: #0000ff;">agilis</span>)</span></strong>
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><span style="color: #008000;"># create an object called 'model.aov' (it could be called anything you like), and drop your ANOVA model into it.</span></span></span></strong>
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">plot</span>(<span style="color: #0000ff;">model.aov</span>)</strong></span>
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;"><span style="color: #008000;"># plot the model we created above</span></span></strong></span>
You should obtain four plots that look like this:
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/Basic_model_diagnostics.png" alt="diagnostics" width="90%" height="auto"/>
[[But what does it all mean?|R16]]
[[<|R14]]
[[<<|start]]<span class="header2">Residuals vs Fitted</span>
Used to check linearity and homogeneity of variances. The red line should be (relatively) straight and horizontal for the data to be linear. The cloud of data should be (relatively) amorphous (cloud like), and should not have a 'wedge' or 'arrowhead' shape.
<span class="header2">Normal Q-Q Plot</span>
Used to check normality of residuals. The horizontal axis plots the values that the residuals should show if the residuals were perfectly normal. The vertical axis shows the actual residuals. It follows then that if the data is departing from normality, the dots (real vs theoretical residuals) would wander off the red line, which represents a perfect 1:1 relationship.
<span class="header2">Scale-Location</span>
Used to check linearity and homogeneity of variances. You read this plot in the same way as the Residuals vs Fitted plot. The Residuals vs Fitted plot is consider a (slightly) better diagnosis plot for linearity, whereas the Scale-Location is considered a (slightly) better diagnosis plot for equal variances.
<span class="header2">Residuals vs Leverage</span>
This is not strictly speaking an assumption test at all, rather it is checking whether there are any data points that would have a significant effect on the model if removed. This can be thought of as a test for statistical outliers. If a data point is on the other side of the 0.5 or 1.0 Cook's Distance (from the rest of the data), then it is having a substantial independent affect on the model as a whole.
<strong>What to do with Outliers?</strong>
My standard advice is that unless you have a strong reason to believe that an outlier was due to human error or machine error, you should leave it in the data. The reason for this is that removing outliers can lead to a temptation to start 'adjusting' the data so that it fits expectations (i.e. the hypothesis), which is far from ideal.
<span class="header1">Example diagnostic plots</span>
Below is a set of diagnostic plots where there are substantial problems with the residuals, and assumptions are not being met.
Let's start with the Residuals vs Leverage plot. If one (or more) of your datapoints are on the other side of the Cook's d 0.5 line, then this is an outlier, and it is having an out of proportion effect on the whole model.
I've drawn an orange circle around an outlier that is on the other side of the 0.5 Cook's d when compared to all the other datapoints.
* The number next to the datapoint (13) is the row number of the data point in your dataset (so you can find it).
* Sometimes your data is so good, you can't even see the Cook's d line (it is outside the box somewhere). Notice that there is a solid red line as well? It is the dashed line with the 0.5 next to it that you are worried about.
* This isn't strictly speaking an assumption. Generally speaking, I prefer not to remove outliers, as long as we have confidence that the number is biologically 'real' (i.e. it isn't a result of measurement error).
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/diagnostics_1.png" alt="past" width="80%" height="auto"/>
Now, let's look at the Normal Q-Q. The straight line is relatively flat in this example (it isn't always, often it runs diagonal through the box). The series of residuals (white circles) is wandering off the line towards the right of the box. This means that the residuals are not following a normal distribution. If they were following a normal distribution, they would all fall on the line.
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/diagnostics_2.png" alt="past" width="80%" height="auto"/>
Finally, we can look at Residuals vs Fitted and Scale-Location together, as they are both assessing the same thing.
* In both plots there is a clear 'wedge' or 'arrow-head' pointing left. I've drawn orange triangles in to make these 'wedge' shapes even easier to see.
* The red line for Residuals vs Fitted is relatively straight and the red line for Scale-Location is only a little curved. This suggests that the relationship is probably linear (where we expect to see straight red lines), rather than non-linear, where we would expect to see weird wavy or squiggly red lines instead.
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/diagnostics_3.png" alt="past" width="80%" height="auto"/>=
Now, let's look at our diagnostic plot again:
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/Basic_model_diagnostics.png" alt="past" width="80%" height="auto"/>
Note that there are only two columns of residuals in the plots. We expect this to be the case, because we are examining haemoglobin as a function of sex, which only has two levels. Remember that females had slightly higher variance of Hb than males. Look at the Residuals vs Fitted. The column of data on the left is slightly more spread out than on the right. it would be a reasonable guess that the females are in the lefthand column, and the males are in the righthand column.
<strong>Is the relationship linear?</strong>
<<radiobutton "$q32" "a">> (a) Yes, the red lines in the Residuals vs Fitted and Scale-Location are both straight
<<radiobutton "$q32" "b">> (b) No, the red lines in the Residuals vs Fitted and Scale-Location are both wavy and have curves in them
<strong>Are the variances of residuals equal across the sample?</strong>
<<radiobutton "$q33" "a">> (a) Yes. The two columns of data in the Residuals vs Fitted and Scale-Location are both about equal in height. There is no obvious 'wedge' shape.
<<radiobutton "$q33" "b">> (b) No. The two columns of data in the Residuals vs Fitted and Scale-Location are quite unequal in height. A 'wedge' shape could easily be drawn over the plots.
<strong>Are the residuals normally distributed?</strong>
<<radiobutton "$q34" "a">> (a) Yes. The residuals (open circles) more or less fall on the straight Q-Q plot line.
<<radiobutton "$q34" "b">> (b) No. The residuals (open circles) wander away from the straight Q-Q plot line quite substantially.
<strong>Are there one or more outliers in the data?</strong>
<<radiobutton "$q35" "a">> (a) No. The data is so good that the Cook's d dashed line isn't even visible in the plot.
<<radiobutton "$q35" "b">> (b) Yes. The data points at row numbers 105 and 112 are both outliers.
[[Check your answers|R17]]
[[<|R15]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
a
a
a
a
</span>
<</if>><span class="header2">Answers</span>
<strong>Is the relationship linear?</strong>
<<if $q32 is "a">>Correct! The red lines look very straight. This suggests the relationship is linear.
<<else>>This is a little subjective, but that isn't the answer I would pick. Maybe [[try again|R16]]
<</if>>
<strong>Are the variances of residuals equal across the sample?</strong>
<<if $q33 is "a">>Correct! The two columns look to be similar in height in both the Residuals vs Fitted and Scale-Location plots. There is no strong 'arrowhead' or 'wedge' in the plots.
<<else>>This is a little subjective, but that isn't the answer I would pick. Maybe [[try again|R16]]
<</if>>
<strong>Are the residuals normally distributed?</strong>
<<if $q34 is "a">>Correct! Although the datapoints are wandering away from the line a little bit to the bottom left, it is a fairly minor departure. Keep in mind that real biological data is never perfect. This is good enough to accept.
<<else>>This is a little subjective, but that isn't the answer I would pick. Maybe [[try again|R16]]
<</if>>
<strong>Are there one or more outliers in the data?</strong>
<<if $q34 is "a">>Correct! The dashed Cook's d line isn't even visible. There are no outliers.
<<else>>Remember that the Cook's d line is dashed (not solid). Go back and check whether you can see a dashed line anywhere on the figure. [[Try again|R16]]
<</if>>
<<if $q32 eq "a" and $q33 eq "a" and $q34 eq "a" and $q34 eq "a">>Fantastic work, $name! [[Proceed|R18]]
<<else>>
Something doesn't look quite right there. This is a bit subjective in places, but perhaps go back and just look at the diagnostic plots again.
<<endif>>
[[<|R16]]
[[<<|start]]<span class="header1">Transformations</span>
Our model looks fine (we could move onto interpreting it), but what do we do if the residuals are not meeting assumptions (i.e. if there is a 'wedge' in the residuals, or if the points are wandering off the qq line?). The standard (albeit it somewhat old fashioned) step is to try transforming the response variable. Note that we don't typically try transforming the predictors (the indepdendent variables).
A transformation is a simple mathematical operation that changes the distribution, but not the order of the numbers in a dataset. A really simple transformation would be to arrange all your numbers from largest to smallest and then just rank them 1, 2, 3, 4 etc. Some non-parametric methods, like a Spearman's rho correlation test actually do something similar to this.
But we may need to try out several transformations before we find one that works. Let's see how this works with a response variable that is more problematic.
The variable <strong>MASS</strong> is a measure of the individual mass (g) of agile antechinus caught. Because this species is sexually dimorphic, we might expect there to be potential problems if we were examining mass as a function of sex.
Use the same steps you did above, but with <strong>MASS</strong> instead of <strong>Hb</strong>.
<img src="http://cpjohnstone.com/wp-content/uploads/2018/12/assumption-process-basic.png" alt="past" width="70%" height="auto"/>
<span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><strong>par</strong></span><strong>(mfrow = </strong><span style="color: #ff6600;"><strong>c</strong></span><strong>(</strong><span style="color: #0000ff;"><strong>1</strong></span><strong>, </strong><span style="color: #0000ff;"><strong>1</strong></span><strong>))</strong></span> <span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;"><span style="color: #008000;"># reset plotting window to 1x1 array (to show one figure at a time)</span></span></strong></span>
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">boxplot</span>(<span style="color: #0000ff;">MASS</span>~<span style="color: #0000ff;">MF</span>, data=<span style="color: #0000ff;">agilis</span>)</span></strong>
<span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><strong>par</strong></span><strong>(mfrow = </strong><span style="color: #ff6600;"><strong>c</strong></span><strong>(</strong><span style="color: #0000ff;"><strong>2</strong></span><strong>, </strong><span style="color: #0000ff;"><strong>2</strong></span><strong>))</strong></span> <span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;"><span style="color: #008000;"># set plotting window to 2x2 array</span></span></strong></span>
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><span style="color: #0000ff;">model.aov </span><span style="color: #000000;"><- </span>aov</span>(<span style="color: #0000ff;">MASS</span>~<span style="color: #0000ff;">MF</span>, data=<span style="color: #0000ff;">agilis</span>)</span></strong> <strong><span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><span style="color: #008000;"># create an object called 'model.aov' (it could be called anything you like), and drop your ANOVA model into it.</span></span></span></strong>
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">plot</span>(<span style="color: #0000ff;">model.aov</span>)</strong></span> <span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;"><span style="color: #008000;"># plot the model we created above</span></span></strong></span>
R allows you to transform a variable and create a new column with the transformed variable. Try this and then scroll to the righthand end of the dataset.
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><span style="color: #0000ff;">agilis<span style="color: #000000;">$</span>logMASS </span><span style="color: #000000;"><- </span>log</span>(<span style="color: #0000ff;">agilis</span>$<span style="color: #0000ff;">MASS</span>)</span></strong>
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">View</span>(<span style="color: #0000ff;">agilis</span>)</strong></span>
You should see a new variable <strong>logMass</strong> appear on the very end of the dataset. Note that you could have named <strong>logMass</strong> anything you want. I just named it <strong>logMASS</strong> because this allows me to remember that this is the log of mass.
Here are a bunch more transformations you can try. Some will work on any set of numbers, but some only work for very particular types of numbers (i.e. only on percentages entered as 0.00 to 1.0). In each instance <strong>'x'</strong> is the variable we want to transform. To make life easier, we can just drop the variable into an object called <strong>'x'</strong>. Note that 'yourdata' is just the name of your dataset. So, if we want to transform <strong>MASS</strong>, then the following...
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;">x</span> <- <span style="color: #0000ff;">yourdata</span>$<span style="color: #0000ff;">RESPONSE</span></strong></span>
Becomes...
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;">x</span> <- <span style="color: #0000ff;">agilis</span>$<span style="color: #0000ff;">MASS</span></strong></span>
Now, try transforming MASS and boxplotting the results and/or building a model using the new transformed variable. Are you able to fix the distribution so that there is no wedge in the Residuals vs Fitted and the qq plot points fall along the line? Note that there is no need to try every single transformation provided below. Just select a couple of them and give it a go.
<span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><strong>yourdata</strong></span><strong>$<span style="color: #0000ff;">INVR</span> <- 1/<span style="color: #0000ff;">x</span><span style="color: #008000;"> # Reciprocal transformation</span></strong></span>
<span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><strong>yourdata</strong></span><strong>$<span style="color: #0000ff;">NEGINV</span> <- -1/<span style="color: #0000ff;">x</span><span style="color: #008000;"> # Negative reciprocal transformation</span></strong></span>
<span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><strong>yourdata</strong></span><strong>$<span style="color: #0000ff;">SQUARED</span> <- <span style="color: #0000ff;">x</span>^2<span style="color: #008000;"> # Power transformation</span></strong></span>
<span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><strong>yourdata</strong></span><strong>$<span style="color: #0000ff;">CUBERT</span> <- <span style="color: #0000ff;">x</span>^(1/3)<span style="color: #008000;"> # Cube root transformation</span></strong></span>
<span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><strong>yourdata</strong></span><strong>$<span style="color: #0000ff;">SQRT</span> <- <span style="color: #ff6600;">sqrt</span>(<span style="color: #0000ff;">x</span>)<span style="color: #008000;"> # Square root transformation</span></strong></span>
<span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><strong>yourdata</strong></span><strong>$<span style="color: #0000ff;">LOG</span> <- <span style="color: #ff6600;">log</span>(<span style="color: #0000ff;">x</span>)<span style="color: #008000;"> # Log base e. Cannot be applied to zero</span></strong></span>
<span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><strong>yourdata</strong></span><strong>$<span style="color: #0000ff;">LOG10</span> <- <span style="color: #ff6600;">log10</span>(<span style="color: #0000ff;">x</span>)<span style="color: #008000;"> # Log base 10. Cannot be applied to zero</span></strong></span>
<span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><strong>yourdata</strong></span><strong>$<span style="color: #0000ff;">ASQRT</span> <- <span style="color: #ff6600;">asin</span>(<span style="color: #ff6600;">sqrt</span>(<span style="color: #0000ff;">x</span>)) <span style="color: #008000;"># Arcsine square root transformation. </span></strong><span style="color: #008000;"><strong>Only useful for percentages (0.01-0.99)</strong></span></span>
<span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><strong>yourdata</strong></span><strong>$<span style="color: #0000ff;">LOGIT</span> <- <span style="color: #ff6600;">log</span>(<span style="color: #0000ff;">x</span>/(1-<span style="color: #0000ff;">x</span>))<span style="color: #008000;"> # Logistic transformation. </span></strong><span style="color: #008000;"><strong>Only useful for percentages (0.01-0.99)</strong></span></span>
<span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><strong>yourdata</strong></span><strong>$<span style="color: #0000ff;">FOLD1</span> <- <span style="color: #ff6600;">sqrt</span>(<span style="color: #0000ff;">x</span>)/<span style="color: #ff6600;">sqrt</span>(1-<span style="color: #0000ff;">x</span>)<span style="color: #008000;"> # Square root folded transformation.</span></strong><span style="color: #008000;"><strong> Only useful for percentages (0.01-0.99)</strong></span></span>
<span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><strong>yourdata</strong></span><strong>$<span style="color: #0000ff;">FOLD2</span> <- <span style="color: #ff6600;">log</span>(<span style="color: #0000ff;">x</span>)/<span style="color: #ff6600;">log</span>(1-<span style="color: #0000ff;">x</span>)<span style="color: #008000;"> # Log folded transformation. </span></strong><span style="color: #008000;"><strong>Only useful for percentages (0.01-0.99)</strong></span></span>
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;">yourdata</span>$<span style="color: #0000ff;">FT</span> <- <span style="color: #ff6600;">sqrt</span>(<span style="color: #0000ff;">x</span>) + <span style="color: #ff6600;">sqrt</span>(<span style="color: #0000ff;">x</span>+1)<span style="color: #008000;"> # Freeman-Tukey transformation</span></strong></span>
To check if a transformation has improved the fit of the data, try running a boxplot and/or diagnostic plots. Because you might be running through a lot of transformations, often you'll just check boxplots to start off with, and then check the diagnostic plots only if the boxplot looks promising (i.e. if the boxes look relative equal in height and symmetrical above and below the thick median line). Here is an example using <strong>INVR</strong>, produced from an inverse transformation.
<span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><strong>par</strong></span><strong>(mfrow = </strong><span style="color: #ff6600;"><strong>c</strong></span><strong>(</strong><span style="color: #0000ff;"><strong>1</strong></span><strong>, </strong><span style="color: #0000ff;"><strong>1</strong></span><strong>))</strong></span> <span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;"> <span style="color: #008000;"># reset plotting window to 1x1 array (to show one figure at a time)</span></span></strong></span>
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">boxplot</span>(<span style="color: #0000ff;">INVR</span>~<span style="color: #0000ff;">MF</span>, data=<span style="color: #0000ff;">agilis</span>)</span></strong>
<span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><strong>par</strong></span><strong>(mfrow = </strong><span style="color: #ff6600;"><strong>c</strong></span><strong>(</strong><span style="color: #0000ff;"><strong>2</strong></span><strong>, </strong><span style="color: #0000ff;"><strong>2</strong></span><strong>))</strong></span> <span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;"><span style="color: #008000;"># set plotting window to 2x2 array</span></span></strong></span>
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><span style="color: #0000ff;">model.aov </span><span style="color: #000000;"><- </span>aov</span>(<span style="color: #0000ff;">INVR</span>~<span style="color: #0000ff;">MF</span>, data=<span style="color: #0000ff;">agilis</span>)</span></strong> <strong><span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><span style="color: #008000;"># create an object called 'model.aov' (it could be called anything you like), and drop your ANOVA model into it.</span></span></span></strong>
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">plot</span>(<span style="color: #0000ff;">model.aov</span>)</strong></span> <span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;"><span style="color: #008000;"># plot the model we created above</span></span></strong></span>
<span class="header1">Reporting Transformations</span>
If you need to transform the response data to meet assumptions, how is this reported?
* In the methods, you would state something like 'Mass (g) was log transformed to meet the assumptions of an ANOVA' (or whatever transformation was used).
* Report the P values, test statistics and degrees of freedom for the test that uses the transformed data.
* Summaries of the data (e.g. means and standard errors) are reported for the <strong>original, non-transformed data</strong>.
* Graphs and figures use the <strong>orignal, non-transformed data</strong>.
* If you find that your graph using the original data is very hard to read, you can show both the original data and the transformed data in a two part figure (i.e. Fig 1a, b). In which case, in the caption you would state something like 'Transformed data shown for clarity and comparison'.
[[Onward|R19]]
[[<|R17]]
[[<<|start]]<span class="header1">Investigating interaction terms</span>
Although there is no assumption per se that relates to interaction terms, we need to check them if we have two or more predictor variables. We are moving into some deeper steps than with the Past tutorial. Here's where we are up to so far:
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/sequence_05.png" alt="past" width="60%" height="auto"/>
An interaction term is a predictor term that informs you whether two (or more) main effects (the predictors on their own), may have a complex relationship. If there is a complex interaction, then the P values for the main effects could be meaningless. For example, maybe males have a higher mass than females, but only in winter. Or maybe Hb is higher in females, but only in fragmented forest. In both cases it would be incorrect to state that males have higher mass than females (for most of the year they don't), or that females have higher Hb than males (this is only true for one type of habitat).
<span class="header2">Here's a brief diagramming out of ways in which interaction terms can be problematic:</span>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/interaction_01.png" alt="interactions" width="70%" height="auto"/>
<strong>The two sexes are no different in mass and there is no difference between sites A and B</strong>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/interaction_02.png" alt="interactions" width="70%" height="auto"/>
<strong>The two sexes are no different in mass but there is a difference between sites A and B</strong>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/interaction_03.png" alt="interactions" width="70%" height="auto"/>
<strong>The two sexes are different in mass, and there is a difference between sites A and B, but males and females are responding differently</strong>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/interaction_04.png" alt="interactions" width="70%" height="auto"/>
<strong>Note how the following examples have similar statistical results but the relationships are markedly different...</strong>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/interaction_05.png" alt="interactions" width="70%" height="auto"/>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/interaction_06.png" alt="interactions" width="70%" height="auto"/>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/interaction_07.png" alt="interactions" width="70%" height="auto"/>
<strong>You can even get instances where there don't appear to be any main effects, but they are being hidden by an interaction...</strong>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/interaction_08.png" alt="interactions" width="70%" height="auto"/>
<span class="header2">How do we check interaction terms?</span>
Checking interaction terms is laborious, but unfortunately necessary, especially if you want to publish your findings.
<strong>When you run any sort of linear model:</strong>
# Start by including all interaction terms
# Remove the interaction term that has the highest P value
# Run the model again… check interactions
# Remove another interaction term with a high P value
# Keep removing interaction terms until all you are left with are main effects and significant interactions
<strong>Why do we remove non-significant interactions from the model?</strong>
* Non-significant interaction terms interfere with the over-all predictive power of the model.
* By including them you are in effect telling the model to take into account an interaction and consider it important when it may not be
* When you remove non-significant terms, you will find the other P-values will change (slightly)
* But actually, it's usually no terrible problem leaving them in, and some researchers prefer to leave interactions in a model to show that they remembered to check them. Also, removing interaction terms can look like chasing significance (as P values will tend to decrease), but if significance depends on removing an interaction term, you need to seriously think about what that might mean in terms of overall effect. Relevant effect sizes will be an important next step.
<strong>Do we ever retain non-significant interaction terms in a linear model?</strong>
* Non-significant interaction terms are usually retained if they were part of your hypothesis.
** For example, if your hypothesis specifically states Male guppy preference for larger females depends on the temperature of the water then the interaction term FEMALE.SIZE:WATER.TEMPERATURE is part of your experimental design and you would retain it in the model even if it is non-significant.
* Sometimes non-significant interaction terms are retained when higher order terms including the same variables are significant. So, if you find that the two-way interaction term AGE:SEX is not significant but the three-way interaction term AGE:SEX:HEALTH is significant then there is an argument for retaining the non-significant lower order interactions.
* When sample size is large (n = thousands of samples) the retention or removal of interaction terms makes less difference to the main effects and some researchers chose to leave the non-significant terms in the model. This last point is largely one of personal preference.
Here is the process in a diagrammatic form:
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/interaction_checking.png" alt="interactions" width="70%" height="auto"/>
<span class="header2">Examining a full model with interactions</span>
Let's try building the full model we were interested in earlier. Here is the basic equation and the R code...
* Hb ~ Intercept + MF * Month * EXO * MASS + Error
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><span style="color: #0000ff;">model.aov </span><span style="color: #000000;"><- </span>aov</span>(<span style="color: #0000ff;">Hb</span>~<span style="color: #0000ff;">MF<span style="color: #000000;">*</span>Month<span style="color: #000000;">*</span>EXO<span style="color: #000000;">*</span>MASS</span>, data=<span style="color: #0000ff;">agilis</span>) <span style="color: #0000ff;"><span style="color: #008000;"># note that we are using Month and not MONTH. R is case sensitive.</span></span></span></strong>
You should check the diagnostic plots at this stage too...
<span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><strong>par</strong></span><strong>(mfrow = </strong><span style="color: #ff6600;"><strong>c</strong></span><strong>(</strong><span style="color: #0000ff;"><strong>2</strong></span><strong>, </strong><span style="color: #0000ff;"><strong>2</strong></span><strong>))</strong></span>
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">plot</span>(<span style="color: #0000ff;">model.aov</span>)</strong></span>
You should see this plot:
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/diagnostics_full_model.png" alt="interactions" width="70%" height="auto"/>
<strong>Based on an examination of the residuals in the Residuals vs Fitted and Scale-Location?</strong>
<<radiobutton "$q36" "a">> (a) The variance of residuals is reasonably equal (no wedge in the data cloud)
<<radiobutton "$q36" "b">> (b) The variance of residuals is not equal (clear wedge in the data cloud)
<strong>Based on an examination of the red line in the Residuals vs Fitted and Scale-Location?</strong>
<<radiobutton "$q37" "a">> (a) The relationship is probably linear (both lines are straight)
<<radiobutton "$q37" "b">> (b) The relationship might not be linear (one of the lines has a curve or bend in it)
<strong>Based on an examination of the qq plot?</strong>
<<radiobutton "$q38" "a">> (a) The residuals are generally normally distributed (they more or less follow the line)
<<radiobutton "$q38" "b">> (b) The residuals are clearly not normally distributed (they wander off the line quite badly)
<strong>Based on an examination of the Residuals vs Leverage plot?</strong>
<<radiobutton "$q39" "a">> (a) There are no statistical outliers
<<radiobutton "$q39" "b">> (b) There is at least one, and maybe a few statistical outliers
[[Check your answers|R20]]
[[<|R18]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
a
b
a
b
</span>
<</if>><span class="header2">Answers</span>
<strong>Based on an examination of the residuals in the Residuals vs Fitted and Scale-Location?</strong>
<<if $q36 is "a">>Correct! I don't think there are any clear 'wedge' or 'arrow-head' shapes in the cloud of residuals.
<<else>>This is a little subjective, but that isn't the answer I would pick. Maybe [[try again|R19]]
<</if>>
<strong>Based on an examination of the red line in the Residuals vs Fitted and Scale-Location?</strong>
<<if $q37 is "b">>Correct! The Scale-Location red line is a bit bent... but if you remember, the Residuals vs Fitted is considered the better indicator of linearity. Given that the Residuals vs Fitted is showing a straight red line, I think it is okay to accept that the relationship is probably linear.
<<else>>This is a little subjective, but that isn't the answer I would pick. Maybe [[try again|R19]]
<</if>>
<strong>Based on an examination of the qq plot?</strong>
<<if $q38 is "a">>Correct! The white circles are falling very close to the line. That's actually quite a good fit.
<<else>>This is a little subjective, but that isn't the answer I would pick. Maybe [[try again|R19]]
<</if>>
<strong>Based on an examination of the Residuals vs Leverage plot?</strong>
<<if $q39 is "b">>Correct! We definitely have at least one statistical outlier (the datapoint from row 159), but we have no reason to think this isn't a genuine biological result. It's better to just leave it in the analysis.
<<else>>That isn't the answer I would pick. Have a look at how there are two red dashed lines on the righthand side of the figure. Are there any dots to the right of either dashed line? [[Try again|R19]]
<</if>>
<<if $q36 eq "a" and $q37 eq "b" and $q38 eq "a" and $q39 eq "b">>Excellent, $name! [[Proceed|R21]]
<<else>>
Something doesn't look quite right there. Some of these are a bit subjective, but perhaps go back and just look at the plots again.
<<endif>>
[[<|R19]]
[[<<|start]]<span class="header2">Check for interactions</span>
Use the following code on the model we created to see a summary of the results for the full model.
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">summary</span>(<span style="color: #0000ff;">model.aov</span>)</strong></span>
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;"><span style="color: #008000;"># obtain a summary of the ANOVA results</span></span></strong></span>
You should see results that look like this:
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/full_model_results.png" alt="interactions" width="50%" height="auto"/>
Here is the process in a diagrammatic form:
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/interaction_checking.png" alt="interactions" width="70%" height="auto"/>
<strong> When interpreting, keep in mind the following:</strong>
* P < 0.05 is 'significant'
* P < 0.10 is often called 'near significant'. However, we don't typically report 'near significant' as a finding. That said, it can be important to pay attention to 'near significant' interactions when considering interactions because a near significant interaction might still have implications for interpreting main effects. It is a little counter-intuitive, but it is actually more statistically rigorous to be concerned about any interaction term <0.1, because otherwise we are biasing ourselves towards reporting a simple main effect. If there is an interaction term, it could be that there is no simple main effect, and our hypothesis might not be supported, even if P<0.05 for the term in question.
<strong>Look at the results above and/or in your RStudio window. Are any of the interaction terms significant or near significant?</strong>
<<radiobutton "$q40" "a">> (a) No. There are no significant results at all.
<<radiobutton "$q40" "b">> (b) Yes. Month and EXO are interacting with Hb.
<<radiobutton "$q40" "c">> (c) Yes. MF:Month:MASS and MF:Month:EXO:MASS are both significant.
<<radiobutton "$q40" "d">> (d) Yes. MF:Month:MASS and MF:Month:EXO:MASS are both near significant.
[[Check your answer|R22]]
[[<|R20]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
d
</span>
<</if>><span class="header2">Answer</span>
<strong>Are any of the interaction terms significant or near significant?</strong>
<<if $q40 is "a">>I don't think that is the answer I would give. The asterisks (stars) next to the results indicate significance and a full stop indicates near significance. Maybe check the results [[again|R21]].
<</if>>
<<if $q40 is "b">>This is a common error, and it is something you might lose marks for in a report. We use the term 'interaction' only when two or more predictors are interacting <strong>together</strong> on the response. 'Month' and 'EXO' are not interactions, or interaction terms. They are what is called a <strong>main effect</strong>. The interaction terms include the : symbol. So, that MF:Month (for example) is an interaction term. Go back to the previous page and [[have another go|R21]].
<</if>>
<<if $q40 is "c">>Almost there. Just keep in mind that significant P values are <0.05. Are the P values <0.05 or <0.10? [[try again|R21]].
<</if>>
<<if $q40 is "d">>Correct! MF:Month:MASS and MF:Month:EXO:MASS are both <0.1 but >0.05. This makes them 'near significant' interaction terms. You can now [[proceed to the next page|R23]].
<</if>>
[[<|R21]]
[[<<|start]]<span class="header2">Removing interaction terms</span>
To remove an interaction term (or set of terms) change the * (asterisks) to + (pluses), like this...
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><span style="color: #0000ff;">model.aov </span><span style="color: #000000;"><- </span>aov</span>(<span style="color: #0000ff;">Hb</span>~<span style="color: #0000ff;">MF<span style="color: #000000;">+</span>Month<span style="color: #000000;">+</span>EXO<span style="color: #000000;">+</span>MASS</span>, data=<span style="color: #0000ff;">agilis</span>) <span style="color: #0000ff;"><span style="color: #008000;"># change * to + to remove interaction terms.</span></span></span></strong>
If you ask R to show you a summary of this model, you will see that all the interaction terms have gone.
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">summary</span>(<span style="color: #0000ff;">model.aov</span>) </strong></span><span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;"><span style="color: #008000;"># obtain a summary of the ANOVA results</span></span></strong></span>
But, we should actually add back those two terms that were 'near significant', just to check that they don't become significant in the simplified model. We would be aiming to keep any interaction terms that are either significant or near significant, but then remove them if the significance drops away. There were two terms that showed possible indications of complex interactions (i.e., P<0.1). These were:
* MF:Month:MASS
* MF:Month:EXO:MASS
To include them in the model, we add them onto the end using the + symbol. Like this...
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><span style="color: #0000ff;">model.aov </span><span style="color: #000000;"><- </span>aov</span>(<span style="color: #0000ff;">Hb</span>~<span style="color: #0000ff;">MF<span style="color: #000000;">+</span>Month<span style="color: #000000;">+</span>EXO<span style="color: #000000;">+</span>MASS<span style="color: #000000;">+</span>MF:Month:MASS<span style="color: #000000;">+</span>MF:Month:EXO:MASS</span>, data=<span style="color: #0000ff;">agilis</span>) </span></strong>
Now, call a summary. Note how the interaction terms we specified have been placed back into the model.
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">summary</span>(<span style="color: #0000ff;">model.aov</span>) </strong></span><span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;"><span style="color: #008000;"># obtain a summary of the ANOVA results</span></span></strong></span>
You should see a result that looks like this.
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/model_results_reduced_1.png" alt="past" width="70%" height="auto"/>
<strong>What is the P value for MF:Month:MASS? Write this to 3 decimal places, just as you would in a formal report.</strong>
<<textbox "$q41" "">>
<strong>What is the P value for MF:Month:EXO:MASS? Write this to 3 decimal places, just as you would in a formal report.</strong>
<<textbox "$q42" "">>
[[Check answers|R24]]
[[<|R22]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
0.564
0.808</span>
<</if>><span class="header2">Answers</span>
<strong>What is the P value for MF:Month:EXO?</strong>
<<if $q41 is "0.564">>Correct!
<<else>>Remember to include the 0 at the start (which we do in biology), and remember to write this to 3 decimal places and round appropriately (if needed). [[Try again|R23]]
<</if>>
<strong>What is the P value for MF:Month:EXO:MASS?</strong>
<<if $q42 is "0.808">>Correct!
<<else>>Remember to include the 0 at the start (which we do in biology), and remember to write this to 3 decimal places and <strong>round appropriately (required) </strong>. [[Try again|R23]]
<</if>>
<<if $q41 eq "0.564" and $q42 eq "0.808">>Fantastic work, $name! Because both of these interaction terms have dropped out of significance, it is safe to remove them entirely. Re-run this code to return the model to one in which there are no interaction terms:
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;"><span style="color: #0000ff;">model.aov </span><span style="color: #000000;"><- </span>aov</span>(<span style="color: #0000ff;">Hb</span>~<span style="color: #0000ff;">MF<span style="color: #000000;">+</span>Month<span style="color: #000000;">+</span>EXO<span style="color: #000000;">+</span>MASS</span>, data=<span style="color: #0000ff;">agilis</span>) <span style="color: #0000ff;"><span style="color: #008000;"># change * to + to remove interaction terms.</span></span></span></strong>
Then obtain a summary again:
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">summary</span>(<span style="color: #0000ff;">model.aov</span>) </strong></span><span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;"><span style="color: #008000;"># obtain a summary of the ANOVA results</span></span></strong></span>
[[Proceed|R25]]
<<else>>
Something doesn't look quite right there. You'll need to go back and [[try agian|R23]].
<<endif>>
[[<|R23]]
[[<<|start]]<span class="header2">The Reduced Model</span>
Here is what your output should look like now (with all the interaction terms removed).
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/model_results_reduced_2.png" alt="reduced model" width="70%" height="auto"/>
We have two significant results, EXO and Month. However, at this stage, we don't know in what way they are significant. As a first step, we need to look at a plot of the relationships. You might not use these graphs in a formal report, but they are fine for just looking at relationships and working out what is going on.
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">plot</span>(<span style="color: #0000ff;">Hb</span>~<span style="color: #0000ff;">EXO</span>, data=<span style="color: #0000ff;">agilis</span>) <span style="color: #0000ff;"><span style="color: #008000;"># plot the relationship of Hb and exoparasite load</span></span></span></strong>
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">abline</span>(<span style="color: #ff6600;">lm<span style="color: #000000;">(</span><span style="color: #0000ff;">Hb</span><span style="color: #000000;">~</span><span style="color: #0000ff;">EXO</span><span style="color: #000000;">,</span> <span style="color: #000000;">data=</span><span style="color: #0000ff;">agilis</span><span style="color: #000000;">),col="<span style="color: #0000ff;">red</span>") <span style="color: #0000ff;"><span style="color: #008000;"># add a line of best fit based on the linear model (lm) relationship of Hb and EXO. Make the line red. abline can just be read as 'line from a to b'.</span></span></span></span></span></strong>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/Hb_EXO.png" alt="RStudio" width="70%" height="auto"/>
From this we can see that as exoparasite load increases, haemoglobin concentration in the blood also increases. This is counterintuitive. We would expect animals with more parasites to have less haemoglobin, but maybe these heavily infested individuals are compensating for parasites by releasing more red blood cells from their bone marrow? At any rate, this provides a basis for writing a discussion.
Now, let's look at the effect of month...
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">boxplot</span>(<span style="color: #0000ff;">Hb</span>~<span style="color: #0000ff;">Month</span>, data=<span style="color: #0000ff;">agilis</span>) <span style="color: #0000ff;"><span style="color: #008000;"># plot the relationship of Hb and month</span></span></span></strong>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/Hb_Month.png" alt="RStudio" width="70%" height="auto"/>
Two things...
* Note that the months have been laid out alphabetically. That isn't useful for presenting in a figure.
* Although we know that <strong>somehow</strong> month is having an effect on haemoglobin, we don't know how. This is a categorical variable for three or more levels. So, we know that it is significant, but we don't know how it is significant. Maybe Hb is different in April to March, or maybe Hb is different in August from all the other months. At this stage we simply can't tell.
First off, let's fix the order of the months...
<span style="font-family: 'Courier New', Courier;"><span style="color: #0000ff;"><strong>agilis</strong></span><strong>$<span style="color: #0000ff;">Month</span><-<span style="color: #ff6600;">factor</span>(<span style="color: #0000ff;">agilis</span>$<span style="color: #0000ff;">Month</span>, <span style="color: #ff6600;">c</span>("<span style="color: #0000ff;">Mar</span>", "<span style="color: #0000ff;">Apr</span>", "<span style="color: #0000ff;">May</span>", "<span style="color: #0000ff;">Jun</span>", "<span style="color: #0000ff;">Jul</span>", "<span style="color: #0000ff;">Aug</span>"))</strong></span>
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">boxplot</span>(<span style="color: #0000ff;">Hb</span>~<span style="color: #0000ff;">Month</span>, data=<span style="color: #0000ff;">agilis</span>) <span style="color: #0000ff;"><span style="color: #008000;"># plot the relationship of Hb and month</span></span></span></strong>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/Hb_Month_reordered.png" alt="RStudio" width="70%" height="auto"/>
But, we still can't interpret this. We need to move to our final step before writing up. This is the use of a Tukey's (or other post-hoc) test when we have:
* A significant categorical predictor
* That has 3 or more levels
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/sequence_06.png" alt="RStudio" width="70%" height="auto"/>
[[Let's apply a Tukey's test|R26]]
[[<|R24]]
[[<<|start]]
<<set $q43 to "x">> <<set $q44 to "x">> <<set $q45 to "x">> <<set $q46 to "x">> <<set $q47 to "x">> <<set $q48 to "x">> <<set $q49 to "x">> <<set $q50 to "x">> <<set $q51 to "x">> <<set $q52 to "x">> <<set $q53 to "x">> <<set $q54 to "x">> <<set $q55 to "x">> <<set $q56 to "x">> <<set $q57 to "x">><span class="header2">Tukey's Test</span>
Luckily for us, applying a Tukey's test in R is really easy. You just apply the function <strong>TukeyHSD</strong> (Tukey honest significant difference) to the model we are working with. We also need to state which categorical variable we want to see a result for:
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">TukeyHSD</span>(<span style="color: #0000ff;">model.aov<span style="color: #000000;">, "</span>Month<span style="color: #000000;">"</span></span>) </strong></span><span style="font-family: 'Courier New', Courier;"><strong><span style="color: #0000ff;"><span style="color: #008000;"># obtain a summary of the ANOVA results for Hb as a function of Month</span></span></strong></span>
You should get this output:
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/tukey_test.png" alt="RStudio" width="70%" height="auto"/>
This looks confusing, but it isn't as complicated as it looks.
* The first column can be read as one level minus another level. So, Aug-Apr is 'what is the mean of August Hb minus the mean of April Hb?'
* The next column, diff, is the difference between the means. The mean of August minue the mean of April is -4.74 g/dL Hb.
* lwr is the lower 95% confidence interval for the difference. Within a 95% level of confidence this is as low as the difference might be.
* upr is the upper 95% confidence interval for the difference. Within a 95% level of confidence this is as high as the difference might be.
* In the case of Aug-Apr, the confidence interval crosses zero... that is, the difference could between -16.0 and 6.6 g/dL.
* Because the confidence interval crosses zero, we cannot be sure that the difference is not zero.
* therefore, there is no significant difference.
* therefore, P > 0.05. In this case P = 0.833.
<strong>Which contrasts are significantly different?</strong>
<<radiobutton "$q43" "y">> Aug-Apr
<<radiobutton "$q44" "y">> Jul-Apr
<<radiobutton "$q45" "y">> Jun-Apr
<<radiobutton "$q46" "y">> Mar-Apr
<<radiobutton "$q47" "y">> May-Apr
<<radiobutton "$q48" "y">> Jul-Aug
<<radiobutton "$q49" "y">> Jun-Aug
<<radiobutton "$q50" "y">> Mar-Aug
<<radiobutton "$q51" "y">> May-Aug
<<radiobutton "$q52" "y">> Jun-Jul
<<radiobutton "$q53" "y">> Mar-Jul
<<radiobutton "$q54" "y">> May-Jul
<<radiobutton "$q55" "y">> Mar-Jun
<<radiobutton "$q56" "y">> May-Jun
<<radiobutton "$q57" "y">> May-Mar
[[Check answers|R27]]
[[<|R25]]
[[<<|start]]
<<if $name is "cheat">>
<span class="cheat">
Jun-Apr
Jun-Jul
May-Jun
</span>
<</if>><span class="header2">Tukey's Test Answers</span>
<<if $q43 eq "x" and $q44 eq "x" and $q45 eq "y" and $q46 eq "x" and $q47 eq "x" and $q48 eq "x" and $q49 eq "x" and $q50 eq "x" and $q51 eq "x" and $q52 eq "y" and $q53 eq "x" and $q54 eq "x" and $q55 eq "x" and $q56 eq "y" and $q57 eq "x">>Fantastic work, $name! You're so good at this! There are significant differences for Jun-Apr, Jun-Jul and May-Jun.
Looking at this, it seems like June is the standout, and we can frame reporting around this month. A results paragraph like this would be suitable.
<strong>Tukey's post-hoc comparisons revealed that animals sampled in June had significantly lower Hb than in Apr (mean difference = 11.5 g/dL), May (mean difference = 9.6 g/dL) and July (mean difference = 14.3 g/dL).</strong>
June is approximately the coldest month in the study area (keeping in mind, this is a southern hemisphere dataset, so June is in the middle of winter). Plausibly, there is a genuine underlying biological relationship here, which could be explored in a discussion.
[[Proceed|R28]]
<<else>>
Something doesn't look quite right there. Perhaps [[go back|R26a]] and try again.
<<endif>>
[[<|R26a]]
[[<<|start]]<span class="header1">Writing up for a formal report</span>
ANOVA results are elaborate enough that they are typically (although not always) presented as a table. You can present individual results in a results paragraph like this:
<strong> There was no effect of sex on agile antechinus haemoglobin levels (F = 0.80, df = 1, P = 0.371).</strong>
...but if there are a lot of results to work through, a table is probably better.
<span class="header2">Writing up for a formal report</span>
We also want to obtain and report an effect size wherever possible. Our handy <strong>lsr</strong> library has a couple good effect size options. Eta squared is generally well liked as a measure of the percentage of variance explained for ANOVA variables, so we will give that a go.
<strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">install.packages</span>("<span style="color: #800080;">lsr</span>") <span style="color: #008000;"># installs a library</span></span></strong><br /><strong><span style="font-family: 'Courier New', Courier;"><span style="color: #ff6600;">library</span>("<span style="color: #800080;">lsr</span>") <span style="color: #008000;"># loads a library</span></span></strong>
<span style="font-family: 'Courier New', Courier;"><strong><span style="color: #ff6600;">etaSquared</span>(<span style="color: #0000ff;">model.aov</span>)</strong></span>
You should obtain a result that looks like this:
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/etaSquared.png" alt="RStudio" width="35%" height="auto"/>
In terms of which column to report, the eta squared (left) takes into account the variance explained by the other variables. The partial eta squared (right) is the variance explained as if each variable were alone in the model. The eta squared (right) is the more rigorous set of values to report. Note that these are given as proportions. If presenting these as percentages in a results paragraph you would rewrite them like so:
<strong>Eta squared effect sizes indicated that sex explained 0.4% of variance in Hb, whereas month explained 13.1%, exoparasite load explained 1.7% of the variance, and body mass explained just 0.1% of the variance in Hb.</strong>
At this stage it should be fairly obvious why effect sizes are useful. Although both Month and Exoparasite load both had significant effects on Hb, Month explained 13.1% of the variance, whilst in contrast exoparasite load explained only 1.7% of the variance. So, they are both significant, but month was by far the more important of the two variables for explaining Hb in antechinus.
As a final step, [[let's look at an example of how this might be written up|R29]].
[[<|R27]]
[[<<|start]]When checking for significance, just remember that any P value less than 0.05 is 'significant'.
[[Return to Question|R26]]
<<set $q43 to "x">> <<set $q44 to "x">> <<set $q45 to "x">> <<set $q46 to "x">> <<set $q47 to "x">> <<set $q48 to "x">> <<set $q49 to "x">> <<set $q50 to "x">> <<set $q51 to "x">> <<set $q52 to "x">> <<set $q53 to "x">> <<set $q54 to "x">> <<set $q55 to "x">> <<set $q56 to "x">> <<set $q57 to "x">><span class="header1">Writing up for a formal report</span>
Here is an example of how you might write up what we have produced. Note that this is not the only way to write up ANOVA results, but it is one of the more common approaches.
<strong>
RESULTS
An ANOVA revealed a significant positive effect of exoparasite load on haemoglobin (g/dL) in peripheral circulating blood (fig 1; table 1), and a significant effect of month (conducted from March to August) on haemoglobin (g/dL) (fig 2; table 1). A Tukey's post-hoc test indicated that mean haemoglobin (g/dL) was different in June and May (P = 0.005), June and April (P = 0.007) and June and July (P < 0.001), with individuals in June always having a lower mean haemoglobin concentration than in the other three months. No other contrasts were significant. Based on an eta squared, exoparasite load explained 1.7% of variance in haemoglobin levels whereas month explained 13.1% of variance in haemoglobin. All other predictors explained <1% of total variance.
Table 1. Results on an ANOVA investigating the relationships between haemoglobin (g/dL) and the predictors Sex (M/F), Month (conducted from March to August), indexed exoparasite load and mass of individuals (g).
<img src="http://cpjohnstone.com/wp-content/uploads/2019/02/final_table.png" alt="RStudio" width="75%" height="auto"/>
<img src="http://cpjohnstone.com/wp-content/uploads/2019/01/Hb_EXO.png" alt="RStudio" width="50%" height="auto"/>
Fig 1. Haemoglobin (g/dL) as a function of exoparasite load. Exoparasite load was indexed as a count of all permanently attached arthopod expoparasites attached to the left and right ears of an agile antechinus.
<img src="http://cpjohnstone.com/wp-content/uploads/2019/02/Hb_Month_reordered_sig.png" alt="RStudio" width="50%" height="auto"/>
Fig 2. Haemoglobin (g/dL) as a function of month from March to August. Boxplots show median, upper and lower quartiles and ranges. Pairwise significance was established using a Tukey's post-hoc test and is indicated using letters where the months that were not significantly different share the same letter.
</strong>
If you haven't already done the past tutorial, you can [[do so now|P01]]
Otherwise, that's us for today. Thanks so much for running through these tutorials with me. I hope they have been helpful.
[[<|R28]]
[[<<|start]]