Chi-Squared

Your browser lacks required capabilities. Please upgrade it or switch to another to continue.

Loading…

Welcome In today's workshop you will learn how to run a chi-squared test and make a mosaic plot. A chi-squared test is used for count data that is arranged by categories. The result is informative about whether any given count is above or below expected, given the overall pattern in the set of counts. By the end of this tutorial you should know how to do run a chi-squared in Past and RStudio. Before we start, what is your name? <<textbox "$name" "">> Now, choose the program you want to start with. Do you want to begin with [[Past|PAST]] or [[RStudio]]? ---- Table of Contents The below Table of Contents is intended to help you navigate back to a point in the tutorial if you need to stop half-way through. You can click the 'back to start' '<<' link at the bottom of each page to return to the this page. PAST [[Past 1: Intro|PAST]] [[Past 2: Loading a Dataset|arrington]] [[Past 3: Rearranging your data|past5]] [[Past 4: Chi-squared test|past6]] [[Past 5: Output & Assumption Checking|past8]] [[Past 6: End|past12]] RStudio [[RStudio 1: Intro|RStudio]] [[RStudio 2: More Intro|R1a]] [[RStudio 3: Colour coding|R2]] [[RStudio 4: Three ways of coding|R4]] [[RStudio 5: Interpreting the output|R5]] [[RStudio 6: Checking assumptions|R7]] [[RStudio 7: Monte Carlo Simulation|R9]] [[RStudio 8: Contigency Tables|R10]] [[RStudio 9: Running the test|R12]] [[RStudio 10: Mosaic Plots|R13]] [[RStudio 11: Wildebeest: Chi-squared and Mosaic Plots|R15]] [[RStudio 12: Final plots & writing up...|R16]]

<img src="http://cpjohnstone.com/wp-content/uploads/2018/08/past_img.png" alt="past" width="10%" height="auto"/> Past Hi $name. Great. Let's get started with Past. Past is a free point and click statistical program loosely similar to SPSS. It is an excellent option for most statistical analyses and graph-making, although it is somewhat limited with regard to some analyses. Past has been designed for analysing paleontological data, but, conveniently for us, this means it is easily applied to biological sciences data too. There is an excellent and thorough pdf guide to using Past available from the Past team. If you plan to use Past, you should <a href="https://folk.uio.no/ohammer/past/past3manual.pdf" target="_blank">download a copy</a>. Once you have a copy of the manual, head to the <a href="https://folk.uio.no/ohammer/past/" target="_blank">Past website</a>. and download the program to your computer. If you are working on a lab computer, you may already have a copy. Once you have downloaded the manual and installed the Past program, you can download a data file. This file is in a comma delineated format (csv). It is actually just a text file with commas separating values. Comma delineated files are not proprietary, which means that no one owns this format, and a csv file is very stable (i.e. the requirements to open a csv file won't ever change in the future). We're going to start with a dataset called <a href="http://cpjohnstone.com/wp-content/uploads/2018/08/arrington.csv.zip" target="_blank">Arrington</a>. Once you have downloaded it, you can move onto the [[next page|arrington]]. [[<|start]]

<img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R_logo.png" alt="past" width="10%" height="auto"/> <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R_studio_logo.png" alt="past" width="10%" height="auto"/> Hi $name. Great choice. Once you've worked through R and Rstudio you can move onto the Past tutorial. What is R? R is a free to use, open source programming language used for statistical analysis and graphing. Along with Python, it is one of the default professional tools used by data scientists. R has a bit of a learning curve, but don't worry, you can do it! And once you learn the basics, the syntax is largely the same or similar across many different tasks. It's not as hard as it seems at first. What is RStudio? RStuido is a wrap-around for the free statistical programming language R. RStudio helps make R into a more user-friendly experience. You need to <a href="https://www.r-project.org/" target="_blank">download and install R first</a>, and then <a href="https://www.rstudio.com/products/rstudio/download/" target="_blank">download and install RStudio</a>. Once you have downloaded and installed R and RStudio you can [[proceed|R1a]] [[<|start]]

http://www.adamhammond.com/twineguide/

Loading a Dataset Open the Arrington zip file that you downloaded by double-clicking it. A file called 'arrington.csv' should appear. You can import this file into Past by selecting 'Past > Open' or you can drag and drop the file onto your Past window. Just make sure that you select: * Rows contain... Only data cells * Columns contain... Names, data * Separator... comma Here is a short video demonstrating how to open a csv in Past. <iframe width="560" height="315" src="https://www.youtube.com/embed/MtDI7sElB3M?rel=0&showinfo=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe> ---- Breaking Down Arrington Now look at the Arrington dataset. This dataset was collected by researchers who were interested in whether a species of fish shifted its predominant diet as individuals mature from juvenile to adult. There are two variables. * STOMACH: the researchers measured stomach length to age a fish. Fish with a stomach length less than 16.2 cm were deemed to be juvenile. Fish with a stomach length greater than 16.2 cm were deemed to be adults. * TROPHIC: the researchers examined stomach contents and classified individual fish into detrivores (DET), omnivores (OMN), invertivores (INV) or piscivores (PISC). Detrivores were mostly eating detritus. Omnivores had mixed stomach contents. Invertivores were mostly eating invertebrates. Piscivores were primarily eating other fish. The way the data has been entered is convenient for data entry but is not sensible for analysis. We need to change the layout to something that can be analysed. As a first step, count up how many fish of each type there were. i.e.-- # How many juveniles were detrivores? # How many juveniles were omnivores? # How many juveniles were invertivores? # How many juveniles were piscivores? # How many adults were detrivores? # How many adults were omnivores? # How many adults were invertivores? # How many adults were piscivores? Note that you can just open the csv in Excel and count the categories of fish there, if you find that easier. Let's check a couple of your numbers: How many juveniles were invertivores? <<textbox "$q1" "">> [[Check your answer|test2]] <<if $name is "cheat">> 58 <</if>> [[<|PAST]] [[<<|start]]

Your answer was $q1 <<if $q1 is "58">>Correct! [[proceed|test3]] <<else>>Hmm, that doesn't look right. Maybe [[try again|arrington]] <</if>> [[<|arrington]] [[<<|start]]

How many adults were piscivores? <<textbox "$q2" "">> [[Check your answer|test4]] <<if $name is "cheat">> 34 <</if>> [[<|test2]] [[<<|start]]

Your answer was $q2 <<if $q2 is "34">>Correct! [[proceed|past5]] <<else>>Hmm, that doesn't look right. Maybe [[try again|test3]] <</if>> [[<|test3]] [[<<|start]]

Rearranging your data We need to rearrange the Arrington data into a form that Past can recognise as data suitable for a chi-squared test. You've already done the hard part by counting up the categories. We now need to arrange the data into a 'contingency table'. This is an array of frequencies of occurrence (counts) arranged by category. # Open a new Excel file # Type in row and column names as per the image below # Fill in the rest of the empty cells. I've filled in the cells for Juvenile Invertivores (n = 58) and adult piscivores (n = 34). <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/cont_table.png" alt="A contingency table" width="50%" height="auto"/> Once you have filled in all the cells, 'save as' a csv file. Call it 'fish.csv'. Now you can import it into Past. Just like we did earlier, you can either 'drag and drop' the file onto the Past window, or you can use 'Past > Open' to navigate to the file and open it. Because we have both column and row names now, you need to select a slightly different set-up when importing your newly arranged data. Make sure to select: * Rows contain... Names, data * Columns contain... Names, data * Separator... comma Once you have imported your new file, you can [[proceed|past6]] to the next page. [[<|test4]] [[<<|start]]

Chi-squared test Now that we have imported the fish.csv data, we can apply a chi-squared test. Note that Past will attempt to run several tests that can be used for contingency tables under the 'contingency table' option. You will see an error if one of these tests isn't suitable. In the case of our data based on Arrington the 'Fisher's Exact test' is not suitable, so you will see an error message. But don't worry! The other tests will work. The error is simply telling you that Fisher's Exact test isn't suitable for this dataset. Here is a short video in which I apply the contingency table set of tests to our data. You need to: # Highlight the cells you want to apply the test to # Select 'Univariate > Contingency table (chi^2 etc)' <iframe width="560" height="315" src="https://www.youtube.com/embed/AOtH5QobRzo?rel=0&showinfo=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe> A chi-squared statistic is a ratio of signal-to-noise. The higher the number, the more likely it is that some of the observed values are departing from the expected values. What are these values? Briefly? * Observed values: the counts you actually recorded * Expected values: the counts you should obtain if there are no patterns in the data other than sampling differences. In a simple two sample example, the expected values are taken to be 50:50. That is, if you count 62 pink flowers and 44 white flowers, 62 and 44 are the observed values, but the 50:50 ratio of these numbers (62+44 divided by 2) are the expected values. In this case 53:53, if there were no pattern. * Residuals: the difference between observed and expected values. If you want to go into this in more depth, then Video 2 on the <a href="http://cpjohnstone.com/video-tutorials/" target="_blank">video tutorial page that explains this...</a> The degrees of freedom can be thought of a number that tells you how much independent information you have in your test. The P value represents the probability that you could have obtained the result you got by chance, assuming that the null hypothesis is true. * Chi squared (and other test statistics) are typically reported by not interpreted * Degrees of freedom are typically reported but not interpreted * P values are reported, and form the basis for your interpretation in the Discussion of a scientific paper. Now highlight the block of numbers and select 'Univariate > Contingency table (chi^2 etc)' yourself, and answer the following questions... What is the value of the Chi-squared statistic? Write your answer to one decimal place. <<textbox "$q3" "">> What is the value of the degrees of freedom for the chi-squared test? Don't include any decimal places. <<textbox "$q4" "">> What is the P-value for the chi-squared test? Write your answer to three decimal places. If the P-value is less than 0.001, this is written as <0.001. <<textbox "$q5" "">> [[Check your answer|past7]] <<if $name is "cheat">> 43.8 3 <0.001 <</if>> [[<|past5]] [[<<|start]]

You answered that the chi-squared value was $q3 <<if $q3 is "43.8">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|past6]] <</if>> You answered that the degrees of freedom was $q4 <<if $q4 is "3">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|past6]] <</if>> You answered that the P-value for the chi-squared test was $q5 <<if $q5 is "<0.001">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|past6]] <</if>> <<if $q3 eq "43.8" and $q4 eq "3" and $q5 eq "<0.001">>Great work! [[Proceed|past8]] <<else>> Something doesn't look quite right there. Be sure to check the decimal places carefully. <<endif>> [[<|past6]] [[<<|start]]

Output & Assumption Checking Here's an image of what you should obtain from your Past contingency table test. <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/chisq_past.png" alt="Past output for our contingency table test of the Arrington data" width="50%" height="auto"/> You can click the 'copy' button and paste the output in text, too. Chi squared Rows, columns: 4, 2 Degrees freedom: 3 Chi2: 43.835 p (no assoc.): 1.6364E-09 Monte Carlo p : 0.0001 Fisher´s exact Not available Other statistics Cramer´s V : 0.47052 Contingency C : 0.42574 What does the output mean? * Rows, columns: This is the number of rows and columns in your contingency table. * Degrees freedom: This is the degrees of freedom. * Chi2: This is the chi-squared value. In a manuscript it is often written using the Greek letter chi with an exponent, like this χ2 * p (no assoc.): This is the P value, given that the null (no association) is true. Note that the (no assoc.) here can be confusing. It looks like the test is telling you that there is no association, but the output is just reminding you that the test is against a null of 'no association'. * Monte Carlo p : This is a P value derived from a Monte Carlo picking method. It is useful if your assumptions are not met, as the Monte Carlo method allows you to get around one of the assumptions (see below). * Fisher´s exact: Not available here, but, this is another test that can (sometimes) get around the assumptions of a Chi-squared test. * Cramer's V : Technically, this should be written Cramér's V (with a dash over the 'e'). This is an 'effect size', and is a way of calculating correlation in tables with more than 2x2 rows and columns. Cramér's V ranges from 0 (no association) to 1 (perfect association). It is much like a Pearson's r, except that there cannot be a negative association, so Cramér's V will never be a negative number. We interpret it in the same way we interpret a Pearson's r. If Cramér's V is small (below 0.3-0.4), then even if we do get a significant result, we should treat it with caution because the real biological effect might be quite marginal. * Contingency C : The contingency coefficient. This is another Effect Size very similar to Cramér's V. The contingency coefficient also ranges for 0 (no association) to 1 (perfect association), except that because of the way a Contingency Coefficient is calculated, it may never reach 1 even in a table with perfect associations. For this reason, Cramér's V tends to be preferred (i.e. you wouldn't report both the Contingency Coefficient and Cramér's V, but rather, just pick one or the other). Effect Sizes We always like to report a P value and an Effect Size. The reason for this is that P values are influenced by both the test statistic and the degrees of freedom, whereas the Effect Size is simply informative about the strength of the relationship. The reason that Cramér's V or Contingency C are used instead of only reporting the chi-squared value, is that Cramér's V and Contingency C have been scaled so that they are easily comparable among different chi-squared tests. Comparing the raw chi-squared value of different tests is not straightforward, and tends not to be done. Assumptions Chi-squared test Before we interpret the results, we need to check the assumptions of a chi-squared test. You must always check that the assumptions of a given test are met before interpreting the results. If the assumptions are not met, the result might be meaningless. Luckily for us, a chi-squared test only has two assumptions # Categories are independent # No more than 20% of the expected values can be 5 or less Independence of samples is an assumption of all statistical tests. If you are asked for the assumptions of a test in an exam, and you can't remember them, you might as well write 'independence of samples', as that will always be one requirement. There are no statistical tests for independence that are widely accepted. It is a matter of good experimental design, and is linked strongly to the methods and considerations used to avoid pseudoreplication. However, it is easy to check the assumption that no more than 20% of the expected values can be 5 or less. The way the assumption is phrased may seem a bit confusing, but it is literally just a count of how many expected values are 5 or less. So, if your expected values were 4, 10, 14 and 8, then 4 is less than 5. 10, 14 and 8 are greater than 5. This means one value was less than 5, and three values were more than 5. This means that 25% of the expected values were 5 or less... which means we haven't met the assumption in this case. Now... * Navigate back to your contingency table results window * Select the 'Residuals' tab * Select the drop-down menu (it will read 'Raw residuals' at the moment) and select 'Expected values' instead. How many expected values are 5 or less. Don't include any decimal places. <<textbox "$q6" "">> What is the value of the smallest expected value. Write this to one decimal place. <<textbox "$q7" "">> What is the value of the largest expected value. Write this to one decimal place. <<textbox "$q8" "">> [[Check your answer|past9]] <<if $name is "cheat">> 0 6.8 50.5 <</if>> [[<|past7]] [[<<|start]]

You answered that number of expected values that were 5 or less was $q6 <<if $q6 is "0">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|past8]] <</if>> You answered that the smallest expected value was $q7 <<if $q7 is "6.8">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|past8]] <</if>> You answered that largest expected value was $q8 <<if $q8 is "50.5">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|past8]] <</if>> <<if $q6 eq "0" and $q7 eq "6.8" and $q8 eq "50.5">>Great work! Now answer this: what percentage of expected values are 5 or less? <<radiobutton "$q9" "0%">> 0% <<radiobutton "$q9" "12.5%">> 12.5% <<radiobutton "$q9" "22.8%">> 22.8% <<radiobutton "$q9" "40%">> 40% [[Check your answer|past10]] <<else>> Something doesn't look quite right there. Be sure to check the decimal places carefully. <</if> <<if $name is "cheat">> 0% [[<|past8]] [[<<|start]]

You answered that the observed count of Juvenile Detrivores is $q10 <<if $q10 is "not significantly different to expected">>Correct! A value of 0.71196 is between -2 and +2, so isn't significant. <<else>>Hmm, your answer doesn't look right. Maybe [[try again|past10]] <</if>> You answered that the observed count of Adult Omnivore is $q11 <<if $q11 is "significant at the 0.05 level">>Correct! A value of -2.061 is between -2 and -4, so is significant at the 0.05 level. <<else>>Hmm, your answer doesn't look right. Maybe [[try again|past10]] <</if>> You answered that the observed count of Juvenile Invertivore is $q12 <<if $q12 is "not significantly different to expected">>Correct! A vlue of -1.5794 is between -2 and +2, so isn't significant. <<else>>Hmm, your answer doesn't look right. Maybe [[try again|past10]] <</if>> You answered that the observed count of Adult Piscivore is $q13 <<if $q13 is "significant at the 0.01 level">>Correct! A value of 4.7381 is greater than +4, so is significant at the 0.01 level. <<else>>Hmm, your answer doesn't look right. Maybe [[try again|past10]] <</if>> <<if $q10 is "not significantly different to expected" and $q11 is "significant at the 0.05 level" and $q12 is "not significantly different to expected" and $q13 is "significant at the 0.01 level">>Geart work $name! In practise, we tend not to differentiate between significances at the 0.01 and 0.05 level when writing up results. The way this would be written up in a paragraph would be... The results of the chi-squared test indicated a significant departure of observed from expected values (χ2 = 43.8, df = 3, P < 0.001, Cramér's V = 0.47). Juvenile piscivores were significantly below expected (residual = -3.2). Adult omnivores were significantly below expected (residual = -2.1). Adult piscivores were significantly above expected (residual = +4.7). All other counts did not depart from expected values. What this means is that there were fewer juveniles that ate fish and fewer adults that had an omnivorous diet than we would have expected if there were no associations in the data. However, there were more adults that ate other fish, than what we would expect if there were no associations in the data. [[Proceed|past12]] <<else>> Something doesn't look quite right there. You might need to try those answers again. <<endif>> [[<|past10]] [[<<|start]]

Mosaic Plot Finally, we need to generate a graph that is suitable for a chi-squared test. The most suitable plot is a Mosaic Plot. You will find that RStudio has a default option for showing significant residuals on a Mosaic Plot. Unfortunately, at the time of writing this, Past does not have this option (but Past is updated regularly, so it's worth checking in case this feature has been added). We will generate a standard mosaic plot with colours showing the counts by categories. Navigate back to Past and highlight the whole set of numbers. Now select 'Plot > Mosaic Plot' <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/mosaic_plot.png" alt="past" width="50%" height="auto"/> You will probably find that the names of the categories are on top of each other. The easiest way to fix this is simply to grab the window, and drag it to reshape it into a rectangle. Like so: <iframe width="560" height="315" src="https://www.youtube.com/embed/R8VH8KDnA_E?rel=0&showinfo=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe> You can also select 'Graph settings' and play around with setting, and then 'Copy' the graph to an image file. If you want to save a PDF version of the image select 'print' and print to a PDF. A higher quality SVG file option is available inside the 'Graph settings'. You should be able to obtain a plot that looks something like this: <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/mosaic_plot_arrington.png" alt="past" width="50%" height="auto"/> However, remember that unfortunately (at the time of writing this), Past doesn't have an automatic function for showing significantly above expected or significantly below expected values on the mosaic plot itself. You either need to create a mosaic plot in R, or write the significance levels manually onto the Past plot using a simple paint program or similar. You could also write the significance levels into the caption, although it is better to present these results visually if you can. Excellent work $name. You've now completed a chi-squared test in Past. You can either return to the [[start]] or try applying a chi-squared test in [[Rstudio|RStudio_alt_start]]. [[<|past11]] [[<<|start]]

You answered that $q9 of expected values had values of 5 or less <<if $q9 is "0%">>Correct! <<else>>I don't think that looks correct. Perhaps best to [[try again|past9]] <</if>> Okay. Let's move on. The assumptions are met and we have a significant result. Under classical statistical norms, if a P value is <0.05 we deem it to be 'significant'. This is an arbitrary threshold, but is more or less standard in the sciences. If a P value is <0.05, then we can report this as a result, and interpret the result in our Discussion. You would write this in the Results section like so: The results of the chi-squared test indicated a significant departure of observed from expected values (χ2 = 43.8, df = 3, P < 0.001, Cramér's V = 0.47). Note that the test statistic tends to be written to one decimal place. Degrees of freedom don't (typically) have decimal places, P values are written to 3 decimal places and effect sizes are usually written to either two or three decimal places. But how are counts significant? The chi-squared test has told us that the counts are somehow different to expected values. But all this tells us is that there is at least one association somewhere in the data. We need to look at the standardised residuals to identify how they depart from expected values. Navigate back to your contingency table results and select the Residuals tab. Now select the drop-down menu and select Standardized residuals. You should get an image like this: <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/standard_resid.png" alt="past" width="50%" height="auto"/> How to interpret this? Standardised residuals that are between -2 and -4 or between +2 and +4 are significant at the 0.05 level. Standardised residuals that are less than -4 or greater than +4 are significant at the 0.01 level. Residuals between -2 and +2 are not significant. Note that using this method there is no correction for multiple comparison error, however, this tends not be raised as a concerns for the purposes of interpreting residuals of a chi-squared test in this way. Now, look at the residuals. Based on the rules given above: The observed count of Juvenile Detrivores is... <<radiobutton "$q10" "not significantly different to expected">> ...not significantly different to expected. <<radiobutton "$q10" "n">> ...significant at the 0.05 level. <<radiobutton "$q10" "n">> ...significant at the 0.01 level. The observed count of Adult Omnivore is... <<radiobutton "$q11" "n">> ...not significantly different to expected. <<radiobutton "$q11" "significant at the 0.05 level">> ...significant at the 0.05 level. <<radiobutton "$q11" "n">> ...significant at the 0.01 level. The observed count of Juvenile Invertivore is... <<radiobutton "$q12" "not significantly different to expected">> ...not significantly different to expected. <<radiobutton "$q12" "n">> ...significant at the 0.05 level. <<radiobutton "$q12" "n">> ...significant at the 0.01 level. The observed count of Adult Piscivore is... <<radiobutton "$q13" "n">> ...not significantly different to expected. <<radiobutton "$q13" "n">> ...significant at the 0.05 level. <<radiobutton "$q13" "significant at the 0.01 level">> ...significant at the 0.01 level. [[Check your answers|past11]] <<if $name is "cheat">> non-sig sig at 0.05 non-sig sig at 0.01 <</if>> [[<|past9]] [[<<|start]]

Hi $name. Now that you've completed the tutorial on Past, we can take a look at R and Rstudio. What is R? R is a free to use, open source programming language used for statistical analysis and graphing. Along with Python, it is one of the default professional tools used by data scientists. R has a bit of a learning curve, but don't worry, you can do it! And once you learn the basics, the syntax is largely the same or similar across many different tasks. It's not as hard as it seems at first. What is RStudio? RStuido is a wrap-around for the free statistical programming language R. RStudio helps make R into a more user-friendly experience. You need to <a href="https://www.r-project.org/" target="_blank">download and install R first</a>, and then <a href="https://www.rstudio.com/products/rstudio/download/" target="_blank">download and install RStudio</a>. Once you have downloaded and installed R and RStudio you can [[proceed|R1b]] [[<|start]]

Once you have downloaded R and RStudio, open RStudio (there is no need to open R as well. It will open automatically inside R Studio). You should get a set of windows that look like this: <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/RStudio_1.png" alt="RStudio" width="80%" height="auto"/> This might all look a bit confusing, but it's easy to navigate once you know what the parts are. Before we go on, one important bit is missing. You need a place to store and save your script. You can use either: * R script: plain text file. Highly stable, but not very flashy to look at. Uses the name_of_file.R appended file typer, but actually, it is just a name_of_file.txt with the extension changed. * R Notebook: attempting to mimic a <a href="http://jupyter.org/" target="_blank">Jupyter Notebook</a> layout. Some people like R Notebooks. The main substantial advantage of R Notebooks is that if you are involved in a collaborative project (i.e. between you and a supervisor in an honours), then the Notebook is a live document that is easier to work on together. The only real downside of notebooks is that they are marginally more complicated to navigate than a plain R script, and they are less stable (only because nothing is more stable than a plain text file). Either is totally fine. It's just a matter of preference. * However, remember, if you are working in a plain text file you need to use plenty of hashtags (#) to leave notes for yourself * And if you are working in a notebook you should be using one chunk per task. That is, when you move onto a new 'task' in the R scripting, you should create a new chunk for this. Why are we not using the learnr R package? The learnr R package is great for baby steps, but, I actually think it makes running code a little too easy. It is easy to switch off mentally and just start running stuff on autopilot. It's also only somewhat stable, and new updates in the past have sometimes caused the learnr R package to run aground. * Using a single javascript file is stable and portable. You do need internet access for all the videos and images to load correctly, but this file should never trip over because of updating problems. * Copying and pasting (or better yet, actually writing out code) is much closer to what your actual R coding experience will be like. It's best to get a taste of this now. * Although it appears to be possible to 'colour code' the code in learnr, doing so is by no means straightforward. I think colour coding is hugely helpful for understanding the code, rather than just seeing it as a jumble of letters and symbols. That's a big reason why I've opted for a javascript tutorial instead. To open a new R Script select 'File > New File > R Script' <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R-script.png" alt="RStudio" width="50%" height="auto"/> To open a new R Notebook select 'File > New File > R Notebook' <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R-notebook.png" alt="RStudio" width="50%" height="auto"/> Because I'm not a very clever person, I find R Notebooks a bit confusing. This means I mostly work in plain old fashioned R scripts. __However, this makes no difference to the steps for running tests.__ The only difference is purely to do with how the code is run... # R script: Use the 'Run' button or command-return (mac) or Cntrl-enter (PC) to run a single line of code. The cursor will drop down a line. # R script: To run a chunk, highlight what you want to run and use the run commands as per above. # R notebook: There is no 'Run' button. Use command-return (mac) or Cntrl-enter (PC) to run a single line of code. The cursor will drop down a line. # R script: To run a chunk, click the little green play arrow associated with the chunk. You need organise code into chunks as you go. Once you have a Script or Notebook, you should have four windows. Here are the important parts of the windows. R Script Layout <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R01.png" alt="RStudio" width="100%" height="auto"/> R Notebook Layout <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R02.png" alt="RStudio" width="100%" height="auto"/> On the [[next page|R2]] we will start running some basic code. [[<|RStudio_alt_start]] [[<<|start]]

Three ways of coding We're going to start off by using an extremely simple dataset that consists of just two numbers, '52' and '20'. You can think of these as two independent sets of counts. It could be counts of males and females, or counts of two distinct behaviours, or counts of two levels of gene expression, or two phenotypes. A chi-squared test could be used to ask the question: Have these counts departed significantly from a 50:50 ratio? There are three commonly used ways to code in R. We'll call these 'nested', 'object orientated' and 'piping'. Of these, 'piping' requires that you install and load a library. The other two methods just require writing out syntax in a particular way. So, first off, install and load 'tidyverse'. This is a collection of different libraries that all work together to help with data and coding management, as well as a few other things. Install and load 'tidyverse'. install.packages("tidyverse") # installs a library library(tidyverse) # loads a library A library for effect sizes We are going to obtain a Cramér's V for the Chi-squared test. This is an 'effect size'. Cramér's V ranges from 0 (no association) to 1 (perfect association). It is much like a Pearson's r, except that there cannot be a negative association, so Cramér's V will never be a negative number. Cramér's V also requires a library (i.e. it is not part of the core R package). R has a lot of libraries, which is great. Just about anything you could ever want to do has probably already been put into a library by someone, somewhere. So, now install and load the 'lsr' library, which contains a Cramér's V command. install.packages("lsr") # installs a library library(lsr) # loads a library Note that you have to be connected to the internet to download and install libraries. You only ever need to install a library once, but you do need to load it every time you start a new RStudio session. I'm going to step through each of these coding methods just to give you a taster, but, we are going to proceed with standard object orientated coding rather than nested or piping for the rest of this tutorial. Why? * Everyone (more or less) agrees that nested coding is a great way to give yourself a terrible headache. It is super easy to make mistakes with brackets, and then everything falls apart. * Even researchers who really love tidyverse piping tend to only use it when 1) they have a chain of functions they need to run together or 2) they are doing data manipulation (subsetting etc). It's is highly unusual to use piping for simple activities like running chi squared tests or building and testing linear models. It would be like crossing a river to get a drink of water, if that makes sense. Nonetheless, let's (quickly) look at how these three methods compare. 1. Nested Coding Nested coding is the 'native' coding mode in R. It is quick and requires the least typing, but it is extremely easy to insert errors or lose track of parentheses. We won't be using nested coding. It is prone to error, and can turn into a confusing mess quickly. I recommend you either use object orientated coding or piping. 2. Object Orientated Object orientated coding is also 'native' to R, but represents a bit of conceptual finesse. In object orientated coding you create an object (e.g. dataset, ANOVA result), and then do things to it (e.g. look at the structure, ask for a summary). I personally like object orientated coding because humans naturally think in terms of objects (nouns) and actions (verbs). It is also time-saving, because once you've created an object once, you can do multiple tasks with it. You don't have to keep writing out the early steps. However, a downside to object orientated coding is that you may have to keep track of a lot of names. Using appended extensions like .aov or .lm can help, but it can be easy to lose track of which name links to what. 3. Piping Piping is part of the 'tidyverse' ecosystem, and is largely intended to mimic piping in Linux. A 'pipe' is simply a way of sending something from one place to another, much like a physical pipe in a house. Some people hold that a pipe also needs to be linking functional steps together, but for our purposes an exact definition is neither here nor there. The left arrow used for object orientated coding (<-) is (arguably) also a form pipe, albeit one that is extremely limited. The piping system used in 'tidyverse' allows you to link functions together as steps (which the left arrow won't). However, it is also worth being aware that there are inconsistencies in the syntax. For instance, you can't pipe things to a combine 'c' function (as far as I can tell), and there is no clear way to use piping in conjunction with some non-parametric tests. For these reasons, tidyverse piping seems to be mostly restricted to big data manipulation. Finally, you can actually use an object orientated approach combined with tidyverse piping and nested code, but now you're really running the risk of creating all sorts of weird errors. There are times when this makes sense, such as when you might want to use tidyverse piping to run data through a series of subsets, and then you want to apply linear models using object orientated code, but, the key thing is just to know why you are using the code you are using. People sometimes get into the habit of just copying code from online forums without really understanding it, and that is another way to mix up methods and get confused. Now, here are examples of each approach doing the same tasks in RStudio. Try each of them (remember to copy and paste the code only... if you copy and paste the numbered instructions, R will be confused). 1. Nested Coding 1) Apply the 'chisq.test' function to a combined, 'c', set of two numbers, 52 and 20. chisq.test(c(52,20)) 2) Apply the 'cramersV' function to a combined, 'c', set of two numbers, 52 and 20. cramersV(c(52,20)) 2. Object Orientated 1) Create an object called 'my.data'. Drop a combined, 'c', set of numbers, 52 and 20, into 'my.data'. my.data <- c(52, 20) 2) Apply the 'chisq.test' function to the 'my.data' object. chisq.test(my.data) 3) Apply the 'cramersV' function to the 'my.data' object. cramersV(my.data) 3. Piping 1) Apply the combine, 'c', function to the numbers 52 and 20, and then pipe '%>%' this... c(52, 20) %>% 2) ...to the 'chisq.test' function. chisq.test() 3) Apply the combine, 'c', function to the numbers 52 and 20, and then pipe '%>%' this... c(52, 20) %>% 4) ...to the 'cramersV' function. cramersV() All right. That is the last we will see of 'nested coding' or 'tidyverse piping' in these tutorials. They exist, and you can use them if you really like those approaches, but, object orientated coding is much easier to use and understand when running simple tests. Make sure you keep your answers visible in RStudio. You will need them to answer questions on the [[next page|R5]]. [[<|R3]] [[<<|start]]

Colour coding Code will be colour coded to make it easier to understand. Orange will be used for functions (commands), blue for anything that you can change or name yourself (the 'moving pieces' of the code, if you like), black for basic syntax requirements (arrows, brackets, commas mostly), green for comments (R doesn't read anything after a hash tag), and purple for libraries. For example: library(vcd) # loads library 'vcd' attach(skink) # attaches the skink dataset Running code Code will be entered into either your Script or Notebook, and run from there. It doesn't matter which you use, but for your own sake you must use either a Script or Notebook. Entering code directly into the execution window is possible, __but nothing is saved__. Have a go at entering this code and running it. data(cars) # loads the cars dataset, which comes bundled with R plot(cars) # plots the cars dataset str(cars) # gives you the 'structure' of the cars dataset. Really useful for spotting errors in a dataset! Remember, these are the instructions for running code: # R script: Use the 'Run' button or command-return (mac) or Cntrl-enter (PC) to run a single line of code. The cursor will drop down a line. # R script: To run a chunk, highlight what you want to run and use the run commands as per above. # R notebook: There is no the 'Run' button. Use command-return (mac) or Cntrl-enter (PC) to run a single line of code. The cursor will drop down a line. # R script: To run a chunk, click the little green play arrow associated with the chunk. You need organise code into chunks as you go. Now, look at the 'structure' of the dataset and answer these questions: How many observations are in the cars dataset. hint: it is shown as 'obs.' <<textbox "$q14" "">> How many variables are in the cars dataset. hint: it is shown next to the observations <<textbox "$q15" "">> Speed and distance are the same type of variable. They are both... <<radiobutton "$q16" "factors">> Factors <<radiobutton "$q16" "characters">> Characters <<radiobutton "$q16" "numbers">> Numbers <<radiobutton "$q16" "integers">> Integers [[Check your answer|R3]] <<if $name is "cheat">> 50 2 Numbers <</if>> [[<|R1a]] [[<<|start]]

You answered that there were $q14 observations in the cars dataset <<if $q14 is "50">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|R2]] <</if>> You answered that there were $q15 variables in the cars dataset <<if $q15 is "2">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|R2]] <</if>> You answered that 'speed' and 'distance' were both $q16 <<if $q16 is "numbers">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|R2]] <</if>> <<if $q14 eq "50" and $q15 eq "2" and $q16 eq "numbers">>Geart work $name! Now, we'll move onto [[chi-squared tests|R4]]. <</if>> [[<|R2]] [[<<|start]]

Interpreting the output You should have obtained output that looks like this: > chisq.test(my.data) Chi-squared test for given probabilities data: my.data X-squared = 14.222, df = 1, p-value = 0.0001624 > cramersV(my.data) [1] 0.4444444 What does the output mean? * data: my.data This is reminding you what dataset this test was applied to. * X-squared: This is the chi-squared value. In a manuscript it is often written using the Greek letter chi with an exponent, like this χ2. A chi-squared statistic is a ratio of signal-to-noise. The higher the number, the more likely it is that some of the observed values are departing from the expected values. I won't get into how observed and expected values are calculated here, but Video 2 on the <a href="http://cpjohnstone.com/video-tutorials/" target="_blank">video tutorial page that explains this...</a> * df: This is the degrees of freedom. The degrees of freedom can be thought of a number that tells you how much independent information you have in your test. * p-value: This is the P value, given that the null (no association) is true. Unpacking this a little, the P value represents the probability that you could have obtained the result you got by chance, assuming that the null hypothesis is true. * Cramer's V : Technically, this should be written Cramér's V (with a dash over the 'e'). This is an 'effect size', and is a way of calculating correlation in tables with more than 2x2 rows and columns. Cramér's V ranges from 0 (no association) to 1 (perfect association). It is much like a Pearson's r, except that there cannot be a negative association, so Cramér's V will never be a negative number. We interpret it in the same way we interpret a Pearson's r. If Cramér's V is small (below 0.3-0.4), then even if we do get a significant result, we should treat it with caution because the real biological effect might be quite marginal. Now, answer the following questions... What is the value of the Chi-squared statistic? Write your answer to one decimal place. <<textbox "$q17" "">> What is the value of the degrees of freedom for the chi-squared test? Don't include any decimal places. <<textbox "$q18" "">> What is the P-value for the chi-squared test? Write your answer to three decimal places. If the P-value is less than 0.001, this is written as <0.001. <<textbox "$q19" "">> [[Check your answer|R6]] <<if $name is "cheat">> 14.2 1 <0.001 <</if>> [[<|R4]] [[<<|start]]

You answered that the chi-squared value was $q17 <<if $q17 is "14.2">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|R5]] <</if>> You answered that the degrees of freedom was $q18 <<if $q18 is "1">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|R5]] <</if>> You answered that the P-value for the chi-squared test was $q19 <<if $q19 is "<0.001">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|R5]] <</if>> <<if $q17 eq "14.2" and $q18 eq "1" and $q19 eq "<0.001">>Geart work! [[Proceed|R7]] <<else>> Something doesn't look quite right there. Be sure to check the decimal places carefully. <<endif>> [[<|R5]] [[<<|start]]

Checking assumptions As with all tests, we need to check the assumptions of a chi-squared test before we interpret the results. A chi-squared test only has two assumptions: # Categories are independent # No more than 20% of expected values are <5 There are no widely accepted tests for independence. This is a matter of good experimental design, and involves taking care not to pseudoreplicate your samples. As for the second assumption, actually, you will get an error message if this assumption is not met, but we can also check this by looking at the expected values in RStudio. If the assumption is not met, you will get this error message. Warning message: In chisq.test(c(3, 5)) : Chi-squared approximation may be incorrect We didn't see that message, which means assumptions have been met, but let's check the expected values anyway. The way the assumption is phrased may seem a bit confusing, but it is literally just a count of how many expected values are 5 or less. So, if your expected values were 4, 10, 14 and 8, then 4 is less than 5. 10, 14 and 8 are greater than 5. This means one value was less than 5, and three values were more than 5. This means that 25% of the expected values were 5 or less... which means we haven't met the assumption in this case. Now, let's check this. Expected values Create the chi-squared object with two counts, 52 and 20, and then ask to see the expected values (exp), which are inside the chi-squared object. The dollar sign is used to indicate that one variable is inside another variable. my.data <- c(52, 20) chisq.test(my.data)$exp Because there are only two counts, and we are testing a departure from a 50:50 ratio, you should get the same expected value twice. What is the expected value you obtained? <<textbox "$q20" "">> [[Check your answer|R8]] <<if $name is "cheat">> 36 <</if>> [[<|R6]] [[<<|start]]

You answered that the chi-squared value was $q20. <<if $q20 is "36">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|R7]] <</if>> <<if $q20 is "36">>Great work $name! Because 36 is >5, we don't have any expected values <5. Assumptions have been met. This means that we can write our results for this test. The results of the chi-squared test indicated a significant departure of observed from expected values (χ2 = 14.2, df = 1, P < 0.001, Cramér's V = 0.44). However, let's look at how to get around the problem of not meeting assumptions when it happens. [[Proceed|R9]] <</if>> [[<|R7]] [[<<|start]]

Monte Carlo Simulation If your chi-squared test does not meet assumptions, you can use a Monte Carlo simulation to generate valid results. To do this, you set 'simulate' to 'true' in your code. Let's try an example where the assumptions are not met. We'll create a 4x4 matrix of values. * Let's imagine you are counting pink and white flowers across male and female plants, and you want to know whether male or female plants might be more likely to produce either pink or white flowers. After counting flowers, we get this result: <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/flowers_table.png" alt="RStudio" width="40%" height="auto"/> We could use the 'matrix' function in R to create generate a table inside R, but let's get some practise creating a csv file and importing it. # Open Excel # Type the numbers in as shown below # save as flowers.csv <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/flowers_csv.png" alt="RStudio" width="40%" height="auto"/> You can leave the csv file open, or close it. RStudio will import a copy of the file only, so that anything you do in RStudio won't be saved in the original file. This means that you always have a backed-up data file, which is good to have. Now, see if you can import the dataset. The 'Import Dataset' button should be visible on the top right-hand window. <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R02.png" alt="RStudio" width="100%" height="auto"/> <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/R01.png" alt="RStudio" width="100%" height="auto"/> You might see 'base' and 'readr' options... either of these are fine. They do the same thing but rely on different packages to do so. Here is the window you see when you use the 'readr' option. <img src="http://cpjohnstone.com/wp-content/uploads/2018/09/readr_import.png" alt="RStudio" width="100%" height="auto"/> You want to be sure that: # You have a Name that is short and easy to type # First Row as Names is ticked # The Delimiter is set to Comma for a csv file (commas separated values). Here's a short video showing the saving of a csv file, and importing it into RStudio. You shouldn't have to switch tick boxes on and off (everything should default to the correct settings), but I've done this in the video so that you can see how it changes the way R sees the dataset. <iframe width="560" height="315" src="https://www.youtube.com/embed/SaJDmlwwm2k?rel=0&showinfo=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe> Once you have imported the dataset, you can [[move onto the next step|R10]]. [[<|R8]] [[<<|start]]

Contigency Tables First, we need to take our dataset and turn it into a contingency table. To do this, we use the xtabs function. We will then apply a chi-squared test to the contingency table. Create a contingency table Create a contingency table using the xtabs function, and then apply a chi-squared test to the flowers.xtab object. Note that the flowers.xtab object could be called anything at all. I've just picked this name because it makes it easy to remember that this is a contingency table. flowers.xtab <- xtabs(COUNT ~ FLOWERS + SEX, data = flowers) flowers.xtab # check what the contingency table looks like chisq.test(flowers.xtab) Error message Because >20% of the expected values are less than or equal ot 5, you should get this output and error message: Pearson's Chi-squared test with Yates' continuity correction data: . X-squared = 17.644, df = 1, p-value = 2.663e-05 Warning message: In chisq.test(.) : Chi-squared approximation may be incorrect Okay, now, rather than just give you the code, I'm going to ask you to look at the help file for chi-squared tests and see if you can work it out. Run this line of code: ?chisq.test Now, scroll through the help file. The screed of code up the top tells you what the default settings are for this particular test. You should see something like this (under 'Usage') chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), rescale.p = FALSE, simulate.p.value = FALSE, B = 2000) Now, have a look through the Arguments. Which logical Argument is used to set a Monte Carlo simulation to either TRUE or FALSE for this test? <<radiobutton "$q21" "NULL">> NULL <<radiobutton "$q21" "rescale.p">> rescale.p <<radiobutton "$q21" "simulate.p.value">> simulate.p.value <<radiobutton "$q21" "length">> length Now answer this: which Argument is used to specify an integer that dictates the number of replicates used in the Monte Carlo test? <<radiobutton "$q22" "x">> x <<radiobutton "$q22" "y">> y <<radiobutton "$q22" "p">> p <<radiobutton "$q22" "B">> B [[Check your answer |R11]] <<if $name is "cheat">> simulate.p.value B <</if>> [[<|R9]] [[<<|start]]

You answered that the logical argument that specifies whether to run a Monte Carlo simulation was $q21. <<if $q21 is "simulate.p.value">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|R10]] <</if>> You answered that the integer that specifies the number of Monte Carlo replicates to use is $q22. <<if $q22 is "B">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|R10]] <</if>> <<if $q21 eq "simulate.p.value" and $q22 eq "B">>Geart work! [[Proceed|R12]] <<else>> Something doesn't look quite right there. Best go back and check the help file again carefully. <<endif>> [[<|R10]] [[<<|start]]

Running the test Alright, now see if you can use a Monte Carlo simulation to obtain a valid P value for the flowers dataset that we've created. Create a contingency table and then apply a chi-squared test to the flowers.xtab object. Use the simulation commands to tell R that we want to use a Monte Carlo simulation based on 2000 simulations. flowers.xtab <- xtabs(COUNT ~ FLOWERS + SEX, data = flowers) # apply the chi squared test and drop it into an object called flowers.chisq flowers.chisq <- chisq.test(flowers.xtab, simulate.p.value = TRUE, B = 2000) # Look at the results flowers.chisq # Look at the expected values flowers.chisq$exp # Look at the standardised residuals flowers.chisq$res Stable results Try rerunning the code you used above. Do the results change slightly? Because the results are based on generating random numbers, they will differ slightly each time you run the test. You need to set a seed if you want a test like this to be stable. The 'seed' is a number (any number will do) that is used to start the generation of random numbers. If you set a seed, then you will get the same random numbers produced, over and over again. This is important if you want someone else to be able to run your test (that is based on random number generation) and get the same result, or if you want to come back to a test and get the same result in the future. Remember that you only have to do this if you are using something like a simulation which depends on random numbers. I've picked the number '42' for no particular reason except that is a reference to a very funny book. You could use any number, as long as you use the same number in the future. set.seed(42) # apply the chi squared test and run it Now try re-running the test a few times. Is the result stable now? A warning... Sometimes, when a student uses a randomisation method (like the Monte Carlo method we used above), and they get a borderline P value (like P = 0.052), they then re-run the test until (by chance) it is significant. This is definitely an example of P hacking, and you should not do it. If you do have a result that you are unsure about, you can run the test multiple times and take the average result, but don't go hunting for the result you want. All right, let's now try some [[plotting|R13]]. [[<|R11]] [[<<|start]]

Mosaic Plots The most appropriate plot to use with a Chi Squared test is a mosaic plot. R has a default mosaic plot option, which is nice enough, but there are some very nice mosaic plot options inside of a library called vcd. We'll use both options here. Incidentally, the implementation of the mosaic plot in ggplot2 lacks the functionality needed to show significant results automatically (you could manually work out the significance levels based on standardised residuals and then code the colours you want into the ggplot2 figure, but really, who wants to do that?). For this step, we're going to use a dataset called wildebeest.csv. In this study, researchers were interested in whether male and female wildebeest showed different frequencies of death due to 'predation' and 'other' causes. You can download the file <a href="http://cpjohnstone.com/wp-content/uploads/2018/10/wildebeest.csv" target="_blank">here</a>. Once you have download the zip file, unzip it by double-clicking. You should get a csv file. You can open the csv file in Excel (if you want to), to look at it, or you can just go straight to RStudio, and load it. Do you remember how to load a csv file into RStudio? Remember that there is an 'Import Datasets' command in the top-right window pane. Either 'Base' or 'Readr' are perfectly okay options to use. Just be sure that your header is set to 'yes' or 'true', and that the separator is set to 'comma' and not 'tab' or 'white space'. Once you have imported the dataset you should see a copy of it that looks like this in RStudio. <img src="http://cpjohnstone.com/wp-content/uploads/2018/10/Wildebeest.png" alt="past" width="40%" height="auto"/> The hypothesis that the researchers were interested was: * There is a difference in the frequency of causes of wildbeest death dependant on sex. The null hypothesis would therefore be: * There is no difference in the frequency of causes of wildbeest death dependant on sex. In biology, we more usually talk about variables in terms of being 'predictors' and 'responses' rather than 'independent' and 'dependant' variables. The predictor(s) (sometimes, 'explanatory variable') are the same as the independent variable(s). The 'response' is the same as the 'dependant variable'. What do you think is the response that best relates to the hypothesis that the researchers were interested in testing? <<radiobutton "$q22" "site">> Site <<radiobutton "$q22" "sex">> Sex <<radiobutton "$q22" "death">> Death <<radiobutton "$q22" "count">> Count What do you think are the predictors that best relate to the hypothesis that the researchers were interested in testing? <<radiobutton "$q23" "site, sex and death">> Site, Sex and Death <<radiobutton "$q23" "sex and death">> Sex and Death <<radiobutton "$q23" "sex only">> Death <<radiobutton "$q23" "death only">> Count [[Check your answers|R14]] <<if $name is "cheat">> Count Sex and Death (site wasn't a part of the hypothesis) <</if>> [[<|R12]] [[<<|start]]

You answered that the predictor is most likely, $q22 <<if $q22 is "count">>Correct! <<else>>Hmm, that doesn't look right. Maybe [[try again|R13]]. Keep in mind that responses almost always have to be numerical (i.e. either continuois or a count). <</if>> You answered that there the predictors were most likely, $q23 <<if $q23 is "sex and death">>Correct! Remember, 'site' wasn't a part of the hypothesis, so it isn't a predictor. <<else>>Hmm, that doesn't look right. Maybe [[try again|R13]]. Keep in mind that the researchers were interested in an interaction of two biological variables, and 'site' was never mentioned in the hypothesis. <</if>> <<if $q22 eq "count" and $q23 eq "sex and death">>Great work $name! Now, we'll move onto [[a chi-squared test and mosaic plots|R15]]. <</if>> [[<|R13]] [[<<|start]]

Wildebeest: Chi-squared and Mosaic Plots Let's start off by creating a contingency table and applying a chi-squared test as per previously... we shouldn't need to use a Monte Carlo simulation because no more than 20% of the expected values are 5 or less, but we'll still look at the expected values just to be sure of this. wildebeest.xtab <- xtabs(COUNT ~ SEX + DEATH, data = wildebeest) # apply the chi squared test and drop it into an object called wildebeest.chisq wildebeest.chisq <- chisq.test(wildebeest.xtab) # Look at the results wildebeest.chisq # Look at the expected values wildebeest.chisq$exp # Look at the standardised residuals wildebeest.chisq$res Base Mosaic Plot The base mosaic plot in R, conveniently enough, uses the function 'mosaicplot'. We apply the function to the contingency table we created above. We need to include the command 'shade = TRUE' to obtain red and blue colours representing significances of the residuals. This allows us to determine which counts are significantly above or below expected. In a sense, this means that the mosaic plot is a little bit like a post hoc test, such as a Tukey's Test (typically performed after an ANOVA). The chi-squared test tells us that there is some sort of difference (or not). The mosaic plot tells us where the differences are. If there are no differences, the squares will all come up as grey. Try using this code. You should get plots similar to the ones below. mosaicplot(wildebeest.xtab) mosaicplot(wildebeest.xtab, shade = TRUE) <img src="http://cpjohnstone.com/wp-content/uploads/2018/10/mosaicplots_grey_colour.png" alt="mosaic plot" width="100%" height="auto"/> What does it mean? Read the boxes as if they were part of a grid. There are four boxes because there are four categories: # Female x Other # Female x Predation # Male x Other # Male x Predation Now, look at the Male x Other square in the coloured version. It is a relatively small box that is coloured pale red. What is the size based off? What does the colour mean? * The box sizes are simply graphical representations of the standardised residuals. If you check the residuals of the chi-squared, you're find that they are proportional to the area of the boxes. * Blue boxes denote counts that are significantly above expected * Red boxes denote counts that are significantly below expected * Grey boxes (we don't have any in the wildebeest dataset) denote counts that were not significantly different to expected. You can take this to a finer level: * Dark blue is significant at the 0.01 level * Pale blue is significant at the 0.05 level * Grey is not significant * Pale red is significant at the 0.05 level * Dark red is significant at the 0.01 level Finally, you can switch the direction of the plot by switching the order of predictors in the contingency table. wildebeest.xtab <- xtabs(COUNT ~ SEX + DEATH, data = wildebeest) mosaicplot(wildebeest.xtab, shade = TRUE) wildebeest.xtab <- xtabs(COUNT ~ DEATH + SEX, data = wildebeest) mosaicplot(wildebeest.xtab, shade = TRUE) <img src="http://cpjohnstone.com/wp-content/uploads/2018/10/mosaicplot_swapped.png" alt="mosaic plot" width="100%" height="auto"/> Let's wrap up by producing a slightly nicer looking plot out of the vcd library and then writing up our results. [[Next|R16]] [[<|R14]] [[<<|start]]

Final plots & writing up... The library 'vcd' mosaic plots are a little more professional looking than those made using the base R function. You will need to install and load the library. One quirk is that the 'strucplot' function, which is used to make a mosaic plot in 'vcd' flips the axes around the other way to the base 'mosaicplot' function, but you can always reorder the contingency table to get the plot looking how you want it. install.packages("vcd") # installs vcd library("vcd") # loads vcd wildebeest.xtab <- xtabs(COUNT ~ SEX + DEATH, data = wildebeest) strucplot(wildebeest.xtab, shade = TRUE) wildebeest.xtab <- xtabs(COUNT ~ DEATH + SEX, data = wildebeest) strucplot(wildebeest.xtab, shade = TRUE) <img src="http://cpjohnstone.com/wp-content/uploads/2018/10/structplot_2.png" alt="mosaic plot" width="50%" height="auto"/><img src="http://cpjohnstone.com/wp-content/uploads/2018/10/structplot_1.png" alt="mosaic plot" width="50%" height="auto"/> We might as well also obtain an effect size, before writing up our final results... and we'll re-run our test and residuals just so we have them readily to hand... library("lsr") # loads a library cramersV(wildebeest.xtab) chisq.test(wildebeest.xtab) chisq.test(wildebeest.xtab)$res <img src="http://cpjohnstone.com/wp-content/uploads/2018/10/Screenshot-2018-10-11-at-12.56.44-pm.png" alt="raw results" width="60%" height="auto"/> The results of the chi-squared test indicated that the causes of death ('predation' and 'other) were significantly different for male and female wildebeest (χ2 = 22.4, df = 1, P < 0.001, Cramér's V = 0.31). Female wildebeest were significantly more likely to die from causes other than predation (res = +2.66), and less likely to die from predation (res = -2.11). In contrast, male wildebeest were more likely to die from predation (res = +2.17) and less likely to die from other causes (res = -2.72). The Cramér's V (0.31) suggests moderate strength of association, although it is worth noting that 69% of the variation in counts remains unexplained, so that further investigation into causes of death seems warranted. All right. Great work, $name. You're done with RStudio chi-squared tests. If you haven't already completed the tutorial on using Past, [[you can do so now|PAST]]. Otherwise, you're now done with the chi-squared tutorials. [[<|R15]] [[<<|start]]