Hypothesis Testing: T-Test

What is a T-Test?

A t-test is an inferential technique to assess two competing hypotheses about the population means across one or two samples. We can use the t-test for three specific purposes:

One-Sample: tests whether the population mean is less than, greater than, or not equal to a prespecified value. The t statistic compares the observed data with what we expect under the null hypothesis, which states that the population mean equals the testing value. The resulting p-value tells us how likely observing the evidence we have for the alternative hypothesis or more when the null hypothesis is true. If the p-value is less than the specified significance level (e.g., less than 0.05), we reject the null hypothesis in favor of the alternate hypothesis, which states that the population mean is less than, greater than, or not equal to the testing value. Otherwise, we fail to reject the null hypothesis, indicating we do not have significant evidence for the alternative hypothesis.

Two Independent Samples: tests whether the difference of two population means is lesser, greater than, or not equal to a prespecified value, which we frequently take to be zero. The t statistic compares the observed data with what we expect under the null hypothesis, which states that the difference in population means equals the testing value; usually, we take the testing value to be zero (e.g., the means are the same). The resulting p-value tells us how likely observing the evidence we have for the alternative hypothesis or more when the null hypothesis is true. If the p-value is less than the specified significance level (e.g., less than 0.05), we reject the null hypothesis in favor of the alternate hypothesis, which states that the difference of population means is less than, greater than, or not equal to the testing value. Otherwise, we fail to reject the null hypothesis, indicating we do not have significant evidence for the alternative hypothesis.

Paired Samples: tests whether the population mean of differences is lesser, greater than, or not equal to a prespecified value, which we frequently take to be zero. This procedure is used for a pre-post design. The t statistic compares the observed data with what we expect under the null hypothesis, which states that the population mean of differences equals the testing value; usually, we take the testing value to be zero (e.g., the differences have mean zero). The resulting p-value tells us how likely observing the evidence we have for the alternative hypothesis or more when the null hypothesis is true. If the p-value is less than the specified significance level (e.g., less than 0.05), we reject the null hypothesis in favor of the alternate hypothesis, which states that the population mean difference is less than, greater than, or not equal to the testing value. Otherwise, we fail to reject the null hypothesis, indicating we do not have significant evidence for the alternative hypothesis.

The t-test can be used under the following conditions.

1. The observations are representative of the population of interest and independent.

2. The observations within each sample are normally distributed. Note that the t-test is robust to this assumption as long as the sample size is large (i.e., at least thirty).

Note: The independent two-sample t-test requires independent samples, as its name suggests. We employ Welch's t-test for two independent samples by default according to the work of Delacre et al. (2017), who show that Welch's t-test provides better control of Type 1 error rates (i.e., false positives) than the version that assumes equal variance.

How to use this app?

Step 1: To use this app, go to the 'Dataset and Hypothesis' tab and upload your .csv type dataset, or select a sample dataset.

Step 2: Next, you must select the type of t-test (One-Sample, Two Independent Samples, or Paired Sample).

Step 3: You can check the assumptions in the 'Assumptions' tab. We recommend assessing assumptions visually using the provided graphical summary and confirming using the numerical summaries. The app will provide results for a Shapiro-Wilkes (n ≤ 5000) or a Kolmogorov-Smirnov (n > 5000) test for normality. While these tests might be helpful, they can be rather sensitive for small sample sizes leading us to detect minuscule transgressions.

Step 4: You can check the result of the t-test procedure (test statistics, decision making, and test visualization) in the 'Hypothesis Test' and 'Confidence Interval' tabs.

Step 5 (Optional): We also provide the results of a bootstrap approach for computing a confidence interval and a randomization test. These are nonparametric alternatives to the t-test that can be used when t-test assumptions are not met or to evaluate whether the results of the t-test procedure depend on its assumptions.

Contact us

Please contact us if you have any questions at datascience@colgate.edu.

Example 1

Within the t-test app, we provide the penguin data that includes measurements for penguin species inhabiting islands in Palmer Archipelago and made available through the palmerpenguins library for R (Gorman et al., 2014). Suppose researchers aimed to evaluate whether Adelie and Chinstrap penguins have differing bill depth (mm). This is a classic example of a scenario requiring the t-test framework. Note that this is a special case of the <a href="https://shiny.colgate.edu/apps/Collaboratory-Apps/ANOVA-Test/">ANOVA test</a> .

Here, we have three samples of observations (the species) and a continuous attribute (bill depth). We will use the t-test procedure to evaluate whether the data support the claim that Adelie and Chinstrap penguins have different population mean bill depths.

First, we load the t-test app. Second, we click 'Sample Data' to load the penguin data. Once the data are loaded, we select the quantitative variable (bill_depth_mm) and the categorical variable (species). Ensure that we have chosen the two-sample t-test (independent). Then we specify the independent samples as the Adelie and Chinstrap. The data summary provides our first look at the data.

The first step of conducting the t-test procedure requires us to evaluate the assumptions. When we click 'Assumptions', the data are plotted for interpretation.

This plot shows that the data are roughly normally distributed within each species as the densities are symmetric and bell-shaped. Note that the application uses Welch's independent sample t-test, so we are not concerned with equal variances. We note that evaluating whether the observations are representative of the population of interest and independent is more challenging. These data were collected from many penguin nests across three different islands in Palmer Archipelago, meaning the data are likely representative. We trust that the researchers collected data in a way that made the observations near independent.

The 'Hypothesis Test' tab shows the result of the t-test procedure. As we might expect after checking the assumptions, there is not significant evidence that the population mean bill depths (mm) differ across Adelie and Chinstrap penguins (t=-0.4377, p=0.6623).

We can interpret a t-confidence interval for the population mean difference to provide context. We are 95% confident that the true population mean difference of bill depths (Adelie - Chinstrap) is between -0.4096 mm and 0.2611 mm. Note that zero is on this interval, indicating that it is plausible that the population means are the same, which agrees with the interpretation of the test.

The conditions for using the t-test (e.g., normality) were evaluated above and deemed satisfactory. This is reflected in the randomization test, which is very close to the parametric result. Specifically, the randomization test produces a p-value of 0.692, which leads us to conclude that there is not significant evidence that the population mean bill depths (mm) differ across Adelie and Chinstrap penguins. Note that this is the result of random resampling, and if you run the inference yourself, the result may vary slightly.

The same is true for the confidence interval. Using the bootstrap confidence interval, we are 95% confident that the true population mean difference bill depths (Adelie - Chinstrap) is between -0.4096 and 0.2611. Note that this too is the result of random resampling, and if you run the inference yourself, the result may vary slightly.

Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081

Example 2

Within the t-test app, we provide the MFAP4 data that includes measurements for Hepatitis C patients collected by the German network of Excellence for Viral Hepatitis and studied by Bracht et al. (2016). Suppose these researchers wanted to show that the human microfibrillar-associated protein 4 (MFAP4, U/ml) is increased for hepatitis C patients. The researchers can use the t-test to evaluate whether the population mean log-2 transformed MFAP4 is greater than that of healthy patients, which we take to be 1.71 U/ml (Zhang et al., 2019) on the log-2 scale.

Here, we have one sample of observations and a continuous attribute (MFAP4 log-2 U/ml). We will use the t-test procedure to evaluate whether the data support the claim that the population mean log-2 MFAP4 level in hepatitis C patients is larger than 1.71.

First, we load the t-test app. Second, we click 'Sample Data' to load the MFAP4 data. Once the data are loaded, we select the variable (log2.MFAP4). The data summary provides our first look at the data.

The first step of conducting the t-test procedure requires us to evaluate the assumptions. When we click 'Assumptions', the data are plotted for interpretation.

At this point, we see that the data are roughly normally distributed. We note that evaluating whether the observations are representative of the population of interest and independent is more challenging. In their paper, Bracht et al. (2016) tell us these data were collected at different sites using a protocol meant to reduce bias, meaning the data are likely to be representative. We trust that the researchers collected data in a way that made the observations near independent.

The data summary provides our first look at the data. The histogram plot shows that the data are roughly normal. When we click 'Assumptions', the plots confirm that the data fit the assumptions of the t-test model, though we note there is some slight departure from normality.

The 'Hypothesis Test' tab shows the result of the t-test procedure. As we might expect after viewing graphs of the data, there is significant evidence that the population mean log-2 MFAP4 U/ml levels in hepatitis C patients is larger than 1.71 (t=39.6488, p<0.0001).

We can interpret a t-confidence interval for the population mean to provide context. We are 95% confident that the true population mean log-2 MFAP4 level of hepatitis C patients is between 3.3246 and 3.4929 U/ml. Note that the values the interval covers are larger than 1.71 on the log-2 scale, indicating that the population mean log-2 MFAP4 level for hepatitis C patients is larger than 1.71.

The conditions for using the t-test (e.g., normality) were evaluated above and deemed satisfactory. This is reflected in the randomization test, which is very close to the parametric result. Specifically, the randomization test produces a p-value < 0.0001, which leads us to conclude that there is significant evidence that the population mean log-2 MFAP4 level is larger than 1.71. Note that this is the result of random sampling, and if you run the inference yourself, the result may vary slightly.

The same is true for the confidence interval. Using the bootstrap confidence interval, we are 95% confident that the true population mean log-2 MFAP4 level among hepatitis C patients is between 3.3241 and 3.4939. Note that this too is the result of random sampling, and if you run the inference yourself, the result may vary slightly.

Bracht, T., Molleken, C., Ahrens, M., Poschmann, G., Schlosser, A., Eisenacher, M., ... & Sitek, B. (2016). Evaluation of the biomarker candidate MFAP4 for non-invasive assessment of hepatic fibrosis in hepatitis C patients. Journal of Translational Medicine, 14(1), 1-9.

Zhang, X., Li, H., Kou, W., Tang, K., Zhao, D., Zhang, J., ... & Xu, Y. (2019). Increased plasma microfibrillar-associated protein 4 is associated with atrial fibrillation and more advanced left atrial remodelling. Archives of Medical Science, 15(3), 632-640.

Example 3

Within the t-test app, we provide U.S. News and World Report's College Data that includes measurements for many U.S. Colleges from the 1995 issue of U.S. News and World Report and made available through the ISLR library in R (James et al., 2017). Suppose we aimed to evaluate whether private schools have a higher percentage of new students coming from the top 10% of their high school class than public schools.

Here, we have two samples of observations (private/public) and a discrete attribute (percent of new students in the top 10% of their high school class). We will use the independent samples t-test to evaluate whether the data support the claim that there is a difference in the mean percent of new students coming from the top 10% of their high school class.

First, we load the t-test app. Second, we click 'Sample Data' to load the U.S. News College data. Once the data are loaded, we select the variable (Top10perc) and the samples (private). Ensure to choose a two-sample t-test (independent). The data summary provides our first look at the data.

The first step of conducting the t-test procedure requires us to evaluate the assumptions. When we click 'Assumptions', the data are plotted for interpretation.

This plot shows that the data are skewed for private and public institutions. The normality condition is not met. One option is to conduct a log or an inverse hyperbolic sine transformation, which, in this case, alleviates the normality concern. Instead, we note that the t-test is robust to departures from normality for large samples, so we may proceed without conducting a transformation. We note that evaluating whether the observations are representative of the population of interest and independent is more challenging. We won't get into how U.S. News conducts its ratings, but it has been heavily scrutinized in the media. For demonstration purposes, we will proceed assuming that the data are representative. The data may be representative, but we'd have to do more digging.

The 'Hypothesis Test' tab shows the result of the t-test procedure. As we might expect after viewing the data, there is significant evidence that the population mean percentage of new students coming from the top 10% of their high school class differs across institution types (t=4.8433, p < 0.0001).

We can interpret a t-confidence interval for the population mean difference to provide context. We are 95% confident that the true population mean percentage of new students coming from the top 10% of their high school class (private-public) is between 3.8596 and 9.1326 percentage points. Note that the values the interval covers are larger than 0, indicating that the population mean is larger for private schools than public schools.

The conditions for using the t-test (e.g., normality) were evaluated above and deemed satisfactory, though we did lean on the robust nature of the test. This is reflected in the randomization test, which produces a p-value < 0.0001, which leads us to the same conclusion. Note that this is the result of random sampling, and if you run the inference yourself, the result may vary slightly.

The same is true for the confidence interval. Using the bootstrap confidence interval, we are 95% confident that the true population mean difference of the percentage of new students coming from the top 10% of their high school class (private-public) is between 3.9039 and 9.2263. Note that this too is the result of random sampling, and if you run the inference yourself, the result may vary slightly.

Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani (2017). ISLR: Data for an Introduction to Statistical Learning with Applications in R. R package version 1.2. https://CRAN.R-project.org/package=ISLR

Example 4

Within the t-test app, we provide the well-being data of undergraduate college students collected by Binfet et al. (2021). Suppose the researchers wanted to show that reported loneliness is decreased among undergraduate college students after contact with canines. The researchers can use the t-test to evaluate whether the population mean loneliness is greater before canine contact compared to after. Loneliness is measured using the UCLA Loneliness Scale (Russell, 1996), the average of twenty questions answered on a one to four scale.

Here, we have two samples of observations (before/after) and a discrete attribute (self-reported loneliness). We will use the paired samples t-test to evaluate whether the population mean loneliness is greater before canine contact compared to after.

First, we load the t-test app. Second, we click 'Sample Data' to load Binfet's Canine Data: Contact Group. Once the data are loaded, we select the `after' variable (lonely2) and the 'before' variable (lonely1). Ensure to choose Paired two-sample t-test (dependent). The data summary provides our first look at the data.

The first step of conducting the t-test procedure requires us to evaluate the assumptions. When we click 'Assumptions', the data are plotted for interpretation.

This plot shows that the observed differences (after-before) are roughly normal (i.e., symmetric and bell-shaped). We note that evaluating whether the observations are representative of the population of interest and independent is more challenging. Binfet et al. (2021) recruited undergraduate students from one mid-sized Canadian University who were enrolled in a psychology course offering bonus credit for participating in research studies. While this sample may be representative of undergraduate students at midsized Canadian universities who take psychology courses, it may not represent all undergraduate students (e.g., non-Canadian institutions, students who don't take psychology courses, etc.).

The 'Hypothesis Test' tab shows the result of the t-test procedure. There is significant evidence that the population mean loneliness is greater before canine contact compared to after (t=-5.8159, p < 0.0001).

We can interpret a t-confidence interval for the population mean difference to provide context. We are 95% confident that the true population mean loneliness (after-before) is between -0.1007 and -0.0497. Note that this difference is based on the average of responses on a 1 to 4 scale. That is, the difference is significant but not large. Note that the interval only covers values less than 0, indicating that the population mean loneliness is larger before compared to after.

The same is true for the confidence interval. Using the bootstrap confidence interval, we are 95% confident that the true population mean loneliness (after-before) is between -0.0988 and -0.0497. Note that this too is the result of random sampling, and if you run the inference yourself, the result may vary slightly.

Binfet, J. T., Green, F. L., & Draper, Z. A. (2022). The Importance of Client-Canine Contact in Canine-Assisted Interventions: A Randomized Controlled Trial. Anthrozoös, 35(1), 1-22.

Russell, D. W. (1996). UCLA Loneliness Scale (Version 3): Reliability, validity, and factor structure. Journal of personality assessment, 66(1), 20-40.

Graphical Summary

Height

Width

Units

Format

Download

Numerical Summary

LaTeX

Interpretation

Assumptions

Make sure that you satisfy all the assumptions for t-test

Observations are representative of the population and independent.

Data are normally distributed within each sample, or the sample size(s) are large.

Graphical Summary

Height

Width

Units

Format

Download

Graphical Summary

Height

Width

Units

Format

Download

Hypothesis Test Details

Interpretation

Graphical Summary

Height

Width

Units

Format

Download

Confidence Interval Details

Interpretation

Graphical Summary

Height

Width

Units

Format

Download

Interpretation

Assumptions

Height

Width

Units

Format

Download

Hypothesis Test Graphical Summary

Height

Width

Units

Format

Download

Hypothesis Test Interpretation

Confidence Interval Graphical Summary

Height

Width

Units

Format

Download

Confidence Interval Interpretation

References

Attali, Dean. 2020. Shinyjs: Easily Improve the User Experience of Your Shiny Apps in Seconds. https://deanattali.com/shinyjs/.

Attali, Dean, and Tristan Edwards. 2020. Shinyalert: Easily Create Pretty Popup Messages (Modals) in Shiny. https://github.com/daattali/shinyalert https:// daattali.com/shiny/shinyalert-demo/.

Bailey, Eric. 2015. shinyBS: Twitter Bootstrap Components for Shiny. https://ebailey78.github.io/shinyBS.

Binfet, John-Tyler, Freya LL Green, and Zakary A Draper. 2021. “The Importance of Client–Canine Contact in Canine-Assisted Interventions: A Randomized Controlled Trial.” Anthrozoös, 1–22.

Bracht, Thilo, Christian Mölleken, Maike Ahrens, Gereon Poschmann, Anders Schlosser, Martin Eisenacher, Kai Stühler, et al. 2016. “Evaluation of the Biomarker Candidate Mfap4 for Non-Invasive Assessment of Hepatic Fibrosis in Hepatitis c Patients.” Journal of Translational Medicine 14 (1): 1–9.

Canty, Angelo, and Brian Ripley. 2021. Boot: Bootstrap Functions (Originally by Angelo Canty for s). https://CRAN.R-project.org/package=boot.

Chang, Winston. 2021. Shinythemes: Themes for Shiny. https://rstudio.github.io/shinythemes/.

Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2021. Shiny: Web Application Framework for r. https://shiny.rstudio.com/.

Cheng, Joe, and Carson Sievert. 2021. Shinymeta: Export Domain Logic from Shiny Using Meta-Programming. https://CRAN.R-project.org/package=shinymeta.

Dahl, David B., David Scott, Charles Roosen, Arni Magnusson, and Jonathan Swinton. 2019. Xtable: Export Tables to LaTeX or HTML. http://xtable.r-forge.r-project.org/.

Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge: Cambridge University Press. http://statwww.epfl.ch/davison/BMA/.

Delacre, Marie, Daniël Lakens, and Christophe Leys. 2017. “Why Psychologists Should by Default Use Welch’s t-Test Instead of Student’s t-Test.” International Review of Social Psychology 30 (1).

Horst, Allison, Alison Hill, and Kristen Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://CRAN.R-project.org/package=palmerpenguins.

James, Gareth, Daniela Witten, Trevor Hastie, and Rob Tibshirani. 2017. ISLR: Data for an Introduction to Statistical Learning with Applications in r. http://www.StatLearning.com.

Nijs, Vincent, Forest Fang, Trestle Technology, LLC, and Jeff Allen. 2019. shinyAce: Ace Editor Bindings for Shiny. https://CRAN.R-project.org/package=shinyAce.

Pedersen, Thomas Lin. 2020. Patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.

Peng, Roger D. 2019. Simpleboot: Simple Bootstrap Routines. https://github.com/rdpeng/simpleboot.

R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Russell, Daniel W. 1996. “UCLA Loneliness Scale (Version 3): Reliability, Validity, and Factor Structure.” Journal of Personality Assessment 66 (1): 20–40.

Sali, Andras, and Dean Attali. 2020. Shinycssloaders: Add Loading Animations to a Shiny Output While It’s Recalculating. https://github.com/daattali/shinycssloaders.

Wickham, Hadley. 2021. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/ 9781466561595.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

———. 2021. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2021. DT: A Wrapper of the JavaScript Library DataTables. https://github.com/rstudio/DT.

Zhang, Xianlin, Hailing Li, Wenxin Kou, Kai Tang, Dongdong Zhao, Jingying Zhang, Jianhui Zhuang, et al. 2019. “Increased Plasma Microfibrillar-Associated Protein 4 Is Associated with Atrial Fibrillation and More Advanced Left Atrial Remodelling.” Archives of Medical Science 15 (3): 632–40.