2b. Sample Size Justification

What to write

Explain how the sample size was decided.

Provide details of any a priori sample size calculation, if done.

Explanation

For any type of experiment, it is crucial to explain how the sample size was determined. For hypothesis-testing experiments, in which inferential statistics are used to estimate the size of the effect and to determine the weight of evidence against the null hypothesis, the sample size needs to be justified to ensure experiments are of an optimal size to test the research question1,2 (see Item 13. Objectives). Sample sizes that are too small (i.e., underpowered studies) produce inconclusive results, whereas sample sizes that are too large (i.e., overpowered studies) raise ethical issues over unnecessary use of animals and may produce trivial findings that are statistically significant but not biologically relevant3. Low power has three effects: first, within the experiment, real effects are more likely to be missed; second, when an effect is detected, this will often be an overestimation of the true effect size4; and finally, when low power is combined with publication bias, there is an increase in the false positive rate in the published literature5. Consequently, low-powered studies contribute to the poor internal validity of research and risk wasting animals used in inconclusive research6.

Study design can influence the statistical power of an experiment, and the power calculation used needs to be appropriate for the design implemented. Statistical programmes to help perform a priori sample size calculations exist for a variety of experimental designs and statistical analyses, both freeware (web-based applets and functions in R) and commercial software78. Choosing the appropriate calculator or algorithm to use depends on the type of outcome measures and independent variables, and the number of groups. Consultation with a statistician is recommended, especially when the experimental design is complex or unusual.

When the experiment tests the effect of an intervention on the mean of a continuous outcome measure, the sample size can be calculated a priori, based on a mathematical relationship between the predefined, biologically relevant effect size, variability estimated from prior data, chosen significance level, power, and sample size (see Section 3 and9,10 for practical advice). If you have used an a priori sample size calculation, report

  • the analysis method (e.g., two-tailed Student t test with a 0.05 significance threshold)
  • the effect size of interest and a justification explaining why an effect size of that magnitude is relevant
  • the estimate of variability used (e.g., standard deviation) and how it was estimated
  • the power selected.

Information used in a power calculation

Sample size calculation is based on a mathematical relationship between the following parameters: effect size, variability, significance level, power, and sample size. Questions to consider are the following:

The primary objective of the experiment—What is the main outcome measure?

The primary outcome measure should be identified in the planning stage of the experiment; it is the outcome of greatest importance, which will answer the main experimental question.

The predefined effect size—What is a biologically relevant effect size?

The effect size is estimated as a biologically relevant change in the primary outcome measure between the groups under study. This can be informed by similar studies and involves scientists exploring what magnitude of effect would generate interest and would be worth taking forward into further work. In preclinical studies, the clinical relevance of the effect should also be taken into consideration.

What is the estimate of variability?

Estimates of variability can be obtained

  • From data collected from a preliminary experiment conducted under identical conditions to the planned experiment, e.g., a previous experiment in the same laboratory, testing the same treatment under similar conditions on animals with the same characteristics
  • From the control group in a previous experiment testing a different treatment
  • From a similar experiment reported in the literature

Significance threshold—What risk of a false positive is acceptable?

The significance level or threshold (α) is the probability of obtaining a false positive. If it is set at 0.05, then the risk of obtaining a false positive is 1 in 20 for a single statistical test. However, the threshold or the p-values will need to be adjusted in scenarios of multiple testing (e.g., by using a Bonferroni correction).

Power—What risk of a false negative is acceptable?

For a predefined, biologically meaningful effect size, the power (1 − β) is the probability that the statistical test will detect the effect if it genuinely exists (i.e., true positive result). A target power between 80% and 95% is normally deemed acceptable, which entails a risk of false negative between 5% and 20%.

Directionality—Will you use a one- or two-sided test?

The directionality of a test depends on the distribution of the test statistics for a given analysis. For tests based on t or z distributions (such as t tests), whether the data will be analysed using a one- or two-sided test relates to whether the alternative hypothesis is directional or not. An experiment with a directional (one-sided) alternative hypothesis can be powered and analysed with a one-sided test with the goal of maximising the sensitivity to detect this directional effect. Controversy exists within the statistics community on when it is appropriate to use a one-sided test11. The use of a one-sided test requires justification of why a treatment effect is only of interest when it is in a defined direction and why they would treat a large effect in the unexpected direction no differently from a nonsignificant difference12. Following the use of a one-sided test, the investigator cannot then test for the possibility of missing an effect in the untested direction. Choosing a one-tailed test for the sole purpose of attaining statistical significance is not appropriate.

Two-sided tests with a nondirectional alternative hypothesis are much more common and allow researchers to detect the effect of a treatment regardless of its direction.

Note that analyses such as ANOVA and chi-squared are based on asymmetrical distributions (F-distribution and chi-squared distribution) with only one tail. Therefore, these tests do not have a directionality option.

There are several types of studies in which a priori sample size calculations are not appropriate. For example, the number of animals needed for antibody or tissue production is determined by the amount required and the production ability of an individual animal. For studies in which the outcome is the successful generation of a sample or a condition (e.g., the production of transgenic animals), the number of animals is determined by the probability of success of the experimental procedure.

In early feasibility or pilot studies, the number of animals required depends on the purpose of the study. When the objective of the preliminary study is primarily logistic or operational (e.g., to improve procedures and equipment), the number of animals needed is generally small. In such cases, power calculations are not appropriate and sample sizes can be estimated based on operational capacity and constraints13. Pilot studies alone are unlikely to provide adequate data on variability for a power calculation for future experiments. Systematic reviews and previous studies are more appropriate sources of information on variability14.

If no power calculation was used to determine the sample size, state this explicitly and provide the reasoning that was used to decide on the sample size per group. Regardless of whether a power calculation was used or not, when explaining how the sample size was determined take into consideration any anticipated loss of animals or data, for example, due to exclusion criteria established upfront or expected attrition (see Item 3. Inclusion and exclusion criteria).

Examples

‘The sample size calculation was based on postoperative pain numerical rating scale (NRS) scores after administration of buprenorphine (NRS AUC mean = 2.70; noninferiority limit = 0.54; standard deviation = 0.66) as the reference treatment… and also Glasgow Composite Pain Scale (GCPS) scores… using online software (Experimental design assistant; https://eda.nc3rs.org.uk/eda/login/auth). The power of the experiment was set to 80%. A total of 20 dogs per group were considered necessary’15.

‘We selected a small sample size because the bioglass prototype was evaluated in vivo for the first time in the present study, and therefore, the initial intention was to gather basic evidence regarding the use of this biomaterial in more complex experimental designs’16.

Training

The UK EQUATOR Centre runs training on how to write using reporting guidelines.

Discuss this item

Visit this items’ discussion page to ask questions and give feedback.

References

1.
Vahidy F, Schäbitz WR, Fisher M, Aronowski J. Reporting standards for preclinical studies of stroke therapy. Stroke. 2016;47(10):2435-2438. doi:10.1161/strokeaha.116.013643
2.
Muhlhausler BS, Bloomfield FH, Gillman MW. Whole animal experiments should be more like human randomized controlled trials. PLoS Biology. 2013;11(2):e1001481. doi:10.1371/journal.pbio.1001481
3.
Jennions MD. A survey of the statistical power of research in behavioral ecology and animal behavior. Behavioral Ecology. 2003;14(3):438-445. doi:10.1093/beheco/14.3.438
4.
Lazic SE, Clarke-Williams CJ, Munafò MR. What exactly is “n” in cell culture and animal experiments? PLOS Biology. 2018;16(4):e2005282. doi:10.1371/journal.pbio.2005282
5.
Button KS, Ioannidis JPA, Mokrysz C, et al. Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience. 2013;14(5):365-376. doi:10.1038/nrn3475
6.
Würbel H. More than 3Rs: The importance of scientific validity for harm-benefit analysis of animal research. Lab Animal. 2017;46(4):164-166. doi:10.1038/laban.1220
7.
Team RC. Published online 2017.
8.
Charan J, Kantharia ND. How to calculate sample size in animal studies? Journal of Pharmacology and Pharmacotherapeutics. 2013;4(4):303-306. doi:10.4103/0976-500x.119726
9.
Bate ST, Clark RA. The Design and Statistical Analysis of Animal Experiments. Cambridge University Press; 2014. doi:10.1017/cbo9781139344319
10.
Festing MF. On determining sample size in experiments involving laboratory animals. Laboratory Animals. 2018;52(4):341-350. doi:10.1177/0023677217738268
11.
Freedman LS. An analysis of the controversy over classical one-sided tests. Clinical Trials. 2008;5(6):635-640. doi:10.1177/1740774508098590
12.
Ruxton GD, Neuhäuser M. When should we use one‐tailed hypothesis testing? Methods in Ecology and Evolution. 2010;1(2):114-117. doi:10.1111/j.2041-210x.2010.00014.x
13.
Reynolds PS. When power calculations won’t do: Fermi approximation of animal numbers. Lab Animal. 2019;48(9):249-253. doi:10.1038/s41684-019-0370-2
14.
Bate ST. How to decide your sample size when the power calculation is not straightforward. 2018 aug 1 [cited 2018 aug 2]. In: NC3Rs.org.uk [internet]. Available from: Https://www.nc3rs.org.uk/news/how-decide-your-sample-size-when-power-calculation-not-straightforward.
15.
Bustamante R, Daza MA, Canfrán S, et al. Comparison of the postoperative analgesic effects of cimicoxib, buprenorphine and their combination in healthy dogs undergoing ovariohysterectomy. Veterinary Anaesthesia and Analgesia. 2018;45(4):545-556. doi:10.1016/j.vaa.2018.01.003
16.
Spin J. 2015;44.

Citation

For attribution, please cite this work as:
Sert NP du, Hurst V, Ahluwalia A, et al. The ARRIVE reporting guideline for writing animal research articles. The EQUATOR Network guideline dissemination platform. doi:10.1234/equator/1010101

Reporting Guidelines are recommendations to help describe your work clearly

Your research will be used by people from different disciplines and backgrounds for decades to come. Reporting guidelines list the information you should describe so that everyone can understand, replicate, and synthesise your work.

Reporting guidelines do not prescribe how research should be designed or conducted. Rather, they help authors transparently describe what they did, why they did it, and what they found.

Reporting guidelines make writing research easier, and transparent research leads to better patient outcomes.

Easier writing

Following guidance makes writing easier and quicker.

Smoother publishing

Many journals require completed reporting checklists at submission.

Maximum impact

From nobel prizes to null results, articles have more impact when everyone can use them.

Who reads research?

You work will be read by different people, for different reasons, around the world, and for decades to come. Reporting guidelines help you consider all of your potential audiences. For example, your research may be read by researchers from different fields, by clinicians, patients, evidence synthesisers, peer reviewers, or editors. Your readers will need information to understand, to replicate, apply, appraise, synthesise, and use your work.

Cohort studies

A cohort study is an observational study in which a group of people with a particular exposure (e.g. a putative risk factor or protective factor) and a group of people without this exposure are followed over time. The outcomes of the people in the exposed group are compared to the outcomes of the people in the unexposed group to see if the exposure is associated with particular outcomes (e.g. getting cancer or length of life).

Source.

Case-control studies

A case-control study is a research method used in healthcare to investigate potential risk factors for a specific disease. It involves comparing individuals who have been diagnosed with the disease (cases) to those who have not (controls). By analysing the differences between the two groups, researchers can identify factors that may contribute to the development of the disease.

An example would be when researchers conducted a case-control study examining whether exposure to diesel exhaust particles increases the risk of respiratory disease in underground miners. Cases included miners diagnosed with respiratory disease, while controls were miners without respiratory disease. Participants' past occupational exposures to diesel exhaust particles were evaluated to compare exposure rates between cases and controls.

Source.

Cross-sectional studies

A cross-sectional study (also sometimes called a "cross-sectional survey") serves as an observational tool, where researchers capture data from a cohort of participants at a singular point. This approach provides a 'snapshot'— a brief glimpse into the characteristics or outcomes prevalent within a designated population at that precise point in time. The primary aim here is not to track changes or developments over an extended period but to assess and quantify the current situation regarding specific variables or conditions. Such a methodology is instrumental in identifying patterns or correlations among various factors within the population, providing a basis for further, more detailed investigation.

Source

Systematic reviews

A systematic review is a comprehensive approach designed to identify, evaluate, and synthesise all available evidence relevant to a specific research question. In essence, it collects all possible studies related to a given topic and design, and reviews and analyses their results.

The process involves a highly sensitive search strategy to ensure that as much pertinent information as possible is gathered. Once collected, this evidence is often critically appraised to assess its quality and relevance, ensuring that conclusions drawn are based on robust data. Systematic reviews often involve defining inclusion and exclusion criteria, which help to focus the analysis on the most relevant studies, ultimately synthesising the findings into a coherent narrative or statistical synthesis. Some systematic reviews will include a meta-analysis.

Source

Systematic review protocols

TODO

Meta analyses of Observational Studies

TODO

Randomised Trials

A randomised controlled trial (RCT) is a trial in which participants are randomly assigned to one of two or more groups: the experimental group or groups receive the intervention or interventions being tested; the comparison group (control group) receive usual care or no treatment or a placebo. The groups are then followed up to see if there are any differences between the results. This helps in assessing the effectiveness of the intervention.

Source

Randomised Trial Protocols

TODO

Qualitative research

Research that aims to gather and analyse non-numerical (descriptive) data in order to gain an understanding of individuals' social reality, including understanding their attitudes, beliefs, and motivation. This type of research typically involves in-depth interviews, focus groups, or field observations in order to collect data that is rich in detail and context. Qualitative research is often used to explore complex phenomena or to gain insight into people's experiences and perspectives on a particular topic. It is particularly useful when researchers want to understand the meaning that people attach to their experiences or when they want to uncover the underlying reasons for people's behavior. Qualitative methods include ethnography, grounded theory, discourse analysis, and interpretative phenomenological analysis.

Source

Case Reports

TODO

Diagnostic Test Accuracy Studies

Diagnostic accuracy studies focus on estimating the ability of the test(s) to correctly identify subjects with a predefined target condition, or the condition of interest (sensitivity) as well as to clearly identify those without the condition (specificity).

Prediction Models

Prediction model research is used to test the accurarcy of a model or test in estimating an outcome value or risk. Most models estimate the probability of the presence of a particular health condition (diagnostic) or whether a particular outcome will occur in the future (prognostic). Prediction models are used to support clinical decision making, such as whether to refer patients for further testing, monitor disease deterioration or treatment effects, or initiate treatment or lifestyle changes. Examples of well known prediction models include EuroSCORE II for cardiac surgery, the Gail model for breast cancer, the Framingham risk score for cardiovascular disease, IMPACT for traumatic brain injury, and FRAX for osteoporotic and hip fractures.

Source

Animal Research

TODO

Quality Improvement in Healthcare

Quality improvement research is about finding out how to improve and make changes in the most effective way. It is about systematically and rigourously exploring "what works" to improve quality in healthcare and the best ways to measure and disseminate this to ensure positive change. Most quality improvement effectiveness research is conducted in hospital settings, is focused on multiple quality improvement interventions, and uses process measures as outcomes. There is a great deal of variation in the research designs used to examine quality improvement effectiveness.

Source

Economic Evaluations in Healthcare

TODO

Meta Analyses

A meta-analysis is a statistical technique that amalgamates data from multiple studies to yield a single estimate of the effect size. This approach enhances precision and offers a more comprehensive understanding by integrating quantitative findings. Central to a meta-analysis is the evaluation of heterogeneity, which examines variations in study outcomes to ensure that differences in populations, interventions, or methodologies do not skew results. Techniques such as meta-regression or subgroup analysis are frequently employed to explore how various factors might influence the outcomes. This method is particularly effective when aiming to quantify the effect size, odds ratio, or risk ratio, providing a clearer numerical estimate that can significantly inform clinical or policy decisions.

How Meta-analyses and Systematic Reviews Work Together

Systematic reviews and meta-analyses function together, each complementing the other to provide a more robust understanding of research evidence. A systematic review meticulously gathers and evaluates all pertinent studies, establishing a solid foundation of qualitative and quantitative data. Within this framework, if the collected data exhibit sufficient homogeneity, a meta-analysis can be performed. This statistical synthesis allows for the integration of quantitative results from individual studies, producing a unified estimate of effect size. Techniques such as meta-regression or subgroup analysis may further refine these findings, elucidating how different variables impact the overall outcome. By combining these methodologies, researchers can achieve both a comprehensive narrative synthesis and a precise quantitative measure, enhancing the reliability and applicability of their conclusions. This integrated approach ensures that the findings are not only well-rounded but also statistically robust, providing greater confidence in the evidence base.

Why Don't All Systematic Reviews Use a Meta-Analysis?

Systematic reviews do not always have meta-analyses, due to variations in the data. For a meta-analysis to be viable, the data from different studies must be sufficiently similar, or homogeneous, in terms of design, population, and interventions. When the data shows significant heterogeneity, meaning there are considerable differences among the studies, combining them could lead to skewed or misleading conclusions. Furthermore, the quality of the included studies is critical; if the studies are of low methodological quality, merging their results could obscure true effects rather than explain them.

Protocol

A plan or set of steps that defines how something will be done. Before carrying out a research study, for example, the research protocol sets out what question is to be answered and how information will be collected and analysed.

Source

Animal research

When ARRIVE refers to animal research it is referring to in vivo animal research. This is the use of non-human animals, sometimes known as model organisms, in experiments that seek to control the variables that affect the behavior or biological system under study. This approach can be contrasted with field studies in which animals are observed in their natural environments or habitats. Animal research varies on a continuum from pure research, focusing on developing fundamental knowledge of an organism, to applied research, which may focus on answering some questions of great practical importance, such as finding a cure for a disease. Source" The ARRIVE guidelines apply to all areas of bioscience research involving living animals. That includes mammalian species as well as model organisms such as Drosophila or Caenorhabditis elegans. Each item is equally relevant to manuscripts centred around a single animal study and broader-scope manuscripts describing in vivo observations along with other types of experiments. The exact type of detail to report, however, might vary between species and experimental setup; this is acknowledged in the guidance provided for each item. Source

Bias

The over- or underestimation of the true effect of an intervention. Bias is caused by inadequacies in the design, conduct, or analysis of an experiment, resulting in the introduction of error.\n\nSource

Descriptive and inferential statistics

Descriptive statistics are used to summarise the data. They generally include a measure of central tendency (e.g., mean or median) and a measure of spread (e.g., standard deviation or range). Inferential statistics are used to make generalisations about the population from which the samples are drawn. Hypothesis tests such as ANOVA, Mann-Whitney, or t tests are examples of inferential statistics.\n\nSource

Effect size

Quantitative measure of differences between groups, or strength of relationships between variables.\n\nSource

Experimental unit

Biological entity subjected to an intervention independently of all other units, such that it is possible to assign any two experimental units to different treatment groups. Sometimes known as unit of randomisation.\n\nSource

External validity

Extent to which the results of a given study enable application or generalisation to other studies, study conditions, animal strains/species, or humans.\n\nSource

False negative

Statistically nonsignificant result obtained when the alternative hypothesis (H~1~) is true. In statistics, it is known as the type II error.\n\nSource

False positive

Statistically significant result obtained when the null hypothesis (H~0~) is true. In statistics, it is known as the type I error.\n\nSource

Independent variable

Variable that either the researcher manipulates (treatment, condition, time) or is a property of the sample (sex) or a technical feature (batch, cage, sample collection) that can potentially affect the outcome measure. Independent variables can be scientifically interesting, or nuisance variables. Also known as predictor variable.\n\nSource

Internal validity

Extent to which the results of a given study can be attributed to the effects of the experimental intervention, rather than some other, unknown factor(s) (e.g., inadequacies in the design, conduct, or analysis of the study introducing bias).\n\nSource

Nuisance variable

Variables that are not of primary interest but should be considered in the experimental design or the analysis because they may affect the outcome measure and add variability. They become confounders if, in addition, they are correlated with an independent variable of interest, as this introduces bias. Nuisance variables should be considered in the design of the experiment (to prevent them from becoming confounders) and in the analysis (to account for the variability and sometimes to reduce bias). For example, nuisance variables can be used as blocking factors or covariates.\n\nSource

Null and alternative hypotheses

The null hypothesis (H~0~) is that there is no effect, such as a difference between groups or an association between variables. The alternative hypothesis (H~1~) postulates that an effect exists.\n\nSource

Outcome measure

Any variable recorded during a study to assess the effects of a treatment or experimental intervention. Also known as dependent variable, response variable.\n\nSource

Power

For a predefined, biologically meaningful effect size, the probability that the statistical test will detect the effect if it exists (i.e., the null hypothesis is rejected correctly).\n\nSource

Sample size

Number of experimental units per group, also referred to as n.\n\nSource