What Test Should I Use?

Tareq Kheirbek, MD ScM FACS

(I will work on posting more discussions on individual tests, but for now please use the following table to assist you with your choice of statistical test of association)

When choosing the correct statistical test you should answer the following questions:

  1. Is my outcome categorical or continuous?
    • If categorical, how many categories?
    • Does an of the categories have sparse data?
    • If continuous exposure, is it normally distributed?
  2. Are my exposure categories independent or paired?
  3. Am I studying time to event?

For the following we will assume your exposure categories are independent, ie two distinct groups of patients. At the end, I will present alternative tests to use if you have paired groups (example: outcome before and after intervention in the same patients).

Categorical Outcome:

Majority of the studies are going to have categorical outcomes, specifically binary (mortality: yes/no, readmission: yes/no, infection: yes/no). When you have two categories of exposure (yes/no) and two outcomes (yes/no) then you can generate a 2×2 table:

Outcome +Outcome –
Exposure +ab
Exposure –cd

The next question would be to evaluate if you have adequate number of patients in each cell. Chi Square is the test to use in this case. It tests the likelihood of independence between rows and columns in the table. If you have less than 30 observation in one cell then you have sparse data and you need to use an alternative test – Fisher’s Exact test.

STATA CODE:

.tab expvar outcomevar, row col chi

For Fisher’s exact test:

.tab expvar outcomevar, row col chi e

The above tests are used for hypothesis testing and will result in a p value for the test. If you want to measure the association between exposure and outcome (Odds Ratio) then a logistic regression (simple or multiple) will be used:

.logistic outcomevar expvar var2 var3 var4

where (var2 var3 var4) are other variables that you are controlling for in the logistic model.

Continuous Outcomes

If you are comparing continuous outcomes between two groups (BP, age, ISS, weight, etc), you first need to assess normality of the outcome. Examples of nonparametric distribution include time to OR. In this continuous variable it is obvious that majority of observation are going to lean towards early interventions with a few observations that are going to be late, resulting in a skewness to the right. Large sample sizes (in the hundreds and above) are robust against normality assumption. In order to check normality in STATA you run the following code:

.sum var, detail

The following is an example of describing age variable in one of my datasets:

In the results above look for “skewness“. You want this to between -3 and +3 to indicate normal distribution, or normal enough to run a parametric test.

Another useful code to describe a continuous variable and compare basic stats between two groups (not for hypothesis testing, but just descriptive) is tabstat. Example:

p50=median, sd=standard deviation

The test to use in this case is Student t-test, which compares the means between the two groups. If the continuous outcome is not normally distributed then you should use Wilcoxon Rank Sum test which compares the distribution between the two groups or the Median test which test the hypothesis that the median in both groups is equal.

.ttest outcomevar, by(expvar)

.ranksum outcomevar, by(expvar)

.median outcomevar, by(expvar)

Examples:

Student t test:

For nonparametric testing (Rank Sum):

To measure the association, a linear regression can be used for continuous data:

.regress outcomevar expvar var2 var3 var

Notes and Alternative Testings:

  • If exposure groups are paired, matched, dependent (all the same), which is rare to encounter in most study designs, then the following tests should be used:
    • For categorical outcomes: McNemar test
    • For continuous outcomes: Paired t-test for normally distributed variables and Signed Rank Test for nonparametric variables.
  • If there are more than two exposure groups and continuous outcome variables then you use ANOVA for independent groups and Repeated Measure ANOVA for paired groups.
  • If you have a continuous exposure and continuous outcome, then you are testing correlation between the two variables and you’d use Pearson correlation coefficient for normally distributed variables and Spearman’s Rank correlation coefficient for nonparametric variables.
  • Time to event or survival analysis will be discussed separately.