Table 1 for Simple and Stratified Descriptive Statistics

Produces a descriptive table, stratified by an optional categorical variable, providing means/frequencies and standard deviations/percentages. It is well-formatted for easy transition to academic article or report. Can be used within the piping framework [see library(magrittr)].

table1(
  .data,
  ...,
  splitby = NULL,
  FUN = NULL,
  FUN2 = NULL,
  total = FALSE,
  second = NULL,
  row_wise = FALSE,
  test = FALSE,
  param = TRUE,
  header_labels = NULL,
  type = "pvalues",
  output = "text",
  rounding_perc = 1,
  digits = 1,
  var_names = NULL,
  format_number = FALSE,
  NAkeep = NULL,
  na.rm = TRUE,
  booktabs = TRUE,
  caption = NULL,
  align = NULL,
  float = "ht",
  export = NULL,
  label = NULL
)

Arguments

.data: the data.frame that is to be summarized
...: variables in the data set that are to be summarized; unquoted names separated by commas (e.g. age, gender, race) or indices. If indices, it needs to be a single vector (e.g. c(1:5, 8, 9:20) instead of 1:5, 8, 9:20). As it is currently, it CANNOT handle both indices and unquoted names simultaneously. Finally, any empty rows (where the row is NA for each variable selected) will be removed for an accurate n count.
splitby: the categorical variable to stratify (in formula form splitby = ~gender) or quoted splitby = "gender"; instead, dplyr::group_by(...) can be used within a pipe (this is the default when the data object is a grouped data frame from dplyr::group_by(...)).
FUN: the function to be applied to summarize the numeric data; default is to report the means and standard deviations
FUN2: a secondary function to be applied to summarize the numeric data; default is to report the medians and 25% and 75% quartiles
total: whether a total (not stratified with the splitby or group_by()) should also be reported in the table
second: a vector or list of quoted continuous variables for which the FUN2 should be applied
row_wise: how to calculate percentages for factor variables when splitby != NULL: if FALSE calculates percentages by variable within groups; if TRUE calculates percentages across groups for one level of the factor variable.
test: logical; if set to TRUE then the appropriate bivariate tests of significance are performed if splitby has more than 1 level. A message is printed when the variances of the continuous variables being tested do not meet the assumption of Homogeneity of Variance (using Breusch-Pagan Test of Heteroskedasticity) and, therefore, the argument `var.equal = FALSE` is used in the test.
param: logical; if set to TRUE then the appropriate parametric bivariate tests of significance are performed (if `test = TRUE`). For continuous variables, it is a t-test or ANOVA (depending on the number of levels of the group). If set to FALSE, the Kruskal-Wallis Rank Sum Test is performed for the continuous variables. Either way, the chi-square test of independence is performed for categorical variables.
header_labels: a character vector that renames the header labels (e.g., the blank above the variables, the p-value label, and test value label).
type: what is displayed in the table; a string or a vector of strings. Two main sections can be inputted: 1. if test = TRUE, can write "pvalues", "full", or "stars" and 2. can state "simple" and/or "condense". These are discussed in more depth in the details section below.
output: how the table is output; can be "text" or "text2" for regular console output or any of kable()'s options from knitr (e.g., "latex", "markdown", "pandoc"). A new option, 'latex2', although more limited, allows the variable name to show and has an overall better appearance.
rounding_perc: the number of digits after the decimal for percentages; default is 1
digits: the number of significant digits for the numerical variables (if using default functions); default is 1.
var_names: custom variable names to be printed in the table. Variable names can be applied directly in the list of variables.
format_number: default is FALSE; if TRUE, then the numbers are formatted with commas (e.g., 20,000 instead of 20000)
NAkeep: when set to TRUE it also shows how many missing values are in the data for each categorical variable being summarized (deprecated; use na.rm)
na.rm: when set to FALSE it also shows how many missing values are in the data for each categorical variable being summarized
booktabs: when output != "text"; option is passed to knitr::kable
caption: when output != "text"; option is passed to knitr::kable
align: when output != "text"; option is passed to knitr::kable
float: the float applied to the table in Latex when output is latex2, default is "ht".
export: character; when given, it exports the table to a CSV file to folder named "table1" in the working directory with the name of the given string (e.g., "myfile" will save to "myfile.csv")
label: for output == "latex2", this provides a table reference label for latex

Value

A table with the number of observations, means/frequencies and standard deviations/percentages is returned. The object is a table1 class object with a print method. Can be printed in LaTex form.

Details

In defining type, 1. options are "pvalues" that display the p-values of the tests, "full" which also shows the test statistics, or "stars" which only displays stars to highlight significance with *** < .001 ** .01 * .05; and 2. "simple" then only percentages are shown for categorical variable and "condense" then continuous variables' means and SD's will be on the same line as the variable name and dichotomous variables only show counts and percentages for the reference category.

Examples


## Fictitious Data ##
library(furniture)
library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union

x  <- runif(1000)
y  <- rnorm(1000)
z  <- factor(sample(c(0,1), 1000, replace=TRUE))
a  <- factor(sample(c(1,2), 1000, replace=TRUE))
df <- data.frame(x, y, z, a)

## Simple
table1(df, x, y, z, a)
#> 
#> 
#> ────────────────────────
#>       Mean/Count (SD/%)
#>       n = 1000         
#>  x                     
#>       0.5 (0.3)        
#>  y                     
#>       0.0 (1.0)        
#>  z                     
#>     0 503 (50.3%)      
#>     1 497 (49.7%)      
#>  a                     
#>     1 483 (48.3%)      
#>     2 517 (51.7%)      
#> ────────────────────────

## Stratified
## all three below are the same
table1(df, x, y, z,
       splitby = ~ a)
#> 
#> 
#> ──────────────────────────────
#>                a 
#>       1           2          
#>       n = 483     n = 517    
#>  x                           
#>       0.5 (0.3)   0.5 (0.3)  
#>  y                           
#>       0.0 (1.0)   -0.0 (1.0) 
#>  z                           
#>     0 236 (48.9%) 267 (51.6%)
#>     1 247 (51.1%) 250 (48.4%)
#> ──────────────────────────────
table1(df, x, y, z,
       splitby = "a")
#> 
#> 
#> ──────────────────────────────
#>                a 
#>       1           2          
#>       n = 483     n = 517    
#>  x                           
#>       0.5 (0.3)   0.5 (0.3)  
#>  y                           
#>       0.0 (1.0)   -0.0 (1.0) 
#>  z                           
#>     0 236 (48.9%) 267 (51.6%)
#>     1 247 (51.1%) 250 (48.4%)
#> ──────────────────────────────

## With Piping
df %>%
  table1(x, y, z, 
         splitby = ~a) 
#> 
#> 
#> ──────────────────────────────
#>                a 
#>       1           2          
#>       n = 483     n = 517    
#>  x                           
#>       0.5 (0.3)   0.5 (0.3)  
#>  y                           
#>       0.0 (1.0)   -0.0 (1.0) 
#>  z                           
#>     0 236 (48.9%) 267 (51.6%)
#>     1 247 (51.1%) 250 (48.4%)
#> ──────────────────────────────
         
df %>%
  group_by(a) %>%
  table1(x, y, z)
#> Using dplyr::group_by() groups: a
#> 
#> 
#> ──────────────────────────────
#>                a 
#>       1           2          
#>       n = 483     n = 517    
#>  x                           
#>       0.5 (0.3)   0.5 (0.3)  
#>  y                           
#>       0.0 (1.0)   -0.0 (1.0) 
#>  z                           
#>     0 236 (48.9%) 267 (51.6%)
#>     1 247 (51.1%) 250 (48.4%)
#> ──────────────────────────────

## Adjust variables within function and assign name
table1(df, 
       x2 = ifelse(x > 0, 1, 0), z = z)
#> 
#> 
#> ────────────────────────
#>       Mean/Count (SD/%)
#>       n = 1000         
#>  x2                    
#>       1.0 (0.0)        
#>  z                     
#>     0 503 (50.3%)      
#>     1 497 (49.7%)      
#> ────────────────────────