cb_create() builds an object of class "li_codebook" from a dataset and optional
metadata. The resulting object can be used to write an Excel workbook with variable
and data summaries (using cb_write()), extract processed data (cb_get_data()),
or generate dataset summaries (cb_summarize_numeric(), cb_summarize_categorical(),
cb_summarize_text()).
Usage
cb_create(
data,
metadata = NULL,
...,
.name = name,
.var_label = label,
.val_labels = val_labels,
.user_missing = NULL,
.split_var_labels = NULL,
.val_labs_sep1 = NULL,
.val_labs_sep2 = NULL,
.options = cb_create_options()
)Arguments
- data
A data frame.
- metadata
A data frame containing metadata, such as variable labels and value labels.
- ...
Additional columns from
metadatato preserve in the final codebook. New names can be assigned by passing named arguments. Columns for variable name, form, variable label, and value labels are included by default.- .name, .var_label, .val_labels
Columns in
metadatacontaining variable name, variable label, and value labels, respectively. Ifmetadatais provided,.namemust be specified..var_labeland.val_labelsmay be set toNULLto omit.- .user_missing
A formula or list of formulas specifying user missing values. Formulas should specify variables on the left-hand side (as variable names or tidyselect expressions), and missing values on the right-hand side. If left-hand side is omitted, defaults to
tidyselect::everything(). See "Specifying user missing values" below for examples.- .split_var_labels
A
tidyselectexpression or list of tidyselect expressions, indicating (sets of) variable labels with a common stem that should be extracted into a separate column.- .val_labs_sep1, .val_labs_sep2
Regex patterns separating value labels in
metadata..val_labs_sep1separates values from labels, and.val_labs_sep2separates value/label pairs from one another. e.g., if value labels are in the format"1, First label|2, Second label", set.val_labs_sep1to","and.val_labs_sep2to"\\|".- .options
Additional options to use for codebook creation. Must be the result from a call to
cb_create_options(). See that function's help page for available options.
Value
An "li_codebook" object, consisting of a tibble summarizing the passed
dataset and attributes containing additional metadata. The tibble includes columns:
name: variable nametype: column containing simplified variable typeclass: optional column containing class(es) of each variablelabel_stem: optional column containing variable label stems, if any variables are specified in.split_var_labelslabel: variable labelvalues: values, with labels if applicableuser_missing: optional column showing user missing values, with labels if applicable. By default, this column is included only if user missings are specified for at least one variable. This behavior can be changed using theuser_missing_colargument tocb_create_options().missing: proportion missingadditional columns if specified in
...
Specifying user missing values
User missing values are defined by passing a formula or list of formulas to the
.user_missing argument. Formulas should specify variables on the left-hand
side and user missing values for those variables on the right-hand side:
cb <- cb_create(data, metadata, .user_missing = var1 ~ 99)The same user missings can be applied to multiple variables using tidyselect expressions.
# for variables `var1` through `var5`
.user_missing = var1:var5 ~ 99
# for all numeric variables, plus `var6` and `var7`
.user_missing = c(where(is.numeric), var6, var7) ~ c(-9, -8, -7)
# omitted left-hand side defaults to `tidyselect::everything()`
.user_missing = ~ -99Different user missings can be applied to different variables using a list of formulas:
User missing values may optionally be named to set value labels:
.user_missing = ~ c(Declined = -98, "Not applicable" = -99)If labels set in .user_missing conflict with those in metadata, the user_missing_conflict
argument to cb_create_options() controls which labels are used.
User missings may be set for numeric, character, factor/ordered factor, and haven_labelled/haven_labelled_spss
vectors. For factors, user missings are set based on factor labels (not the underlying
integer codes). For "haven_labelled" vectors, user missings are set based on
values (not value labels). By default, variables with incompatible classes (e.g.,
logical, Date, POSIXt) will be ignored if specified in .user_missing. This
behavior can be changed using the .user_missing_incompatible argument to cb_create_options().
Examples
diamonds2 <- ggplot2::diamonds |>
transform(
carat_group = as.integer(cut(carat, breaks = 3, labels = 1:3)),
price_group = as.integer(cut(
price,
breaks = c(0, 500, 1000, 2000, 5000, 10000, Inf),
labels = 1:6,
right = FALSE
))
)
# basic codebook
cb_create(diamonds2)
#> # A tibble: 12 × 4
#> name type values missing
#> <chr> <chr> <chr> <dbl>
#> 1 carat numeric NA 0
#> 2 cut ordinal Fair; Good; Very Good; Premium; Ideal 0
#> 3 color ordinal D; E; F; G; H; I; J 0
#> 4 clarity ordinal I1; SI2; SI1; VS2; VS1; VVS2; VVS1; IF 0
#> 5 depth numeric NA 0
#> 6 table numeric NA 0
#> 7 price integer NA 0
#> 8 x numeric NA 0
#> 9 y numeric NA 0
#> 10 z numeric NA 0
#> 11 carat_group integer NA 0
#> 12 price_group integer NA 0
# convert variables to factor to treat as categorical
diamonds2 |>
transform(
carat_group = factor(carat_group),
price_group = factor(price_group)
) |>
cb_create()
#> # A tibble: 12 × 4
#> name type values missing
#> <chr> <chr> <chr> <dbl>
#> 1 carat numeric NA 0
#> 2 cut ordinal Fair; Good; Very Good; Premium; Ideal 0
#> 3 color ordinal D; E; F; G; H; I; J 0
#> 4 clarity ordinal I1; SI2; SI1; VS2; VS1; VVS2; VVS1; IF 0
#> 5 depth numeric NA 0
#> 6 table numeric NA 0
#> 7 price integer NA 0
#> 8 x numeric NA 0
#> 9 y numeric NA 0
#> 10 z numeric NA 0
#> 11 carat_group categorical 1; 2; 3 0
#> 12 price_group categorical 1; 2; 3; 4; 5; 6 0
# provide metadata for variable and value labels
diamonds_meta <- data.frame(
name = names(diamonds2),
label = c(
# from ?ggplot2::diamonds
"price in US dollars ($326–$18,823)",
"weight of the diamond (0.2–5.01)",
"quality of the cut (Fair, Good, Very Good, Premium, Ideal)",
"diamond colour, from D (best) to J (worst)",
"a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))",
"length in mm (0–10.74)",
"width in mm (0–58.9)",
"depth in mm (0–31.8)",
"total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)",
"width of top of diamond relative to widest point (43–95)",
"diamond carat (3 groups)",
"diamond price (6 groups)"
),
val_labels = c(
rep(NA, 10),
"1 = small; 2 = medium; 3 = large",
"1 = <$500; 2 = $500-$999; 3 = $1,000-$1,999; 4 = $2,000-$4,999; 5 = $5,000-$9,999; 6 = $10,000+"
)
)
cb_create(
diamonds2, diamonds_meta,
.val_labs_sep1 = " = ", .val_labs_sep2 = "; "
)
#> # A tibble: 12 × 5
#> name type label values missing
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 carat numeric price in US dollars ($326–$18,823) NA 0
#> 2 cut ordinal weight of the diamond (0.2–5.01) Fair;… 0
#> 3 color ordinal quality of the cut (Fair, Good, Very … D; E;… 0
#> 4 clarity ordinal diamond colour, from D (best) to J (w… I1; S… 0
#> 5 depth numeric a measurement of how clear the diamon… NA 0
#> 6 table numeric length in mm (0–10.74) NA 0
#> 7 price integer width in mm (0–58.9) NA 0
#> 8 x numeric depth in mm (0–31.8) NA 0
#> 9 y numeric total depth percentage = z / mean(x, … NA 0
#> 10 z numeric width of top of diamond relative to w… NA 0
#> 11 carat_group categorical diamond carat (3 groups) [1] s… 0
#> 12 price_group categorical diamond price (6 groups) [1] <… 0