Check and clean spelling or codes of multiple variables in a data frame

This function allows you to clean your data according to pre-defined rules encapsulated in either a data frame or list of data frames. It has application for addressing mis-spellings and recoding variables (e.g. from electronic survey data).

clean_variable_spelling(
  x = data.frame(),
  wordlists = list(),
  from = 1,
  to = 2,
  spelling_vars = 3,
  sort_by = NULL,
  classes = NULL,
  warn = FALSE
)

Arguments

x	a `data.frame`
wordlists	a data frame or named list of data frames with at least two columns defining the word list to be used. If this is a data frame, a third column must be present to split the wordlists by column in `x` (see `spelling_vars`).
from	a column name or position defining words or keys to be replaced
to	a column name or position defining replacement values
spelling_vars	character or integer. If `wordlists` is a data frame, then this column in defines the columns in `x` corresponding to each section of the `wordlists` data frame. This defaults to `3`, indicating the third column is to be used.
sort_by	a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted.
classes	a vector of class definitions for each of the columns. If this is not provided, the classes will be read from the columns themselves. Practically, this is used in `clean_data()` to mark columns as protected.
warn	if `TRUE`, warnings and errors from `clean_spelling()` will be shown as a single warning. Defaults to `FALSE`, which shows nothing.

Value

a data frame with re-defined data based on the dictionary

Details

By default, this applies the function clean_spelling() to all columns specified by the column names listed in spelling_vars, or, if a global dictionary is used, this includes all character and factor columns as well.

spelling_vars

Spelling variables within wordlists represent keys that you want to match to column names in x (the data set). These are expected to match exactly with the exception of two reserved keywords that starts with a full stop:

.regex [pattern]: any column whose name is matched by [pattern]. The [pattern] should be an unquoted, valid, PERL-flavored regular expression.
.global: any column (see Section Global wordlists)

Global wordlists

A global wordlist is a set of definitions applied to all valid columns of x indiscriminantly.

.global spelling_var: If you want to apply a set of definitions to all valid columns in addition to specified columns, then you can include a .global group in the spelling_var column of your wordlists data frame. This is useful for setting up a dictionary of common spelling errors. NOTE: specific variable definitions will override global defintions. For example: if you have a column for cardinal directions and a definiton for N = North, then the global variable N = no will not override that. See Example.
spelling_vars = NULL: If you want your data frame to be applied to all character/factor columns indiscriminantly, then setting spelling_vars = NULL will use that wordlist globally.

Note

This function will only parse character and factor columns to protect numeric and Date columns from conversion to character.

Examples


# Read in dictionary and coded date examples --------------------

wordlist <- read.csv(linelist_example("spelling-dictionary.csv"),
                     stringsAsFactors = FALSE)
dat      <- read.csv(linelist_example("coded-data.csv"),
                     stringsAsFactors = FALSE)
dat$date <- as.Date(dat$date)

# Clean spelling based on wordlist ------------------------------ 

wordlist # show the wordlist
#>     options       values                 grp orders
#> 1         y          Yes         readmission      1
#> 2         n           No         readmission      2
#> 3         u      Unknown         readmission      3
#> 4  .missing      Missing         readmission      4
#> 5         0          Yes             treated      1
#> 6         1           No             treated      2
#> 7  .missing      Missing             treated      3
#> 8         1  Facility  1            facility      1
#> 9         2  Facility  2            facility      2
#> 10        3  Facility  3            facility      3
#> 11        4  Facility  4            facility      4
#> 12        5  Facility  5            facility      5
#> 13        6  Facility  6            facility      6
#> 14        7  Facility  7            facility      7
#> 15        8  Facility  8            facility      8
#> 16        9  Facility  9            facility      9
#> 17       10  Facility 10            facility     10
#> 18 .default      Unknown            facility     11
#> 19        0          0-9           age_group      1
#> 20       10        10-19           age_group      2
#> 21       20        20-29           age_group      3
#> 22       30        30-39           age_group      4
#> 23       40        40-49           age_group      5
#> 24       50          50+           age_group      6
#> 25     high         High .regex ^lab_result_      1
#> 26     norm       Normal .regex ^lab_result_      2
#> 27      inc Inconclusive .regex ^lab_result_      3
#> 28        y          yes             .global    Inf
#> 29        n           no             .global    Inf
#> 30        u      unknown             .global    Inf
#> 31      unk      unknown             .global    Inf
#> 32      oui          yes             .global    Inf
#> 33 .missing      missing             .global    Inf
head(dat) # show the data
#>       id       date readmission treated facility age_group lab_result_01
#> 1 ef267c 2019-07-08        <NA>       0        C        10           unk
#> 2 e80a37 2019-07-07           y       0        3        10           inc
#> 3 b72883 2019-07-07           y       1        8        30           inc
#> 4 c9ee86 2019-07-09           n       1        4        40           inc
#> 5 40bc7a 2019-07-12           n       1        6         0          norm
#> 6 46566e 2019-07-14           y      NA        B        50           unk
#>   lab_result_02 lab_result_03 has_symptoms followup
#> 1          high           inc         <NA>        u
#> 2           unk          norm            y      oui
#> 3          norm           inc                   oui
#> 4           inc           unk            y      oui
#> 5           unk          norm         <NA>        n
#> 6           unk           inc         <NA>     <NA>

res1 <- clean_variable_spelling(dat,
                                wordlists = wordlist,
                                from = "options",
                                to = "values",
                                spelling_vars = "grp")
head(res1)
#>       id       date readmission treated    facility age_group lab_result_01
#> 1 ef267c 2019-07-08     Missing     Yes     Unknown     10-19       unknown
#> 2 e80a37 2019-07-07         Yes     Yes Facility  3     10-19  Inconclusive
#> 3 b72883 2019-07-07         Yes      No Facility  8     30-39  Inconclusive
#> 4 c9ee86 2019-07-09          No      No Facility  4     40-49  Inconclusive
#> 5 40bc7a 2019-07-12          No      No Facility  6       0-9        Normal
#> 6 46566e 2019-07-14         Yes Missing     Unknown       50+       unknown
#>   lab_result_02 lab_result_03 has_symptoms followup
#> 1          High  Inconclusive      missing  unknown
#> 2       unknown        Normal          yes      yes
#> 3        Normal  Inconclusive      missing      yes
#> 4  Inconclusive       unknown          yes      yes
#> 5       unknown        Normal      missing       no
#> 6       unknown  Inconclusive      missing  missing

# You can ensure the order of the factors are correct by specifying 
# a column that defines order.

dat[] <- lapply(dat, as.factor)
as.list(head(dat))
#> $id
#> [1] ef267c e80a37 b72883 c9ee86 40bc7a 46566e
#> 50 Levels: 00220d 0ab27b 0f4fd2 0ff1e1 1513bb 174713 1793eb 25be8f ... fa5389
#> 
#> $date
#> [1] 2019-07-08 2019-07-07 2019-07-07 2019-07-09 2019-07-12 2019-07-14
#> 10 Levels: 2019-07-07 2019-07-08 2019-07-09 2019-07-10 ... 2019-07-16
#> 
#> $readmission
#> [1] <NA> y    y    n    n    y   
#> Levels:  n u y
#> 
#> $treated
#> [1] 0    0    1    1    1    <NA>
#> Levels: 0 1
#> 
#> $facility
#> [1] C 3 8 4 6 B
#> Levels: 1 10 2 3 4 5 6 7 8 9 A B C
#> 
#> $age_group
#> [1] 10 10 30 40 0  50
#> Levels: 0 10 20 30 40 50
#> 
#> $lab_result_01
#> [1] unk  inc  inc  inc  norm unk 
#> Levels: high inc norm unk
#> 
#> $lab_result_02
#> [1] high unk  norm inc  unk  unk 
#> Levels: high inc norm unk
#> 
#> $lab_result_03
#> [1] inc  norm inc  unk  norm inc 
#> Levels: high inc norm unk
#> 
#> $has_symptoms
#> [1] <NA> y         y    <NA> <NA>
#> Levels:  n oui u unk y
#> 
#> $followup
#> [1] u    oui  oui  oui  n    <NA>
#> Levels: n oui u unk y
#> 
res2 <- clean_variable_spelling(dat,
                                wordlists = wordlist,
                                from = "options",
                                to = "values",
                                spelling_vars = "grp",
                                sort_by = "orders")
head(res2)
#>       id       date readmission treated    facility age_group lab_result_01
#> 1 ef267c 2019-07-08     Missing     Yes     Unknown     10-19       unknown
#> 2 e80a37 2019-07-07         Yes     Yes Facility  3     10-19  Inconclusive
#> 3 b72883 2019-07-07         Yes      No Facility  8     30-39  Inconclusive
#> 4 c9ee86 2019-07-09          No      No Facility  4     40-49  Inconclusive
#> 5 40bc7a 2019-07-12          No      No Facility  6       0-9        Normal
#> 6 46566e 2019-07-14         Yes Missing     Unknown       50+       unknown
#>   lab_result_02 lab_result_03 has_symptoms followup
#> 1          High  Inconclusive      missing  unknown
#> 2       unknown        Normal          yes      yes
#> 3        Normal  Inconclusive      missing      yes
#> 4  Inconclusive       unknown          yes      yes
#> 5       unknown        Normal      missing       no
#> 6       unknown  Inconclusive      missing  missing
as.list(head(res2))
#> $id
#> [1] ef267c e80a37 b72883 c9ee86 40bc7a 46566e
#> 50 Levels: 00220d 0ab27b 0f4fd2 0ff1e1 1513bb 174713 1793eb 25be8f ... fa5389
#> 
#> $date
#> [1] 2019-07-08 2019-07-07 2019-07-07 2019-07-09 2019-07-12 2019-07-14
#> 10 Levels: 2019-07-07 2019-07-08 2019-07-09 2019-07-10 ... 2019-07-16
#> 
#> $readmission
#> [1] Missing Yes     Yes     No      No      Yes    
#> Levels: Yes No Unknown Missing
#> 
#> $treated
#> [1] Yes     Yes     No      No      No      Missing
#> Levels: Yes No Missing
#> 
#> $facility
#> [1] Unknown     Facility  3 Facility  8 Facility  4 Facility  6 Unknown    
#> 11 Levels: Facility  1 Facility  2 Facility  3 Facility  4 ... Unknown
#> 
#> $age_group
#> [1] 10-19 10-19 30-39 40-49 0-9   50+  
#> Levels: 0-9 10-19 20-29 30-39 40-49 50+
#> 
#> $lab_result_01
#> [1] unknown      Inconclusive Inconclusive Inconclusive Normal      
#> [6] unknown     
#> Levels: High Normal Inconclusive unknown
#> 
#> $lab_result_02
#> [1] High         unknown      Normal       Inconclusive unknown     
#> [6] unknown     
#> Levels: High Normal Inconclusive unknown
#> 
#> $lab_result_03
#> [1] Inconclusive Normal       Inconclusive unknown      Normal      
#> [6] Inconclusive
#> Levels: High Normal Inconclusive unknown
#> 
#> $has_symptoms
#> [1] missing yes     missing yes     missing missing
#> Levels: yes no unknown missing
#> 
#> $followup
#> [1] unknown yes     yes     yes     no      missing
#> Levels: yes no unknown missing
#>