Check and clean spelling or codes of multiple variables in a data frame

This function allows you to clean your data according to pre-defined rules encapsulated in either a data frame or list of data frames. It has application for addressing mis-spellings and recoding variables (e.g. from electronic survey data).

match_df(
  x = data.frame(),
  dictionary = list(),
  from = 1,
  to = 2,
  by = 3,
  order = NULL,
  warn = FALSE
)

Arguments

x	a character or factor vector
dictionary	a data frame or named list of data frames with at least two columns defining the word list to be used. If this is a data frame, a third column must be present to split the dictionary by column in `x` (see `by`).
from	a column name or position defining words or keys to be replaced
to	a column name or position defining replacement values
by	character or integer. If `dictionary` is a data frame, then this column in defines the columns in `x` corresponding to each section of the `dictionary` data frame. This defaults to `3`, indicating the third column is to be used.
order	a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted.
warn	if `TRUE`, warnings and errors from `match_vec()` will be shown as a single warning. Defaults to `FALSE`, which shows nothing.

Value

a data frame with re-defined data based on the dictionary

Details

By default, this applies the function match_vec() to all columns specified by the column names listed in by, or, if a global dictionary is used, this includes all character and factor columns as well.

`by` column

Spelling variables within dictionary represent keys that you want to match to column names in x (the data set). These are expected to match exactly with the exception of two reserved keywords that starts with a full stop:

.regex [pattern]: any column whose name is matched by [pattern]. The [pattern] should be an unquoted, valid, PERL-flavored regular expression.
.global: any column (see Section Global dictionary)

Global dictionary

A global dictionary is a set of definitions applied to all valid columns of x indiscriminantly.

.global keyword in by: If you want to apply a set of definitions to all valid columns in addition to specified columns, then you can include a .global group in the by column of your dictionary data frame. This is useful for setting up a dictionary of common spelling errors. NOTE: specific variable definitions will override global defintions. For example: if you have a column for cardinal directions and a definiton for N = North, then the global variable N = no will not override that. See Example.
by = NULL: If you want your data frame to be applied to all character/factor columns indiscriminantly, then setting by = NULL will use that dictionary globally.

Examples


# Read in dictionary and coded date examples --------------------

dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
  stringsAsFactors = FALSE)
dat <- read.csv(matchmaker_example("coded-data.csv"),
  stringsAsFactors = FALSE)
dat$date <- as.Date(dat$date)

# Clean spelling based on dictionary -----------------------------

dict # show the dict
#>     options       values                 grp orders
#> 1         y          Yes         readmission      1
#> 2         n           No         readmission      2
#> 3         u      Unknown         readmission      3
#> 4  .missing      Missing         readmission      4
#> 5         0          Yes             treated      1
#> 6         1           No             treated      2
#> 7  .missing      Missing             treated      3
#> 8         1  Facility  1            facility      1
#> 9         2  Facility  2            facility      2
#> 10        3  Facility  3            facility      3
#> 11        4  Facility  4            facility      4
#> 12        5  Facility  5            facility      5
#> 13        6  Facility  6            facility      6
#> 14        7  Facility  7            facility      7
#> 15        8  Facility  8            facility      8
#> 16        9  Facility  9            facility      9
#> 17       10  Facility 10            facility     10
#> 18 .default      Unknown            facility     11
#> 19        0          0-9           age_group      1
#> 20       10        10-19           age_group      2
#> 21       20        20-29           age_group      3
#> 22       30        30-39           age_group      4
#> 23       40        40-49           age_group      5
#> 24       50          50+           age_group      6
#> 25     high         High .regex ^lab_result_      1
#> 26     norm       Normal .regex ^lab_result_      2
#> 27      inc Inconclusive .regex ^lab_result_      3
#> 28        y          yes             .global    Inf
#> 29        n           no             .global    Inf
#> 30        u      unknown             .global    Inf
#> 31      unk      unknown             .global    Inf
#> 32      oui          yes             .global    Inf
#> 33 .missing      missing             .global    Inf
head(dat) # show the data
#>       id       date readmission treated facility age_group lab_result_01
#> 1 ef267c 2019-07-08        <NA>       0        C        10           unk
#> 2 e80a37 2019-07-07           y       0        3        10           inc
#> 3 b72883 2019-07-07           y       1        8        30           inc
#> 4 c9ee86 2019-07-09           n       1        4        40           inc
#> 5 40bc7a 2019-07-12           n       1        6         0          norm
#> 6 46566e 2019-07-14           y      NA        B        50           unk
#>   lab_result_02 lab_result_03 has_symptoms followup
#> 1          high           inc         <NA>        u
#> 2           unk          norm            y      oui
#> 3          norm           inc                   oui
#> 4           inc           unk            y      oui
#> 5           unk          norm         <NA>        n
#> 6           unk           inc         <NA>     <NA>

res1 <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp")
head(res1)
#>       id       date readmission treated    facility age_group lab_result_01
#> 1 ef267c 2019-07-08     Missing     Yes     Unknown     10-19       unknown
#> 2 e80a37 2019-07-07         Yes     Yes Facility  3     10-19  Inconclusive
#> 3 b72883 2019-07-07         Yes      No Facility  8     30-39  Inconclusive
#> 4 c9ee86 2019-07-09          No      No Facility  4     40-49  Inconclusive
#> 5 40bc7a 2019-07-12          No      No Facility  6       0-9        Normal
#> 6 46566e 2019-07-14         Yes Missing     Unknown       50+       unknown
#>   lab_result_02 lab_result_03 has_symptoms followup
#> 1          High  Inconclusive      missing  unknown
#> 2       unknown        Normal          yes      yes
#> 3        Normal  Inconclusive      missing      yes
#> 4  Inconclusive       unknown          yes      yes
#> 5       unknown        Normal      missing       no
#> 6       unknown  Inconclusive      missing  missing

# Show warnings/errors from each column --------------------------
# Internally, the `match_vec()` function can be quite noisy with warnings for
# various reasons. Thus, by default, the `match_df()` function will keep
# these quiet, but you can have them printed to your console if you use the
# warn = TRUE option:

res1 <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp",
  warn = TRUE)
#> 
#> ── Warnings were found in the following columns ──
#> 
#> ● age_group
#>   1. ⚠ None of the variables in `age_group` were found in the global
#>     dictionary. Did you use the correct dictionary?
#> ● facility
#>   1. ⚠ None of the variables in `facility` were found in the global
#>     dictionary. Did you use the correct dictionary?
#>   2. ⚠ 'A', 'B', 'C' were changed to the default value ('Unknown')
#> ● readmission
#>   1. ⚠ None of the variables in `readmission` were found in the global
#>     dictionary. Did you use the correct dictionary?
#> ● treated
#>   1. ⚠ None of the variables in `treated` were found in the global
#>     dictionary. Did you use the correct dictionary?
#> ● id
#>   1. ⚠ None of the variables in `id` were found in `dict`. Did you use the
#>     correct dictionary?
head(res1)
#>       id       date readmission treated    facility age_group lab_result_01
#> 1 ef267c 2019-07-08     Missing     Yes     Unknown     10-19       unknown
#> 2 e80a37 2019-07-07         Yes     Yes Facility  3     10-19  Inconclusive
#> 3 b72883 2019-07-07         Yes      No Facility  8     30-39  Inconclusive
#> 4 c9ee86 2019-07-09          No      No Facility  4     40-49  Inconclusive
#> 5 40bc7a 2019-07-12          No      No Facility  6       0-9        Normal
#> 6 46566e 2019-07-14         Yes Missing     Unknown       50+       unknown
#>   lab_result_02 lab_result_03 has_symptoms followup
#> 1          High  Inconclusive      missing  unknown
#> 2       unknown        Normal          yes      yes
#> 3        Normal  Inconclusive      missing      yes
#> 4  Inconclusive       unknown          yes      yes
#> 5       unknown        Normal      missing       no
#> 6       unknown  Inconclusive      missing  missing


# You can ensure the order of the factors are correct by specifying
# a column that defines order.

dat[] <- lapply(dat, as.factor)
as.list(head(dat))
#> $id
#> [1] ef267c e80a37 b72883 c9ee86 40bc7a 46566e
#> 50 Levels: 00220d 0ab27b 0f4fd2 0ff1e1 1513bb 174713 1793eb 25be8f ... fa5389
#> 
#> $date
#> [1] 2019-07-08 2019-07-07 2019-07-07 2019-07-09 2019-07-12 2019-07-14
#> 10 Levels: 2019-07-07 2019-07-08 2019-07-09 2019-07-10 ... 2019-07-16
#> 
#> $readmission
#> [1] <NA> y    y    n    n    y   
#> Levels:  n u y
#> 
#> $treated
#> [1] 0    0    1    1    1    <NA>
#> Levels: 0 1
#> 
#> $facility
#> [1] C 3 8 4 6 B
#> Levels: 1 10 2 3 4 5 6 7 8 9 A B C
#> 
#> $age_group
#> [1] 10 10 30 40 0  50
#> Levels: 0 10 20 30 40 50
#> 
#> $lab_result_01
#> [1] unk  inc  inc  inc  norm unk 
#> Levels: high inc norm unk
#> 
#> $lab_result_02
#> [1] high unk  norm inc  unk  unk 
#> Levels: high inc norm unk
#> 
#> $lab_result_03
#> [1] inc  norm inc  unk  norm inc 
#> Levels: high inc norm unk
#> 
#> $has_symptoms
#> [1] <NA> y         y    <NA> <NA>
#> Levels:  n oui u unk y
#> 
#> $followup
#> [1] u    oui  oui  oui  n    <NA>
#> Levels: n oui u unk y
#> 
res2 <- match_df(dat,
  dictionary = dict,
  from = "options",
  to = "values",
  by = "grp",
  order = "orders")
head(res2)
#>       id       date readmission treated    facility age_group lab_result_01
#> 1 ef267c 2019-07-08     Missing     Yes     Unknown     10-19       unknown
#> 2 e80a37 2019-07-07         Yes     Yes Facility  3     10-19  Inconclusive
#> 3 b72883 2019-07-07         Yes      No Facility  8     30-39  Inconclusive
#> 4 c9ee86 2019-07-09          No      No Facility  4     40-49  Inconclusive
#> 5 40bc7a 2019-07-12          No      No Facility  6       0-9        Normal
#> 6 46566e 2019-07-14         Yes Missing     Unknown       50+       unknown
#>   lab_result_02 lab_result_03 has_symptoms followup
#> 1          High  Inconclusive      missing  unknown
#> 2       unknown        Normal          yes      yes
#> 3        Normal  Inconclusive      missing      yes
#> 4  Inconclusive       unknown          yes      yes
#> 5       unknown        Normal      missing       no
#> 6       unknown  Inconclusive      missing  missing
as.list(head(res2))
#> $id
#> [1] ef267c e80a37 b72883 c9ee86 40bc7a 46566e
#> 50 Levels: 00220d 0ab27b 0f4fd2 0ff1e1 1513bb 174713 1793eb 25be8f ... fa5389
#> 
#> $date
#> [1] 2019-07-08 2019-07-07 2019-07-07 2019-07-09 2019-07-12 2019-07-14
#> 10 Levels: 2019-07-07 2019-07-08 2019-07-09 2019-07-10 ... 2019-07-16
#> 
#> $readmission
#> [1] Missing Yes     Yes     No      No      Yes    
#> Levels: Yes No Unknown Missing
#> 
#> $treated
#> [1] Yes     Yes     No      No      No      Missing
#> Levels: Yes No Missing
#> 
#> $facility
#> [1] Unknown     Facility  3 Facility  8 Facility  4 Facility  6 Unknown    
#> 11 Levels: Facility  1 Facility  2 Facility  3 Facility  4 ... Unknown
#> 
#> $age_group
#> [1] 10-19 10-19 30-39 40-49 0-9   50+  
#> Levels: 0-9 10-19 20-29 30-39 40-49 50+
#> 
#> $lab_result_01
#> [1] unknown      Inconclusive Inconclusive Inconclusive Normal      
#> [6] unknown     
#> Levels: High Normal Inconclusive unknown
#> 
#> $lab_result_02
#> [1] High         unknown      Normal       Inconclusive unknown     
#> [6] unknown     
#> Levels: High Normal Inconclusive unknown
#> 
#> $lab_result_03
#> [1] Inconclusive Normal       Inconclusive unknown      Normal      
#> [6] Inconclusive
#> Levels: High Normal Inconclusive unknown
#> 
#> $has_symptoms
#> [1] missing yes     missing yes     missing missing
#> Levels: yes no unknown missing
#> 
#> $followup
#> [1] unknown yes     yes     yes     no      missing
#> Levels: yes no unknown missing
#>

Check and clean spelling or codes of multiple variables in a data frame

Arguments

Value

Details

`by` column

Global dictionary

See also

Examples

Contents

Author

Check and clean spelling or codes of multiple variables in a data frame

Arguments

Value

Details

by column

Global dictionary

See also

Examples

Contents

Author

`by` column