This function applies several cleaning procedures to an input data.frame
,
by standardising variable names, labels used categorical variables
(characters of factors), and setting dates to Date
objects. Optionally, an
intelligent date search can be used on character strings to extract dates
from various formats mixed with other text. See details for more information.
clean_data( x, sep = "_", force_Date = TRUE, guess_dates = FALSE, error_tolerance = 0.5, wordlists = NULL, spelling_vars = 3, sort_by = NULL, warn_spelling = FALSE, protect = FALSE, ... )
x | a |
---|---|
sep | The separator used between words, and defaults to the underscore
|
force_Date | a |
guess_dates | a |
error_tolerance | a number between 0 and 1 indicating the proportion of entries which cannot be identified as dates to be tolerated; if this proportion is exceeded, the original vector is returned, and a message is issued; defaults to 0.1 (10 percent) |
wordlists | a data frame or named list of data frames with at least two
columns defining the word list to be used. If this is a data frame, a third
column must be present to split the wordlists by column in |
spelling_vars | character or integer. If |
sort_by | a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted. |
warn_spelling | if |
protect | a logical or numeric vector defining the columns to protect
from any manipulation. Note: columns in |
... | further arguments passed on to |
A data.frame
with standardised labels for characters and
factors.
When creating the wordlist for clean_variable_spelling()
, it's important
to remember that the data will first be cleaned with
clean_variable_labels()
, which will remove any capitalisation, accents,
and replace all punctuation and spaces with "_".
This function wraps three other functions:
clean_variable_names()
- to handle variable names,
clean_variables()
- to handle character/factor variables,
clean_dates()
- to handle dates.
#> 'ID Date of Onset. DisCharge.. GENDER_ Épi.Case_définition #> 1 lvijbf 2018-01-08 18/01/2018 FEMALE Confirmed #> 2 glzvne 2018-01-06 16/01/2018 Female confirmed #> 3 gslqeq 2018-01-07 17/01/2018 MALE not a case #> 4 mfaamr 2018-01-05 15/01/2018 male suspected #> 5 ewkadg 2018-01-07 17/01/2018 MALE suspected #> 6 wxbhis 2018-01-10 20/01/2018 female suspected #> 7 ifmmyu 2018-01-06 16/01/2018 Female suspected #> 8 cmrwtk 2018-01-06 16/01/2018 female PROBABLE #> 9 dbirsl 2018-01-09 19/01/2018 male Not.a.Case #> 10 sqjlmp 2018-01-03 13/01/2018 Male not a case #> 11 gudacg 2018-01-08 18/01/2018 MALE suspected #> 12 qflgkj 2018-01-06 16/01/2018 male suspected #> 13 hvqlto 2018-01-06 16/01/2018 MALE not a case #> 14 rlvuzj 2018-01-03 13/01/2018 female Not.a.Case #> 15 cdmfnb 2018-01-07 17/01/2018 FEMALE not a case #> 16 yesrej 2018-01-05 15/01/2018 MALE suspected #> 17 ldsxty 2018-01-03 13/01/2018 male Not.a.Case #> 18 qwtxjx 2018-01-04 14/01/2018 male Not.a.Case #> 19 wiyjis 2018-01-10 20/01/2018 female not a case #> 20 ysoqow 2018-01-07 17/01/2018 Female Not.a.Case #> messy/dates lat lon #> 1 <NA> 11.582811 48.26727 #> 2 01-12-2001 12.582206 46.19974 #> 3 female 13.704558 46.68366 #> 4 female 16.489427 47.76145 #> 5 male 9.714188 46.85687 #> 6 // 24//12//1989 11.441429 47.21552 #> 7 female 16.078899 47.17223 #> 8 2018 10 19 14.400111 48.94434 #> 9 2018-10-18 13.235185 47.36031 #> 10 01-12-2001 12.432755 45.82377 #> 11 // 24//12//1989 13.995818 47.78286 #> 12 2018 10 19 16.219886 48.57991 #> 13 2018 10 19 16.722272 49.95717 #> 14 female 13.772498 47.04740 #> 15 that's 24/12/1989! 10.062713 46.13261 #> 16 male 14.435050 47.03233 #> 17 <NA> 13.263820 48.79245 #> 18 01-12-2001 9.268153 49.81503 #> 19 01-12-2001 11.158673 47.06303 #> 20 // 24//12//1989 15.408058 48.86130## clean variable names, store in new object, show results clean_data <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.1) clean_data#> id date_of_onset discharge gender epi_case_definition messy_dates #> 1 lvijbf 2018-01-08 2018-01-18 female confirmed <NA> #> 2 glzvne 2018-01-06 2018-01-16 female confirmed 01_12_2001 #> 3 gslqeq 2018-01-07 2018-01-17 male not_a_case female #> 4 mfaamr 2018-01-05 2018-01-15 male suspected female #> 5 ewkadg 2018-01-07 2018-01-17 male suspected male #> 6 wxbhis 2018-01-10 2018-01-20 female suspected 24_12_1989 #> 7 ifmmyu 2018-01-06 2018-01-16 female suspected female #> 8 cmrwtk 2018-01-06 2018-01-16 female probable 2018_10_19 #> 9 dbirsl 2018-01-09 2018-01-19 male not_a_case 2018_10_18 #> 10 sqjlmp 2018-01-03 2018-01-13 male not_a_case 01_12_2001 #> 11 gudacg 2018-01-08 2018-01-18 male suspected 24_12_1989 #> 12 qflgkj 2018-01-06 2018-01-16 male suspected 2018_10_19 #> 13 hvqlto 2018-01-06 2018-01-16 male not_a_case 2018_10_19 #> 14 rlvuzj 2018-01-03 2018-01-13 female not_a_case female #> 15 cdmfnb 2018-01-07 2018-01-17 female not_a_case that_s_24_12_1989 #> 16 yesrej 2018-01-05 2018-01-15 male suspected male #> 17 ldsxty 2018-01-03 2018-01-13 male not_a_case <NA> #> 18 qwtxjx 2018-01-04 2018-01-14 male not_a_case 01_12_2001 #> 19 wiyjis 2018-01-10 2018-01-20 female not_a_case 01_12_2001 #> 20 ysoqow 2018-01-07 2018-01-17 female not_a_case 24_12_1989 #> lat lon #> 1 11.582811 48.26727 #> 2 12.582206 46.19974 #> 3 13.704558 46.68366 #> 4 16.489427 47.76145 #> 5 9.714188 46.85687 #> 6 11.441429 47.21552 #> 7 16.078899 47.17223 #> 8 14.400111 48.94434 #> 9 13.235185 47.36031 #> 10 12.432755 45.82377 #> 11 13.995818 47.78286 #> 12 16.219886 48.57991 #> 13 16.722272 49.95717 #> 14 13.772498 47.04740 #> 15 10.062713 46.13261 #> 16 14.435050 47.03233 #> 17 13.263820 48.79245 #> 18 9.268153 49.81503 #> 19 11.158673 47.06303 #> 20 15.408058 48.86130clean_data2 <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.8) clean_data2#> id date_of_onset discharge gender epi_case_definition messy_dates #> 1 lvijbf 2018-01-08 2018-01-18 female confirmed <NA> #> 2 glzvne 2018-01-06 2018-01-16 female confirmed 2001-12-01 #> 3 gslqeq 2018-01-07 2018-01-17 male not_a_case <NA> #> 4 mfaamr 2018-01-05 2018-01-15 male suspected <NA> #> 5 ewkadg 2018-01-07 2018-01-17 male suspected <NA> #> 6 wxbhis 2018-01-10 2018-01-20 female suspected 1989-12-24 #> 7 ifmmyu 2018-01-06 2018-01-16 female suspected <NA> #> 8 cmrwtk 2018-01-06 2018-01-16 female probable 2018-10-19 #> 9 dbirsl 2018-01-09 2018-01-19 male not_a_case 2018-10-18 #> 10 sqjlmp 2018-01-03 2018-01-13 male not_a_case 2001-12-01 #> 11 gudacg 2018-01-08 2018-01-18 male suspected 1989-12-24 #> 12 qflgkj 2018-01-06 2018-01-16 male suspected 2018-10-19 #> 13 hvqlto 2018-01-06 2018-01-16 male not_a_case 2018-10-19 #> 14 rlvuzj 2018-01-03 2018-01-13 female not_a_case <NA> #> 15 cdmfnb 2018-01-07 2018-01-17 female not_a_case 1989-12-24 #> 16 yesrej 2018-01-05 2018-01-15 male suspected <NA> #> 17 ldsxty 2018-01-03 2018-01-13 male not_a_case <NA> #> 18 qwtxjx 2018-01-04 2018-01-14 male not_a_case 2001-12-01 #> 19 wiyjis 2018-01-10 2018-01-20 female not_a_case 2001-12-01 #> 20 ysoqow 2018-01-07 2018-01-17 female not_a_case 1989-12-24 #> lat lon #> 1 11.582811 48.26727 #> 2 12.582206 46.19974 #> 3 13.704558 46.68366 #> 4 16.489427 47.76145 #> 5 9.714188 46.85687 #> 6 11.441429 47.21552 #> 7 16.078899 47.17223 #> 8 14.400111 48.94434 #> 9 13.235185 47.36031 #> 10 12.432755 45.82377 #> 11 13.995818 47.78286 #> 12 16.219886 48.57991 #> 13 16.722272 49.95717 #> 14 13.772498 47.04740 #> 15 10.062713 46.13261 #> 16 14.435050 47.03233 #> 17 13.263820 48.79245 #> 18 9.268153 49.81503 #> 19 11.158673 47.06303 #> 20 15.408058 48.86130## clean variable names, but keep our "messy/dates" column to_protect <- names(toy_data) %in% "messy/dates" clean_data3 <- clean_data(toy_data, guess_dates = TRUE, error_tolerance = 0.8, protect = to_protect ) clean_data3#> id date_of_onset discharge gender epi_case_definition #> 1 lvijbf 2018-01-08 2018-01-18 female confirmed #> 2 glzvne 2018-01-06 2018-01-16 female confirmed #> 3 gslqeq 2018-01-07 2018-01-17 male not_a_case #> 4 mfaamr 2018-01-05 2018-01-15 male suspected #> 5 ewkadg 2018-01-07 2018-01-17 male suspected #> 6 wxbhis 2018-01-10 2018-01-20 female suspected #> 7 ifmmyu 2018-01-06 2018-01-16 female suspected #> 8 cmrwtk 2018-01-06 2018-01-16 female probable #> 9 dbirsl 2018-01-09 2018-01-19 male not_a_case #> 10 sqjlmp 2018-01-03 2018-01-13 male not_a_case #> 11 gudacg 2018-01-08 2018-01-18 male suspected #> 12 qflgkj 2018-01-06 2018-01-16 male suspected #> 13 hvqlto 2018-01-06 2018-01-16 male not_a_case #> 14 rlvuzj 2018-01-03 2018-01-13 female not_a_case #> 15 cdmfnb 2018-01-07 2018-01-17 female not_a_case #> 16 yesrej 2018-01-05 2018-01-15 male suspected #> 17 ldsxty 2018-01-03 2018-01-13 male not_a_case #> 18 qwtxjx 2018-01-04 2018-01-14 male not_a_case #> 19 wiyjis 2018-01-10 2018-01-20 female not_a_case #> 20 ysoqow 2018-01-07 2018-01-17 female not_a_case #> messy/dates lat lon #> 1 <NA> 11.582811 48.26727 #> 2 01-12-2001 12.582206 46.19974 #> 3 female 13.704558 46.68366 #> 4 female 16.489427 47.76145 #> 5 male 9.714188 46.85687 #> 6 // 24//12//1989 11.441429 47.21552 #> 7 female 16.078899 47.17223 #> 8 2018 10 19 14.400111 48.94434 #> 9 2018-10-18 13.235185 47.36031 #> 10 01-12-2001 12.432755 45.82377 #> 11 // 24//12//1989 13.995818 47.78286 #> 12 2018 10 19 16.219886 48.57991 #> 13 2018 10 19 16.722272 49.95717 #> 14 female 13.772498 47.04740 #> 15 that's 24/12/1989! 10.062713 46.13261 #> 16 male 14.435050 47.03233 #> 17 <NA> 13.263820 48.79245 #> 18 01-12-2001 9.268153 49.81503 #> 19 01-12-2001 11.158673 47.06303 #> 20 // 24//12//1989 15.408058 48.86130## Using a wordlist ------------------------------- # location data with mis-spellings, French, and English. messy_locations <- c("hopsital", "h\u00f4pital", "hospital", "m\u00e9dical", "clinic", "feild", "field") toy_data$location <- factor(sample(messy_locations, 20, replace = TRUE)) # show data toy_data$location#> [1] hopsital clinic hopsital clinic hôpital field clinic hospital #> [9] field hôpital hospital hopsital field hopsital hopsital hôpital #> [17] hôpital hospital clinic hôpital #> Levels: clinic field hôpital hopsital hospital# add a wordlist wordlist <- data.frame( from = c("hopsital", "hopital", "medical", "feild"), to = c("hospital", "hospital", "clinic", "field"), variables = rep("location", 4), stringsAsFactors = FALSE ) clean_data4 <- clean_data(toy_data, wordlists = wordlist, spelling_vars = "variables" ) clean_data4#> id date_of_onset discharge gender epi_case_definition messy_dates #> 1 lvijbf 2018-01-08 18_01_2018 female confirmed <NA> #> 2 glzvne 2018-01-06 16_01_2018 female confirmed 01_12_2001 #> 3 gslqeq 2018-01-07 17_01_2018 male not_a_case female #> 4 mfaamr 2018-01-05 15_01_2018 male suspected female #> 5 ewkadg 2018-01-07 17_01_2018 male suspected male #> 6 wxbhis 2018-01-10 20_01_2018 female suspected 24_12_1989 #> 7 ifmmyu 2018-01-06 16_01_2018 female suspected female #> 8 cmrwtk 2018-01-06 16_01_2018 female probable 2018_10_19 #> 9 dbirsl 2018-01-09 19_01_2018 male not_a_case 2018_10_18 #> 10 sqjlmp 2018-01-03 13_01_2018 male not_a_case 01_12_2001 #> 11 gudacg 2018-01-08 18_01_2018 male suspected 24_12_1989 #> 12 qflgkj 2018-01-06 16_01_2018 male suspected 2018_10_19 #> 13 hvqlto 2018-01-06 16_01_2018 male not_a_case 2018_10_19 #> 14 rlvuzj 2018-01-03 13_01_2018 female not_a_case female #> 15 cdmfnb 2018-01-07 17_01_2018 female not_a_case that_s_24_12_1989 #> 16 yesrej 2018-01-05 15_01_2018 male suspected male #> 17 ldsxty 2018-01-03 13_01_2018 male not_a_case <NA> #> 18 qwtxjx 2018-01-04 14_01_2018 male not_a_case 01_12_2001 #> 19 wiyjis 2018-01-10 20_01_2018 female not_a_case 01_12_2001 #> 20 ysoqow 2018-01-07 17_01_2018 female not_a_case 24_12_1989 #> lat lon location #> 1 11.582811 48.26727 hospital #> 2 12.582206 46.19974 clinic #> 3 13.704558 46.68366 hospital #> 4 16.489427 47.76145 clinic #> 5 9.714188 46.85687 hospital #> 6 11.441429 47.21552 field #> 7 16.078899 47.17223 clinic #> 8 14.400111 48.94434 hospital #> 9 13.235185 47.36031 field #> 10 12.432755 45.82377 hospital #> 11 13.995818 47.78286 hospital #> 12 16.219886 48.57991 hospital #> 13 16.722272 49.95717 field #> 14 13.772498 47.04740 hospital #> 15 10.062713 46.13261 hospital #> 16 14.435050 47.03233 hospital #> 17 13.263820 48.79245 hospital #> 18 9.268153 49.81503 hospital #> 19 11.158673 47.06303 clinic #> 20 15.408058 48.86130 hospitalclean_data4$location#> [1] hospital clinic hospital clinic hospital field clinic hospital #> [9] field hospital hospital hospital field hospital hospital hospital #> [17] hospital hospital clinic hospital #> Levels: hospital clinic field