This package is dedicated to simplifying the cleaning and standardisation of linelist data. Considering a case linelist data.frame, it aims to:

  • standardise the variables names, replacing all non-ascii characters with their closest latin equivalent, removing blank spaces and other separators, enforcing lower case capitalisation, and using a single separator between words

  • standardise the labels used in all variables of type character and factor, as above

  • set POSIXct and POSIXlt to Date objects

  • extract dates from a messy variable, automatically detecting formats, allowing inconsistent formats, and dates flanked by other text

Installing the package

To install the current stable, CRAN version of the package, type:

install.packages("linelist")

To benefit from the latest features and bug fixes, install the development, github version of the package using:

devtools::install_github("reconhub/linelist")

Note that this requires the package devtools installed.

What does it do?

The main features of the package include:

  • clean_data: the main function, taking a data.frame as input, and doing all the variable names, internal labels, and date processing described above

  • clean_variable_names: like clean_data, but only the variable names

  • clean_variable_labels: like clean_data, but only the variable labels

  • clean_dates: like clean_data, but only the dates

  • guess_dates: find dates in various, unspecified formats in a messy character vector

Worked example

Let us consider some messy data.frame as a toy example:


## make toy data
onsets <- as.Date("2018-01-01") + sample(1:10, 20, replace = TRUE)
discharge <- format(as.Date(onsets) + 10, "%d/%m/%Y")
genders <- c("male", "female", "FEMALE", "Male", "Female", "MALE")
gender <- sample(genders, 20, replace = TRUE)
case_types <- c("confirmed", "probable", "suspected", "not a case",
                "Confirmed", "PROBABLE", "suspected  ", "Not.a.Case")
messy_dates <- sample(
                 c("01-12-2001", "male", "female", "2018-10-18", "2018_10_17",
                   "2018 10 19", "// 24//12//1989", NA, "that's 24/12/1989!"),
                 20, replace = TRUE)
case <- factor(sample(case_types, 20, replace = TRUE))
toy_data <- data.frame("Date of Onset." = onsets,
                       "DisCharge.." = discharge,
                       "GENDER_ " = gender,
                       "Épi.Case_définition" = case,
                       "messy/dates" = messy_dates)
## show data
toy_data
#>    Date.of.Onset. DisCharge.. GENDER_. Épi.Case_définition
#> 1      2018-01-06  16/01/2018   FEMALE         suspected  
#> 2      2018-01-09  19/01/2018   FEMALE            probable
#> 3      2018-01-08  18/01/2018     male          not a case
#> 4      2018-01-02  12/01/2018     MALE            PROBABLE
#> 5      2018-01-05  15/01/2018     Male          Not.a.Case
#> 6      2018-01-04  14/01/2018     MALE          Not.a.Case
#> 7      2018-01-09  19/01/2018     male           confirmed
#> 8      2018-01-06  16/01/2018   female           confirmed
#> 9      2018-01-06  16/01/2018   FEMALE            probable
#> 10     2018-01-08  18/01/2018   female           confirmed
#> 11     2018-01-10  20/01/2018   female            PROBABLE
#> 12     2018-01-06  16/01/2018     male           Confirmed
#> 13     2018-01-09  19/01/2018     male         suspected  
#> 14     2018-01-02  12/01/2018   Female           confirmed
#> 15     2018-01-10  20/01/2018   female          Not.a.Case
#> 16     2018-01-11  21/01/2018     Male         suspected  
#> 17     2018-01-08  18/01/2018   female            PROBABLE
#> 18     2018-01-08  18/01/2018   female         suspected  
#> 19     2018-01-09  19/01/2018     Male            PROBABLE
#> 20     2018-01-08  18/01/2018   FEMALE           Confirmed
#>           messy.dates
#> 1     // 24//12//1989
#> 2              female
#> 3                <NA>
#> 4                <NA>
#> 5          2018_10_17
#> 6          2018-10-18
#> 7              female
#> 8     // 24//12//1989
#> 9          01-12-2001
#> 10             female
#> 11         2018_10_17
#> 12         2018-10-18
#> 13             female
#> 14 that's 24/12/1989!
#> 15         01-12-2001
#> 16    // 24//12//1989
#> 17               <NA>
#> 18             female
#> 19         01-12-2001
#> 20         01-12-2001
## load library
library(linelist)

## clean data with defaults
x <- clean_data(toy_data)
x
#>    date_of_onset  discharge gender epi_case_definition messy_dates
#> 1     2018-01-06 2018-01-16 female           suspected  1989-12-24
#> 2     2018-01-09 2018-01-19 female            probable        <NA>
#> 3     2018-01-08 2018-01-18   male          not_a_case        <NA>
#> 4     2018-01-02 2018-01-12   male            probable        <NA>
#> 5     2018-01-05 2018-01-15   male          not_a_case  2018-10-17
#> 6     2018-01-04 2018-01-14   male          not_a_case  2018-10-18
#> 7     2018-01-09 2018-01-19   male           confirmed        <NA>
#> 8     2018-01-06 2018-01-16 female           confirmed  1989-12-24
#> 9     2018-01-06 2018-01-16 female            probable  2001-12-01
#> 10    2018-01-08 2018-01-18 female           confirmed        <NA>
#> 11    2018-01-10 2018-01-20 female            probable  2018-10-17
#> 12    2018-01-06 2018-01-16   male           confirmed  2018-10-18
#> 13    2018-01-09 2018-01-19   male           suspected        <NA>
#> 14    2018-01-02 2018-01-12 female           confirmed  1989-12-24
#> 15    2018-01-10 2018-01-20 female          not_a_case  2001-12-01
#> 16    2018-01-11 2018-01-21   male           suspected  1989-12-24
#> 17    2018-01-08 2018-01-18 female            probable        <NA>
#> 18    2018-01-08 2018-01-18 female           suspected        <NA>
#> 19    2018-01-09 2018-01-19   male            probable  2001-12-01
#> 20    2018-01-08 2018-01-18 female           confirmed  2001-12-01

## lower tolerance for unconverted dates
clean_data(toy_data, error_tolerance = 0.05)
#>    date_of_onset  discharge gender epi_case_definition       messy_dates
#> 1     2018-01-06 2018-01-16 female           suspected        24_12_1989
#> 2     2018-01-09 2018-01-19 female            probable            female
#> 3     2018-01-08 2018-01-18   male          not_a_case              <NA>
#> 4     2018-01-02 2018-01-12   male            probable              <NA>
#> 5     2018-01-05 2018-01-15   male          not_a_case        2018_10_17
#> 6     2018-01-04 2018-01-14   male          not_a_case        2018_10_18
#> 7     2018-01-09 2018-01-19   male           confirmed            female
#> 8     2018-01-06 2018-01-16 female           confirmed        24_12_1989
#> 9     2018-01-06 2018-01-16 female            probable        01_12_2001
#> 10    2018-01-08 2018-01-18 female           confirmed            female
#> 11    2018-01-10 2018-01-20 female            probable        2018_10_17
#> 12    2018-01-06 2018-01-16   male           confirmed        2018_10_18
#> 13    2018-01-09 2018-01-19   male           suspected            female
#> 14    2018-01-02 2018-01-12 female           confirmed that_s_24_12_1989
#> 15    2018-01-10 2018-01-20 female          not_a_case        01_12_2001
#> 16    2018-01-11 2018-01-21   male           suspected        24_12_1989
#> 17    2018-01-08 2018-01-18 female            probable              <NA>
#> 18    2018-01-08 2018-01-18 female           suspected            female
#> 19    2018-01-09 2018-01-19   male            probable        01_12_2001
#> 20    2018-01-08 2018-01-18 female           confirmed        01_12_2001

Getting help online

Bug reports and feature requests should be posted on github using the issue system. All other questions should be posted on the RECON forum:
http://www.repidemicsconsortium.org/forum/

Contributions are welcome via pull requests.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.