Recode factors, keeping only most frequent levels

This function is a generic, with methods for factor and character objects. It lists all unique values in the input, ranks them from the most to the least frequent, and keeps the top n values. Other values are replaced by the chosen replacement. As an option, the user can specify a subset of the input data to define dominant values. Under the hood, this uses forcats::fct_lump() and forcats::fct_recode().

top_values(x, n, ...)

# S3 method for default
top_values(x, n, ...)

# S3 method for factor
top_values(
  x,
  n,
  replacement = "other",
  subset = NULL,
  ties_method = "first",
  ...
)

# S3 method for character
top_values(
  x,
  n,
  replacement = "other",
  subset = NULL,
  ties_method = "first",
  ...
)

Arguments

x	a `factor` or a `character` vector
n	the number of levels or values to keep
...	further arguments passed to `forcats::fct_lump()`.
replacement	a single value to replace the less frequent values with
subset	a `logical`, `integer` or `character` vector used to subset the input; only the subsetted data will be used to define the dominant values, which are then used for re-defining values in the entire input
ties_method	how to deal with ties when ranking factor levels, which is passed on to `rank()`. The default is set at "first" (see Details).

Details

This function is an opinionated wrapper around forcats::fct_lump() with the following changes:

characters are not auto-converted to factor
the default ties method defaults to "first" instead of "min"
if n = nlevels(x) - 1, then the nth level is still converted to the value of replacement (forcats will assume you didn't want to convert the nth level)
it is possible to convert the replacement to NA

Examples


## make toy data
x <- sample(letters[1:10], 100, replace = TRUE)
sort(table(x), decreasing = TRUE)
#> x
#>  g  i  a  d  h  e  f  b  c  j 
#> 15 14 11 11 10  9  9  7  7  7 

## keep top values
top_values(x, 2) # top 2
#>   [1] "other" "other" "other" "other" "i"     "other" "other" "other" "other"
#>  [10] "g"     "other" "other" "i"     "g"     "other" "other" "i"     "i"    
#>  [19] "other" "other" "i"     "g"     "other" "other" "other" "other" "other"
#>  [28] "other" "other" "other" "other" "other" "other" "other" "g"     "other"
#>  [37] "i"     "i"     "other" "g"     "i"     "other" "i"     "other" "other"
#>  [46] "other" "g"     "other" "other" "other" "g"     "other" "other" "other"
#>  [55] "other" "other" "g"     "other" "other" "other" "other" "i"     "g"    
#>  [64] "other" "g"     "i"     "g"     "other" "other" "g"     "other" "other"
#>  [73] "other" "i"     "other" "other" "g"     "i"     "other" "g"     "other"
#>  [82] "other" "other" "other" "other" "other" "other" "g"     "other" "i"    
#>  [91] "other" "other" "other" "other" "other" "other" "other" "other" "other"
#> [100] "other"
top_values(x, 2, NA) # top 3, replace with NA
#>   [1] NA  NA  NA  NA  "i" NA  NA  NA  NA  "g" NA  NA  "i" "g" NA  NA  "i" "i"
#>  [19] NA  NA  "i" "g" NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  "g" NA 
#>  [37] "i" "i" NA  "g" "i" NA  "i" NA  NA  NA  "g" NA  NA  NA  "g" NA  NA  NA 
#>  [55] NA  NA  "g" NA  NA  NA  NA  "i" "g" NA  "g" "i" "g" NA  NA  "g" NA  NA 
#>  [73] NA  "i" NA  NA  "g" "i" NA  "g" NA  NA  NA  NA  NA  NA  NA  "g" NA  "i"
#>  [91] NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 
top_values(x, 0) # extreme case, keep nothing
#>   [1] "other" "other" "other" "other" "other" "other" "other" "other" "other"
#>  [10] "other" "other" "other" "other" "other" "other" "other" "other" "other"
#>  [19] "other" "other" "other" "other" "other" "other" "other" "other" "other"
#>  [28] "other" "other" "other" "other" "other" "other" "other" "other" "other"
#>  [37] "other" "other" "other" "other" "other" "other" "other" "other" "other"
#>  [46] "other" "other" "other" "other" "other" "other" "other" "other" "other"
#>  [55] "other" "other" "other" "other" "other" "other" "other" "other" "other"
#>  [64] "other" "other" "other" "other" "other" "other" "other" "other" "other"
#>  [73] "other" "other" "other" "other" "other" "other" "other" "other" "other"
#>  [82] "other" "other" "other" "other" "other" "other" "other" "other" "other"
#>  [91] "other" "other" "other" "other" "other" "other" "other" "other" "other"
#> [100] "other"

## dealing with ties
x <- c("a", "b", "a", "b", "c")

## in the case of a tie (a, b), the first value is ranked higher than the
## others
top_values(x, n = 1)
#> Warning: a tie among values (a, b) was broken by choosing the first value
#> [1] "a"     "other" "a"     "other" "other"

## here, the ties are ranked in reverse order, so b comes before a
top_values(x, n = 1, ties_method = "last")
#> Warning: a tie among values (a, b) was broken by choosing the last value
#> [1] "other" "b"     "other" "b"     "other"

## top_values differs from forcats::fct_lump in that if the user selects n - 1
## values, it will force the last value to be "other"
forcats::fct_lump(x, n = 2)
#> [1] a b a b c
#> Levels: a b c
top_values(x, n = 2)
#> [1] "a"     "b"     "a"     "b"     "other"

## If there is a tie for the last level, then it will drop the level
## depending on the ties_method

# replace "d" with other
top_values(c(x, "d"), n = 3)
#> Warning: a tie among values (c, d) was broken by choosing the first value
#> [1] "a"     "b"     "a"     "b"     "c"     "other"

# replace "c" with other
top_values(c(x, "d"), n = 3, ties_method = "last")
#> Warning: a tie among values (c, d) was broken by choosing the last value
#> [1] "a"     "b"     "a"     "b"     "other" "d"    

## using subset
x <- c("a", "a", "a", "b", "b", "c")
x
#> [1] "a" "a" "a" "b" "b" "c"
top_values(x, n = 1, subset = 4:6)
#> [1] "other" "other" "other" "b"     "b"     "other"
top_values(x, n = 2, subset = 4:6)
#> [1] "other" "other" "other" "b"     "b"     "c"    
top_values(x, n = 1, subset = -1)
#> Warning: a tie among values (a, b) was broken by choosing the first value
#> [1] "a"     "a"     "a"     "other" "other" "other"
top_values(x, n = 1, subset = -1, ties_method = "last")
#> Warning: a tie among values (a, b) was broken by choosing the last value
#> [1] "other" "other" "other" "b"     "b"     "other"

Recode factors, keeping only most frequent levels

Arguments

Details

Examples

Contents

Author