Filtering observations with {dplyr}

February 21, 2021    R dplyr

I was using dplyr::filter() to subset a data.frame and was having trouble getting the result I wanted. I had a data.frame along the lines of:

library(dplyr)

df <- tibble::tribble(
  ~x, ~y, ~z,
   1,  2,  3,
   4,  5,  6,
  NA,  8,  9
)

And wanted to drop rows that met a certain condition:

df %>%
  filter(x != 4)
## # A tibble: 1 x 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3

But in this case the third row with a NA value for x was also being dropped.

The documentation for dplyr::filter() explains that:

To be retained, the row must produce a value of TRUE for all conditions. Note 
that when a condition evaluates to NA the row will be dropped, unlike base 
subsetting with [.

So in my case the last row of the simple df was not kept because:

NA != 4
## [1] NA

To get the result I wanted, I needed evaluation along the lines of:

!(NA %in% 4)
## [1] TRUE

And now my call to dplyr::filter() returns the result I was looking for:

df %>%
  filter(!(x %in% 4))
## # A tibble: 2 x 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2    NA     8     9

The documentation made me curious about subsetting in base R. This also produced results that were initially counter-intutitive to me:

df[df$x != 4, ]
## # A tibble: 2 x 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2    NA    NA    NA

The third row was kept in the subset, but the values for y, and z were set to NA as well. The documentation for [ notes that:

rows containing an NA produce an NA in the result

But similar logic that I used in dplyr::filter() will work here as well:

df[!(df$x %in% 4), ]
## # A tibble: 2 x 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2    NA     8     9