A crudely-drawn calendar that says '1 dependency' in the style of a date

tl;dr

I used {itdepends} to see how CRAN packages depend on {lubridate}, which was not removed from CRAN recently.

Lubrigate

A test failure in {lubridate} led to hundreds of R developers being emailed about its potential expulsion from CRAN, which also threatened the hundreds of packages that depend on it.

I see the benefit of minimising dependencies. I also understand the drawbacks of reinventing the wheel. Maybe {lubridate} is a good dependency: a simple API, part of the popular {tidyverse}, and it handles stuff you can’t be bothered with (like what’s 29 February plus one year?).

Jim Hester spoke at rstudio::conf(2019) about dependencies. His {itdepends} package helps you understand their scale and impact on your package.¹

So, for fun, I’m looking at how {lubridate} is used by packages that import it.

CRANk it up

CRAN_package_db() is a convenient function that returns information about packages available on CRAN. We can filter it for the packages that import {lubridate}, i.e. they have {lubridate} in the Imports section of their DESCRIPTION file.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(stringr)

cran <- tools::CRAN_package_db()

imports_lubridate <- cran |> 
  filter(str_detect(Imports, "lubridate")) |> 
  pull(Package)

sample(imports_lubridate, 5)  # random sample

## [1] "quantdates"  "GetDFPData2" "esmprep"     "strand"      "votesmart"

Right, so that’s 494 packages out of 18515 (3%). Is that a lot? Well, the tidyverse package {dplyr}, the Swiss Army knife of data wrangling, is listed in the Imports of 2353.

InstALL

So, perhaps this is a little nuts, but we’re going to install all the {lubridate}-dependent packages because {itdepends} works with locally-installed packages.

tmp <- tempdir()  # temporary folder

purrr::walk(
  imports_lubridate,
  ~install.packages(
    .x, 
    destdir = tmp, 
    dependencies = FALSE,  # skip installing dependencies
    repos = "https://cran.ma.imperial.ac.uk/"  # mirror
  )
)

This takes a little while. There’s probably faster methods, like maybe the {pak} package, but for now I just used what worked. I’ve also hidden the output, obviously. It’s also possible that some packages will error out and won’t install. Oh no! Ah well.

It depends on {itdepends}

{itdepends} is not available from CRAN, but you can install from GitHub.

remotes::install_github("jimhester/itdepends")

Now we can pass each of package name to the dep_usage_package() function of {itdepends} in a loop. We get back a dataframe for each package, listing each function call it makes and the package that the function comes from.

I’ve added a mildly unorthodox use of next, borrowed from StackOverflow, because I was having trouble with the loop after a failure.

dep_list <- vector("list", length(imports_lubridate)) |> 
  setNames(imports_lubridate)

for (i in imports_lubridate) {
  
  skip <- FALSE
  
  tryCatch({ 
    dep_list[[i]] <- itdepends::dep_usage_pkg(i)
    dep_list[[i]]$focus <- i
  },
  error = function(e) { 
    dep_list[[i]] <- data.frame(
      pkg   = NA_character_,
      fun   = NA_character_,
      focus = NA_character_
    )
    skip <<- TRUE 
  })
  
  if (skip) next
  
}

I absolutely do not claim this to be the best, most optimised approach. But it works for me.

Dependensheeesh

Now that {itdepends} has extracted all the function calls from each of the packages, we can take a look at their frequencies.

Example

Here’s the top 10 most-used functions from the first package alphabetically: {academictwitteR}.

ex_pkg <- "academictwitteR"

dep_list[[ex_pkg]] |> 
  count(pkg, fun, sort = TRUE) |>
  slice(1:5)

## # A tibble: 5 × 3
##   pkg   fun       n
##   <chr> <chr> <int>
## 1 base  <-      228
## 2 base  {       197
## 3 base  if      109
## 4 base  $        90
## 5 base  !        42

It’s not particularly exciting to know that the top 5 are made up of base R functions like the assignment arrow (<-), the dollar-sign ($) data accessor² and the square bracket ([). We also don’t really care about the package’s internal functions. Let’s filter out these packages and re-count

base_pkgs <- sessionInfo()$basePkgs

dep_list[[ex_pkg]] |>
  filter(!pkg %in% c(base_pkgs, ex_pkg)) |> 
  count(pkg, fun, sort = TRUE) |> 
  slice(1:10)

## # A tibble: 10 × 3
##    pkg       fun                n
##    <chr>     <chr>          <int>
##  1 lifecycle deprecate_soft    16
##  2 magrittr  %>%               14
##  3 dplyr     bind_rows          8
##  4 dplyr     left_join          5
##  5 dplyr     select_if          5
##  6 httr      status_code        4
##  7 jsonlite  read_json          4
##  8 purrr     map_dfr            4
##  9 tibble    tibble             4
## 10 dplyr     distinct           3

Aha. We can see immediately that the authors have made use of tidyverse to write their package, since you can see {dplyr}, {tibble}, etc, in there. This makes the use of {lubridate} relatively unsurprising.

Here’s the {lubridate} functions used by this package.

dep_list[[ex_pkg]] |>
  filter(pkg == "lubridate") |> 
  count(pkg, fun, sort = TRUE)

## # A tibble: 4 × 3
##   pkg       fun             n
##   <chr>     <chr>       <int>
## 1 lubridate as_datetime     1
## 2 lubridate seconds         1
## 3 lubridate with_tz         1
## 4 lubridate ymd_hms         1

So this package uses four {lubridate} functions for conversion and formatting of datetimes.

All packages

Now let’s take a look at the function calls across all the packages that import {lubridate}. I’m first going to convert the list of results to a dataframe.

dep_df <- do.call(rbind, dep_list)

Function use by package

This is a count of the number of uses of each {lubridate} function by each of the the focus packages (i.e. the packages we installed).

pkg_fn_count <- dep_df |>
  filter(pkg == "lubridate") |>
  count(focus, fun, sort = TRUE)

pkg_fn_count |> slice(1:5)

## # A tibble: 5 × 3
##   focus        fun         n
##   <chr>        <chr>   <int>
## 1 PriceIndices month    1096
## 2 PriceIndices year      678
## 3 tidyndr      as_date    53
## 4 RClimacell   with_tz    52
## 5 RobinHood    ymd_hms    52

Holy moley, the {PriceIndices} package calls month() and year(), used to extract elements of a date, over 1400 times combined.

Unique function use by package

We can also look at things like the packages that make calls to the greatest number of unique {lubridate} functions. Here’s the top 5.

fn_distinct_count <- dep_df |>
  filter(pkg == "lubridate") |>
  distinct(focus, fun) |>
  count(focus, sort = TRUE) 

fn_distinct_count |> slice(1:5)

## # A tibble: 5 × 2
##   focus              n
##   <chr>          <int>
## 1 photobiology      26
## 2 mctq              25
## 3 fmdates           21
## 4 finbif            15
## 5 xml2relational    15

So these packages are using more than 10 unique functions from {lubridate}, which is pretty extensive usage. It may be tricky to do away with the convenience of the dependnecy in these cases, especially.

Conversely, a quick histogram reveals that a large number of packages are actually using just a single {lubridate} function.

hist(
  fn_distinct_count$n,
  breaks = 30,
  main = "Unique {lubridate} functions used by\npackages importing {lubridate}",
  xlab = "Function count"
)

Histrogram of unique lubridate functions used by the packages that import lubridate. The vast majority are using 1 or 2, with a long tail out to about 25.

Maybe the dependency could be dropped in these cases?

Out of interest, which {lubridate} function is the most frequent in packages that use just one?

focus_one_fn <- fn_distinct_count |>
  filter(n == 1) |>
  pull(focus)

pkg_fn_count |> 
  filter(focus %in% focus_one_fn) |> 
  count(fun, sort = TRUE) |> 
  slice(1:5)

## # A tibble: 5 × 2
##   fun             n
##   <chr>       <int>
## 1 as_datetime     7
## 2 as_date         6
## 3 ymd             6
## 4 ymd_hms         6
## 5 is.Date         4

Looks like some pretty standard functions, like converting to a date (as_date(), as_datetime()) or to parse dates with a particular time component (ymd_hms for year, month, date, hour, minute, seconds, and ymd()).

I think this is interesting: some packages are importing {lubridate} in its entirety to use a single function. And these functions have base R equivalents with no package-dependency cost. Without diving too deep, this implies that people are using {lubridate} because of syntax familiarity or perhaps because they’re already loading other tidyverse packages anyway.

Non-unique function use by package

What about total calls to {lubridate} functions by each of the dependent package? This is on-unique, so could include one function being called multiple times by a given package.

fn_nondistinct_count <- dep_df |>
  filter(pkg == "lubridate") |>
  count(focus, sort = TRUE)

dep_df |> 
  count(focus) |> 
  left_join(
    fn_nondistinct_count,
    by = "focus",
    suffix = c("_total", "_lub")
  ) |> 
  mutate(percent_lub = round(100 * n_lub / n_total, 1)) |> 
  arrange(desc(percent_lub)) |>
  slice(1:5)

## # A tibble: 5 × 4
##   focus        n_total n_lub percent_lub
##   <chr>          <int> <int>       <dbl>
## 1 RClimacell      2241   225        10  
## 2 riem             113     9         8  
## 3 quantdates       534    42         7.9
## 4 rtrends          101     8         7.9
## 5 PriceIndices   23235  1805         7.8

Wow, 10% of calls by the {RClimacell} package involve {lubridate} functions. Make sense: this package relates to weather readings at certain time intervals.

And another quick histogram of what the distribution looks like.

hist(
  fn_nondistinct_count$n,
  breaks = 30,
  main = "Non-unique {lubridate} functions used by\npackages importing {lubridate}",
  xlab = "Function count"
)

Histogram of non-unique lubridate functions used by packages that import lubridate. The vast majority make fewer than 50 calls.

Huh, so the number of non-unique {lubridate} calls is almost always less than 50 per package. Seems in general that a small number of {lubridate} functions are called per dependent package, but they might be called a lot.

You do you

Does the information here imply that many developers could consider removing their small number of {lubridate} calls in favour of date-related base functions? Maybe. That’s up to the developers.

Ultimately, {itdepends} might be a useful tool for you to work out if you need all the dependencies you have. Other tools are out there; I read recently about Ashley Baldry’s {depcheck} package, for example

It might be interesting to redo this investigation for all CRAN packages and their dependencies, but I don’t have a personal CRAN mirror and I don’t write particularly performant code.

Anyway, don’t listen to me: I write joke packages that I don’t put on CRAN, lol.

Session info

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.1.0 (2021-05-18)
##  os       macOS Big Sur 10.16         
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_GB.UTF-8                 
##  ctype    en_GB.UTF-8                 
##  tz       Europe/London               
##  date     2021-11-28                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
##  blogdown      1.4     2021-07-23 [1] CRAN (R 4.1.0)
##  bookdown      0.23    2021-08-13 [1] CRAN (R 4.1.0)
##  bslib         0.3.1   2021-10-06 [1] CRAN (R 4.1.0)
##  cli           3.1.0   2021-10-27 [1] CRAN (R 4.1.0)
##  crayon        1.4.2   2021-10-29 [1] CRAN (R 4.1.0)
##  DBI           1.1.1   2021-01-15 [1] CRAN (R 4.1.0)
##  digest        0.6.28  2021-09-23 [1] CRAN (R 4.1.0)
##  dplyr       * 1.0.7   2021-06-18 [1] CRAN (R 4.1.0)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.1.0)
##  fansi         0.5.0   2021-05-25 [1] CRAN (R 4.1.0)
##  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
##  generics      0.1.1   2021-10-25 [1] CRAN (R 4.1.0)
##  glue          1.5.0   2021-11-07 [1] CRAN (R 4.1.0)
##  highr         0.9     2021-04-16 [1] CRAN (R 4.1.0)
##  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.0)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.1.0)
##  jsonlite      1.7.2   2020-12-09 [1] CRAN (R 4.1.0)
##  knitr         1.36    2021-09-29 [1] CRAN (R 4.1.0)
##  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.0)
##  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.1.0)
##  pillar        1.6.4   2021-10-18 [1] CRAN (R 4.1.0)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
##  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
##  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.0)
##  rlang         0.4.12  2021-10-18 [1] CRAN (R 4.1.0)
##  rmarkdown     2.10    2021-08-06 [1] CRAN (R 4.1.0)
##  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
##  sass          0.4.0   2021-05-12 [1] CRAN (R 4.1.0)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.1.0)
##  stringi       1.7.5   2021-10-04 [1] CRAN (R 4.1.0)
##  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
##  tibble        3.1.6   2021-11-07 [1] CRAN (R 4.1.0)
##  tidyr       * 1.1.3   2021-03-03 [1] CRAN (R 4.1.0)
##  tidyselect    1.1.1   2021-04-30 [1] CRAN (R 4.1.0)
##  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
##  vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
##  withr         2.4.2   2021-04-18 [1] CRAN (R 4.1.0)
##  xfun          0.26    2021-09-14 [1] CRAN (R 4.1.0)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.1.0)
## 
## [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

Tim reminded me of this package/nerdsniped me.↩︎
You should be aware of the international conspiracy behind the use of this symbol in R.↩︎

{itdepends} on {lubridate}