README

‘packageRank’ is an R package that helps put package download counts into context. It does so via two core functions, cranDownloads() and packageRank(), a set of filters that reduce download count inflation, and a host of other assorted functions.

getting started

install.packages("packageRank")

# You may need to first install 'remotes' via install.packages("remotes").
remotes::install_github("lindbrook/packageRank", build_vignettes = TRUE)

I - download counts

cranlogs::cran_downloads(packages = "HistData")

i) “spell check” for package names

cranDownloads(packages = "GGplot2")

cranDownloads(packages = "ggplot2")

cranDownloads(packages = "vr")

cranDownloads(packages = "VR")

ii) two additional date formats

With cranlogs::cran_downloads(), you specify a time frame using the from and to arguments. The downside of this is that you must use “yyyy-mm-dd”. For convenience’s sake, cranDownloads() also allows you to use “yyyy-mm” or yyyy (“yyyy” also works).

“yyyy-mm”

Let’s say you want the download counts for ‘HistData’ for February 2020. With cranlogs::cran_downloads(), you’d have to type out the whole date and remember that 2020 was a leap year:

cranlogs::cran_downloads(packages = "HistData", from = "2020-02-01",
  to = "2020-02-29")

cranDownloads(packages = "HistData", from = "2020-02", to = "2020-02")

yyyy or “yyyy”

Let’s say you want the download counts for ‘rstan’ for 2020. With cranlogs::cran_downloads(), you’d type something like:

cranlogs::cran_downloads(packages = "rstan", from = "2022-01-01",
  to = "2022-12-31")

cranDownloads(packages = "rstan", from = 2020, to = 2020)

cranDownloads(packages = "rstan", from = "2020", to = "2020")

iii) shortcuts with from = and to = in cranDownloads()

These additional date formats help to create convenient shortcuts. Let’s say you want the year-to-date download counts for ‘rstan’. With cranlogs::cran_downloads(), you’d type something like:

cranlogs::cran_downloads(packages = "rstan", from = "2023-01-01",
  to = Sys.Date() - 1)

cranDownloads(packages = "rstan", from = 2023)

cranDownloads(packages = "rstan", to = 2023)

iv) check date validity

cranDownloads(packages = "HistData", from = "2019-01-15",
  to = "2019-01-35")

v) cumulative count for selected time frame

cranDownloads(packages = "HistData", when = "last-week")

pro.mode

The “spell check” or validation of packages described above, requires some additional background downloads. While those data are cached via the ‘meomoise’ package, this will add time the first time cranDownloads() is run. For faster results, you can bypass those features by setting pro.mode = TRUE. The downside is that you’ll see zero downloads for packages on dates before they’re published on CRAN and zero downloads for mis-spelled/non-existent packages. You’ll also won’t be able to use the to = argument by itself.

For example, ‘packageRank’ was first published on CRAN on 2019-05-16 - you can verify this via packageHistory("packageRank"). But if you use cranlogs::cran_downloads() or cranDownloads(pro.mode = TRUE) before that date, you’ll see zero downloads for dates before 2019-05-16:

cranDownloads("packageRank", from = "2019-05-10", to = "2019-05-16", pro.mode = TRUE)
>         date count cumulative     package
> 1 2019-05-10     0          0 packageRank
> 2 2019-05-11     0          0 packageRank
> 3 2019-05-12     0          0 packageRank
> 4 2019-05-13     0          0 packageRank
> 5 2019-05-14     0          0 packageRank
> 6 2019-05-15     0          0 packageRank
> 7 2019-05-16    68         68 packageRank

You’ll notice this particularly when one of the packages you’re including newer packages in cranDownloads().

cranDownloads("vr", from = "2019-05-10", to = "2019-05-16", pro.mode = TRUE)
>         date count cumulative package
> 1 2019-05-10     0          0      vr
> 2 2019-05-11     0          0      vr
> 3 2019-05-12     0          0      vr
> 4 2019-05-13     0          0      vr
> 5 2019-05-14     0          0      vr
> 6 2019-05-15     0          0      vr
> 7 2019-05-16     0          0      vr

cranDownloads(to = 2024, pro.mode = TRUE)

visualizing package download counts

plot(cranDownloads(packages = "HistData", from = "2019", to = "2019"))

If you pass a vector of package names for a single day, plot() returns a dotchart:

plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
  from = "2020-03-01", to = "2020-03-01"))

If you pass a vector of package names for multiple days, plot() uses ‘ggplot2’ facets:

plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
  from = "2020", to = "2020-03-20"))

plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
  from = "2020", to = "2020-03-20"), multi.plot = TRUE)

To plot those data in separate plots on the same scale, set graphics = "base" and you’ll be prompted for each plot:

plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
  from = "2020", to = "2020-03-20"), graphics = "base")

plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
  from = "2020", to = "2020-03-20"), graphics = "base", same.xy = FALSE)

logarithm of download counts

To use the base 10 logarithm of the download count in a plot, set log.y = TRUE:

plot(cranDownloads(packages = "HistData", from = "2019", to = "2019"),
  log.y = TRUE)

Note that for the sake of the plot, zero counts are replaced by ones so that the logarithm can be computed (This does not affect the data returned by cranDownloads()).

packages = NULL

cranlogs::cran_download(packages = NULL) computes the total number of package downloads from CRAN. You can plot these data by using:

plot(cranDownloads(from = 2019, to = 2019))

packages = "R"

cranlogs::cran_download(packages = "R") computes the total number of downloads of the R application (note that you can only use “R” or a vector of packages names, not both!). You can plot these data by using:

plot(cranDownloads(packages = "R", from = 2019, to = 2019))
> Missing: 2025-08-25, 2025-08-26, 2025-08-29, 2025-08-30, 2025-08-31, 2025-09-01, 2025-09-02

plot(cranDownloads(packages = "R", from = 2019, to = 2019), r.total = TRUE)

Note that since Sunday 06 November 2022 and Wednesday, 18 January 2023, there’ve been spikes of downloads of the Windows version of R on Sundays and Wednesdays (details below in R Windows Sunday and Wednesday downloads).

smoothers and confidence intervals

plot(cranDownloads(packages = "rstan", from = "2019", to = "2019"),
  smooth = TRUE)

plot(cranDownloads(packages = c("HistData", "rnaturalearth", "Zelig"),
  from = "2020", to = "2020-03-20"), smooth = TRUE, se = TRUE)

In general, loess is the chosen smoother. Note that with base graphics, lowess is used when there are 7 or fewer observations. Thus, to control the degree of smoothness, you’ll typically use the span argument (the default is span = 0.75). With base graphics with 7 or fewer observations, you control the degree of smoothness using the f argument (the default is f = 2/3):

plot(cranDownloads(packages = c("HistData", "rnaturalearth", "Zelig"),
  from = "2020", to = "2020-03-20"), smooth = TRUE, span = 0.75)

plot(cranDownloads(packages = c("HistData", "rnaturalearth", "Zelig"),
  from = "2020", to = "2020-03-20"), smooth = TRUE, graphics = "ggplot2", 
  span = 0.33)

package, R and ChatGPT release dates (base graphics)

plot(cranDownloads(packages = "rstan", from = "2019", to = "2019"),
  package.version = TRUE, unit.observation = "week")

plot(cranDownloads(packages = "rstan", from = "2019", to = "2019"),
  r.version = TRUE, unit.observation = "week")

plot(cranDownloads(packages = "R", from = "2020-12", to = "2025-01"),
  chatgpt = TRUE, r.total = TRUE, unit.observation = "week")
> Missing: 2025-08-25, 2025-08-26, 2025-08-29, 2025-08-30, 2025-08-31, 2025-09-01, 2025-09-02

If you pass “line” to the package.version, r.version or chatgpt argument, a vertical line will be drawn on the plot:

plot(cranDownloads(packages = "rstan", from = "2019", to = "2019"),
  package.version = "line", unit.observation = "week")

weekends (base graphics)

With unit.observation = “day”, you can highlight weekends with an empty circle by setting weekend = TRUE:

plot(cranDownloads(packages = "rstan", from = "2024-06", to = "2024-06"), weekend = TRUE)

plot growth curves (cumulative download counts)

plot(cranDownloads(packages = c("ggplot2", "data.table", "Rcpp"),
  from = "2020", to = "2020-03-20"), statistic = "cumulative",
  multi.plot = TRUE, points = FALSE)

population plot

To visualize a package’s downloads relative to “all” other packages over time:

plot(cranDownloads(packages = "HistData", from = "2020", to = "2020-03-20"),
  population.plot = TRUE)

This longitudinal view plots the date (x-axis) against the base 10 logarithm of the selected package’s download counts (y-axis). To get a sense of how the selected package’s performance stacks up against “all” other packages, a set of smoothed curves representing a stratified random sample of packages is plotted in gray in the background (this is the “typical” pattern of downloads on CRAN for the selected time period).¹

unit of observation

The default unit of observation for both cranDownloads() and cranlogs::cran_dowanlods() is the day. The graph below plots the daily downloads for ‘cranlogs’ from 01 January 2022 through 15 April 2022.

plot(cranDownloads(packages = "cranlogs", from = 2022, to = "2022-04-15"))

To view the data from a less granular perspective, change plot.cranDownloads()’s unit.observation argument from “day” to “week”, “month”, or “year”.

unit.observation = "month"

plot(cranDownloads(packages = "cranlogs", from = 2022, to = "2022-04-15"),
  unit.observation = "month", smooth = TRUE, graphics = "ggplot2")

Three things to note. First, if the last/current month (far right) is still in-progress (it’s not yet the end of the month), that observation will be split in two: one point for the in-progress total (empty black square), another for the estimated total (empty red circle). The estimate is based on the proportion of the month completed. In the example above, the 635 observed downloads from April 1 through April 15 translates into an estimate of 1,270 downloads for the entire month (30 / 15 * 635). Second, if a smoother is included, it will only use “complete” observations, not in-progress or estimated data. Third, all points are plotted along the x-axis on the first day of the month.

unit.observation = "week"

plot(cranDownloads(packages = "cranlogs", from = 2022, to = "2022-06-15"),
  unit.observation = "week", smooth = TRUE)

Four things to note. First, if the first week (far left) is incomplete (the ‘from’ date is not a Sunday), that observation will be split in two: one point for the observed total on the start date (gray empty square) and another point for the backdated total. Backdating involves completing the week by pushing the nominal start date back to include the previous Sunday (blue asterisk). In the example above, the nominal start date (01 January 2022) is moved back to include data through the previous Sunday (26 December 2021). This is useful because with a weekly unit of observation the first observation is likely to be truncated and would not give the most representative picture of the data. Second, if the last week (far right) is in-progress (the ‘to’ date is not a Saturday), that observation will be split in two: the observed total (gray empty square) and the estimated total based on the proportion of week completed (red empty circle). Third, just like the monthly plot, smoothers only use complete observations, including backdated data but excluding in-progress and estimated data. Fourth, with the exception of first week’s observed count, which is plotted at its nominal date, points are plotted along the x-axis on Sundays, the first day of the week.

my default plots

For what it’s worth, below are my go-to commands for graphs. They take advantage of RStudio IDE’s plot history panel, which allows you to cycle through and compare graphs. Typically, I’ll look at the data for the last year or so at the three available units of observation: day, week and month. I use base graphics, via graphics = "base", to take advantage of prompts and “nicer” axes annotation. This also allows me to easily add graphical elements afterwards as needed, e.g., abline(h = 100, lty = "dotted").

plot(cranDownloads(packages = c("cholera", "packageRank"), from = 2023),
  graphics = "base", package.version = TRUE, smooth = TRUE, 
  unit.observation = "day")

plot(cranDownloads(packages = c("cholera", "packageRank"), from = 2023),
  graphics = "base", package.version = TRUE, smooth = TRUE, 
  unit.observation = "week")

# Note that I disable smoothing for monthly data
plot(cranDownloads(packages = c("cholera", "packageRank"), from = 2023),
  graphics = "base", package.version = TRUE, smooth = FALSE, 
  unit.observation = "month")

pro.mode

Perhaps the biggest downside of using cranDownloads(pro.mode = TRUE) is that you might draw mistaken inferences from plotting the data since it adds false zeroes to your data.

plot(cranDownloads("packageRank", from = "2019-05", to = "2019-05", 
  pro.mode = TRUE), smooth = TRUE)

plot(cranDownloads("packageRank", from = "2019-05", to = "2019-05", 
  pro.mode = FALSE), smooth = TRUE)

II - download percentile ranks

After spending some time with nominal download counts, the “compared to what?” question will come to mind. For instance, consider the data for the ‘cholera’ package from the first week of March 2020:

plot(cranDownloads(packages = "cholera", from = "2020-03-01",
  to = "2020-03-07"))

Do Wednesday and Saturday reflect surges of interest in the package or surges of traffic to CRAN? To put it differently, how can we know if a given download count is typical or unusual?

To answer these questions, we can start by looking at the total number of package downloads:

plot(cranDownloads(from = "2020-03-01", to = "2020-03-07"))

Here we see that there’s a big difference between the work week and the weekend. This seems to indicate that the download activity for ‘cholera’ on the weekend seems high. Moreover, the Wednesday peak for ‘cholera’ downloads seems higher than the mid-week peak of total downloads.

One way to better address these observations is to locate your package’s download counts in the overall frequency distribution of download counts. ‘cholera’ allows you to do so via packageDistribution(). Below are the distributions of logarithm of download counts for Wednesday and Saturday. Each vertical segment (along the x-axis) represents a download count. The height of a segment represents that download count’s frequency. The location of ‘cholera’ in the distribution is highlighted in red.

plot(packageDistribution(package = "cholera", date = "2020-03-04"))

plot(packageDistribution(package = "cholera", date = "2020-03-07"))

While these plots give us a better picture of where ‘cholera’ is located, comparisons between Wednesday and Saturday are still impressionistic: all we can confidently say is that the download counts for both days were greater than the mode.

To facilitate interpretation and comparison, I use the percentile rank of a download count instead of the simple nominal download count. This nonparametric statistic tells you the percentage of packages that had fewer downloads. In other words, it gives you the location of your package relative to the locations of all other packages. More importantly, by rescaling download counts to lie on the bounded interval between 0 and 100, percentile ranks make it easier to compare packages within and across distributions.

For example, we can compare Wednesday (“2020-03-04”) to Saturday (“2020-03-07”):

packageRank(package = "cholera", date = "2020-03-04")
>         date package count            rank percentile
> 1 2020-03-04 cholera    38 5,788 of 18,038       67.9

On Wednesday, we can see that ‘cholera’ had 38 downloads, came in 5,788th place out of the 18,038 different packages downloaded, and earned a spot in the 68th percentile.

packageRank(package = "cholera", date = "2020-03-07")
>         date package count            rank percentile
> 1 2020-03-07 cholera    29 3,189 of 15,950         80

On Saturday, we can see that ‘cholera’ had 29 downloads, came in 3,189st place out of the 15,950 different packages downloaded, and earned a spot in the 80th percentile.

So contrary to what the nominal counts tell us, one could say that the interest in ‘cholera’ was actually greater on Saturday than on Wednesday.

computing percentile rank

To compute percentile ranks, I do the following. For each package, I tabulate the number of downloads and then compute the percentage of packages with fewer downloads. Here are the details using ‘cholera’ from Wednesday as an example:

pkg.rank <- packageRank(packages = "cholera", date = "2020-03-04")

downloads <- pkg.rank$cran.data$count
names(downloads) <- pkg.rank$cran.data$package

round(100 * mean(downloads < downloads["cholera"]), 1)
> [1] 67.9

(pkgs.with.fewer.downloads <- sum(downloads < downloads["cholera"]))
> [1] 12250

(tot.pkgs <- length(downloads))
> [1] 18038

round(100 * pkgs.with.fewer.downloads / tot.pkgs, 1)
> [1] 67.9

competition v. nominal ranks

In the example above, 38 downloads puts ‘cholera’ in 5,788th place if we allow for ties using competition (i.e., “1224” ranking) and 5,556th place if we don’t by using nominal/ordinal (i.e., “1234” ranking).

Prior to v0.9.2.9008, only nominal/ordinal ranking was available. Competition ranking is now the default via packageRank(rank.ties = TRUE). If you want ordinal ranking, use packageRank(rank.ties = FALSE).

visualizing package download percentile ranks

plot(packageRank(packages = "cholera", date = "2020-03-04"))

plot(packageRank(packages = "cholera", date = "2020-03-07"))

These graphs above, which are customized here to be on the same scale, plot the rank order of packages’ download counts (x-axis) against the logarithm of those counts (y-axis). It then highlights (in red) a package’s position in the distribution along with its percentile rank and download count. In the background, the 75th, 50th and 25th percentiles are plotted as dotted vertical lines. The package with the most downloads, ‘magrittr’ in both cases, is at top left (in blue). The total number of downloads is at the top right (in blue).

III - inflation filters

‘cranlogs’ computes the number of package downloads by simply counting log entries. While straightforward, this approach can run into problems. Putting aside the question of whether package dependencies should be counted, what I have in mind here is what I believe to be two types of “invalid” log entries. The first, a software artifact, stems from entries that are smaller, often orders of magnitude smaller, than a package’s actual binary or source file. The second, a behavioral artifact, emerges from efforts to download all of CRAN. In both cases, a reliance on nominal counts will give you an inflated sense of the degree of interest in your package. For those interested, an early but detailed analysis and discussion of both types of inflation is included as part of this R-hub blog post.

software artifacts

When looking at package download logs, the first thing you’ll notice are wrongly sized log entries. They come in two sizes. The “small” entries are approximately 500 bytes in size. The “medium” entries vary in size, falling somewhere between a “small” entry and a full download (i.e., “small” <= “medium” <= full download). “Small” entries manifest themselves as standalone entries, paired with a full download, or as part of a triplet along side a “medium” and a full download. “Medium” entries manifest themselves as either standalone entries or as part of a triplet.

packageLog(date = "2020-07-01")[4:6, -(4:6)]
>               date     time    size package version country ip_id
> 3998633 2020-07-01 07:56:15   99622 cholera   0.7.0      US  4760
> 3999066 2020-07-01 07:56:15 4161948 cholera   0.7.0      US  4760
> 3999178 2020-07-01 07:56:15     536 cholera   0.7.0      US  4760

The “medium” entry is the first observation (99,622 bytes). The full download is the second entry (4,161,948 bytes). The “small” entry is the last observation (536 bytes). At a minimum, what makes a triplet a triplet (or a pair a pair) is that all members share system configuration (e.g. IP address, etc.) and have identical or adjacent time stamps.

To deal with the inflationary effect of “small” entries, I filter out observations smaller than 1,000 bytes (the smallest package on CRAN appears to be ‘LifeInsuranceContracts’, whose source file weighs in at 1,100 bytes). “Medium” entries are harder to handle. I remove them using a filter functions that looks up a package’s actual size.

behavioral artifacts

While wrongly sized entries are fairly easy to spot, seeing the effect of efforts to download all of CRAN require a change of perspective. While details and further evidence can be found in the R-hub blog post mentioned above, I’ll illustrate the problem with the following example:

packageLog(packages = "cholera", date = "2020-07-31")[8:14, -(4:6)]

Here, we see that seven different versions of the package were downloaded as a sequential bloc. A little digging shows that these seven versions represent all versions of ‘cholera’ available on that date:

packageHistory(package = "cholera")

While there are “legitimate” reasons for downloading past versions (e.g., research, container-based software distribution, etc.), I’d argue that examples like the above are “fingerprints” of efforts to download CRAN. While this is not necessarily problematic, it does mean that when your package is downloaded as part of such efforts, that download is more a reflection of an interest in CRAN itself (a collection of packages) than of an interest in your package per se. And since one of the uses of counting package downloads is to assess interest in your package, it may be useful to exclude such entries.

To do so, I try to filter out these entries in two ways. The first identifies IP addresses that download “too many” packages and then filters out campaigns, large blocs of downloads that occur in (nearly) alphabetical order. The second looks for campaigns not associated with “greedy” IP addresses and filters out sequences of past versions downloaded in a narrowly defined time window.

example usage

To get an idea of how inflated your package’s download count may be, use filteredDownloads(). Below are the results for ‘ggplot2’ for 15 September 2021.

filteredDownloads(package = "ggplot2", date = "2021-09-15")
>         date package downloads filtered.downloads delta inflation
> 1 2021-09-15 ggplot2    113842             111662  2180    1.95 %

While there were 113,842 nominal downloads, applying all the filters reduced that number to 111,662, an inflation of 1.95%.

Excluding the time it takes to download the log file (typically the bulk of the computation time), the above example take approximate 15 additional seconds to run on a single core on a 3.1 GHz Dual-Core Intel Core i5 processor.

There are 4 filters. You can control them using the following arguments (listed in order of application):

For filteredDownloads(), they are all on by default. For packageLog() and packageRank(), they are off by default. To apply them, simply set the argument for the filter you want to TRUE:

packageRank(package = "cholera", small.filter = TRUE)

Alternatively, for packageLog() and packageRank() you can simply set all.filters = TRUE.

packageRank(package = "cholera", all.filters = TRUE)

Note that the all.filters = TRUE is contextual. Depending on the function used, you’ll either get the CRAN-specific or the package-specific set of filters. The former sets ip.filter = TRUE and size.filter = TRUE; it works independently of packages at the level of the entire log. The latter sets sequence.filter = TRUEandsize.filter TRUE`; it relies on package specific information (e.g., size of source or binary file).

Ideally, we’d like to use both sets. However, the package-specific set is computationally expensive because they need to be applied individually to all packages in the log, which can involve tens of thousands of packages. While not unfeasible, currently this takes a long time. For this reason, when all.filters = TRUE, packageRank(), ipPackage(), countryPackage(), countryDistribution() and packageDistribution() use only CRAN specific filters while packageLog(), packageCountry(), and filteredDownloads() use both CRAN and package specific filters.

IV - availability of results

To understand when results become available, you need to be aware that ‘packageRank’ has two upstream, online dependencies. The first is Posit/RStudio’s CRAN package download logs, which record traffic to the “0-Cloud” mirror at cloud.r-project.org (formerly Posit/RStudio’s CRAN mirror). The second is Gábor Csárdi’s ‘cranlogs’ R package, which uses those logs to compute the download counts of both the R application and R packages.

The CRAN package download logs for the previous day are typically posted by 17:00 UTC. The results for ‘cranlogs’ usually become available soon thereafter (sometimes as much as a day later).

why aren’t today’s logs and results available?

Occasionally problems with “today’s” data can emerge due to the upstream dependencies (illustrated below).

If there’s a problem with the logs (e.g., they’re not posted on time), both ‘cranlogs’ and ‘packageRank’ will be affected. If this happens, you’ll see things like an unexpected zero count(s) for your package(s) (actually, you’ll see a zero download count for both your package and for all of CRAN), data from “yesterday”, or a “Log is not (yet) on the server” error message.

If there’s a problem with ‘cranlogs’ but not with the logs, only packageRank::cranDownalods() will be affected. In that case, you might get a warning that only “previous” results will be used. All other ‘packageRank’ functions should work since they either directly access the logs or use some other source. Usually, these errors resolve themselves the next time the underlying scripts are run (“tomorrow”, if not sooner).

logInfo()

To check the status of the download logs and ‘cranlogs’, use logInfo(). This function checks whether 1) “today’s” log is posted on Posit/RStudio’s server and 2) “today’s” results have been computed by ‘cranlogs’.

logInfo()

time zones

Because you’re typically interested in today’s log file, another thing that affects availability is your time zone. For example, let’s say that it’s 09:01 on 01 January 2021 and you want to compute the percentile rank for ‘ergm’ for the last day of 2020. You might be tempted to use the following:

packageRank(packages = "ergm")

However, depending on where you make this request, you may not get the data you expect. In Honolulu, USA, you will. In Sydney, Australia you won’t. The reason is that you’ve somehow forgotten a key piece of trivia: Posit/RStudio typically posts yesterday’s log around 17:00 UTC the following day.

The expression works in Honolulu because 09:01 HST on 01 January 2021 is 19:01 UTC 01 January 2021. So the log you want has been available for 2 hours. The expression fails in Sydney because 09:01 AEDT on 01 January 2021 is 31 December 2020 22:00 UTC. The log you want won’t actually be available for another 19 hours.

To make life a little easier, ‘packageRank’ does two things. First, when the log for the date you want is not available (due to time zone rather than server issues), you’ll just get the last available log. If you specified a date in the future, you’ll either get an error message or a warning with an estimate of when the log you want should be available.

Using the Sydney example and the expression above, you’d get the results for 30 December 2020:

packageRank(packages = "ergm")

packageRank(packages = "ergm", date = "2021-01-01")

Keep in mind that 17:00 UTC is not a hard deadline. Barring server issues, the logs are usually posted a little before that time. I don’t know when the script starts but the posting time seems to be a function of the number of entries: closer to 17:00 UTC when there are more entries (e.g., weekdays); earlier than 17:00 UTC when there are fewer entries (e.g., weekends). Again, barring server issues, the ‘cranlogs’ results are usually available before 18:00 UTC.

logInfo(details = TRUE)

The function uses your local time zone, which depends on R’s ability to compute your local time and time zone (e.g., Sys.time() and Sys.timezone()). My understanding is that there may be operating system or platform specific issues that could undermine this.

V - Reverse lookup of counts, ranks and percentiles

To query the log for a specific count, rank or percentile rank, use the functions below:

queryCount()

To find the packages that had 100 downloads (the default is 1, the lowest number of observable downloads):

queryCount(100)

queryRank()

To find the package that was ranked 20th in downloads (the default is 1st, the most downloaded package):

queryRank(20)

queryPercentile()

If you want the packages with a particular percentile rank, use queryPercentile(). Note that due to the discrete nature of counts, your choice of percentile may not be available because they may fall in the vertical gaps in the observed data:

For this reason, queryPercentile() rounds you selection to whole numbers. Also, the default value, which is set to 50, uses median()to guarantee a result.

# head() is used because there will be many observations with median count.
head(queryPercentile())

You can also set a range of percentile ranks using the ‘lo’ and/or ‘hi’ arguments. If you get an error message, you may need to widen your interval:

head(queryPercentile(lo = 95, hi = 96), 3)
tail(queryPercentile(lo = 95, hi = 96), 3)

cranDistribution()

The above functions leverage cranDistribution(), which computes the ranks and the distribution of download counts for a given day’s log.

Its print method provides the date, the number of unique packages downloaded, the total number of downloads (the total number of rows/observations in the log) and the count and rank data for the top 20 packages:

cranDistribution()

Note that if you want to specify the number of top N packages, you’ll have to explicitly use the print() and the ‘top.n’ argument:

print(cranDistribution(), top.n = 7)

queryRank(1:7)

The summary method provides the number of unique packages downloaded, the total number of downloads and the five number summary (plus the arithmetic mean):

summary(cranDistribution())

The plot method graphs the distribution of base 10 logarithm of download counts. Each plot is annotated with the median, mean and maximum download counts, as well as the total number of downloads and the total number of unique packages observed.

plot(cranDistribution())

VI - data fixes

The first data problem involves logs collected between late 2012 and the beginning of 2013. It’s a bit complicated. To understand it, we need to be know that the Posit/RStudio download logs are stored as separate files with a name/URL that embeds the log’s date:

For the logs in question, this convention was broken in three ways: i) some logs are effectively duplicated (same log, multiple names), ii) at least one is mislabeled and iii) the logs from 13 October through 28 December are offset by +3 days (e.g., the file with the name/URL “2012-12-01” contains the log for “2012-11-28”). As a result, we get erroneous download counts and we actually lose the last three logs of 2012. Details are available here.

Functions that rely on cranlogs::cran_download() (e.g., ‘packageRank::cranDownloads()’, ‘adjustedcranlogs’ and ‘dlstats’) are susceptible to the first error - duplicate names. My understanding is that this is because ‘cranlogs’ uses the date in a log rather than the filename/URL to retrieve logs. To put it differently, ‘cranlogs’ can’t detect multiple instances of logs with the same date. I found 3 logs with duplicate filename/URLs, and 5 additional instances of overcounting (including one of tripling). ‘fixCranlogs()’ addresses this overcounting problem behind the scenes by recomputing the download counts using the actual log(s) when any of the eight problematic dates are requested. Details about the 8 days and fixCranlogs() can be found here.

Functions that access logs via their filename/URL, e.g., packageRank() and packageLog(), are affected by the second and third defects - mislabeled and offset logs. fixDate_2012() addresses this, in the background, by re-mapping problematic logs so you get the log you expect.

The second data problem is of more recent vintage. From 2023-09-13 through 2023-10-02, the download counts for the R application returned by cranlogs::cran_downloads(packages = "R"), is, with two exceptions, twice what one would expect when looking at the actual log(s). The two exceptions are: 1) 2023-09-28 where the counts are identical but for a “rounding error” possibly due to NAs and 2) 2023-09-30 where there is actually a three-fold difference.

Here are the relevant ratios of counts comparing ‘cranlogs’ results with counts based on the underlying logs:

Details and code for replication can be found in issue #69. fixRCranlogs() corrects the problem. Note that there was a similar issue for package download counts around the same period but that is now fixed in ‘cranlogs’. For details, see issue #68

The third problem is the apparent loss (i.e., zero downloads for the R application and for all packages) for 7 logs in 2025: 8/25-8/26 and 8/29-9/02. For what it’s worth, both gaps were preceeded by two unusually large sets of downloads: Sun 8/24 (14,521,256) and Wed 8/27 & Thu 8/28 (16,860,505 and 16,477,023). These outliers are approximately twice the size of “typical” counts (see graph below).

As a “fix” for the missing data, which are stored as a vector of dates in cholera::missing.date, I did the following. First, when a missing date is included, cranDownloads() prints a message in the console:

Second, when plotting cranDownloads() two gray polygons to highlight those dates are added to the graph and are labeled with a “⌀” (empty set) on the top axis. Third, smoothers ignore these missing dates.

The graph below, which plots the total number of downloads recorded by the Posit/RStudio mirror from Sat 7/05 through Sun 9/14 (weekends are represented by open circles), shows the magnitude of the outliers and the two graphical fixes.

plot(cranDownloads(from = "2025-07-05", to = "2025-09-10"), smooth = TRUE, weekend = TRUE)
> Missing: 2025-08-25, 2025-08-26, 2025-08-29, 2025-08-30, 2025-08-31, 2025-09-01, 2025-09-02

VII - data note

R Windows Sunday and Wednesday download spikes (06 Nov 2022 - 19 March 2023)

The graph above for R downloads shows the daily downloads of the R application broken down by platform (Mac, Source, Windows). In it, you can see the typical pattern of mid-week peaks and weekend troughs.

Between 06 November 2022 and 19 March 2023, this pattern was broken. On Sundays (06 November 2022 - 19 March 2023) and Wednesdays (18 January 2023 - 15 March 2023), there were noticeable, repeated orders-of-magnitude spikes in the daily downloads of the Windows version of R.

plot(cranDownloads("R", from = "2022-10-06", to = "2023-04-14"))
> Missing: 2025-08-25, 2025-08-26, 2025-08-29, 2025-08-30, 2025-08-31, 2025-09-01, 2025-09-02
axis(3, at = as.Date("2022-11-06"), labels = "2022-11-06", cex.axis = 2/3, 
  padj = 0.9)
axis(3, at = as.Date("2023-03-19"), labels = "2023-03-19", cex.axis = 2/3, 
  padj = 0.9)
abline(v = as.Date("2022-11-06"), col = "gray", lty = "dotted")
abline(v = as.Date("2023-03-19"), col = "gray", lty = "dotted")

These download spikes did not seem to affect either the Mac or Source versions. I show this in the graphs below. Each plot, which is individually scaled, breaks down the data in the graph above by day (Sunday or Wednesday) and platform.

The key thing is to compare the data in the period bounded by vertical dotted lines with the data before and after. If a Sunday or Wednesday is orders-of-magnitude unusual, I plot that day with a filled rather than an empty circle. Only Windows, the final two graphs below, earn this distinction.

VIII - et cetera

For those interested in directly using the , this section describes some issues that may be of use.

country codes (top level domains)

While the IP addresses in the Posit/RStudio logs are anonymized, packageCountry() and countryPackage() the logs include ISO country codes or top level domains (e.g., AT, JP, US).

Note that coverage extends to only about 85% of observations (approximately 15% country codes are NA), and that there seems to be a a couple of typos for country codes: “A1” (A + number one) and “A2” (A + number 2). According to Posit/RStudio’s documentation, this coding was done using MaxMind’s free database, which no longer seems to be available and may be a bit out of date.

memoization

To avoid the bottleneck of downloading multiple log files, packageRank() is currently limited to individual calendar dates. To reduce the bottleneck of re-downloading logs, which can approach 100 MB, ‘packageRank’ makes use of memoization via the ‘memoise’ package.

fetchLog <- function(url) data.table::fread(url)

mfetchLog <- memoise::memoise(fetchLog)

if (RCurl::url.exists(url)) {
  cran_log <- mfetchLog(url)
}

# Note that data.table::fread() relies on R.utils::decompressFile().

This means that logs are intelligently cached; those that have already been downloaded in your current R session will not be downloaded again.

timeout

With R 4.0.3, the timeout value for internet connections became more explicit. Here are the relevant details from that release’s “New features”:

This change can affect functions that download logs. This is especially true over slower internet connections or when you’re dealing with large log files. To fix this, fetchCranLog() will, if needed, temporarily set the timeout to 600 seconds.

Specifically, within each 5% interval of percentile ranks (e.g., 0 to 5, 5 to 10, 95 to 100, etc.), a random sample of 5% of packages is selected and tracked.↩︎

packageRank: compute and visualize package download counts and percentile ranks

getting started

I - download counts

i) “spell check” for package names

ii) two additional date formats

“yyyy-mm”

yyyy or “yyyy”

iii) shortcuts with from = and to = in cranDownloads()

iv) check date validity

v) cumulative count for selected time frame

pro.mode

visualizing package download counts

logarithm of download counts

packages = NULL

packages = "R"

smoothers and confidence intervals

package, R and ChatGPT release dates (base graphics)

weekends (base graphics)

plot growth curves (cumulative download counts)

population plot

unit of observation

unit.observation = "month"

unit.observation = "week"

my default plots

pro.mode

II - download percentile ranks

computing percentile rank

competition v. nominal ranks

visualizing package download percentile ranks

III - inflation filters

software artifacts

behavioral artifacts

example usage

IV - availability of results

why aren’t today’s logs and results available?

logInfo()

time zones

V - Reverse lookup of counts, ranks and percentiles

queryCount()

queryRank()

queryPercentile()

cranDistribution()

VI - data fixes

VII - data note

R Windows Sunday and Wednesday download spikes (06 Nov 2022 - 19 March 2023)

VIII - et cetera

country codes (top level domains)

memoization

timeout

iii) shortcuts with `from =` and `to =` in `cranDownloads()`

`packages = NULL`

`packages = "R"`

`unit.observation = "month"`

`unit.observation = "week"`

`logInfo()`