To compute the number of package downloads for a given day, cranlogs::cran_downloads()

It seems that I cannot fix this in the original . I'll filter them out in my DB

Option to not count downloads < 1000 bytes about cranlogs HOT 9 OPEN

r-hub commented on July 17, 2024

Option to not count downloads < 1000 bytes

from cranlogs.

Comments (9)

gaborcsardi commented on July 17, 2024

It turns out that these are HTTP HEAD requests, from an lftp client. I guess it is some feature of lftp. I'll try to update the original script that creates the log files on cran-logs.rstudio.com, to filter these out.

We can think about cleaning up the current DB retrospectively. I guess it would make sense, but it will also change the reports generated in the past, etc.

from cranlogs.

gaborcsardi commented on July 17, 2024

It seems that I cannot fix this in the original script. I'll filter them out in my DB update script then. The smallest CRAN package tarball is 1189 bytes currently, and I doubt that it is practically possible to create a smaller one, so 1000 bytes seems like a good limit. Filter must come somewhere here: https://github.com/r-hub/cranlogs.app/blob/4ee2355d0739abb4c4d273e4e891a3a01a6165bf/db/update.sh#L30

from cranlogs.

lindbrook commented on July 17, 2024

If possible, it might be better to leave the logs as "raw" as possible for those who might be interested, and just update/amend the DB. Since I've put in some time on this, if you want, I'm willing to write up my findings as a note/vignette to explain the discrepancy. Essentially, what's happening is that beyond the random HTTP requests to individual packages, each Wednesday (+ additional days) someone is making requests to all packages on CRAN (all active current versions and all archived inactive past versions).

Let me just check if I need to original DB data before you update it.

from cranlogs.

gaborcsardi commented on July 17, 2024

If possible, it might be better to leave the logs as "raw" as possible for those who might be interested

However without the additional information that these are HEAD requests and not downloads, those log entries are misleading.

I don't think this needs any writeup, especially because the situation may change any time. I think this issue is a good place to record the issue.

Let me just check if I need to original DB data before you update it.

It will not happen overnight, I'll probably migrate the service before this, so no worries just yet.

from cranlogs.

lindbrook commented on July 17, 2024

I'm fine with filtering out these records. But I am curious. Besides HEAD requests, why else would we see these small "downloads"? Are aborted "downloads" in the log?

For what it's worth, since yesterday was a Wednesday, here's the top of the filtered log for cranlogs:

vars <- c("date", "time", "size", "package", "version", "country", "ip_id")

log.filtered <- packageRank::packageLog("cranlogs", filter = TRUE)
head(log.filtered[, vars])

date time size package version country ip_id
246036 2019-11-13 13:26:43 27060 cranlogs 2.1.1 US 14760
325235 2019-11-13 19:29:32 24417 cranlogs 2.1.1 US 18769
340204 2019-11-13 18:08:12 24417 cranlogs 2.1.1 US 572
872939 2019-11-13 20:31:34 26972 cranlogs 2.1.1 FR 34238
928117 2019-11-13 14:00:44 24287 cranlogs 2.1.1 TR 3298
963961 2019-11-13 09:09:16 27163 cranlogs 2.1.1 JP 42282

And all 7 of the records that are filtered out:

log.audit <- packageRank::packageLog("cranlogs", filter = -1000)
log.audit[, vars]

date time size package version country ip_id
714794 2019-11-13 20:22:52 527 cranlogs 2.0.0 US 18559
714795 2019-11-13 20:22:52 528 cranlogs 2.1.0 US 18559
1459096 2019-11-13 19:59:32 527 cranlogs 2.0.0 US 18559
1459097 2019-11-13 19:59:32 528 cranlogs 2.1.0 US 18559
2123148 2019-11-13 08:39:09 0 cranlogs 2.1.1 US 2499
3988585 2019-11-13 07:41:52 529 cranlogs 2.1.1 DE 27828
4143146 2019-11-13 19:22:07 528 cranlogs 2.1.1 US 18559

from cranlogs.

gaborcsardi commented on July 17, 2024

But I am curious. Besides HEAD requests, why else would we see these small "downloads"? Are aborted "downloads" in the log?

Every request is in the log I think, so yes. But aborted downloads are rare. The reasons I can think of:

HEAD request
request with If-modified-*, that results a 304 response
aborted downloads, but even these probably send more than ~500 bytes, as far as the web server is concerned.

I think 3. is probably very rare. 1. should be filtered out, ideally, but 2. is a proper download (attempt).

from cranlogs.

lindbrook commented on July 17, 2024

Would you happen to know if the lftp client is part of R and/or RStudio?

from cranlogs.

gaborcsardi commented on July 17, 2024

I don't think so.

from cranlogs.

lindbrook commented on July 17, 2024

A little more digging...

~500 byte downloads manifest themselves in three ways.

as standalone entries:

log <- packageRank::packageLog("cholera", "2020-06-03")
log[log$time == "18:48:17", ]

date time size r_version r_arch r_os package version country
24 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.2.1 US
25 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.3.0 US
26 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.4.0 US
27 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.5.0 US
28 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.5.1 US
29 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.6.0 US
30 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.6.5 US
ip_id
24 4724
25 4724
26 4724
27 4724
28 4724
29 4724
30 4724

as part of a pair (same machine, same time stamp): a ~500 byte entry plus a full download.

log <- packageRank::packageLog("sp", "2020-05-06")
log[log$time == "08:14:34", ]

date time size r_version r_arch r_os package version country
4263 2020-05-06 08:14:34 367 <NA> <NA> <NA> sp 1.4-1 <NA>
4264 2020-05-06 08:14:34 1134627 <NA> <NA> <NA> sp 1.4-1 <NA>
ip_id
4263 6
4264 6

as part of a triplet (same machine, same time stamp). For packages >= 1 MB in size, it seems you'll see a ~500 byte download, a full download, and an intermediate-sized download:

log <- packageLog("cholera", "2020-02-02")
log[log$time == "10:19:17", ]

date time size r_version r_arch r_os package version country
5 2020-02-02 10:19:17 4147305 <NA> <NA> <NA> cholera 0.7.0 US
6 2020-02-02 10:19:17 34821 <NA> <NA> <NA> cholera 0.7.0 US
7 2020-02-02 10:19:17 539 <NA> <NA> <NA> cholera 0.7.0 US
ip_id
5 1047
6 1047
7 1047

For packages < 1 MB, it seems you'll see a ~500 byte download plus two "full" downloads:

log <- packageLog("cranlogs", "2020-05-06")
log[log$time == "09:03:46", ]

date time size r_version r_arch r_os package version country
19 2020-05-06 09:03:46 529 <NA> <NA> <NA> cranlogs 2.1.1 US
20 2020-05-06 09:03:46 24273 <NA> <NA> <NA> cranlogs 2.1.1 US
21 2020-05-06 09:03:46 24280 <NA> <NA> <NA> cranlogs 2.1.1 US
ip_id
19 3678
20 3678
21 3678

While we can easily correct for the first two cases by simply filtering out entries smaller than say 1000 bytes, the variability in size of the intermediate download makes triplets harder to deal with. That said, I'm wondering if you'd consider implementing a fix for them.

Here are three reasons why. First, we're essentially counting one download as three. Second, their frequency in the logs seems to be increasing. Third, because it is a somewhat computationally intensive task, it makes more sense to deal with it on the back-end than as part of a function on the user end.

For what it's worth, to get a ballpark empirical sense of the problem I wrote some prototype R code and done some preliminary analysis. If interested, I can post it.

from cranlogs.

Option to not count downloads < 1000 bytes about cranlogs HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs