Comments (9)
It turns out that these are HTTP HEAD requests, from an lftp client. I guess it is some feature of lftp. I'll try to update the original script that creates the log files on cran-logs.rstudio.com, to filter these out.
We can think about cleaning up the current DB retrospectively. I guess it would make sense, but it will also change the reports generated in the past, etc.
from cranlogs.
It seems that I cannot fix this in the original script. I'll filter them out in my DB update script then. The smallest CRAN package tarball is 1189 bytes currently, and I doubt that it is practically possible to create a smaller one, so 1000 bytes seems like a good limit. Filter must come somewhere here: https://github.com/r-hub/cranlogs.app/blob/4ee2355d0739abb4c4d273e4e891a3a01a6165bf/db/update.sh#L30
from cranlogs.
If possible, it might be better to leave the logs as "raw" as possible for those who might be interested, and just update/amend the DB. Since I've put in some time on this, if you want, I'm willing to write up my findings as a note/vignette to explain the discrepancy. Essentially, what's happening is that beyond the random HTTP requests to individual packages, each Wednesday (+ additional days) someone is making requests to all packages on CRAN (all active current versions and all archived inactive past versions).
Let me just check if I need to original DB data before you update it.
from cranlogs.
If possible, it might be better to leave the logs as "raw" as possible for those who might be interested
However without the additional information that these are HEAD
requests and not downloads, those log entries are misleading.
I don't think this needs any writeup, especially because the situation may change any time. I think this issue is a good place to record the issue.
Let me just check if I need to original DB data before you update it.
It will not happen overnight, I'll probably migrate the service before this, so no worries just yet.
from cranlogs.
I'm fine with filtering out these records. But I am curious. Besides HEAD
requests, why else would we see these small "downloads"? Are aborted "downloads" in the log?
For what it's worth, since yesterday was a Wednesday, here's the top of the filtered log for cranlogs
:
vars <- c("date", "time", "size", "package", "version", "country", "ip_id")
log.filtered <- packageRank::packageLog("cranlogs", filter = TRUE)
head(log.filtered[, vars])
date time size package version country ip_id
246036 2019-11-13 13:26:43 27060 cranlogs 2.1.1 US 14760
325235 2019-11-13 19:29:32 24417 cranlogs 2.1.1 US 18769
340204 2019-11-13 18:08:12 24417 cranlogs 2.1.1 US 572
872939 2019-11-13 20:31:34 26972 cranlogs 2.1.1 FR 34238
928117 2019-11-13 14:00:44 24287 cranlogs 2.1.1 TR 3298
963961 2019-11-13 09:09:16 27163 cranlogs 2.1.1 JP 42282
And all 7 of the records that are filtered out:
log.audit <- packageRank::packageLog("cranlogs", filter = -1000)
log.audit[, vars]
date time size package version country ip_id
714794 2019-11-13 20:22:52 527 cranlogs 2.0.0 US 18559
714795 2019-11-13 20:22:52 528 cranlogs 2.1.0 US 18559
1459096 2019-11-13 19:59:32 527 cranlogs 2.0.0 US 18559
1459097 2019-11-13 19:59:32 528 cranlogs 2.1.0 US 18559
2123148 2019-11-13 08:39:09 0 cranlogs 2.1.1 US 2499
3988585 2019-11-13 07:41:52 529 cranlogs 2.1.1 DE 27828
4143146 2019-11-13 19:22:07 528 cranlogs 2.1.1 US 18559
from cranlogs.
But I am curious. Besides HEAD requests, why else would we see these small "downloads"? Are aborted "downloads" in the log?
Every request is in the log I think, so yes. But aborted downloads are rare. The reasons I can think of:
HEAD
request- request with
If-modified-*
, that results a 304 response - aborted downloads, but even these probably send more than ~500 bytes, as far as the web server is concerned.
I think 3. is probably very rare. 1. should be filtered out, ideally, but 2. is a proper download (attempt).
from cranlogs.
Would you happen to know if the lftp client is part of R and/or RStudio?
from cranlogs.
I don't think so.
from cranlogs.
A little more digging...
~500 byte downloads manifest themselves in three ways.
- as standalone entries:
log <- packageRank::packageLog("cholera", "2020-06-03")
log[log$time == "18:48:17", ]
date time size r_version r_arch r_os package version country
24 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.2.1 US
25 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.3.0 US
26 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.4.0 US
27 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.5.0 US
28 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.5.1 US
29 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.6.0 US
30 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.6.5 US
ip_id
24 4724
25 4724
26 4724
27 4724
28 4724
29 4724
30 4724
- as part of a pair (same machine, same time stamp): a ~500 byte entry plus a full download.
log <- packageRank::packageLog("sp", "2020-05-06")
log[log$time == "08:14:34", ]
date time size r_version r_arch r_os package version country
4263 2020-05-06 08:14:34 367 <NA> <NA> <NA> sp 1.4-1 <NA>
4264 2020-05-06 08:14:34 1134627 <NA> <NA> <NA> sp 1.4-1 <NA>
ip_id
4263 6
4264 6
- as part of a triplet (same machine, same time stamp). For packages >= 1 MB in size, it seems you'll see a ~500 byte download, a full download, and an intermediate-sized download:
log <- packageLog("cholera", "2020-02-02")
log[log$time == "10:19:17", ]
date time size r_version r_arch r_os package version country
5 2020-02-02 10:19:17 4147305 <NA> <NA> <NA> cholera 0.7.0 US
6 2020-02-02 10:19:17 34821 <NA> <NA> <NA> cholera 0.7.0 US
7 2020-02-02 10:19:17 539 <NA> <NA> <NA> cholera 0.7.0 US
ip_id
5 1047
6 1047
7 1047
For packages < 1 MB, it seems you'll see a ~500 byte download plus two "full" downloads:
log <- packageLog("cranlogs", "2020-05-06")
log[log$time == "09:03:46", ]
date time size r_version r_arch r_os package version country
19 2020-05-06 09:03:46 529 <NA> <NA> <NA> cranlogs 2.1.1 US
20 2020-05-06 09:03:46 24273 <NA> <NA> <NA> cranlogs 2.1.1 US
21 2020-05-06 09:03:46 24280 <NA> <NA> <NA> cranlogs 2.1.1 US
ip_id
19 3678
20 3678
21 3678
While we can easily correct for the first two cases by simply filtering out entries smaller than say 1000 bytes, the variability in size of the intermediate download makes triplets harder to deal with. That said, I'm wondering if you'd consider implementing a fix for them.
Here are three reasons why. First, we're essentially counting one download as three. Second, their frequency in the logs seems to be increasing. Third, because it is a somewhat computationally intensive task, it makes more sense to deal with it on the back-end than as part of a function on the user end.
For what it's worth, to get a ballpark empirical sense of the problem I wrote some prototype R code and done some preliminary analysis. If interested, I can post it.
from cranlogs.
Related Issues (20)
- country variable typos in CRAN logs
- Add function for trending endpoint
- 502 error: Bad Gateway HOT 1
- package name errors in CRAN logs HOT 4
- Default "from" to the CRAN release date HOT 7
- Option to count only current, unarchived packages
- `cran_downloads` not working since `2020-01-01` HOT 1
- Days with no CRAN downloads HOT 30
- `cran_downloads` returning 0s starting from 22.01.2020 HOT 1
- limit on number of packages as argument to cran_downloads HOT 7
- aggregate counts by package over a period HOT 2
- cranlogs appears to be down
- Not Found (HTTP 404) in cran_downloads HOT 6
- cranlogs::cran_downloads() overcounts downloads on 8 days at end of 2012 and beginning of 2013 HOT 1
- Odd download counts recently
- Change to new cran checks badge URL HOT 1
- cranlogs::cran_downloads() double counting 2023-09-19 through 2023-10-01 HOT 1
- cranlogs::cran_downloads("R") (mostly) double counts from 2023-09-13 through 2023-10-02
- R-version for packages download statistics HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cranlogs.