GithubHelp home page GithubHelp logo

Comments (9)

gaborcsardi avatar gaborcsardi commented on July 17, 2024

It turns out that these are HTTP HEAD requests, from an lftp client. I guess it is some feature of lftp. I'll try to update the original script that creates the log files on cran-logs.rstudio.com, to filter these out.

We can think about cleaning up the current DB retrospectively. I guess it would make sense, but it will also change the reports generated in the past, etc.

from cranlogs.

gaborcsardi avatar gaborcsardi commented on July 17, 2024

It seems that I cannot fix this in the original script. I'll filter them out in my DB update script then. The smallest CRAN package tarball is 1189 bytes currently, and I doubt that it is practically possible to create a smaller one, so 1000 bytes seems like a good limit. Filter must come somewhere here: https://github.com/r-hub/cranlogs.app/blob/4ee2355d0739abb4c4d273e4e891a3a01a6165bf/db/update.sh#L30

from cranlogs.

lindbrook avatar lindbrook commented on July 17, 2024

If possible, it might be better to leave the logs as "raw" as possible for those who might be interested, and just update/amend the DB. Since I've put in some time on this, if you want, I'm willing to write up my findings as a note/vignette to explain the discrepancy. Essentially, what's happening is that beyond the random HTTP requests to individual packages, each Wednesday (+ additional days) someone is making requests to all packages on CRAN (all active current versions and all archived inactive past versions).

Let me just check if I need to original DB data before you update it.

from cranlogs.

gaborcsardi avatar gaborcsardi commented on July 17, 2024

If possible, it might be better to leave the logs as "raw" as possible for those who might be interested

However without the additional information that these are HEAD requests and not downloads, those log entries are misleading.

I don't think this needs any writeup, especially because the situation may change any time. I think this issue is a good place to record the issue.

Let me just check if I need to original DB data before you update it.

It will not happen overnight, I'll probably migrate the service before this, so no worries just yet.

from cranlogs.

lindbrook avatar lindbrook commented on July 17, 2024

I'm fine with filtering out these records. But I am curious. Besides HEAD requests, why else would we see these small "downloads"? Are aborted "downloads" in the log?

For what it's worth, since yesterday was a Wednesday, here's the top of the filtered log for cranlogs:

vars <- c("date", "time", "size", "package", "version", "country", "ip_id")

log.filtered <- packageRank::packageLog("cranlogs", filter = TRUE)
head(log.filtered[, vars])

date time size package version country ip_id
246036 2019-11-13 13:26:43 27060 cranlogs 2.1.1 US 14760
325235 2019-11-13 19:29:32 24417 cranlogs 2.1.1 US 18769
340204 2019-11-13 18:08:12 24417 cranlogs 2.1.1 US 572
872939 2019-11-13 20:31:34 26972 cranlogs 2.1.1 FR 34238
928117 2019-11-13 14:00:44 24287 cranlogs 2.1.1 TR 3298
963961 2019-11-13 09:09:16 27163 cranlogs 2.1.1 JP 42282

And all 7 of the records that are filtered out:

log.audit <- packageRank::packageLog("cranlogs", filter = -1000)
log.audit[, vars]

date time size package version country ip_id
714794 2019-11-13 20:22:52 527 cranlogs 2.0.0 US 18559
714795 2019-11-13 20:22:52 528 cranlogs 2.1.0 US 18559
1459096 2019-11-13 19:59:32 527 cranlogs 2.0.0 US 18559
1459097 2019-11-13 19:59:32 528 cranlogs 2.1.0 US 18559
2123148 2019-11-13 08:39:09 0 cranlogs 2.1.1 US 2499
3988585 2019-11-13 07:41:52 529 cranlogs 2.1.1 DE 27828
4143146 2019-11-13 19:22:07 528 cranlogs 2.1.1 US 18559

from cranlogs.

gaborcsardi avatar gaborcsardi commented on July 17, 2024

But I am curious. Besides HEAD requests, why else would we see these small "downloads"? Are aborted "downloads" in the log?

Every request is in the log I think, so yes. But aborted downloads are rare. The reasons I can think of:

  1. HEAD request
  2. request with If-modified-*, that results a 304 response
  3. aborted downloads, but even these probably send more than ~500 bytes, as far as the web server is concerned.

I think 3. is probably very rare. 1. should be filtered out, ideally, but 2. is a proper download (attempt).

from cranlogs.

lindbrook avatar lindbrook commented on July 17, 2024

Would you happen to know if the lftp client is part of R and/or RStudio?

from cranlogs.

gaborcsardi avatar gaborcsardi commented on July 17, 2024

I don't think so.

from cranlogs.

lindbrook avatar lindbrook commented on July 17, 2024

A little more digging...

~500 byte downloads manifest themselves in three ways.

  1. as standalone entries:

log <- packageRank::packageLog("cholera", "2020-06-03")
log[log$time == "18:48:17", ]

date time size r_version r_arch r_os package version country
24 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.2.1 US
25 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.3.0 US
26 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.4.0 US
27 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.5.0 US
28 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.5.1 US
29 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.6.0 US
30 2020-06-03 18:48:17 533 <NA> <NA> <NA> cholera 0.6.5 US
ip_id
24 4724
25 4724
26 4724
27 4724
28 4724
29 4724
30 4724

  1. as part of a pair (same machine, same time stamp): a ~500 byte entry plus a full download.

log <- packageRank::packageLog("sp", "2020-05-06")
log[log$time == "08:14:34", ]

date time size r_version r_arch r_os package version country
4263 2020-05-06 08:14:34 367 <NA> <NA> <NA> sp 1.4-1 <NA>
4264 2020-05-06 08:14:34 1134627 <NA> <NA> <NA> sp 1.4-1 <NA>
ip_id
4263 6
4264 6

  1. as part of a triplet (same machine, same time stamp). For packages >= 1 MB in size, it seems you'll see a ~500 byte download, a full download, and an intermediate-sized download:

log <- packageLog("cholera", "2020-02-02")
log[log$time == "10:19:17", ]

date time size r_version r_arch r_os package version country
5 2020-02-02 10:19:17 4147305 <NA> <NA> <NA> cholera 0.7.0 US
6 2020-02-02 10:19:17 34821 <NA> <NA> <NA> cholera 0.7.0 US
7 2020-02-02 10:19:17 539 <NA> <NA> <NA> cholera 0.7.0 US
ip_id
5 1047
6 1047
7 1047

For packages < 1 MB, it seems you'll see a ~500 byte download plus two "full" downloads:

log <- packageLog("cranlogs", "2020-05-06")
log[log$time == "09:03:46", ]

date time size r_version r_arch r_os package version country
19 2020-05-06 09:03:46 529 <NA> <NA> <NA> cranlogs 2.1.1 US
20 2020-05-06 09:03:46 24273 <NA> <NA> <NA> cranlogs 2.1.1 US
21 2020-05-06 09:03:46 24280 <NA> <NA> <NA> cranlogs 2.1.1 US
ip_id
19 3678
20 3678
21 3678

While we can easily correct for the first two cases by simply filtering out entries smaller than say 1000 bytes, the variability in size of the intermediate download makes triplets harder to deal with. That said, I'm wondering if you'd consider implementing a fix for them.

Here are three reasons why. First, we're essentially counting one download as three. Second, their frequency in the logs seems to be increasing. Third, because it is a somewhat computationally intensive task, it makes more sense to deal with it on the back-end than as part of a function on the user end.

For what it's worth, to get a ballpark empirical sense of the problem I wrote some prototype R code and done some preliminary analysis. If interested, I can post it.

from cranlogs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.