zkat / pacote Goto Github PK
View Code? Open in Web Editor NEWprogrammatic npm package and metadata downloader (moved!)
Home Page: https://github.com/npm/pacote
License: MIT License
programmatic npm package and metadata downloader (moved!)
Home Page: https://github.com/npm/pacote
License: MIT License
The cache can get into various states where cached data will no longer pass integrity checks. When this happens, the prefetch or extract call fails because local data's bad.
So. Try again.
There's cool stuff happening over in node-tar land. Once it's solid and released, let's shove it in pacote and bask in its absurdly performant glory.
So lib/registry/request
has very little about it that is actually registry-specific. It's probably a good idea to fork this out as a standalone server-agnostic caching client (as opposed to how it currently does a bit of registry-specific work mixed in). This should make the client usable for other things like remote tarballs and grabbing git stuff directly.
Honestly, everything in that cache wrapper is a simple call straight to cacache. The main addition of importance here is memoization, and that can reasonably be moved straight into cacache for convenience.
Once the http client has been refactored, it'll probably account for most of the cache-related calls. That client, then, can just call cacache instead of maintaining even more code.
All source files should have 'use strict'
added to the first line of each file. Ditto for test code.
npm currently returns an ETARGET
error whenever it tries and fails to fetch package metadata: https://github.com/npm/npm/blob/1067febf1875c92d6498ede7c0b20012a0c33d30/lib/fetch-package-metadata.js#L154-L162
For the sake of better general compatibility, should pacote return the same type of error? I've been thinking that the current style of just chucking out ENOENT
is bound to cause problems if there's different types of ENOENT
s coming from different parts of pacote or cacache.
support an enforced offline mode which errors if any network requests are attempted and tries to use the cache as much as possible
Once the codebase stabilizes a bit, I want to write an ARCHITECTURE.md
document that gives an overview of how the project is structured, the purpose of various components, overall design concepts that are good to remember which might not be too obvious, etc.
There's basically no coverage on them, but they might be a bit tricky to write due to having to set up and launch git daemons. Once the mocking utility's done, though, things should go much faster.
This really took a bite out of general coverage for the library so it's best to try and get this done sooner rather than later :\
The npm installer currently has 3 major stages relevant to pacote -- but only uses two, right now.
While it's nice to stream end-to-end in a single step with pacote.extract
, the whole point of having a multi-stage installer is to take advantage of the predictability and isolation that come from having discrete steps. pacote.extract
, as it turns out, does a lot -- and it's used in the extract
step of the installer.
There's another step that's been commented out for a while now: fetch
, which is intended to be the stage where npm will actually go out into the network to grab any tarballs.
So, the proposal here is to add another toplevel API function to pacote: pacote.prefetch
. Its job should be purely to warm up pacote's cache and allow pacote.extract
to bypass the cache index and always extract tarballs by digest, since it'll know those tarballs are present.
The nice thing is all the code that pacote.prefetch
would need is already there: simply calling the appropriate tarball.js
handler and draining the stream into the void (stream.on('data', function () {})
) is good enough to put it together. All of that code is already in pacote.extract
. ๐
It's just kinda hanging out there on latest
right now, and it needs some test coverage :<
pacote should normalize metadata fields to only the things the CLI might need, and standardize those across the different sources. Should also run the manifest through normalize-package-data
.
The current auth stuff is kinda janky. Figure out an auth mechanism that generalizes well and make configuring it more straightforward. :waves-hands:
Besides bearer tokens, pacote needs to support basic http auth. It should also obey auth.alwaysAuth
.
Right now, there's a bunch of errors that get spit out by various deps and such that we use.
The CLI eventually has to make its own decisions about these, but it would be super handy to distinguish between three main error categories:
This might be a really great first step towards having much richer error reporting data for CLI users to consume -- specially user-level conditions that we might be able to be very specific about.
Add various bits of noteworthy performance analytics to be collected on the fly, and log them out as things complete.
Some ideas:
finalize-manifest
tarball extractionshttp://npm.im/request-capture-har might be of use for the network part of this
This one's a tricky one. There are many ways to get the contents of a git repo. The best way for pacote to handle git is by judiciously picking which of these to use depending on what's being requested, and trying to avoid a full git clone at all costs.
Semver range dependencies should be resolved according to npm/npm#15308. These can be resolved with a git ls-remote
.
These are the possible ways I've found so far to get either full package data, or subsets, and associated caveats:
$ git clone https://github.com/npm/npm
npm/npm
: 13s, zkat/cacache
: 1.7s (so it won't be so bad on small repos))$ git clone https://github.com/npm/npm --depth=1 -b <named-ref>
HEAD
or named refs. Any commit hash that has an existing remote ref will work, too.git clone
for larger repos but still fairly heavy. (npm/npm
: 4.35s, zkat/cacache
: 1.52s)git archive
$ git archive --format=tar.gz --prefix=package/ --remote=https://github.com/npm/npm <committish>
package/
prefix can be added with --prefix=
option.$ curl -LO https://github.com/npm/npm/archive/<committish>.tgz
npm/npm
: 0.6s, zkat/cacache
: 0.39s)hosted-git-info
package/
by default, so we need to manually add a level$ curl -LO https://raw.githubusercontent.com/npm/npm/<committish>/package.json
package.json
and npm-shrinkwrap.json
bin
when directories.bin
is there. $ git ls-remote https://github.com/npm/npm -t -h '*'
npm/npm
: 0.83s, zkat/cacache
: 0.37s)-t -h
and -t
. The former will use more RAM but increase chances of a non-semver committish matching. The latter will be smaller and be all that's needed for a semver ref.bulk stuff is way faster than stream-based stuff. Should do #25 before doing this just so we have solid numbers on the difference, and stream stuff needs to still exist because we want to be able to handle multi-gig files.
Ideally, pacote.extract
and pacote.prefetch
would only use streams for particularly big packages. pacote.manifest
should always use bulk requests.
The benchmark results in this tweet: https://twitter.com/fold_left/status/860239327229607937
npm5: 15.561s
npm5-cached: 16.351s
npm5-shrinkpack: 1.216s
npm5-shrinkpack-compressed: 8.817s
Seem to hint that there's a significant slowdown when tarballs are already gzipped.
I'm not terribly surprised by a slowdown, but the specific slowdown seems pretty big. I wonder if there's something along the way slowing things down more than expected.
We just kinda grab the tarball every time rn. We can cache that separately.
This one should be super simple! Add the tarball to the cache pretty much directly! Manifests, again, will need to get picked up during extraction, though :(
While the CLI probably doesn't need to worry about this much except in case of catastrophe, there's some user tooling that could really benefit from pacote's default mode refusing to overwrite directory contents on extract.
opts.extractOverwrite
should be required for anyone who targets a directory that: A. exists, B. has any contents in it.
It's ok to be racy about this. If two processes shove things in one dir at the same time, so be it. This feature is primarily to protect about what is bound to be a common footgun for users using straight-up pacote (it literally just bit me and I don't wanna even).
If you think this is an interesting bug to pick up, this is what I think is generally the direction to go in:
add the extractOverwrite
option to lib/util/opt-check
-- you won't be able to read it otherwise.
add a conditional readdirAsync()
call early on in extract.js
, before most other work is done. (note: readdirAsync
is basically const readdirAsync = BB.promisify(fs.readdir)
, by convention).
If the resulting listing has any items in it, throw an error with a useful code and an explanation of what the user tried to do -- include a mention of opts.extractOverwrite
in the error message so it's discoverable.
If opts.extractOverwrite
is true, bypass the fs.readdirAsync
call entirely with a BB.resolve()
.
Feel free to do it your way, too, if you find a better alternative. The goal is to prevent users from accidentally writing into things they didn't intend to write to.
Deprecation warnings in npm are weird: right now, it's the CLI's cache that takes care of this. It's probably not a good idea for the cache itself to be responsible for this. At the same time, deprecation information is passed to the CLI in headers -- so pacote would have to know about them already.
So, do the following: Add a _deprecated: Bool
field to the finalized manifest based on that registry header. Let the npm installer take care of the rest.
If a package manifest has no bundleDependencies
, we should filter out any files contained within node_modules
for it.
preferOffline
and offline
both change the fetch mechanism to in one way or another, lean towards maximizing use of the local caches.
There is an issue, though, where package metadata may have been fetched into the cache, but a corresponding matching latest package may not have been downloaded. In these cases, it turns out, we may in fact have a semver-compatible tarball available in the cache that at least offline
could fall back to in this particular corner case.
This case, though, is probably pretty rare. I think. Just an idea out there that's pretty low-priority. And I might be wrong about how rare that situation really is.
there's currently a finalized manifest caching scheme that is keyed off pkg._resolved
, assuming it's a globally-unique, immutable key: this is not the case.
Perhaps it would be good to add the _shasum
for a tarball to that name, and skip cache reads if we don't have a shasum either in pkg._shasum
or opts.digest
. Shove the hashAlgorithm in there too for good measure.
pacote should have a step-by-step guide on how to use it. This is probably pretty straightforward, since the API surface is relatively small. Still, it's good to have this.
npm/npm-package-arg#21 will eventually get merged, and it involves a pretty major API change for that library. pacote
should switch to using that instead when parsing non-object specs, and expect npa output objects to be the values passed in as spec objects.
There is a placeholder in CONTRIBUTING.md to instruct contributors on the steps they should follow to tag a new release of Pacote. Let's fill it!
In CONTRIBUTING.md the second paragraph reads
Please make sure to read the relevant section before making your contribution! It will make it a lot easier for us maintainers to make the most of it and smooth out the experience fo all involved. ๐
"fo all involved" should be changed to "of all involved"!
Check out the section on Contributing Documentation to discover how to make this contributing!
๐
In registry-key.js, there's currently the following call to url.format, which currently is dropping the path:
const formatted = url.format({
host: parsed.host,
path: parsed.path,
slashes: parsed.slashes
})`
It looks like (presumably after nodejs/node#303) 'path' is no longer supported, and we need to use 'pathname' instead.
https://nodejs.org/api/url.html#url_url_format_urlobject
Some days ago on twitter https://twitter.com/serapath/status/856908380731916288
Now I just stumbled upon the module.
It seems it currently does not work in the browser, but if it would, that would be awesome, because I would love to use it.
Other than that - one feature I'd love to use it to prompt a user for a token so that it's possible to actually publish data to npm from the browser (think: in-browser Javascript IDE)
I would also try to implement it myself, but dont know what kind of requests I would need to make or how I can learn about it and on top of that if it's even possible regarding maybe CORS settings.
This one should definitely use git proper for any requests. And probably some sort of specialized caching technique (since it's not just gonna go through the http client)
Related to #17, another thing that we need to do in order to have a complete manifest is to fill in the bin
field if there is a directories
entries with bin
in it. And excludes anything that starts with a .
in the bin
dir. This isn't needed at all if there's already a bin
field, or if there's no directories
field.
npm-registry-client shoots out all these special headers itself. A lot of them have to do more with specific npm features than just fetching packages.
Move these headers out of pacote and into a single opts.extraHeaders opt in npm.
Examples:
npm-scope
npm-in-ci
referer
user-agent
(?)pacote is built for performance. Performance is meaningless without benchmarks and profiling. So. We need benchmarks.
There should be benchmarks for each of the supported types (note: only registry ones are needed for 0.2.0
), with small, medium, and large packages (including some variation for number of files vs size of individual files). All of these for both manifest-fetching and extraction.
We should make sure all the benchmarks run hit the following cases too, for each of the groups described above:
pkg._hasShrinkwrap === false
(so no extract)lib/cache
exports a _clearMemoized()
fn for this purpose)https://npm.im/benchmark does support async stuff and seems like a reasonable base to build this suite upon.
Marking this as starter
because while it's likely to take some time to write, you need relatively little context to be able to write some baseline benchmarks for the above cases. The actual calls are literally all variations of pacote.manifest()
and pacote.extract()
calls: that's the level these benchmarks should run at, rather than any internals. At least for now.
I would also say that comparing benchmark results across different versions automatically is just a stretch goal, because the most important bit is to be able to run these benchmarks at all.
@simonua @zkat
Some npm registries redirect to another host for package tarball downloads. For example, Microsoft VSTS redirects to Azure Blob.
pacote (or possibly make-fetch-happen/node-fetch) appears to forward authorization headers on a redirect to another host, unlike previous versions of npm. In the specific case of Azure Blob, these credentials are invalid (the correct token is provided in the URI querystring), and an Authorization header must not be present.
This results in an error like:
npm ERR! 400 Authentication information is not given in the correct format. Check the value of Authorization header.
More generally, from a security perspective forwarding credentials (by default, at least) to another host isn't great.
Right now, we call realize-package-specifier
whenever we call either pacote.tarball
or pacote.manifest
or pacote.prefetch
. While this is convenient for our own testing, and potentially standalone (read: non-npm) uses of the library, the CLI will basically always have a Result
object to pass in -- then we can skip the whole parsing process, and we don't make the CLI construct a bullshit pretend-specifier-string like it does in a bunch of places right now.
opt-check
was a pretty basic option handling mechanism but I'm not feeling great about it: it silently fails if it gets unexpected options (and those options are later requested), it doesn't support types or any sort of verification for options, and it just assumes that everything is gonna want all the options.
But, as it turns out, as we call individual things, they expect other subsets of options. It might be nice to have an opts mechanism where every layer can specify what exactly it wants and needs, so it's easy to see what's using what, and at what level -- specially stuff we're passing to dependencies like cacache
, which have a bunch of their own opts!
Currently, lib/finalize-manifest
not only "fills out" and standardizes the manifest format, but might also potentially request and extract a tarball to make sure _shasum
, _shrinkwrap
, and bin
have the right data in them.
All that heavy lifting of extracting package tarballs during the manifest
stage, though, isn't cached at all.
A custom cache key of some sort should be added such that we can cache the results of completeFromTarball
only when a tarball extraction is needed. Don't risk hitting the disk unless we really have to. The results of that function can also be memoized, in case we have multiple requests for it.
This one should be super straightforward on the tarball side, but will probably need some munging on the manifest front because we need to grab the manifests (probably mid-stream!) from the tarball.
Note: this would allow cached tarball downloads, and it can probably lean right on registry/request.js
. manifest
can probably be done by a dummy manifest with a _resolved
field, and then expand finalize-manifest
to fill in the rest of the manifest from the package.json
in the tarball :)
There is a placeholder in CONTRIBUTING.md to instruct contributors on the steps they should follow to merge a pull request. Let's fill it!
Both Pacote an Cacache are using this label, but it's not documented in either CONTRIBUTING guide.
Steps to reproduce: npm install
with the following line in package.json
([REDACTED]
is a Bitbucket Server host; this dependency works in npm 4)
"circular-list": "git+ssh://git@[REDACTED]/circular-list.git#v1.0.2",
output:
npm ERR! code 128
npm ERR! Command failed: /usr/local/bin/git clone --depth=1 -q -b v1.0.2^{} ssh://git@[REDACTED]/circular-list.git /Users/matthew.brennan/.npm/_cacache/tmp/git-clone-d9115048
npm ERR! warning: templates not found /var/folders/0m/smmrszcj367g1ds3nkjrv2y42l6kl3/T/pacote-git-template-tmp/git-clone-410ac485
npm ERR! fatal: Remote branch v1.0.2^{} not found in upstream origin
npm ERR!
`npm version`:
```json
{ npm: '5.0.0',
ares: '1.10.1-DEV',
http_parser: '2.7.0',
icu: '57.1',
modules: '48',
node: '6.9.2',
openssl: '1.0.2j',
uv: '1.9.1',
v8: '5.1.281.88',
zlib: '1.2.8' }
cloned from npm/npm#16789
If npm has a manifest cached, but fails to find a matching version in a given manifest, it will assume a cache miss and try a full request. See https://github.com/npm/npm/blob/1067febf1875c92d6498ede7c0b20012a0c33d30/lib/fetch-package-metadata.js#L146-L152
This can cause some annoying issues when, for example, someone tries to bump their local version shortly after publishing -- their next install will take some period of time (depending on opts.maxAge
) before the manifest request expires and gets re-requested.
This can probably be implemented right into https://github.com/zkat/pacote/blob/latest/lib/registry/manifest.js. The general idea would be to have pacote.manifest()
try the usual case of a requested version being found, and after pickManifest
, try the request + manifest picking just one more time, after busting the cache.
To cache bust, two things will be needed: one, a way for a cache to invalidate a specific key and nothing else (on disk), and another to bust the memoized version of that key. That can be added to lib/cache/index.js
.
So with the http client, it's possible for a request to die mid-stream. Right now, that just kinda implodes and starts the process over. Instead, we should emit reset
events on retries. For bonus points, the client should handle http Range requests, which would avoid that reset on http retries -- so the stream can start over exactly where it left off!
Range requests are often supported OOTB by various http servers, and we can just check if our Range was accepted (by looking for Content-Range
) and otherwise do the full reset
. This should be cool!
Implementing this, though, very likely requires ripping open npm-registry-client
, which I guess we should be doing anyway.
cacache 6 involves some big changes! Most notably, changing a bunch of stuff to be Promise-based, a new on-disk format, and moving all the memoization code out of pacote and back into cacache itself.
As part of this integration, pacote itself should be updated to use Promise, the lib/cache code should be torn out, and cacache should be used directly.
This is gonna have to start before cacache@6 itself is tagged because I really wanna know the API changes are good and we don't need to move anything else in there.
I would really like to have a straightforward CONTRIBUTING.md file folks can check out when they open up this repo -- hacking on pacote is a fairly streamlined thing, and it shouldn't need much explaining. This, combined with the starter
tag I'm slapping on stuff, should be a huge help in getting outside contributions <3
things like cacheKey
and the various calls to cacache
are probably best put under a small wrapper module for cacache
that exposes a more focused, dedicated API that contains some of the npm-specific business logic (formatting keys, limited API surface, etc)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.