GithubHelp home page GithubHelp logo

zkat / pacote Goto Github PK

View Code? Open in Web Editor NEW
281.0 9.0 62.0 1.69 MB

programmatic npm package and metadata downloader (moved!)

Home Page: https://github.com/npm/pacote

License: MIT License

JavaScript 100.00%
npm package-management

pacote's Issues

revalidate tarballs on checksum failure

The cache can get into various states where cached data will no longer pass integrity checks. When this happens, the prefetch or extract call fails because local data's bad.

So. Try again.

feature: switch to node-tar@2

There's cool stuff happening over in node-tar land. Once it's solid and released, let's shove it in pacote and bask in its absurdly performant glory.

Split caching request client into a separate module or package

So lib/registry/request has very little about it that is actually registry-specific. It's probably a good idea to fork this out as a standalone server-agnostic caching client (as opposed to how it currently does a bit of registry-specific work mixed in). This should make the client usable for other things like remote tarballs and grabbing git stuff directly.

Move caching code straight into cacache

Honestly, everything in that cache wrapper is a simple call straight to cacache. The main addition of importance here is memoization, and that can reasonably be moved straight into cacache for convenience.

Once the http client has been refactored, it'll probably account for most of the cache-related calls. That client, then, can just call cacache instead of maintaining even more code.

Return ETARGET on missing manifest?

npm currently returns an ETARGET error whenever it tries and fails to fetch package metadata: https://github.com/npm/npm/blob/1067febf1875c92d6498ede7c0b20012a0c33d30/lib/fetch-package-metadata.js#L154-L162

For the sake of better general compatibility, should pacote return the same type of error? I've been thinking that the current style of just chucking out ENOENT is bound to cause problems if there's different types of ENOENTs coming from different parts of pacote or cacache.

offline mode

support an enforced offline mode which errors if any network requests are attempted and tries to use the cache as much as possible

Add ARCHITECTURE.md

Once the codebase stabilizes a bit, I want to write an ARCHITECTURE.md document that gives an overview of how the project is structured, the purpose of various components, overall design concepts that are good to remember which might not be too obvious, etc.

Write tests for git deps

There's basically no coverage on them, but they might be a bit tricky to write due to having to set up and launch git daemons. Once the mocking utility's done, though, things should go much faster.

This really took a bite out of general coverage for the library so it's best to try and get this done sooner rather than later :\

pacote.prefetch

The npm installer currently has 3 major stages relevant to pacote -- but only uses two, right now.

While it's nice to stream end-to-end in a single step with pacote.extract, the whole point of having a multi-stage installer is to take advantage of the predictability and isolation that come from having discrete steps. pacote.extract, as it turns out, does a lot -- and it's used in the extract step of the installer.

There's another step that's been commented out for a while now: fetch, which is intended to be the stage where npm will actually go out into the network to grab any tarballs.

So, the proposal here is to add another toplevel API function to pacote: pacote.prefetch. Its job should be purely to warm up pacote's cache and allow pacote.extract to bypass the cache index and always extract tarballs by digest, since it'll know those tarballs are present.

The nice thing is all the code that pacote.prefetch would need is already there: simply calling the appropriate tarball.js handler and draining the stream into the void (stream.on('data', function () {})) is good enough to put it together. All of that code is already in pacote.extract. ๐ŸŽ‰

Normalize and standardise manifests

pacote should normalize metadata fields to only the things the CLI might need, and standardize those across the different sources. Should also run the manifest through normalize-package-data.

Better auth support

The current auth stuff is kinda janky. Figure out an auth mechanism that generalizes well and make configuring it more straightforward. :waves-hands:

Distinguish between pacote-bugs, user errors, and server issues

Right now, there's a bunch of errors that get spit out by various deps and such that we use.

The CLI eventually has to make its own decisions about these, but it would be super handy to distinguish between three main error categories:

  • Things the user did wrong (and can fix): auth errors, bad arg or opt syntax, 404s, etc.
  • Things pacote code fucked up with: basically any unexpected conditions
  • Abnormal conditions: bad data, missing content, network timeouts, filesystem failures

This might be a really great first step towards having much richer error reporting data for CLI users to consume -- specially user-level conditions that we might be able to be very specific about.

performance instrumentation

Add various bits of noteworthy performance analytics to be collected on the fly, and log them out as things complete.

Some ideas:

  • manifest fetch time
  • tarball fetch time
  • extract time
  • number of finalize-manifest tarball extractions

http://npm.im/request-capture-har might be of use for the network part of this

Implement git handlers

This one's a tricky one. There are many ways to get the contents of a git repo. The best way for pacote to handle git is by judiciously picking which of these to use depending on what's being requested, and trying to avoid a full git clone at all costs.

Semver range dependencies should be resolved according to npm/npm#15308. These can be resolved with a git ls-remote.

These are the possible ways I've found so far to get either full package data, or subsets, and associated caveats:

fullfat clone

$ git clone https://github.com/npm/npm

  • will work pretty much universally
  • may need to try fallback protocols for hosted git
  • is bloody slow (npm/npm: 13s, zkat/cacache: 1.7s (so it won't be so bad on small repos))
  • needs its own entire caching scheme to retain repos, if desired (can be postponed?)

shallow clone

$ git clone https://github.com/npm/npm --depth=1 -b <named-ref>

  • Only works for HEAD or named refs. Any commit hash that has an existing remote ref will work, too.
  • Will always work for semver-range git, because all tags are named refs. ๐ŸŽ‰
  • much much faster than git clone for larger repos but still fairly heavy. (npm/npm: 4.35s, zkat/cacache: 1.52s)
  • github folks might not like us very much if we do this a lot?
  • can tar + cache directly and remove cloned dir from tmp
  • I thought I could fall back to a HEAD clone + git fetch but alas that is also not possible without a named ref.

git archive

$ git archive --format=tar.gz --prefix=package/ --remote=https://github.com/npm/npm <committish>

  • direct tarball download
  • BIG CAVEAT: must be manually enabled server-side. github does not enable it. Seems to fail very quickly when not enabled, though.
  • works on non-hosted git when enabled
  • uses regular git authentication mechanism
  • Probably a really fantastic idea for private corporate git servers
  • package/ prefix can be added with --prefix= option.
  • There is also a terrifying monstrosity that lets you fetch individual files, but I document this here purely for the horror value. It's not worth it.

hosted git tarballs

$ curl -LO https://github.com/npm/npm/archive/<committish>.tgz

  • github caches these! they can be pretty fast! (npm/npm: 0.6s, zkat/cacache: 0.39s)
  • I have no idea right now how to authenticate these for private repos
  • can target any committish (not just named refs)
  • only available on hosted git types supported by hosted-git-info
  • can lean on pacote's existing http caching mechanisms, transparently
  • contents will not be inside package/ by default, so we need to manually add a level

individual direct file download

$ curl -LO https://raw.githubusercontent.com/npm/npm/<committish>/package.json

  • only for fetching package.json and npm-shrinkwrap.json
  • can't even fill out bin when directories.bin is there.
  • only works on hosted gits that support it.
  • allows fast filling out of some manifests without having to fetch a full tarball.

remote ref lookup

$ git ls-remote https://github.com/npm/npm -t -h '*'

  • Fetches a full ref list from a remote
  • Not as fast as you'd think (npm/npm: 0.83s, zkat/cacache: 0.37s)
  • Useful for finding named refs (for semver support or to possibly avoid a full clone)
  • No speed difference between -t -h and -t. The former will use more RAM but increase chances of a non-semver committish matching. The latter will be smaller and be all that's needed for a semver ref.

When I think about implementing this

everybody panic

Bulk request fetching

bulk stuff is way faster than stream-based stuff. Should do #25 before doing this just so we have solid numbers on the difference, and stream stuff needs to still exist because we want to be able to handle multi-gig files.

Ideally, pacote.extract and pacote.prefetch would only use streams for particularly big packages. pacote.manifest should always use bulk requests.

Implement local tarball handler

This one should be super simple! Add the tarball to the cache pretty much directly! Manifests, again, will need to get picked up during extraction, though :(

feature: "safe mode" for extraction

While the CLI probably doesn't need to worry about this much except in case of catastrophe, there's some user tooling that could really benefit from pacote's default mode refusing to overwrite directory contents on extract.

opts.extractOverwrite should be required for anyone who targets a directory that: A. exists, B. has any contents in it.

It's ok to be racy about this. If two processes shove things in one dir at the same time, so be it. This feature is primarily to protect about what is bound to be a common footgun for users using straight-up pacote (it literally just bit me and I don't wanna even).

If you think this is an interesting bug to pick up, this is what I think is generally the direction to go in:

  • add the extractOverwrite option to lib/util/opt-check -- you won't be able to read it otherwise.

  • add a conditional readdirAsync() call early on in extract.js, before most other work is done. (note: readdirAsync is basically const readdirAsync = BB.promisify(fs.readdir), by convention).

  • If the resulting listing has any items in it, throw an error with a useful code and an explanation of what the user tried to do -- include a mention of opts.extractOverwrite in the error message so it's discoverable.

  • If opts.extractOverwrite is true, bypass the fs.readdirAsync call entirely with a BB.resolve().

Feel free to do it your way, too, if you find a better alternative. The goal is to prevent users from accidentally writing into things they didn't intend to write to.

Include deprecation information

Deprecation warnings in npm are weird: right now, it's the CLI's cache that takes care of this. It's probably not a good idea for the cache itself to be responsible for this. At the same time, deprecation information is passed to the CLI in headers -- so pacote would have to know about them already.

So, do the following: Add a _deprecated: Bool field to the finalized manifest based on that registry header. Let the npm installer take care of the rest.

cache fallback for offline modes

preferOffline and offline both change the fetch mechanism to in one way or another, lean towards maximizing use of the local caches.

There is an issue, though, where package metadata may have been fetched into the cache, but a corresponding matching latest package may not have been downloaded. In these cases, it turns out, we may in fact have a semver-compatible tarball available in the cache that at least offline could fall back to in this particular corner case.

This case, though, is probably pretty rare. I think. Just an idea out there that's pretty low-priority. And I might be wrong about how rare that situation really is.

cache invalidation for finalized manifests

there's currently a finalized manifest caching scheme that is keyed off pkg._resolved, assuming it's a globally-unique, immutable key: this is not the case.

Perhaps it would be good to add the _shasum for a tarball to that name, and skip cache reads if we don't have a shasum either in pkg._shasum or opts.digest. Shove the hashAlgorithm in there too for good measure.

Write user guide

pacote should have a step-by-step guide on how to use it. This is probably pretty straightforward, since the API surface is relatively small. Still, it's good to have this.

Switch to new npm-package-arg

npm/npm-package-arg#21 will eventually get merged, and it involves a pretty major API change for that library. pacote should switch to using that instead when parsing non-object specs, and expect npa output objects to be the values passed in as spec objects.

Correct minor spelling mistake in CONTRIBUTING

In CONTRIBUTING.md the second paragraph reads

Please make sure to read the relevant section before making your contribution! It will make it a lot easier for us maintainers to make the most of it and smooth out the experience fo all involved. ๐Ÿ’š

"fo all involved" should be changed to "of all involved"!

Check out the section on Contributing Documentation to discover how to make this contributing!

๐ŸŒž

Can this be made to work in the browser please? :-)

Some days ago on twitter https://twitter.com/serapath/status/856908380731916288

Now I just stumbled upon the module.

It seems it currently does not work in the browser, but if it would, that would be awesome, because I would love to use it.

Other than that - one feature I'd love to use it to prompt a user for a token so that it's possible to actually publish data to npm from the browser (think: in-browser Javascript IDE)

I would also try to implement it myself, but dont know what kind of requests I would need to make or how I can learn about it and on top of that if it's even possible regarding maybe CORS settings.

Implement generic git handler

This one should definitely use git proper for any requests. And probably some sort of specialized caching technique (since it's not just gonna go through the http client)

Fill in `bin` directories for manifests

Related to #17, another thing that we need to do in order to have a complete manifest is to fill in the bin field if there is a directories entries with bin in it. And excludes anything that starts with a . in the bin dir. This isn't needed at all if there's already a bin field, or if there's no directories field.

opts.extraHeaders for extensible header-passing

npm-registry-client shoots out all these special headers itself. A lot of them have to do more with specific npm features than just fetching packages.

Move these headers out of pacote and into a single opts.extraHeaders opt in npm.

Examples:

  • npm-scope
  • npm-in-ci
  • referer
  • user-agent (?)

Start benchmark suite

pacote is built for performance. Performance is meaningless without benchmarks and profiling. So. We need benchmarks.

There should be benchmarks for each of the supported types (note: only registry ones are needed for 0.2.0), with small, medium, and large packages (including some variation for number of files vs size of individual files). All of these for both manifest-fetching and extraction.

We should make sure all the benchmarks run hit the following cases too, for each of the groups described above:

  • no shrinkwrap, tarball extract required
  • no shrinkwrap, but with pkg._hasShrinkwrap === false (so no extract)
  • has shrinkwrap, with alternative fetch (so, an endpoint, git-host, etc)
  • has shrinkwrap, tarball extract required
  • cached data, no memoization (lib/cache exports a _clearMemoized() fn for this purpose)
  • memoized manifest data (tarballs are not memoized)
  • cached data for package needing shrinkwrap fetch
  • memoized data for package needing shrinkwrap fetch
  • stale cache data (so, 304s)
  • concurrency of 50-100 for all of the above, to check for contention and starvation (this is usually what the CLI will set its concurrency to).

https://npm.im/benchmark does support async stuff and seems like a reasonable base to build this suite upon.

Marking this as starter because while it's likely to take some time to write, you need relatively little context to be able to write some baseline benchmarks for the above cases. The actual calls are literally all variations of pacote.manifest() and pacote.extract() calls: that's the level these benchmarks should run at, rather than any internals. At least for now.

I would also say that comparing benchmark results across different versions automatically is just a stretch goal, because the most important bit is to be able to run these benchmarks at all.

Authorization header is forwarded on redirect to different host

@simonua @zkat
Some npm registries redirect to another host for package tarball downloads. For example, Microsoft VSTS redirects to Azure Blob.

pacote (or possibly make-fetch-happen/node-fetch) appears to forward authorization headers on a redirect to another host, unlike previous versions of npm. In the specific case of Azure Blob, these credentials are invalid (the correct token is provided in the URI querystring), and an Authorization header must not be present.
This results in an error like:

npm ERR! 400 Authentication information is not given in the correct format. Check the value of Authorization header.

More generally, from a security perspective forwarding credentials (by default, at least) to another host isn't great.

Skip specifier parsing if we already got a specifier object

Right now, we call realize-package-specifier whenever we call either pacote.tarball or pacote.manifest or pacote.prefetch . While this is convenient for our own testing, and potentially standalone (read: non-npm) uses of the library, the CLI will basically always have a Result object to pass in -- then we can skip the whole parsing process, and we don't make the CLI construct a bullshit pretend-specifier-string like it does in a bunch of places right now.

Better config system

opt-check was a pretty basic option handling mechanism but I'm not feeling great about it: it silently fails if it gets unexpected options (and those options are later requested), it doesn't support types or any sort of verification for options, and it just assumes that everything is gonna want all the options.

But, as it turns out, as we call individual things, they expect other subsets of options. It might be nice to have an opts mechanism where every layer can specify what exactly it wants and needs, so it's easy to see what's using what, and at what level -- specially stuff we're passing to dependencies like cacache, which have a bunch of their own opts!

Cache the work in `finalize-manifest`

Currently, lib/finalize-manifest not only "fills out" and standardizes the manifest format, but might also potentially request and extract a tarball to make sure _shasum, _shrinkwrap, and bin have the right data in them.

All that heavy lifting of extracting package tarballs during the manifest stage, though, isn't cached at all.

A custom cache key of some sort should be added such that we can cache the results of completeFromTarball only when a tarball extraction is needed. Don't risk hitting the disk unless we really have to. The results of that function can also be memoized, in case we have multiple requests for it.

Implement remote tarball handler

This one should be super straightforward on the tarball side, but will probably need some munging on the manifest front because we need to grab the manifests (probably mid-stream!) from the tarball.

Note: this would allow cached tarball downloads, and it can probably lean right on registry/request.js. manifest can probably be done by a dummy manifest with a _resolved field, and then expand finalize-manifest to fill in the rest of the manifest from the package.json in the tarball :)

Document 'refacotr' tag

Both Pacote an Cacache are using this label, but it's not documented in either CONTRIBUTING guide.

Cannot install git dependency from Bitbucket Server

Steps to reproduce: npm install with the following line in package.json ([REDACTED] is a Bitbucket Server host; this dependency works in npm 4)

"circular-list": "git+ssh://git@[REDACTED]/circular-list.git#v1.0.2",

output:

npm ERR! code 128
npm ERR! Command failed: /usr/local/bin/git clone --depth=1 -q -b v1.0.2^{} ssh://git@[REDACTED]/circular-list.git /Users/matthew.brennan/.npm/_cacache/tmp/git-clone-d9115048
npm ERR! warning: templates not found /var/folders/0m/smmrszcj367g1ds3nkjrv2y42l6kl3/T/pacote-git-template-tmp/git-clone-410ac485
npm ERR! fatal: Remote branch v1.0.2^{} not found in upstream origin
npm ERR!


`npm version`:

```json
{ npm: '5.0.0',
  ares: '1.10.1-DEV',
  http_parser: '2.7.0',
  icu: '57.1',
  modules: '48',
  node: '6.9.2',
  openssl: '1.0.2j',
  uv: '1.9.1',
  v8: '5.1.281.88',
  zlib: '1.2.8' }

cloned from npm/npm#16789

Manifest cache should be skipped if compatible version not found

If npm has a manifest cached, but fails to find a matching version in a given manifest, it will assume a cache miss and try a full request. See https://github.com/npm/npm/blob/1067febf1875c92d6498ede7c0b20012a0c33d30/lib/fetch-package-metadata.js#L146-L152

This can cause some annoying issues when, for example, someone tries to bump their local version shortly after publishing -- their next install will take some period of time (depending on opts.maxAge) before the manifest request expires and gets re-requested.

This can probably be implemented right into https://github.com/zkat/pacote/blob/latest/lib/registry/manifest.js. The general idea would be to have pacote.manifest() try the usual case of a requested version being found, and after pickManifest, try the request + manifest picking just one more time, after busting the cache.

To cache bust, two things will be needed: one, a way for a cache to invalidate a specific key and nothing else (on disk), and another to bust the memoized version of that key. That can be added to lib/cache/index.js.

Support ECONNRESET recovery

So with the http client, it's possible for a request to die mid-stream. Right now, that just kinda implodes and starts the process over. Instead, we should emit reset events on retries. For bonus points, the client should handle http Range requests, which would avoid that reset on http retries -- so the stream can start over exactly where it left off!

Range requests are often supported OOTB by various http servers, and we can just check if our Range was accepted (by looking for Content-Range) and otherwise do the full reset. This should be cool!

Implementing this, though, very likely requires ripping open npm-registry-client, which I guess we should be doing anyway.

Integrate cacache@6

cacache 6 involves some big changes! Most notably, changing a bunch of stuff to be Promise-based, a new on-disk format, and moving all the memoization code out of pacote and back into cacache itself.

As part of this integration, pacote itself should be updated to use Promise, the lib/cache code should be torn out, and cacache should be used directly.

This is gonna have to start before cacache@6 itself is tagged because I really wanna know the API changes are good and we don't need to move anything else in there.

Add CONTRIBUTING.md

I would really like to have a straightforward CONTRIBUTING.md file folks can check out when they open up this repo -- hacking on pacote is a fairly streamlined thing, and it shouldn't need much explaining. This, combined with the starter tag I'm slapping on stuff, should be a huge help in getting outside contributions <3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.