GithubHelp home page GithubHelp logo

isabella232 / npm-search Goto Github PK

View Code? Open in Web Editor NEW

This project forked from algolia/npm-search

0.0 0.0 0.0 4.47 MB

๐Ÿ—ฟ npm โ†”๏ธ Algolia replication tool :skier: :snail: :artificial_satellite:

License: MIT License

JavaScript 1.43% Shell 0.31% TypeScript 97.48% Dockerfile 0.74% Procfile 0.03%

npm-search's Introduction

๐Ÿ—ฟ npm-search โ›ท ๐ŸŒ ๐Ÿ›ฐ

npm โ†”๏ธ Algolia replication tool.

CircleCI Datadog Status


This is a failure resilient npm registry to Algolia index replication process. It will replicate all npm packages to an Algolia index and keep it up to date.

The state of the replication is saved in Algolia index settings.

The replication should always be running. Only one instance per Algolia index must run at the same time. If the process fails, restart it and the replication process will continue at the last point it remembers.

Algolia Index

Using the public index

The Algolia index is currently used, for free, by a few selected projects (e.g: yarnpkg.com, codesandbox.io, jsdelivr.com, etc...).

If you want to include this index to your project please reach out to: [email protected].

To be eligible your project must meet these requirements:

  • Publicly available: The project must be publicly usable and, if applicable, include documentation or instructions on how the community can use it.
  • Non-commercial: The project cannot be used to promote a product or service; it has to provide something of value to the community at no cost. Applications for non-commercial projects backed by commercial entities will be reviewed on a case-by-base basis.

You can also use the code or the public docker image to run your own (as of September 2021 it will create ~3M records x2).

Schema

For every single NPM package, we create a record in the Algolia index. The resulting records have the following schema:

{
  name: 'babel-core',
  downloadsLast30Days: 10978749,
  downloadsRatio: 0.08310651682685861,
  humanDownloadsLast30Days: '11m',
  jsDelivrHits: 11684192,
  popular: true,
  version: '6.26.0',
  versions: {
    // [...]
    '7.0.0-beta.3': '2017-10-15T13:12:35.166Z',
  },
  tags: {
    latest: '6.26.0',
    old: '5.8.38',
    next: '7.0.0-beta.3',
  },
  description: 'Babel compiler core.',
  dependencies: {
    'babel-code-frame': '^6.26.0',
    // [...]
  },
  devDependencies: {
    'babel-helper-fixtures': '^6.26.0',
    // [...]
  },
  repository: {
    url: 'https://github.com/babel/babel/tree/master/packages/babel-core',
    host: 'github.com',
    user: 'babel',
    project: 'babel',
    path: '/tree/master/packages/babel-core',
    branch: 'master',
  },
  readme: '# babel-core\n\n> Babel compiler core.\n\n\n [... truncated at 200kb]',
  owner: {
    // either GitHub owner or npm owner
    name: 'babel',
    avatar: 'https://github.com/babel.png',
    link: 'https://github.com/babel',
  },
  deprecated: 'Deprecated', // This field will be removed, please use `isDeprecated` instead
  isDeprecated: true,
  deprecatedReason: 'Deprecated',
  badPackage: false,
  homepage: 'https://babeljs.io/',
  license: 'MIT',
  keywords: [
    '6to5',
    'babel',
    'classes',
    'const',
    'es6',
    'harmony',
    'let',
    'modules',
    'transpile',
    'transpiler',
    'var',
    'babel-core',
    'compiler',
  ],
  created: 1424009748555,
  modified: 1508833762239,
  lastPublisher: {
    name: 'hzoo',
    email: '[email protected]',
    avatar: 'https://gravatar.com/avatar/851fb4fa7ca479bce1ae0cdf80d6e042',
    link: 'https://www.npmjs.com/~hzoo',
  },
  owners: [
    {
      email: '[email protected]',
      name: 'thejameskyle',
      avatar: 'https://gravatar.com/avatar/8a00efb48d632ae449794c094f7d5c38',
      link: 'https://www.npmjs.com/~thejameskyle',
    },
    // [...]
  ],
  lastCrawl: '2017-10-24T08:29:24.672Z',
  dependents: 3321,
  types: {
    ts: 'definitely-typed', // definitely-typed | included | false
    definitelyTyped: '@types/babel__core',
  },
  moduleTypes: ['unknown'], // esm | cjs | none | unknown
  styleTypes: ['none'], // file extensions like css, less, scss or none if no style files present
  humanDependents: '3.3k',
  changelogFilename: null, // if babel-core had a changelog, it would be the raw GitHub url here
  objectID: 'babel-core',
  _searchInternal: {
    popularName: 'babel-core',
    downloadsMagnitude: 8,
    jsDelivrPopularity: 5,
    alternativeNames: [
      // alternative versions of this name, to show up on confused searches
    ],
  },
}

Ranking

If you want to learn more about how Algolia's ranking algorithm is working, you can read this blog post.

Textual relevance

Searchable Attributes

We're restricting the search to use a subset of the attributes only:

  • _searchInternal.popularName
  • name
  • description
  • keywords
  • owner.name
  • owners.name
Prefix Search

Algolia provides default prefix search capabilities (matching words with only the beginning). This is disabled for the owner.name and owners.name attributes.

Typo-tolerance

Algolia provides default typo-tolerance.

Exact Boosting

Using the optionalFacetFilters feature of Algolia, we're boosting exact matches on the name of a package to always be on top of the results.

Custom/Business relevance

Number of downloads

For each package, we use the number of downloads in the last 30 days as Algolia's customRanking setting. This will be used to sort the results having the same textual-relevance against each others.

For instance, search for babel with match both babel-core and babel-messages. From a textual-relevance point of view, those 2 packages are exactly matching in the same way. In such case, Algolia will rely on the customRanking setting and therefore put the package with the highest number of downloads in the past 30 days first.

Popular packages

Some packages will be considered as popular if they have been downloaded "more" than others. We currently consider the packages having more than 0.005% of the total number of downloads on the whole registry as the popular packages. This popular flag is also used to boost some records over non-popular ones.

Usage

Production

yarn
apiKey=... yarn start

Restart

To restart from a particular point (or from the beginning):

seq=0 apiKey=... yarn start

This is useful when you want to completely resync the npm registry because:

  • you changed the way you format packages
  • you added more metadata (like GitHub stars)
  • you are in an unsure state and you just want to restart everything

seq represents a change sequence in CouchDB lingo.

How does it work?

Our goal with this project is to:

  • be able to quickly do a complete rebuild
  • be resilient to failures
  • clean the package data

When the process starts with seq=0:

  • save the current sequence of the npm registry in the state (Algolia settings)
  • bootstrap the initial index content by using /_all_docs
  • replicate registry changes since the current sequence
  • watch for registry changes continuously and replicate them

Replicate and watch are separated because:

  1. In replicate we want to replicate a batch of documents in a fast way
  2. In watch we want new changes as fast as possible, one by one. If watch was asking for batches of 100, new packages would be added too late to the index

Contributing

See CONTRIBUTING.md

npm-search's People

Contributors

bodinsamuel avatar dependabot[bot] avatar dethi avatar drgy avatar evenstensberg avatar greysteil avatar haroenv avatar juresotosek avatar leonardosnt avatar martinkolarik avatar nolanlawson avatar noviny avatar orta avatar pixelastic avatar redox avatar renovate-bot avatar renovate[bot] avatar semantic-release-bot avatar vvo avatar zarianec avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.