GithubHelp home page GithubHelp logo

endiliey / docsearch-configs Goto Github PK

View Code? Open in Web Editor NEW

This project forked from algolia/docsearch-configs

0.0 2.0 0.0 6.13 MB

DocSearch - Configurations

Home Page: https://community.algolia.com/docsearch/

docsearch-configs's Introduction

DocSearch configurations

This is the repository hosting the public DocSearch configurations.

DocSearch is composed of 3 different projects:

If you want to run your own DocSearch instance on those configuration files, please get familiar with the scraper setup guidelines.

Introduction

The DocSearch scraper will use a configuration file specifying:

  • the Algolia index name that will store the records resulting from the crawling
  • the URLs it needs to crawl
  • the URLs it shouldn't crawl
  • the (hierarchical) CSS selectors to use to extract the relevant content from your webpages
  • the CSS selectors to skip
  • An optional sitemap URL that will be crawled and then scraped
  • additional options you might provide to fine-tune the scraping

How it works

Once you run the DocSearch scraper on a specific configuration, it will:

  • crawl all the URLs you specified (from the start_urls or the sitemap)
  • follow all the hyperlinks mentioned in the page, and continue the crawling there
  • stop the crawling as soon as you've reached a URL that is not specified in your configuration or affiliated to a start url
  • extract the content of every single crawled page following the logic you defined using the CSS selectors
  • push the resulting records to the Algolia index you configured

Update You can check the DocSearch dedicated documentation website if you need more details regarding how to fine-tune your configuration.

docsearch-configs's People

Contributors

abernix avatar acmetech avatar codysoyland avatar elartix avatar elpicador avatar endiliey avatar haroenv avatar j-low avatar janpetr avatar jayhesselberth avatar joelmarcey avatar joshed-io avatar julienbourdeau avatar kaelig avatar lipkau avatar lukyvj avatar m-turek avatar maxiloc avatar mgibbs189 avatar ngokevin avatar oliviertassinari avatar phrawzty avatar pixelastic avatar plnech avatar redox avatar rmonnier9 avatar s-pace avatar woodyrew avatar xuechunl avatar xvincentx avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.