GithubHelp home page GithubHelp logo

zipme / docsearch-scraper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from algolia/docsearch-scraper

0.0 2.0 0.0 322 KB

DocSearch - Scraper

Home Page: https://community.algolia.com/docsearch/

License: Other

Python 94.67% Shell 1.01% HTML 3.93% Dockerfile 0.39%

docsearch-scraper's Introduction

DocSearch scraper

This is the repository for the scraper for the DocSearch project. You can run it on your own, or submit a request to crawl your documentation.

DocSearch is composed by 3 different projects:

This project is a collection of submodules, each one in its own directory:

  • cli: A command line tool to manage DocSearch. Run ./docsearch and follow the steps
  • deployer: Tool used by Algolia to deploy the configuration in our mesos infrastructure
  • playground: An HTML page to easily test DocSearch indices
  • scraper: The core of the scraper. It reads the configuration file, fetches the web pages and indexes them in Algolia.

Update You can check the DocSearch dedicated documentation website

Getting started

Install DocSearch

The DocSearch scraper is based on Scrapy, a famous python-based web scraper. Because it might need some JavaScript to render the pages it crawls, the scraper is also depending on selenium.

To ease the setup process, a Docker container is provided to help you run the scraper.

Environment

  • Install python & pip
    • brew install python # will install pip
    • apt-get install python
    • Or any other way
  • git clone [email protected]:algolia/docsearch-scraper.git
  • cd docsearch-scraper
  • pip install --user -r requirements.txt

With docker

  • Build the underlying Docker image: ./docsearch docker:build

Configure DocSearch

You need to create an Algolia account to get the APPLICATION_ID and an admin API_KEY credentials the scraper will use to create the underlying indices.

Create a file named .env file at the root of the project containing the following keys:

APPLICATION_ID=
API_KEY=

And run the CLI to see the available commands:

$ ./docsearch
DocSearch CLI

Usage:
  ./docsearch command [options] [arguments]

Options:
  --help    Display help message

Available commands:
 bootstrap              Bootstrap a docsearch config
 run                    Run a config
 playground             Launch the playground
 docker
  docker:build          Build the scraper images (dev, prod, test)
  docker:run            Run a config using docker
 test                   Run tests

Use DocSearch

Create a config

To use DocSearch, the first thing you need is to create a crawler config. For more details about configs, check out the dedicated configurations repo, you'll have a list of options you can use and a lot of live and working examples.

Crawl the website

With docker:

$ ./docsearch docker:run /path/to/your/config

Without docker:

$ ./docsearch run /path/to/your/config

Try it with our playground

You can open the included Playground to test your DocSearch index.

$ ./docsearch playground

Enter your credentials and the index name mentioned in the crawler config file, and try the search!

Integrate DocSearch to your website

To add the DocSearch dropdown menu to your website, add the following snippet to your website:

<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/docsearch.js@2/dist/cdn/docsearch.min.css" />
<script type="text/javascript" src="https://cdn.jsdelivr.net/npm/docsearch.js@2/dist/cdn/docsearch.min.js"></script>
<script>
  var search = docsearch({
    apiKey: '<API_KEY>', // use a SEARCH-ONLY api key here
    indexName: '<INDEX_NAME>',
    inputSelector: '<YOUR_INPUT_DOM_SELECTOR>',
    debug: false // set to `true` if you want to inspect the dropdown menu's CSS
  });
</script>

And you are good to go!

Specify appId

If you are running the scraper on your own, you will need to tell the widget about your Algolia application ID via the appId parameter.

  var search = docsearch({
    appId: '<APP_ID>', // the application ID containing your DocSearch data
    ... // other parameters as above
  });

If Algolia is handling the crawling of your site, you do not need to specify appId.

Admin task

If you are Algolia employee and want to manage a DocSearch account, you'll need to add the following variables in your .env file:

WEBSITE_USERNAME=
WEBSITE_PASSWORD=
SLACK_HOOK=
SCHEDULER_USERNAME=
SCHEDULER_PASSWORD=
DEPLOY_KEY=

The cli will then have more commands for you to run.

For some actions like deploying you might need to use different credentials than the ones in the .env file. To do this you need to override them when running the cli tool:

APPLICATION_ID= API_KEY= ./docsearch deploy:configs

Run the tests

With docker

$ ./docsearch test

Without docker

$ pip install pytest 
$ API_KEY='test' APPLICATION_ID='test' python -m pytest

docsearch-scraper's People

Contributors

axilleas avatar bcremer avatar dzello avatar ehayman avatar elpicador avatar endiliey avatar jerskouille avatar julien-duponchelle avatar leoercolanelli avatar lukyvj avatar maxiloc avatar orarbel avatar phrawzty avatar pixelastic avatar redox avatar vvo avatar zipme avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.