GithubHelp home page GithubHelp logo

dragondave / basiccrawler Goto Github PK

View Code? Open in Web Editor NEW

This project forked from learningequality/basiccrawler

0.0 2.0 0.0 90 KB

Basic web crawler that automates website exploration and producing web resource trees.

License: MIT License

Python 21.20% Jupyter Notebook 78.60% CSS 0.20%

basiccrawler's Introduction

BasicCrawler

Basic web crawler that automates website exploration and producing web resource trees.

Usage

The goal of the BasicCrawler class is to help with the initial exploration of the source website. It is your responsibility to write a subclass that uses the HTML, URL structure, and content to guide the crawling and produce the web resource tree.

The workflow is as follows

  1. Create your subclass

    • set the following attributes
      • MAIN_SOURCE_DOMAIN e.g. 'https://learningequality.org'
      • START_PAGE e.g. 'https://learningequality.org/'
  2. Run for the first time by calling crawler.crawl() or as a command line script

  • The BasicCrawler has basic logic for visiting pages and will print out on the a summary of the auto inferred site stricture findings and recommendations based on the URL structure observed during the initial crawl.
  • Based on the number of times a link appears on different pages of the site the crawler will suggest to you candidates for global navigation links. Most websites have an /about page, /contact us, and other such non-content-containing pages, which we do not want to include in the web resource tree. You should inspect these suggestions and decide which should be ignored (i.e. not crawled or included in the web_resource_tree output). To ignore URLs you can edit the attributes:
    • IGNORE_URLS (list of strings): crawler will ignore this URL
    • IGNORE_URL_PATTERNS (list of RE objects): regular expression that do the same thing Edit your crawler subclass' code and append to IGNORE_URLS and IGNORE_URL_PATTERNS the URLs you want to skip (anything that is not likely to contain content).
  1. Run the crawler again, this time there should be less noise in the output.
  • Note the suggestion for different paths that you might want to handle specially (e.g. /course, /lesson, /content, etc.) You can define class methods to handle each of these URL types:

     def on_course(self, url, page, context):
         # what do you want the crawler to do when it visits the  course with `url`
         # in the `context` (used for extra metadata; contains reference to parent)
         # The BeautifulSoup parsed contents of the `url` are provided as `page`.
    
     def on_lesson(self, url, page, context):
         # what do you want the crawler to do when it visits the lesson
    
     def on_content(self, url, page, context):
         # what do you want the crawler to do when it visits the content url
    

basiccrawler's People

Contributors

ivanistheone avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.