GithubHelp home page GithubHelp logo

myscrape's Introduction

Myscrape

Implementation of a simple site-specific command-line scraper.

All the real work is delegated to the MetaInspector gem.

Given a URL, it will fetch that URL and all immediate internal links from that page. Output is a JSON-formatted summary of the internal links and (internal or external) assets (images & stylesheets) used on each page.

Installation

git clone http://github.com/jmay/myscrape

Usage

myscrape/bin/scrape http://example.com/

To avoid pounding sites with lots of links, the scraper by default will only pull 3 sub-pages.

Comments

Prompt was as follows:

Write a web crawler in a language of your choice. It should be limited to one domain - so when crawling opusforwork.com it would crawl all pages within the opusforwork.com domain, but not follow any outside links. Given a URL, it should output a site map, showing which static assets each page depends on, and the links between pages. Choose the most appropriate data structure to store & display this site map e.g. printing it to stdout or writing it to a file.

Build this as you would something for production. Focus on code quality and write tests as appropriate. Make sure to include a README documenting how you laid out the code and why you designed it the way you did.

Web crawling/scraping is well-known territory. There's rarely a reason to build a new one from scratch, so I looked for a decent open-source implementation to work from. There's an excellent Python package called Scrapy but that looked like overkill, and I'm more familiar with Ruby.

MetaInspector is much lighter than Scrapy, but is actively under development with some outside contributions.

Using an existing gem means I could rely on the gem for unit testing and worry only about integration testing for the specific use case described here. See =spec/myscrape_spec.rb= for test cases.

Some of the more recent frameworks can conceal content (such as images) from scrapers by using Javascript lazy or incremental page-loading techniques. I've not attempted to deal with these. See http://brettterpstra.com or https://meta.discourse.org for examples: images that appear on the page do not appear in the HTML source retrieved by the scraper.

So, a simple gem that wraps MetaInspector and a command-line executable in bin/scrape that runs the most common case.

myscrape's People

Contributors

jmay avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.