GithubHelp home page GithubHelp logo

hhy5277 / awesome-web-scraper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from duyet/awesome-web-scraper

0.0 1.0 0.0 18 KB

A collection of awesome web scaper, crawler.

License: MIT License

awesome-web-scraper's Introduction

Awesome Web Scraper Awesome Build Status

A collection of awesome web scaper, crawler.

Java

  • Apache Nutch - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
  • websphinx - Website-Specific Processors for HTML INformation eXtraction.
  • Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
  • crawler4j - open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.

C/C++

  • HTTrack - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.

C#

  • ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.

Erlang

  • ebot - Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb.

Python

  • scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
  • gdom - gdom, DOM Traversing and Scraping using GraphQL.

PHP

  • Goutte - Goutte, a simple PHP Web Scraper.
  • DiDOM - Simple and fast HTML parser.
  • simple_html_dom - Just a Simple HTML DOM library fork.
  • PHPCrawl - PHPCrawl is a framework for crawling/spidering websites written in PHP.

Nodejs

  • puppeteer - Headless Chrome Node API https://pptr.dev.
  • Phantomjs - Scriptable Headless WebKit.
  • node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery.
  • node-simplecrawler - Flexible event driven crawler for node.
  • spider - Programmable spidering of web sites with node.js and jQuery.
  • slimerjs - A PhantomJS-like tool running Gecko.
  • casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
  • zombie - Insanely fast, full-stack, headless browser testing using node.js.
  • nightmare - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
  • jsdom - A JavaScript implementation of the WHATWG DOM and HTML standards, for use with node.js
  • xray - The next web scraper. See through the <html> noise.
  • lightcrawler - Crawl a website and run it through Google lighthouse.

Ruby

  • wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

Go

  • gocrawl - Polite, slim and concurrent web crawler.
  • fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

License

MIT

Contributing

Please, read the Contribution Guidelines before submitting your suggestion.

Feel free to open an issue or create a pull request with your additions.

awesome-web-scraper's People

Contributors

duyet avatar joehua87 avatar vsemozhetbyt avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.