GithubHelp home page GithubHelp logo

web-crawler's Introduction

Web Crawler

An exercise on web crawling with Java language, which load request urls from a url list file, and then crawling the content and detecting the search term exists or not.

Requirements and Implementation

  • Search is case insenstive
  • Should be concurrent AND it should not have more than 20 HTTP requests at any given time
  • The results should be written out to a file result.txt
  • Avoid using any thread pooling libraries. For example, in Java, don't use Executor, ThreadPoolExecutor, Celluloid, or Parallel streams.

Implementation in the source code directories:

├─config        The configuration class for the crawling job, including the default settings on max concurrent requests, file input and output.
├─download      A http downloader based on OKHttp client, the max concurrent running requests is controlled by the dispaterch of http client instance, it shares a blocking queue with Crawler.java for the crawled result .
├─entity        Entities used for this project
├─io            The code for loading and parsing urls from a text file.
├─pipeline      A pipeline to combine the filtering and storing job.
│  └─handler   TermSearchHandler.java can search the required term with case insenstive, ResultStoreHandler.java will write out results to the result file.
└─worker        The job dispatcher Crawler.java does the working flow control.
Main.java       The main entry of this program, include the arguments parsing and calling the crawling job.

How to run

To build an executable jar with dependencies, use the follow instruction

$ mvn clean assembly:assembly

Use the --help parameter or without any argument, to list the help message.

$ java -jar target/crawler-1.0-SNAPSHOT-jar-with-dependencies.jar
web crawler.
<term> [maxRequest] [input] [output]
term		the search term.
maxRequest	the max number of concurrent requests, default is 20.
input		the urls to be search, default is web_crawler_url_list.txt.
output		the output filename, default is result.txt.

For example, you can run with the follow command, to search a term like link\d.* in the url list.

$java -jar target/crawler-1.0-SNAPSHOT-jar-with-dependencies.jar link\d.*
01:18:54.749 [main] INFO  c.e.c.download.HttpGetDownloader - downloading 1  http://facebook.com/
01:18:54.763 [main] INFO  c.e.c.download.HttpGetDownloader - downloading 2  http://twitter.com/
01:18:54.764 [main] INFO  c.e.c.download.HttpGetDownloader - downloading 3  http://google.com/
01:18:54.765 [main] INFO  c.e.c.download.HttpGetDownloader - downloading 4  http://youtube.com/
01:18:54.767 [main] INFO  c.e.c.download.HttpGetDownloader - downloading 5  http://wordpress.org/

...

01:20:51.916 [OkHttp http://bloglovin.com/...] WARN  c.e.c.download.HttpGetDownloader - call job 497 for http://bloglovin.com/ with failure connect timed out
01:20:54.064 [OkHttp http://globo.com/...] WARN  c.e.c.download.HttpGetDownloader - on retrieving body content, the job 387, url http://globo.com/, failed message: timeout
01:21:05.107 [OkHttp http://illinois.edu/...] WARN  c.e.c.download.HttpGetDownloader - call job 496 for http://illinois.edu/ with failure Read timed out
01:21:05.107 [main] INFO  com.example.crawler.worker.Crawler - Jobs are done.
01:21:05.107 [main] INFO  com.example.crawler.Main - Jobs done with 131197 ms.

When the program quited, you can find the result in result.txt

$  cat -n result.txt
     1  "Rank","URL","Term Exists"
     2  10,"http://wordpress.com/",0
     3  9,"http://linkedin.com/",0
     4  22,"http://qq.com/",0
     5  16,"http://w3.org/",0
     
     ... omitted for better readablity
     
    239  245,"http://about.me/",0
    240  257,"http://ifeng.com/",0
    241  252,"http://cbc.ca/",0
    242  255,"http://shinystat.com/",1
    243  261,"http://topsy.com/",0
    244  254,"http://jugem.jp/",0
    ...
    496  409,"http://zimbio.com/",0
    497  460,"http://netlog.com/",0
    498  493,"http://unc.edu/",0
    499  466,"http://nymag.com/",0
    500  484,"http://shutterfly.com/",0
    501  496,"http://illinois.edu/",0

web-crawler's People

Contributors

jackeylu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.