GithubHelp home page GithubHelp logo

derrick0714 / web_search_engine Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 0.0 176.74 MB

web search engine projects

Home Page: dengxu.me

C++ 20.78% C 10.62% Python 11.14% PHP 19.10% JavaScript 38.02% Objective-C 0.34%

web_search_engine's Introduction

Simple Crawler

The first project for Web Search Engine

Filelist:

readme.txt
explain.txt
config.ini								program parameters' configuraton
crawler.py								main program
	
core.engine.py							load config file and manage the parser queue and download queue
core.downloader.py						downloader implementation assgins download tasks to thread pool
core.parser.py							parser implementation assgins parse tasks to thread pool
core.searchgoogle.py					Google search API implementation

models.configuration.py					load all configurations from local file and remote mysql
models.html.py							the data structure maintain the crawled page infomation
models.safe_dic.py						implementation of dictionary with lock
models.safe_queue.py					implementation of queue with lock
models.safe_loop_array.py				implementation of array with lock
models.status.py						system global variables

include.database_manager.py				interact with remote mysql
include.database.py						sql executer
include.log.py							implementation of logger
include.setting.py						read program parameters from local configuration file
include.thread_pool.py					implementation of a thread pool		

strategies.bookmarkhandler.py			handle page anchor
strategies.cgihandler.py				block url address with cgi in it
strategies.earlyvisithandler.py			block pages visited before
strategies.filetypehandler.py			decide whether a page is crawable according to its MIME type
strategies.linksextractor.py			extract links from a downloaded page
strategies.nestlevelhandler.py			block pages exceed a cetain depth in a site
strategies.omitindex.py					omit the part of 'index.htm','main.htm',etc with in a url
strategies.robotexclusionrulesparser.py	a robot exclusion rules parser 
strategies.robothandler.py				decide whether a page is crawable according to the robot.txt
strategies.schemehandler.py				scheme whitelist
strategies.urlextender.py				extend partial url

www.

Program parameters:

The config.ini file contains runtime parameters: Downloader.Threadnum The number of thread for download Downloader.SavePath The directory stores the downloaded pages

Parser.Threadnum						The number of thread for parse
Parser.Nestlevel						The maximum depth of a page in a website

seed.keywords							The search key words
seed.result_num							The result num returned from the Google API

Mysql.host								mysql hostname
Mysql.user								mysql username
Mysql.passwd							mysql passwprd
Mysql.db								mysql database name

Desgin:

# Engine 
----------------------
	Engine is the main controller in the crawler.
	
		It has:
		-a downloader object and a parser object, 
		-two safe queues each for the download and parse tasks, 
		-a status object holding the global status variables for the crawler, 
		-a mysql manager, 
		-objects for every filter strategy,the safe dictionary within the object of earlyvisithandler actually maintains the tree structure of the visited url
		
	
	Once the Engine starts up:		
		-it first apply the filter rules to the seeds from the Google API, and then load the valid ones into the download queue;
		-it starts two threads each keep checking the download queue and parse queue, once a html task is found in the queues, it is then assigned to the downloader or the parser; 
		-The downloader will return html tasks to be parse and push them into the parse queue, and the parser will return the html tasks to be download and push them into the download queue;
		-it starts a thread keep checking the status of the crawler and post runtime info to remote mysql;
	

# Parser
----------------------
	Parser will assgin the parsing tasks passed from engine to the thread pool object it maintains
		
	Except applying filtering rules when loading seeds from Google API to download queue in the engine, we mainly apply all the filtering rules in parser:
		-robothandler 				check against robot exclusion rules through robot.txt, it maintains a dictionary,
									of which the key is the url's homesite, the value is a object of robotexclutionrulesparser.
		-earlyvisithandler			check against the url visited before, it maintains a dictionary,
									of which the key is the md5 hashcode of the url, the value is the url's corresponding html object.
		-cgihandler					block the url with cgi in it
		-bookmarkhandler			block the link of page anchor
		-filetypehandler			block the url according to the MIME type
		-nestlevelhandler			block url exceed a cetain depth in a site
		-omitindex					omit the part of 'index.htm','main.htm',etc with in a url
		-schemehandler				block the scheme outside the scheme whitelist
		-urlextender				return the complete url

# Downloader
----------------------
	Downloader will assgin the download tasks passed from engine to the thread pool object it maintains;
	
	The timeout is set as 2 secs, then it returns a exception;
	
	It saves download files to local directory

# Html
----------------------
	It stores the information of a certain url and its corresponding page data:
		_url 						initial url and the extended url
		_scheme						scheme of the url
		_hostname					hostname of the url
		_md5						md5 hash code of the url
		_id							download sequence
		_depth						distance to the initial seed
		_parent						parent Html object
		_return_code				200, 404, etc
		_data						text within the page
		_data_size					size of data
		_crawled_time				download time

web_search_engine's People

Contributors

adam57 avatar derrick0714 avatar

Watchers

 avatar  avatar  avatar

web_search_engine's Issues

multi-thread issue

Under the multi-thread, there are some reason can hold the thread, so that the crawler will work very inefficiently

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.