diastro / zeek Goto Github PK
View Code? Open in Web Editor NEWPython distributed web scrapper and dynamic crawler
License: MIT License
Python distributed web scrapper and dynamic crawler
License: MIT License
Dynamic reload of imported module from server
Server first sends a configuration packet to all of its clients so that they share the same crawling settings.
Server will be able to specify to either crawl a specific list of URL (static), crawl indefinitely (dynamic) or domain specific.
Server will inform node of the crawling type when sending the config info.
Deployment of rule.py and scrapping.py to working nodes
URL dispatching and sorting (server)
Software requirements specification (SRS)
Handle :
Mongo DB
User manual - Deployment guide
Class structure containing all relevant data for a visited site:
Socket communication wasn't using delimiters which cause (in rare cases since I test Zeek running multiple python process on the same server) the ServerSideClient to block on a read eventhough it had received more then a full packet.
Fix:
Added a delimiter and after receiving data we check to see if there's a delimiter in the buffer. If so, we take the complete packet that's in there.
Dispatching of urls :
ServerSide
Client:
rule.py and scrapping.py are not being reloaded from client.py but from original imported file.
Enhancement needed to reload them from client.py
Read all the URLs in a webpage
Review of all the documentation :
Requirements traceability
Coomunication protocol
Hello
Please i am lost on how to view the output of the scrapped url
Architecture overview
Collect stats
Repro :
Fix :
Handle CTR-C and force close all socket before exiting
exe_type need to be formated to remove < >
Add proper error handling throughout the project
Add the size of the queued url to an input in the configuration file
Message targetted to specific client
Creation of the object class that will be passed between the server node and the working nodes
Each connection has a different Thread (Server-side).
Fix queue size.
Each serverSide client will have a fix number of URL dispatched to them (ie 20). Everytime a client replies after visiting a site a new url will be sent.
Protocol needs to be able to send a list of URL and client needs to be able to interpret this list.
Project "retrospective" report
Create rules to prevent the scrapper to go visit certain urls:
Working node needs to be able to scrape URLs from a web page and return the list of scrapped URLs to the server.
In the event that a URLs request isn't successful, the working node needs to return and inform the server.
Not only read back the error code but catch HTTPError exception too
ex :
HTTPError: HTTP Error 404: Not Found
Different action will need to be taken depending on the type of packet received
Ouput result to file (csv)
Application will be able to parse a configuration file and use the needed values from it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.