Scalable backend system designed to collect and manage unique URLs and find their corresponding sub-URL links in a recursive process.
The system stores the unique URLs and their corresponding raw HTML content and can handle multiple requests at the same time.
Furthermore, with a focus on efficiency, the system checks whether a URL address has already been crawled and skips it if so.
- Node.js environment with Express library for receiving HTTP requests.
- MongoDB for storing all the data.
- Node-Crawler library for the crawling (include cash memory build-in).
- Jest for automation tests.
- Swager documentation.
The project has Swager documentation with all the API endpoints and examples,
run the project, open the browser, and get to http://localhost:3000/api-docs.
There are 3 API endpoints:
-
POST / :
- Description: Accept incoming URL and trigger the fetching, parsing, and storage process.
- Request Body:
{ "url": "some website address" }
-
GET / :
- Description: Retrieve stored URLs list.
-
GET /getHtmlByUrl?url=some website address :
- Description: Get stored HTML by URL address.
Clone the repository:
git clone https://github.com/MaorCaspi/Web-Crawler.git
Then, Install all the required packages:
npm install
For starting up, type:
npm start