web-crawler's Introduction

Web Crawler

Scalable backend system designed to collect and manage unique URLs and find their corresponding sub-URL links in a recursive process.
The system stores the unique URLs and their corresponding raw HTML content and can handle multiple requests at the same time.
Furthermore, with a focus on efficiency, the system checks whether a URL address has already been crawled and skips it if so.

System Architecture:

Node.js environment with Express library for receiving HTTP requests.
MongoDB for storing all the data.
Node-Crawler library for the crawling (include cash memory build-in).
Jest for automation tests.
Swager documentation.

API Endpoints:

The project has Swager documentation with all the API endpoints and examples,
run the project, open the browser, and get to http://localhost:3000/api-docs.
There are 3 API endpoints:

POST / :
- Description: Accept incoming URL and trigger the fetching, parsing, and storage process.
- Request Body:
```
{
  "url": "some website address"
}
```
GET / :
- Description: Retrieve stored URLs list.
GET /getHtmlByUrl?url=some website address :
- Description: Get stored HTML by URL address.

Get started

Clone the repository:

git clone https://github.com/MaorCaspi/Web-Crawler.git

Then, Install all the required packages:

npm install

For starting up, type:

npm start

Recommend Projects

maorcaspi / web-crawler Goto Github PK

web-crawler's Introduction

Web Crawler

System Architecture:

API Endpoints:

Get started

web-crawler's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs