leopardslab / crawlerx Goto Github PK

CrawlerX - Develop Extensible, Distributed, Scalable Crawler System which is a web platform that can be used to crawl URLs in different kind of protocols in a distributed way.

License: Apache License 2.0

JavaScript 2.84% HTML 0.21% Vue 34.47% Python 8.51% SCSS 52.31% Dockerfile 0.15% Shell 0.14% Mustache 0.17% CSS 1.20%

django-backend web-crawling mongodb-server vuejs elasticsearch message-broker firebase-auth

crawlerx's People

Contributors

Stargazers

Watchers

Forkers

poornimarangoda codetheorem vinayaksh42 moulik-deepsource 2knal shivam-iitkgp yasirunet sandagomipieris ffalpha beshiniii dqsdatalabs drifterkaru dizzysilva loic-binet lapnd

crawlerx's Issues

Create Projects section in front-end to manage multiple projects per user

[Epic for GSoC 2021] Complete CrawlerX Kubernetes Deployment with Helm

Please note: View this issue after enabling ZenHub

Project - [GSoC 2021] CrawlerX - On-demand auto-scaling platform on Kubernetes
Description - This epic relates to the deployment pattern of the CrawlerX on Kubernetes.
Student: sangagomipieris
Mentor: sajithaliyanage/ prabushitha

Feature: Add Kubernetes artifacts for the CrawlerX project

Is your feature request related to a problem? Please describe.
This issue introduces Kubernetes artifacts for CrawlerX project. Currently, we can only deploy the platform in Docker orchestration frameworks.

Connect ELK capability for data search function

Is your feature request related to a problem? Please describe.
$subject

Type/Improvement

[GSoC 2021] K8s artifact for Django based backend server

We need to create following k8s artifacts for the $subject

Deployment
Service
ConfigMap
Secret
HPA

Feature: Add some new crawlers for popular web pages

Add crawl spiders for the following or popular websites.

Youtube
Quora
Facebook
Reddit
GitHub

Currently implemented spiders can be found in - https://github.com/leopardslab/CrawlerX/tree/master/scrapy_app/scrapy_app/spiders

Feature: Improve the JSON viewer inside the Job data

We can improve the JSON viewer by adding the followings,

Add a syntax highlighter
Add Copy to Clipboard button
Add some readability improvements

Create Dockerfiles for CrawlerX modules

Create Dockerfiles for following modules.

VueJs web application
Django server
Scrapy server

[GSoC 2021] Integrate Apache Airflow to manage Crawler jobs

Is your feature request related to a problem? Please describe.

Airflow is a platform created by the community to programmatically author, schedule and monitor workflows. It provides following advantages.

Easy to manage workflows
Guranteed delivery
Easy to monitor workflows
Easy to schedule Crawler jobs
Supports many many Message broker implementations
Container native support
Dashboard support

Missing forgot password functionality

Is your feature request related to a problem? Please describe.
Current crawlerX_app doesn't have forgot password functionality.

Exposing web app's firebase configuration

Describe the bug
Even though firebase has only used for authentication purpose in CrawlerX it is not nice to put them in public repo. instead of that we can add a env file to the project and put all the environmental variables there

[GSoC 2021] Data export module for CrawlerX projects

Is your feature request related to a problem? Please describe.
Currently it only possible to view data via the embedded JSON viewer. It would be great if we can export these data as a JSON, CSV file in each project.

[GSoC 2021] K8s artifacts for VueJS based crawlerx web application

We need to create following k8s artifacts for the $subject

Deployment
Service
Ingress
ConfigMap
Secret

.gitignore problem

Describe the bug
.gitignore doesn't contain necessary line to ignore the node modules.

[GSoC 2021] Enable tor web URL support for CrawlerX

Is your feature request related to a problem? Please describe.
As per the current implementation, CrawlerX supports only for HTTP and HTTPS urls. This need to extend for Tor browser Urls.

Create a backend REST service server with Django.

Create MongoDB based authentication mechanism

[GSoC 2021] K8s artifacts for Scrapy application

We need to create following k8s artifacts for the $subject

Deployment
Service
Ingress
ConfigMap
Secret
NFS provisioner

[GSoC 2021] Documentation the progress of the project

this issue tracks the documentation progress of the project.

[GSoC 2021] K8s artifacts for ElasticSearch server and dashboard

We need to create following k8s artifacts for the $subject

Deployment
Service
Ingress
ConfigMap
Secret

Create crawler Job section in the Project section to schedule multiple jobs per project

Cannot build docker - Failed to build Twisted

Hi.
I run docker-compose up --build and get an error:

I tried installing Twisted's dependencies in Dockerfile or change version Twisted in requirements.txt but it didn't solve the problem.
RUN apt-get update && apt-get install -y gcc libc6-dev

Can you help me?
Thank you!

Create a front-end using VueJs for user managements

[GSoC 2021] Update Readme.md file with the progress

At the end of the project, Readme.md file needs to be updated with the relevant step by step user guide.

[GSoC 2021] Integrate Django Celery Beat to manage and schedule Crawler jobs

Is your feature request related to a problem? Please describe.
$subject

Create a set of pre-defined Crawlers in CrawlerX

Feature: Create a new logo for the CrawlerX platform

Create a new logo for the CrawlerX platform and insert it inside the folder called logo in the root directory. Also, update the favicon of the web-app

Console warnings in Crawlerx App

Describe the bug
There are few console warnings related to firebase and vue router

To Reproduce
Steps to reproduce the behavior:

Go to 'crawlerx_app'
Run the project by typing 'npm run serve'
Open Console
See warnings

Screenshots

Desktop (please complete the following information):

OS: Windows
Browser :chrome

[GSoC 2021] K8s artifacts for apache Airflow server

We need to create following k8s artifacts for the $subject

Deployment
Service
HPA
ConfigMap
Secret

New section for scheduled jobs
New section for file upload URLs
New design for corn schedule jobs

[Epic for GSoC 2021] Complete Improvements of CrawlerX web application

Please note: View this issue after enabling ZenHub

Project - [GSoC 2021] Improve CrawlerX web application
Description - This epic relates to the project improve CrawlerX web applicaation.
Student: beshiniii
Mentor: sajithaliyanage/ prabushitha

[GSoC 2021] K8s artifacts for MongoDB server

We need to create following k8s artifacts for the $subject

Deployment
Service
ConfigMap
Secret

Create Docker composer.yml files for CrawlerX platform

Is your feature request related to a problem? Please describe.
$subject. We need to create relevant composer.yml for the platform.

Describe the solution you'd like
Type/Improvement

leopardslab / crawlerx Goto Github PK

crawlerx's People

Contributors

Stargazers

Watchers

Forkers

crawlerx's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs