GithubHelp home page GithubHelp logo

amiraflak / watchdog Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 1.38 MB

Scalable and powerfull subdomains monitoring data pipeline.

License: MIT License

Python 52.80% Jupyter Notebook 43.30% Shell 1.26% Makefile 2.64%
airflow data-pipeline docker docker-compose kafka mongodb monitoring-tool pyspark hacktoberfest

watchdog's Introduction

MIT License


Logo

WatchDog

powerful ETL subdomain tracking pipeline

About The Project

Watchdog is a powerful ETL pipeline designed to track subdomains of specified domains in real-time. The goal of this project is to identify new subdomains as soon as they are discovered and alert the user immediately. This is achieved through efficient subdomain generation using multiprocessing, seamless and reliable data streaming with Kafka, flexible and scalable management of subdomains with MongoDB, advanced subdomain processing with PySpark, and effective workflow management and task coordination with Airflow. With the addition of the Telegram Notification feature, Watchdog provides real-time alerts and quick response to potential security threats. This project is ideal for security professionals, system administrators, and anyone who needs to monitor subdomains of specified domains in real-time.

Features

  • Efficient Subdomain Generation: Watchdog leverages multiprocessing to generate subdomains quickly and accurately, optimizing performance.
  • Real-time Streaming: The pipeline integrates Kafka to provide seamless and reliable data streaming, ensuring up-to-date information.
  • Scalable Storage: Watchdog utilizes MongoDB as its storage solution, enabling flexible and scalable management of subdomains.
  • Advanced Subdomain Processing and Security Scanning: With the power of PySpark, Watchdog efficiently processes and analyzes subdomains, allowing for sophisticated data manipulation.Watchdog also offers a powerful subdomain scanning capability, This feature also allows for a more comprehensive understanding of the subdomains and their associated IP addresses, which can be useful for identifying potential security threats.
  • Robust Orchestration: Watchdog employs Airflow for effective workflow management and task coordination, ensuring smooth execution.
  • Telegram Notification: Watchdog supports sending notifications to a Telegram channel or group when a new subdomain is found. This feature allows for real-time alerts and quick response to potential security threats.

Built With

(back to top)

Screenshots

  • Kafka producer: sends subdomains to the specified Kafka topic.
    Logo
  • Kafka consumer: Spark Streaming consumer to consume subdomains and store them in MongoDB.
    Logo
  • MongoDB: Checking the MongoDB collection snapshot to see the subdomains that have been tracked.
    Logo
  • Airflow: The Airflow DAG logs show the status and progress of the ETL pipeline.
    Logo

(back to top)

Getting Started

This is an example of how you may give instructions on setting up your WatchDog locally. To get a local copy up and running follow these simple example steps.

Prerequisites

Before you can use this project, you'll need to have the following installed on your machine:

  • Python above 3.10
  • Docker
  • Docker Compose
  • Airflow

If you don't have these installed, you can follow the installation instructions for each tool:

Once you have these tools installed, you'll be ready to use this project.

Installation & Usage

  1. Clone the repo
    git clone https://github.com/AmirAflak/WatchDog.git
  2. Navigate to the project directory:
    cd WatchDog/
  3. Set targets in configs.py:
    TARGETS=['caterpillar.com', 'url.com']
  4. Install the required packages:
    make install
  5. Initialize Docker Compose:
    make docker
  6. Initialize the Spark streaming consumer:
    make consumer
  7. Initialize the Airflow scheduler:
    make scheduler
  8. Initialize the Airflow webserver GUI:
    make webserver
  9. To stop the Docker Compose containers, run:
    make stop

That's it! You should now be able to use the project.

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

watchdog's People

Contributors

amiraflak avatar nimafaghih avatar

Watchers

 avatar

watchdog's Issues

Add timing for each task

  • Passive subdomain enumeration ⇒ per 4h
  • Name resolution ⇒ per 6h
  • DNS bruteforce ⇒ per 24h
  • Service discovery ⇒ per 8h

Utilizing Async Instead of Multiprocessing for Subdomain Generation from Multiple Sources

The current implementation of subdomain generation in the "core.py" file of the "subfinder" module uses multiprocessing to speed up the process. However, multiprocessing can be resource-intensive and may not be the most efficient solution for this task. This issue proposes to explore the use of asynchronous programming instead of multiprocessing to improve the performance of subdomain generation from multiple sources. The goal is to reduce resource usage and improve the overall efficiency of the subdomain generation process. This issue will involve refactoring the existing code to use async and testing the performance improvements against the current implementation.

Optimize Kafka Streaming Code Using More Efficient PySpark Methods

The current implementation of the Kafka streaming code in this file could be optimized to make use of more efficient PySpark methods. This issue proposes to optimize the code to improve its performance and efficiency. Specifically, the following changes could be made:

  • Use PySpark's built-in filter method instead of an if statement to remove unnecessary records from the DataFrame.
  • Use PySpark's select method to select only the necessary columns from the DataFrame, instead of converting the DataFrame to an RDD and then iterating over each row.
  • Use PySpark's withColumn method to add new columns to the DataFrame, instead of creating a new dictionary and appending it to a list.
  • Use PySpark's foreach method instead of foreachBatch to write the processed data directly to MongoDB, instead of using a separate method to write to MongoDB.

These changes will make the code more efficient and easier to read and maintain.

Unable to Run Airflow with Docker Compose Instead of Local Environment

I am facing difficulties while attempting to run Airflow using Docker Compose instead of my local environment. I have followed the official documentation and various online resources, but I am encountering issues that prevent me from successfully setting up Airflow with Docker Compose.

Expected Behavior:
I expect Airflow to be successfully deployed and running within the Docker containers, allowing me to manage and execute my workflows.

Actual Behavior:
Instead, I am encountering the following issues:
🐛 Airflow containers fail to start or crash unexpectedly.
🐛 Airflow web server or scheduler fails to connect to the database.
🐛 Airflow UI is inaccessible or shows errors.
🐛 Airflow tasks fail to execute or get stuck in a pending state.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.