- Introduction
- System Overview
- Architecture and Design Decisions
- Project Structure
- Key Packages and Dependencies
- Tooling and Development Practices
- Setup and Build Instructions
- Architectural Decisions and Justifications
- Next Steps
- Contributing
- License
This project is a scalable system designed to handle the execution of 1 billion graphs distributed among 1 million users. The system supports graph executions triggered by webhooks, timers, or manual inputs and provides live monitoring via WebSocket connections when users manually initiate executions.
The project is developed in Rust using the Tokio ecosystem and is deployed using Kubernetes for orchestration and scalability. The system is composed of multiple microservices, each responsible for a specific functionality, ensuring clear separation of concerns and maintainability.
The system consists of the following components:
- Trigger Services
- Webhook Service
- Timer Service
- REST Service
- Task Queue (Message Broker)
- Graph Processor Workers
- Live Output Streaming Mechanism
- WebSocket Service
- Data Storage
- Monitoring and Autoscaling
- Security and Authentication Mechanisms
- Functionality: Receives incoming webhook requests, validates them, and enqueues tasks for graph execution.
- Design Decisions:
- Separation from REST Service: The webhook service is separated from the REST service due to different functionalities, security requirements, and load characteristics. This separation allows for independent scaling, better security isolation, and maintainability.
- Stateless Microservice: Enables horizontal scaling to handle high volumes of webhook triggers.
- Decoupling via Task Queue: Ensures that incoming webhooks do not directly interact with the graph processor, preventing overload.
- Functionality: Schedules graph executions based on predefined time triggers.
- Design Decisions:
- Distributed Scheduler: Utilizes a cron-like scheduler for reliability and scalability.
- Enqueuing Tasks: Schedules tasks into the task queue at the appropriate times.
- Functionality: Allows users to manually initiate graph executions from the frontend.
- Design Decisions:
- Separate from Webhook Service: Due to different responsibilities and security considerations.
- Serving Frontend Assets: Serves the frontend application and provides API endpoints.
- Live Monitoring Flag: Enqueues tasks with a flag indicating if live monitoring is required.
- Purpose: Decouples trigger services from the graph processor workers, enabling scalability and load balancing.
- Design Decisions:
- Use of Apache Kafka or RabbitMQ:
- High Throughput: Handles a massive number of tasks efficiently.
- Topic-Based Messaging: Organizes tasks and supports scalable consumption.
- Use of Apache Kafka or RabbitMQ:
- Functionality: Executes graphs using all available processors on the machine.
- Design Decisions:
- Stateless Design: Workers do not maintain internal state, enabling horizontal scaling.
- Resource Optimization:
- Multithreading/Multiprocessing: Utilizes all CPU cores for efficient execution.
- Algorithm Optimization: Ensures graph processing is as efficient as possible.
- Autoscaling Policies: Scales based on CPU usage, memory usage, and task queue length.
- Purpose: Provides real-time updates to users who manually initiate graph executions.
- Design Decisions:
- Topic-Based Pub/Sub System:
- Per-Execution Topics/Channels: Each graph execution requiring live updates has a unique topic or channel.
- Efficient Message Routing: Ensures only relevant messages are delivered to the appropriate WebSocket consumers.
- Implementation with Redis Pub/Sub:
- Low Latency Messaging: Supports real-time communication.
- Scalability: Handles a large number of topics/channels efficiently.
- Topic-Based Pub/Sub System:
- Functionality: Manages WebSocket connections and forwards live updates to users.
- Design Decisions:
- Stateless Service Instances:
- Shared Session Storage: Uses Redis to store session data, avoiding the need for sticky sessions.
- Horizontal Scalability: Allows any instance to handle any user connection.
- Dynamic Subscription Management:
- Subscribes to Relevant Topics: Only subscribes to topics/channels relevant to connected users.
- Load Balancing Without Sticky Sessions: Enables efficient distribution of connections based on current load.
- Stateless Service Instances:
- Functionality: Stores graph definitions, execution results, and logs.
- Design Decisions:
- Scalable Databases:
- Graph Definitions: Stored in databases like PostgreSQL.
- Execution Results: Stored in databases or object storage systems (e.g., AWS S3).
- Caching with Redis: Improves retrieval times for frequently accessed data.
- Data Partitioning and Sharding: Distributes data across multiple databases for scalability.
- Scalable Databases:
- Functionality: Monitors system performance and automatically adjusts resources.
- Design Decisions:
- Metrics Collection with Prometheus:
- System Metrics: CPU usage, memory usage, response times.
- Custom Metrics: Task queue length, number of active WebSocket connections.
- Visualization with Grafana: Provides dashboards for real-time monitoring.
- Autoscaling Policies:
- Horizontal Pod Autoscaler (HPA): Scales microservices based on metrics.
- Cluster Autoscaler: Adjusts the number of nodes in the Kubernetes cluster.
- Alerting Mechanisms: Uses tools like Alertmanager to notify operators of critical issues.
- Metrics Collection with Prometheus:
- Functionality: Secures communication and ensures only authorized users can trigger and monitor graphs.
- Design Decisions:
- TLS Encryption: Secures all inter-service communication.
- Authentication Protocols:
- JWT Tokens: Authenticates API requests.
- Webhook Validation: Verifies signatures or tokens on incoming webhooks.
- Authorization Controls: Ensures users can only access their own graphs and data.
The project is organized as a Rust workspace to manage multiple services (crates) efficiently, ensuring scalability, maintainability, and clear separation of concerns. Each service corresponds to a microservice in the system architecture and can be developed, tested, and deployed independently.
[workspace]
members = [
"common",
"services/webhook_service",
"services/timer_service",
"services/rest_service",
"services/graph_processor_worker",
"services/websocket_service",
]
-
Location:
common/
-
Purpose: Contains shared code used across multiple services, such as data models, utilities, configuration management, and messaging helpers.
-
Structure:
common/ ├── Cargo.toml └── src/ ├── lib.rs ├── models/ │ ├── mod.rs │ ├── user.rs │ ├── graph.rs │ └── execution.rs ├── utils/ │ ├── mod.rs │ ├── config.rs │ ├── error.rs │ └── logging.rs ├── messaging/ │ ├── mod.rs │ ├── kafka.rs │ └── redis.rs └── db/ ├── mod.rs ├── migrations/ │ ├── V1__init.sql │ ├── V2__add_indexes.sql │ └── ... ├── schema.rs └── models.rs
Each service is a separate binary crate within the workspace.
-
Location:
services/webhook_service/
-
Purpose: Handles incoming webhook requests, validates them, and enqueues tasks.
-
Structure:
webhook_service/ ├── Cargo.toml └── src/ ├── main.rs ├── handlers.rs ├── routes.rs ├── models.rs └── utils.rs
-
Location:
services/timer_service/
-
Purpose: Schedules graph executions based on predefined time triggers.
-
Structure:
timer_service/ ├── Cargo.toml └── src/ ├── main.rs ├── scheduler.rs ├── models.rs └── utils.rs
-
Location:
services/rest_service/
-
Purpose: Provides a RESTful API for users and serves the frontend application.
-
Structure:
rest_service/ ├── Cargo.toml └── src/ ├── main.rs ├── handlers.rs ├── routes.rs ├── models.rs ├── utils.rs └── tests/ ├── integration_tests.rs └── ...
-
Location:
services/graph_processor_worker/
-
Purpose: Executes graphs and publishes live updates if required.
-
Structure:
graph_processor_worker/ ├── Cargo.toml └── src/ ├── main.rs ├── processor.rs ├── models.rs ├── utils.rs └── tests/ ├── unit_tests.rs └── ...
-
Location:
services/websocket_service/
-
Purpose: Manages WebSocket connections and forwards live updates.
-
Structure:
websocket_service/ ├── Cargo.toml └── src/ ├── main.rs ├── websocket_handlers.rs ├── models.rs └── utils.rs
-
Location:
frontend/
-
Purpose: Contains the frontend application built with React and HTMX.
-
Structure:
frontend/ ├── package.json ├── webpack.config.js ├── public/ │ ├── index.html │ └── ... ├── src/ │ ├── index.jsx │ ├── components/ │ ├── styles/ │ └── ... └── ...
- Scripts: Located in the
scripts/
directory.build.sh
: Script to build all services.deploy.sh
: Script to deploy services to Kubernetes.
- Configurations: Global configurations are stored in the
configs/
directory.config.toml
: Configuration file for the application.
- Dockerfiles: Located in the
docker/
directory.Dockerfile.service_name
: Dockerfile for each service.
- Kubernetes Manifests: Located in the
k8s/
directory.service_name_deployment.yaml
: Deployment manifest for each service.
The project utilizes a range of Rust crates, selected based on community recommendations to ensure robustness, performance, and maintainability.
- Serialization and Deserialization:
serde
,serde_json
- Asynchronous Runtime:
tokio
- Logging and Diagnostics:
tracing
- Configuration Management:
config
- Error Handling:
- Libraries:
thiserror
(for the common crate) - Applications:
anyhow
(for application crates)
- Libraries:
- Utilities:
- Lazy Initialization:
once_cell
- Regular Expressions:
regex
- UUID Generation:
uuid
- Lazy Initialization:
- Database Interaction:
sqlx
(supports PostgreSQL) - Testing:
- Asynchronous Testing:
tokio-test
- Snapshot Testing:
insta
- Asynchronous Testing:
- Web Framework:
axum
- Kafka Client:
rdkafka
- JWT Authentication:
jsonwebtoken
- Scheduling:
tokio-cron-scheduler
- Date and Time Handling:
chrono
- Web Framework:
axum
- Static File Serving:
tower-http::services::ServeDir
- CORS Handling:
tower-http::cors::CorsLayer
- Session Management:
redis
- Graph Algorithms:
petgraph
- Parallelism:
rayon
- AWS S3 Interaction:
aws-sdk-s3
- WebSocket Support:
tokio-tungstenite
- Session Management and Pub/Sub:
redis
- Linting:
clippy
is integrated into the development workflow for linting and catching common mistakes. - Formatting:
rustfmt
is used to enforce code style consistency across the project. - Dependency Management:
cargo-edit
is used for efficient management of dependencies. - Security Auditing:
cargo-audit
is run regularly to check for vulnerabilities in dependencies. - Testing Framework:
cargo-nextest
is used for faster and more efficient test execution. - Benchmarking:
criterion
is used for performance benchmarking to monitor and optimize performance. - Continuous Integration: Incorporates the above tools into the CI/CD pipeline to ensure code quality and security.
- Rust and Cargo: Install from rustup.rs
- Node.js and npm: Install from nodejs.org
-
Clone the Repository
git clone https://github.com/yourusername/yourrepository.git cd yourrepository
-
Install Rust Dependencies
Navigate to the project root and build the workspace.
cargo build
-
Set Up the Frontend
Navigate to the
frontend/
directory and install frontend dependencies.cd frontend npm install
- Install React, HTMX, and other required packages.
-
Configure Environment Variables
- Copy
configs/config.toml.example
toconfigs/config.toml
and modify it according to your environment.
- Copy
-
Run Services
-
Each service can be run individually. For example:
cargo run --bin rest_service
-
-
Separation of Services
- Decision: Separate the webhook service from the REST service.
- Justification:
- Scalability: Allows independent scaling based on specific load patterns.
- Security: Isolates external webhook handling from user-facing APIs.
- Maintainability: Clear separation of concerns.
-
Use of Message Queues
- Decision: Use message queues (e.g., Kafka) to decouple trigger services from graph processor workers.
- Justification:
- Scalability: Enables independent scaling of producers and consumers.
- Load Balancing: Distributes tasks evenly among workers.
- Resilience: Buffers tasks during spikes, preventing overload.
-
Stateless Microservices
- Decision: Design services to be stateless where possible.
- Justification:
- Scalability: Easier to scale horizontally.
- Fault Tolerance: Failure of one instance doesn't affect others.
-
Efficient Live Update Mechanisms
- Decision: Implement topic-based Pub/Sub for live updates.
- Justification:
- Relevance: Users receive only pertinent updates.
- Performance: Reduces unnecessary processing.
-
Database Interaction
- Decision: Use
sqlx
for database interactions with compile-time query checking. - Justification:
- Safety: Prevents SQL injection and runtime errors.
- Asynchronous Support: Integrates well with Tokio.
- Decision: Use
-
Frontend Integration
- Decision: Use React and HTMX for the frontend, served by the REST service.
- Justification:
- Modern UI/UX: Leverages popular frontend technologies.
- Simplified Deployment: Serving frontend assets from the REST service simplifies architecture.
-
Security Practices
- Decision: Implement JWT authentication and secure communication.
- Justification:
- Data Protection: Ensures only authorized access.
- Compliance: Meets industry standards.
-
Tooling and Best Practices
- Decision: Use community-recommended tooling for development and CI/CD.
- Justification:
- Code Quality: Tools like
clippy
andrustfmt
ensure high code quality. - Security:
cargo-audit
helps identify vulnerabilities.
- Code Quality: Tools like
-
Development:
- Implement core functionalities for each service.
- Develop shared models and utilities in the common crate.
- Build out the frontend application.
-
Testing:
- Write unit and integration tests using
cargo-nextest
andinsta
. - Set up test databases for testing database interactions.
- Write unit and integration tests using
-
Deployment:
- Configure Dockerfiles and Kubernetes manifests with environment-specific details.
- Implement the
build.sh
anddeploy.sh
scripts.
-
Monitoring and Optimization:
- Integrate monitoring tools like Prometheus and Grafana.
- Optimize performance based on collected metrics.
-
Documentation:
- Document APIs and services.
- Provide usage examples and API references.