GithubHelp home page GithubHelp logo

noorts / dlsa Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 33.17 MB

🧬 Distributing Local Sequence Alignment using Volunteer Computing

License: Apache License 2.0

Go 13.68% Python 35.25% Shell 17.73% Makefile 0.39% Rust 31.44% C 1.50%
bioinformatics distributed-systems sequence-alignment smith-waterman-algorithm crowdsourcing

dlsa's Introduction

Distributing Local Sequence Alignment using Volunteer Computing

A coordinator-worker based distributed system for crowdsourced local sequence alignment.

It was developed as a lab project for the 2023/2024 Distributed Systems course at the Vrije Universiteit Amsterdam.

The key idea of the project is to enable crowdsourced local sequence alignment. This allows heterogeneous computers of different sizes (e.g., a laptop or a compute cluster node) to work together to perform sequence alignment jobs for scientists (this is a similar idea to Folding@Home).

The project report can be found here, and the experiment archive is over at DLSA-Experiments.

Overview

The project consists of two main aspects, 1) an implementation of the Smith-Waterman algorithm, and 2) a coordinator-worker architecture that is able to "intelligently" schedule and distribute the sequence alignment jobs across the pool of workers. The heterogeneous workers individually run a compute capacity estimation benchmark (using synthetic sequences), which is communicated to and used by the scheduler to distribute the work.

The diagram below depicts the coordinator-worker architecture. The project requires 1 master node, and 1+ worker nodes to be spun up (see instructions below). A command-line tool (see CLI below) can be used by the "User" (i.e., scientists) to submit sequence alignment jobs to the master node. The master will subsequently schedule and distribute the work across the pool of worker nodes, returning the result to the user when the work is finished.

The Architecture

Prerequisites

The project uses Python, Golang, and (nightly) Rust.

We've used the following versions in our testing. Nightly rust is currently being used for the std::simd module. Once the module is stabilized then stable rust can be used.

Dependency Version
Python 3.11.5
poetry 1.7.1
Go go1.21.4 linux/amd64
rustc rustc 1.76.0-nightly (eeff92ad3 2023-12-13)
cargo cargo 1.76.0-nightly (1aa9df1a5 2023-12-12)

Python dependencies are managed by Poetry (installation instructions). After installing Poetry, you can install the project's dependencies from the root folder using poetry install.

Note: Specific instructions for running this project on the DAS5 compute cluster can be found here.

Master

Usage

Execute poetry run python3 master/run.py to start the master node (see here for more details about poetry and virtual environments).

Optionally, navigate to http://localhost:8000/docs for the API documentation.

Testing

Run poetry run pytest master inside the root directory.

Worker

Usage

Execute go run cmd/worker/main.go "0.0.0.0:8000" to start the worker node. Golang should automatically install the required dependencies.

To use a manual number of cores run go run cmd/worker/main.go "0.0.0.0:8000 4" for 4 cores, this is just for experimentation purposes, by default it uses all available cores

If the "master node IP and port" argument is not supplied, then the worker will connect to a default master node hosted locally at 0.0.0.0:8000.

Testing

Run go test ./... inside the root directory.

Inner Workings

The worker runs in an infinite loop, which tries to register with the master node every X seconds. If the registration is successful, the worker starts sending a pulse to show the master it is alive every Y seconds, the worker also enters another loop state in which it asks for work every Z seconds. If it receives work from the master, it iterates through every query-target pair it was tasked to calculate and performs the Smith-Waterman algorithm. After it calculates the result of each pair, the worker immediately sends the result to the master such that if the worker were to shut down in the midst of calculations, the rest of the work could be delegated to another worker.

CLI

A command-line tool has been developed that allows one to submit sequence alignment jobs.

Run poetry run python3 cli [params] to submit a job. Run it without any parameters for help.

An example use: poetry run python3 cli --query datasets/query_sequences.fasta --database datasets/target_sequences.fasta --server-url http://0.0.0.0:8000 --match-score 2 --mismatch-penalty 1 --gap-penalty 1 --top-k 5

The result of the alignments will be saved to the results directory, where for every query sequence, a file is generated, with the corresponding best result for every target in the database file, with the same id as in the original files.

Synthetic Dataset Generation

To generate a synthetic query and a target/database file you can use the generate_synthetic_dataset.py script. First adjust the configuration in the script, and then execute python3 ./utils/generate_synthetic_data.py, the query and target files will be saved to the current working directory.

Experiments

To effortlessly run experiments on the DAS5 cluster, the run_das5_experiments.py was created. This quick and dirty script automates the starting of the master and workers, the submission of a job using the CLI, and collects all results into a JSON file. See the script and DAS5.md for more information.

For detailed experiment setups and results, and plotting see the DLSA-Experiments repository.

dlsa's People

Contributors

noorts avatar haraldurbjarni avatar niclashaderer avatar danielvoogsgerd avatar enricozeilmaker avatar

Stargazers

 avatar

Watchers

 avatar  avatar

dlsa's Issues

Tracking Issue: Interfaces

We have a quite a few entities that have to communicate. Let this function as a tracking issue for general discussion over interfaces.

Overview of interface issues:

Configuration option passing

Adjust the client (TUI), master, and worker node such that the following configuration options are passed from the client all the way to the worker node.

Configuration options:

  • match score
  • mismatch score
  • gap (usually split into extension penalty δ and gap open penalty Δ, but fine to keep them the same in this PR)

Defaults for these could be as stated in the competition document: Match = +2, Mismatch = -1, Gap=1. It might make sense to define these defaults only in the client (the relevant functions in the master and worker will parameterize these options). We might want to group them into a configuration object (for maintainability’s sake).

Requirements

  • The client (TUI)
    • allows the configuration options to be passed as arguments
    • falls back to defaults in the case that some options are not specified
  • The master
    • passes the configuration options to the worker, taking into account potential job to work splitting
  • The worker node
    • parses and uses the configuration options inside the Smith-Waterman algorithm
    • its tests have been updated to use a default set of configuration options

Write SIMD algorithm

To Do:

  • Don't overcompute diagonal parts of the Matrix
  • Use SIMD in the diagonal parts

About using SIMD in the diagonals. SIMD can be used as soon as the matrix is LANES + 1 wide (one because of the leftmost zero column). This should speed up all cases that have a fairly high query length compared to target length.

Master

  • Add parameters #12
  • Add logging (able to view what is happening)
  • Add new scheduler
    • Simple work split (for one-to-many and for many-to-many jobs)
    • Fancy heuristics once we have the benchmarking numbers sent by the worker

Sequences to compute

class WorkPackage(BaseModel):
    # work package id
    id: str
    targets: Dict[SequenceId, Sequence]
    queries: Dict[SequenceId, Sequence]

    sequences: List[TargetQueryCombination]

Is the point of the sequences field only to specify which of the sequences sent to the worker it should compute?

External REST API

The component that serves a REST API. This is the component that clients (e.g., a user on a laptop) interact with from the outside. It allows a client to 1) submit a sequence alignment job request, and 2) to poll for the results.

The provided job request will be parsed and converted into our internal format.

Worker - Tracker interface

The communication between the worker nodes and the tracker is the most critical, I think. Here is a thread to discuss the development and design decisions for the interface.

Interface - master

This is just a rough draft (hopefully enough for the design to show Tiziano) feel free to comment and obviously there need to be more specifications:

The master node is responsible for receiving jobs from the job scheduler and assigning them to workers. The master keeps a list of workers and whether they are available and which jobs they have and how long they have taken. The master node also keeps track of the health of the workers and provides recovery in case of failures. Once the master has received a confirmation that there are no more jobs to be submitted and all jobs have been processed, the node sends the results to a AWS bucket

Jobs: Queue like data structure
Results: Data structure to store the results
Workers: List of workers available,

Methods:

ReceiveTask()
receive task from scheduler and add it to the task queue

RegisterWorker(workerId: String or int, metadata: Custom class)
Allows worker to register with the master with the id

DelegateTask(workerId: String)
Delegates a task to a worker with it workerID

ReportStatus(wokerId, status: Custom class)
Receives and processes status updates from worker a node(completion, error etc.)

SubmitResult()
Submits a result to the bucket or database

Tracking: Rust

A tracking issue about all problems related to the Rust implementation

To Do

  • Convert unit tests to rust
  • Setup Benchmarking using Criterion
  • Create complete FFI-bindings for Go.
  • #27
  • Write a Low memory variant of the algorithm
  • Use rust version in benchmarker @haraldurbjarni
  • Make vec allocate fallible
  • Catch all panics at FFI @haraldurbjarni

Worker - Metric computation and Benchmarking

  • Metric computation
    • Time
    • (Opt) CUPS
  • Benchmarking (for performance insight and experiments)

Metric granularity can be down to the following 4 metrics inside the worker node:

So computation consists of: | BUILD MATRIX | BACKTRACE |
                            | 1GCUPS,2TIME |   3TIME   |
                            |       4COMBINED TIME     |

Benchmarking has two purposes:

  1. performance benchmarking for the experiments and testing in general (allows us to see which code improvements deliver practical results)
  2. compute capacity estimation, for the “intelligent” scheduler.

Project structure

I created a proposed project structure branch which also includes a grpc config. Be free to just leave a comment or change stuff around.

Roadmap

  • Architecture.md that can be converted into something for the report (e.g., rust analyzer)
  • #2
  • Worker node
    • Algo implementation (Daniel)
    • Node registrar interface (Daniel)
  • Registrar + Register Interface for Node (Halli & Enrico)
  • Job Scheduler (Niclas & Paul)
    • Job Data Format
    • Job interface for internal format
    • Node interface for Job assignment?
  • #4
    • Support File Types to Internal Format
  • Persistence Layer setup

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.