GithubHelp home page GithubHelp logo

cazystack's Introduction

Welcome to CazyStack

The Stackable Cazyme analysis toolkit

Created By TurboDog


Objective: To create a comprehensive toolkit that allows the scalable, reproducible, and efficient analysis of CAZyme data to later be plugged into a larger network of data tools.


**Problem Statement:** Current tools are built on the analytic framework of input --\> [computations] --\> output. This will be referred to as the ICO framework. Often times log files are created in order to track which computations were performed on what files and what the output is representing. This collection of flat input, log, and output files remain connected only insofar as their relational paths to one another, as well as the textual descriptions found in log files. Often times files are moved across a filesystem, disconnecting the frail link between all files in the collection. When this collection of positioned files gets moved, modified, or deleted, a loss of data provenance occurs and the computation can no longer be completely validated. When files get modified by users the logging mechanisms of analysis tools of this analytical framework are unable to detect these changes and immediate loss of data provenance occurs at the **tool** level, meaning the user must track these changes and amend the log file/electronic notebook.

Further, analysis tools of this framework result in disconnected data output when replicating analyses or running new data through the tool. In order to compare data ran separately through these tools, researchers often times need to use custom scripts and data analysis workflows in order to compile all the data in a way to be properly compared.

Solution: To create an interconnected data network across similar collections of data types and compute processes. The implementation utilizes a database management system (DBMS) + API in order to i) structure output in a scalable and efficient way, ii) track all analysis events internally through the database to ensure data provenance and iii) provide a data schema and storage platform that allows 'stackable data analysis'. Communication with all input, program, and output data will be done through a wrapper API connected to the MongoDB-hosted database (which can live on any local machine) with internal event logging. This wrapper also contains bult-in functionality for compiling data to analyze various levels based on metadata attributes. This allows data to be stored and analyzed beyond the researcher + project level, as different individuals within a group can share the same database and have the ability to query/analyze their isolated data or have it pre-compiled based on any provided metadata attribute(s).
Note: Users will still have the ability to export all flat files and data, as well as perform all database interaction steps through MongoDB's native support.


Vocabulary

  1. Input-Computation-Output (ICO) framework
  2. Stackable data structure
  3. Data provenance
  4. Analysis workflow
  5. MongoDB
  6. DAta NEtwork (DANE)
  7. Pass

**Graphical Representation** ![Graphical Representation](./cazystack/static/CAZYstack-Graph.png)

Figure1
Above, 3 different depictions of running a computation that in the end are all connected and able to be analyzed together. (A) In addition to normal input for run_dbcan, the user is prompted to add custom metadata attributes such as: project, scientist, or any other custom user-defined fields necessary. The dbcan program runs as per usual but an added computational step is added to organize the output into a semi-structured format with a relational document database. This local, temporary instance of the database is then merged into a larger database meant to hold all original cazyme output plus the extra metadata and internal logging capabilities.

cazystack's People

Contributors

ddeemerpurdue avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.