GithubHelp home page GithubHelp logo

sierra-moxon / agr_loader Goto Github PK

View Code? Open in Web Editor NEW

This project forked from alliance-genome/agr_loader

1.0 1.0 0.0 2.99 MB

Data loader for the Alliance of Genome Resources website.

Home Page: http://www.alliancegenome.org/

License: MIT License

Makefile 0.25% Python 99.70% Dockerfile 0.05%

agr_loader's Introduction

Build Status Codacy Badge

Alliance of Genome Resources Loader

ETL pipeline for Alliance of Genome Resources

Requirements

  • Docker
  • Docker-compose

Installation

  • Build the local image with make build.

  • Start the Neo4j database with make startdb. Allow ~10 seconds for Neo4j to initialize.

    • To initialize an empty database after previously using the loader, be sure to run make removedb before running make startdb.
  • ensure that your local docker installation has access to at least 5G (preferentially 8G) of memory or else your run_test target will fail with a non-inituative error that "Cannot resolve address 'neo4j'" this can be done in the docker preferences.

Running the Loader

  • Initialize a full load with make run.
  • Alternatively, make run_test will launch a much smaller test load; this is useful for development and testing.

Running Unit Tests

  • Once the loader has been run (either test load or full load), unit tests can be executed via make unit_tests.

Accessing the Neo4j Shell

  • From your command line: docker exec -ti neo4j bin/cypher-shell
    • A quick command to count the number of nodes in your db: match (n) return count (n);

Stopping and Removing the Database

  • Remove the database with make removedb.

Shortcut Commands

  • make reload will re-run the Installation and Running the Loader steps from above.
  • make reload_test will re-run the same steps using a test subset of data.
  • note: reload_test will not re-download the file bolus.

Config

  • There are 3 loader configurations that come with the system (in src/config): default.yml, develop.yml, test.yml. Each is set up to work on a particular environment (and differs in the default number of threads for both downloading files and the number of threads used to load the database). test.yml will be used while running the load using the test data set. default.yml is the configuration used on all the shared systems and on production. develop.yml is used for the full data set on a development system. Each can be modified to remove or add the data types (ie: Allele, BGI, Expression, etc...) and subtypes (ie: ZFIN, SGD, RGD, etc...) as needed for development purposes.
  • When adding a new data load, be sure to add to validation.yml as well so the system knows the expected data types and subtypes.
  • local_submission_system.json is a file consumed in addition to the submission system data (from the submission system API) that is used to customize non-submission system files like ontology files.

ENV Variables

  • DOWNLOAD_HOST - the s3 bucket from which files are pulled.
  • ALLIANCE_RELEASE - the release version that this code acts on.
  • FMS_API_URL - the host from which this code pulls its available file paths from (submission system host). Note: the submission system host is reliant on the ferret file grabber. That pipeline is responsible for ontologie files and GAF files being up to date. And, the submission system requires a snapshot to be taken to fetch 'latest' files.
  • TEST_SCHEMA_BRANCH - If set that branch of the agr_schema wil be used instead of master
  • If the site is built with docker-compose, these will be set automatically to the 'dev' versions of all these variables.

agr_loader's People

Contributors

adamjohnwright avatar azurebrd avatar christabone avatar cmpich avatar dustine32 avatar gildossantos avatar ianlongden avatar lucyhut avatar markquintontulloch avatar nathandunn avatar oblodgett avatar paaatrick avatar sierra-moxon avatar valearna avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.