GithubHelp home page GithubHelp logo

wfau / gaia-dmp Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 5.0 11.88 MB

Gaia data analysis platform

License: GNU General Public License v3.0

Python 0.94% Shell 62.95% HCL 6.86% Dockerfile 6.20% Java 1.39% Smarty 3.95% Jinja 17.72%

gaia-dmp's People

Contributors

akrause2014 avatar dependabot[bot] avatar millingw avatar nigelhambly avatar stvoutsin avatar zarquan avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gaia-dmp's Issues

Java client example

Examples of how to run tasks on our Spark system using a Java program launched from a command line rather than inside a inside a notebook.

Allow users to store data using Zeppelin (Persistent or Temporary Storage)

We need to allow Zeppelin users to store data somewhere.

Do we look into using HDFS for this as a first step?
The Zeppelin node in the "production" service is very small so we can't use that most likely.

If we move to a service similar to what we have for lsst with Jupyter on K8s, and spawn new Zeppelin containers per user, we can allocate storage for those containers. Making it persistent probably is a longer term task though, as it would require us to setup user management / quotas / etc..

There is a requirement for persistent storage, but as a start temporary storage would work for helping our initial users develop notebooks, though persistent storage with HDFS might be easier to setup.

Proposals for CI platform

We need a Continuous Integration system that runs automated tests each time we create a new deployment. This is the first step along that route, to propose solutions and describe how they would work.

Given we are putting our code into GitHub the obvious one is the GitHub Actions CI system.

but there is a whole range of others to choose from as well

Question is .. how do we integrate the CI system with a deployment on an external OpenStack system that isn't part of the GitHub universe ?

Make config-dns use discovery

Make config-dns use the same instance discovery process as config-ssh to get the lists of virtual machines and their IP addresses.

Remove the reliance on the partial files generated during the create process.

Splitting the project into layers

I think we need to split this project into three separate GitHub projects to reflect the different layers in the system. Hopefully this will make it easier to manage the source code, issues and testing for each of the layers.

  1. Deploy Kubernetes into an empty OpenStack project space.
  2. Deploy Spark and Zeppelin into an empty Kubernetes cluster.
  3. Develop a set of Spark and Zeppelin tools for analysing the Gaia data.
  • Each layer is deployable and testable as a separate component in its own right.
  • Each layer has a separate GitHub repo, including code base, issues, and CI testing process.

This way we can be working on development versions of the Kubernetes and Spark layers, while at the same time using the current release versions of them as a stable platform to develop the Spark and Zeppelin analysis tools.

This has implications for issue #33. Each layer would have its own CI integration, using the other layers as components to build the CI test environment.

Python client example

Examples of how to run tasks on our Spark system using a Python client launched from a command line rather than inside a inside a notebook.

Template form for benchmarks

As we add more benchmark tests we need to be sure we are being consistent.
This task is to create some kind of standard layout that we use to state the conditions of each test that we run.

  • Test date : 2020-01-22
  • Cloud name : cumulus, gaia-dev
  • Number of workers: ...
  • Cores per worker: ...
  • Memory per worker: ...
    ... etc

Export a notebook as an executable Python program

Export a notebook as an executable Python program
Future issues will address where the program can be run from, how to import all the required libraries and what happens to the original notebook text.

Integration with Gaia svn

Create a github/aglais directory in the Gaia svn repository and link it to this GitHub project.

This includes a shell script suitable for adding as a cron job that pulls the latest changes from the default branch of the GitHub project and pushes them to the svn repository.

Using a top level github directory in svn leaves room to add links to more GitHub projects if we split the project into smaller parts later on (see #34).

Scala client example

Examples of how to run tasks on our Spark system using a Scala program launched from a command line rather than inside a inside a notebook.

Export notebooks to GitHub

Integration with the GitHub system to be able to load and save notebooks to/from the users GitHub account.
Adding a 'Save to GHitHub' button to the Zeppelin interface could become the primary route for sharing notebooks with colleagues.

Make Ansible os_server idempotent

There is a bug in the Ansible os_server module that means create can't be applied twice. It should detect that there is already an instance with the same name and update it rather than create a new one.

From what I can tell, it does detect an existing instance, but it doesn't do the update properly. If the server definition contains a security group, then it tries to apply the same group again and fails with a duplicate group error message from the server.

First part of this task is to identify where it is going wrong, and create a patch fix for us to use. Then, if we can fix it properly, we could propose our fix to the upstream project.

Source code is here:
https://github.com/ansible/ansible/blob/devel/lib/ansible/modules/cloud/openstack/os_server.py

Logrotate needed on Hadoop workers for /logs directory

The Hadoop logs directory seems to grow quite fast, which we discovered recently after out-of-space errors started appearing when trying to run Spark jobs. The cluster recovered after clearing the logs from each worker.
We need to setup Logrotate on the Hadoop workers for /logs directory.

Track the Ansible/OpenStack network issue

The Ansible/OpenStack os_network module has an issue with an extra param added to the call to the OpenStack client component:

We have a patch in our Docker container that implements a short term fix to make this go away:

Longer term, we need to track this:

  • It may just be incompatible versions of Ansible and the OpenStack client.
  • If Ansible/OpenStack resolve the issue, we can remove our patch.
  • If Ansible/OpenStack don't resolve the issue, we need to work out a better fix.

Parquet analyser (notebook)

A Zeppellin notebook that demonstrates how to unpack a Parquet data set and extract the metadata, displaying things like number of files, file format, compression, number of columns, block size, indexing etc.

Three use cases:

  1. Point this notebook at an unknown dataset to learn how it it formatted
  2. Each individual step in the notebook can be used as an example of how to extract a metadata properties from a dataset.
  3. A derivative of this notebook could be used to check that our datasets are formatted as expected.

Transform Gaia DR2 csv into Parquet

A Spark job to transform the Gaia DR2 csv files into Parquet.
Written as source code for a Spark job that we can submit automatically as part of a Continuous Integration test suite.
Including an initial set of pass/fail tests to count the output files, check the data types, count the rows etc.
Include a parameter to select how much of the data to process, e.g. 10 csv files, 100 csv files etc. enabling us to run quick tests and build small deployments with partial data.

Library of data anlysis tasks

A set of notebooks that demonstrate how to do basic analysis steps on the DR2 dataset, including things like calculating the mean, creating histograms etc.

Each example written in each of the three main languages used for analysis, Python, Java and Scala. Aim is to build up a teaching resource like the Apache Spark documentation with tabs showing the Python, Java and Scala version for each example.

Multi-user SSH key

Create a SSH key file that contains keys for each of the users and install it in our OpenStack projects.
https://help.switch.ch/engines/faq/how-to-use-multiple-ssh-keys/

  • Check to see if the group key is accessible by different OpenStack users.
  • Check to see if this enables us to create VMs with multiple keys in them.

If not, find an alternative method to enable developers to login to virtual machines created by another member of the team?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.