wfau / gaia-dmp Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 5.0 11.88 MB

Gaia data analysis platform

License: GNU General Public License v3.0

Python 0.94% Shell 62.95% HCL 6.86% Dockerfile 6.20% Java 1.39% Smarty 3.95% Jinja 17.72%

gaia-dmp's People

Contributors

Stargazers

Watchers

Forkers

zarquan stvoutsin nigelhambly akrause2014 millingw

gaia-dmp's Issues

Java client example

Examples of how to run tasks on our Spark system using a Java program launched from a command line rather than inside a inside a notebook.

Evaluate INDIGO IAM service

Evaluate the INDIGO IAM service and see if it can help us with things like delegated identities and registering our service as an OAuth application in 3rd party systems like GitHub (see #55 )

Deploy Kubernetes using Ansible

Learn how to deploy Kubernetes using Ansible.

Single user account, allocated by calendar.

Single user account, with some form of calendar to allocate who is using the system when.

Transfer Gaia DR2 parquet into Swift Object Store

Transfer copies of the Gaia DR2 parquet files into a public Swift Object Store that can be shared across projects.

Deploy our Spark system using Ansible.

This is the second step in #20
Using the notes from #26 to deploy a Spark system into the resources created by #43

Transfer Gaia DR2 csv into Swift Object Store

Transfer a copy of the Gaia DR2 csv files into a public Swift Object Store that can be shared across projects.

Evaluate iris-magnum from StackHPC

Evaluate the iris-magnum examples from John Garbutt at StackHPC.
https://github.com/RSE-Cambridge/iris-magnum

Needs to be done in time for the IRIS F2F meeting at ROE in March.
https://indico.ph.qmul.ac.uk/indico/conferenceDisplay.py?confId=645

Allow users to store data using Zeppelin (Persistent or Temporary Storage)

We need to allow Zeppelin users to store data somewhere.

Do we look into using HDFS for this as a first step?
The Zeppelin node in the "production" service is very small so we can't use that most likely.

If we move to a service similar to what we have for lsst with Jupyter on K8s, and spawn new Zeppelin containers per user, we can allocate storage for those containers. Making it persistent probably is a longer term task though, as it would require us to setup user management / quotas / etc..

There is a requirement for persistent storage, but as a start temporary storage would work for helping our initial users develop notebooks, though persistent storage with HDFS might be easier to setup.

Download Gaia DR2 and transfer onto HDFS storage.

Download and transfer using command line tools.

Proposals for CI platform

We need a Continuous Integration system that runs automated tests each time we create a new deployment. This is the first step along that route, to propose solutions and describe how they would work.

Given we are putting our code into GitHub the obvious one is the GitHub Actions CI system.

but there is a whole range of others to choose from as well

https://github.com/marketplace/category/continuous-integration

Question is .. how do we integrate the CI system with a deployment on an external OpenStack system that isn't part of the GitHub universe ?

Make config-dns use discovery

Make config-dns use the same instance discovery process as config-ssh to get the lists of virtual machines and their IP addresses.

Remove the reliance on the partial files generated during the create process.

Splitting the project into layers

I think we need to split this project into three separate GitHub projects to reflect the different layers in the system. Hopefully this will make it easier to manage the source code, issues and testing for each of the layers.

Deploy Kubernetes into an empty OpenStack project space.
Deploy Spark and Zeppelin into an empty Kubernetes cluster.
Develop a set of Spark and Zeppelin tools for analysing the Gaia data.

Each layer is deployable and testable as a separate component in its own right.
Each layer has a separate GitHub repo, including code base, issues, and CI testing process.

This way we can be working on development versions of the Kubernetes and Spark layers, while at the same time using the current release versions of them as a stable platform to develop the Spark and Zeppelin analysis tools.

This has implications for issue #33. Each layer would have its own CI integration, using the other layers as components to build the CI test environment.

Generate import schema from Gaia data model

We have an initial version of the csv->parquet conversion schema as a text file.

Following up on Nigel's comment to issue 31, we should find a way of generating our conversion schema from the Gaia data model.

Deploy Spark on Kubernetes

Learn how to deploy Spark on Kubernetes.

Python client example

Examples of how to run tasks on our Spark system using a Python client launched from a command line rather than inside a inside a notebook.

Deploy a Spark/Yarn/Hadoop Cluster with Gaia DR2 on Openstack

Create a deployment on a large enough Openstack cluster (dev)
Fetch Gaia Source DR2 csv files from http://cdn.gea.esac.esa.int/Gaia/gdr2/gaia_source/csv/
Transform Gaia Source DR2 into Parquet

Import notebooks from GitHub

Initial step of importing notebooks from a publicly readable GitHib repository.

OAuth integration

Integrate OAuth login into Zeppelin.

Template form for benchmarks

As we add more benchmark tests we need to be sure we are being consistent.
This task is to create some kind of standard layout that we use to state the conditions of each test that we run.

Test date : 2020-01-22
Cloud name : cumulus, gaia-dev
Number of workers: ...
Cores per worker: ...
Memory per worker: ...
... etc

Evaluate HDFS on Cinder volumes for data.

Install custom Python libraries on Spark Cluster

As an example, Nigel needs HDBSCAN installed for the “Kounkel and Covey groups demo” notebook.

Import (load) a notebook from a GitHub repository

Export a notebook as an executable Python program

Export a notebook as an executable Python program
Future issues will address where the program can be run from, how to import all the required libraries and what happens to the original notebook text.

GitHub repository of example notebooks

Publicly readable GitHub repository of example notebooks.
Needs #55 to be able to import the notebooks.

Parquet analyser (command line)

Notes on how to use the parquet-tools from Hadoop to inspect Parquet files.
https://stackoverflow.com/questions/36140264/inspect-parquet-from-command-line

I suspect we will end up using these a lot as we evaluate different ways to partition the data into Parquet files.

The StackOverflow question is about how to do this from the command line, but as this is a Java component, we can probably use these from Zeppelin notebook or a Spark job ?

Deploy Kubernetes using Magnum

Learn how to use OpenStack Magnum to deploy Kubernetes.

Integration with Gaia svn

Create a github/aglais directory in the Gaia svn repository and link it to this GitHub project.

This includes a shell script suitable for adding as a cron job that pulls the latest changes from the default branch of the GitHub project and pushes them to the svn repository.

Using a top level github directory in svn leaves room to add links to more GitHub projects if we split the project into smaller parts later on (see #34).

Ansible deploy

Evaluate Ansible deployment playbooks:

Cloudera : https://github.com/cloudera/cloudera-playbook
Alternative : https://github.com/sergevs/ansible-cloudera-hadoop
Cloudera director : https://blog.cloudera.com/automated-provisioning-of-cdh-in-the-cloud-with-cloudera-director-and-ansible/
Cloudera archive : http://archive.cloudera.com/cm5/

Scala client example

Examples of how to run tasks on our Spark system using a Scala program launched from a command line rather than inside a inside a notebook.

Export (save) a notebook to a GitHub repository

Export notebooks to GitHub

Integration with the GitHub system to be able to load and save notebooks to/from the users GitHub account.
Adding a 'Save to GHitHub' button to the Zeppelin interface could become the primary route for sharing notebooks with colleagues.

Deploy Spark using Ansible

Learn how to deploy Spark using Ansible.

Make Ansible os_server idempotent

There is a bug in the Ansible os_server module that means create can't be applied twice. It should detect that there is already an instance with the same name and update it rather than create a new one.

From what I can tell, it does detect an existing instance, but it doesn't do the update properly. If the server definition contains a security group, then it tries to apply the same group again and fails with a duplicate group error message from the server.

First part of this task is to identify where it is going wrong, and create a patch fix for us to use. Then, if we can fix it properly, we could propose our fix to the upstream project.

Source code is here:
https://github.com/ansible/ansible/blob/devel/lib/ansible/modules/cloud/openstack/os_server.py

Export notebook to file

Basic export notebook function to export and download a notebook as a text file.

Logrotate needed on Hadoop workers for /logs directory

The Hadoop logs directory seems to grow quite fast, which we discovered recently after out-of-space errors started appearing when trying to run Spark jobs. The cluster recovered after clearing the logs from each worker.
We need to setup Logrotate on the Hadoop workers for /logs directory.

Webservice API tests

Explore how we can use the Zeppelin webservice API to run integration tests to verify the system is working as expected.

Import a notebook from a JSON file

Instructions on how to import JSON file generated by 'export as file' menu option.
Formatted as a markdown page on the project wiki, including screen shots of key steps in the process.
https://github.com/wfau/aglais/wiki/How-use-the-platform

Track the Ansible/OpenStack network issue

The Ansible/OpenStack os_network module has an issue with an extra param added to the call to the OpenStack client component:

ansible/ansible#64495

We have a patch in our Docker container that implements a short term fix to make this go away:

wfau/atolmis#6

Longer term, we need to track this:

It may just be incompatible versions of Ansible and the OpenStack client.
If Ansible/OpenStack resolve the issue, we can remove our patch.
If Ansible/OpenStack don't resolve the issue, we need to work out a better fix.

Parquet analyser (notebook)

A Zeppellin notebook that demonstrates how to unpack a Parquet data set and extract the metadata, displaying things like number of files, file format, compression, number of columns, block size, indexing etc.

Three use cases:

Point this notebook at an unknown dataset to learn how it it formatted
Each individual step in the notebook can be used as an example of how to extract a metadata properties from a dataset.
A derivative of this notebook could be used to check that our datasets are formatted as expected.

Create our OpenStack resources using Ansible.

This task represents the first part of #20.

Transform Gaia DR2 csv into Parquet

A Spark job to transform the Gaia DR2 csv files into Parquet.
Written as source code for a Spark job that we can submit automatically as part of a Continuous Integration test suite.
Including an initial set of pass/fail tests to count the output files, check the data types, count the rows etc.
Include a parameter to select how much of the data to process, e.g. 10 csv files, 100 csv files etc. enabling us to run quick tests and build small deployments with partial data.

Export a notebook as a PDF (basic formatting)

Export a notebook as a PDF, with basic formatting.
Future issues will cover more advanced formatting options.

Schema for csv import

Importing the Gaia data from the csv files needs to define a schema to set the data type of the columns.

Examples:

Hard coded user account

A single hard coded user account and password, managed manually.

Library of data anlysis tasks

A set of notebooks that demonstrate how to do basic analysis steps on the DR2 dataset, including things like calculating the mean, creating histograms etc.

Each example written in each of the three main languages used for analysis, Python, Java and Scala. Aim is to build up a teaching resource like the Apache Spark documentation with tabs showing the Python, Java and Scala version for each example.

Java transfer tool (http:// to hdfs://) run as a Spark job

Based on example code from ashrithr
https://gist.github.com/ashrithr/f7899fdfd36ee800f151

Java tool that reads blocks from file:// or http:// and writes to hdfs:// designed as a Java program that can be run inside Spark.

Partly because a tool to do this would be useful for deployment, and partly to learn how to write Java code that can be run distributed inside Spark.

Delete our OpenStack resources using Ansible.

This task cleans up and frees all the resources created by #43.

Multi-user SSH key

Create a SSH key file that contains keys for each of the users and install it in our OpenStack projects.
https://help.switch.ch/engines/faq/how-to-use-multiple-ssh-keys/

Check to see if the group key is accessible by different OpenStack users.
Check to see if this enables us to create VMs with multiple keys in them.

If not, find an alternative method to enable developers to login to virtual machines created by another member of the team?

Deploy our Zeppelin system using Ansible

This is the third step in #20 .
Using the notes from #26 to deploy a Zeppelin system into the resources created by #43 .

wfau / gaia-dmp Goto Github PK

gaia-dmp's People

Contributors

Stargazers

Watchers

Forkers

gaia-dmp's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs