gns-science / nshm-toshi-api Goto Github PK

An extensible API where task metadata, and important input and output files relating to data-intensive science processes are retained. Custom task schemas can be defined to support their meta-data needs.

License: GNU Affero General Public License v3.0

Python 100.00%

nshm-toshi-api's People

Contributors

Watchers

nshm-toshi-api's Issues

Create skeleton API for demo

As NSHM testers

We want to store and catalogue the test results from tests on local machines, servers, clusters etc

So that we can compare historic outputs, etc

Done when

secure with APIKey (for demo only)
serverless deploy
store and get tuple of (binary, json_meta)

Tidy up db_root monkey_patching with DataManager

There should be a cleaner pattern to use here.

Feature: Inversion Solution support for labelled table relations

The SRM team want to run different hazard analysis and visualise these using maps and plots , aventually these will come from openquake (ref #68 ). Some of these will produce quite large tables (esp gridded) and will be produced independently, we want to link them as they're produced and retain maximum flexibility/scalablity.

Done When:

API user can link table with type, created, table_id using a standard Mutation query
API table-link accepts meta-data so additional data may be collected, describing the table properties
Add meta-data to Table

Improve development guidance

Add pagination support

Refactor abstract interface Task to Event

Fix node ID uniqueness bug

Expected this , but it was not found until we tested in anger on beavan cluster.

The current approach using S3 object counts is not bulletproof under high load.

Setup CI/CD pipeline using serverless stack

AS NSHM team

we want to configure CI/CD on this project

so that deployments to test and prod environments are robust and environments are stable

Done when

we define a branch user-test that is linked to serverless (sls) stage test
main branch is linked to sls stage prod
typical CI/CD behaviour on each (PR->test->merge->deploy

possible guide:
deploy using github actions (https://medium.com/better-programming/set-up-a-ci-cd-pipeline-for-aws-lambda-with-github-actions-and-serverless-in-under-5-minutes-fd070da9d143)

Add total_count to relay connections

We want to know how many objects will be returned by a connection

Feature: RuptureSet subclass of file

As API users we want to capture the specifics of RuptureSet files

so that they're more easily used in UI & client code.

example http://simple-toshi-ui.s3-website-ap-southeast-2.amazonaws.com/FileDetail/RmlsZToxMjkwOTg0

Done When:

Select and configure an API doc generator

ref https://nordicapis.com/graphql-documentation-generators-explorers-and-tools/

NB for now we have the iGraphQL plug-in incorporated in the service

Add GeneralTask as new schema type

We want to capture metadata and related inputs/outputs for arbitrary tasks that may not happen often enough to justify automation and/or a custom schema type. We'll call this a GeneralTask * as it may be used for many purposes.

NB briefly considered calling this type VersatileEvent

Attributes are:

related files (with reader/writer role)
agent_name: the name of the person or process responsible for the task
title
description
created

APi errors

Inversion runs:
R2VuZXJhbFRhc2s6ODA0NlFNVTc0 and R2VuZXJhbFRhc2s6ODExOVI0VHhB

in batch fargate log ...

2022-02-07T18:04:29.516+13:00 self._toshi_api.automation_task.upload_task_file(task_id, java_log_file, 'WRITE')
...
2022-02-07T18:04:29.517+13:00 requests.exceptions.HTTPError: 504 Server Error: Gateway Timeout for url: https://aihssdkef5.execute-api.ap-southeast-2.amazonaws.com/prod/graphql

and in Toshi API log

Previous request

2022-02-07T18:03:59.483+13:00 SearchManager.index_document https://search-nzshm22-toshi-api-es-prod-cj4taqcgnefophpxzan55xeswa.ap-southeast-2.es.amazonaws.com/toshi_index/_doc/ThingData_81865pDzH_object.json
...
2022-02-07T18:03:59.537+13:00 b'{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse [files]"}],"type":"mapper_parsing_exception","reason":"failed to parse [files]","caused_by":{"type":"illegal_state_exception","reason":"Can\'t get text on a START_OBJECT at 1:78"}},"status":400}'

...
and then

2022-02-07T18:03:59.954+13:00 START RequestId: a9ff3ded-cde6-4ebc-93ce-803742f23d35 Version: $LATEST
2022-02-07T05:04:29.981Z a9ff3ded-cde6-4ebc-93ce-803742f23d35 Task timed out after 30.02 seconds

Support schema for JSON Meta data

Feature: API General Task captures meta-data for args and arg sweeps

We want to capture the job specs for the General Task,

so that it's possible to validate and automate more features in nzshm-runzi

Done when:

all arguments passed to the high_level runzi scripts (automation/run_*.py) are saved as meta data
env/metrics can be used to capture job metadata (allowing replacement of https://docs.google.com/spreadsheets/d/153famuOt-dGSCdsdN8e2NZupxeuLWwHivPkL6-mgwi0)
list the swept args/values

Feature: Table Schema type so we can collect MFDs etc

We want to collect spreadsheet like tables

so that they can be associated with tasks or files and consumed easily by UI etc

Done When:

table column names are configurable
table column types are configurable
read / write entire table + row as object
Inversion Task accepts mfd_table property
[ ]

Slow search with total_count field

eg ....


query q1 {
  search(search_term: "inversion") {
    search_result {
      total_count
      edges {
        node {
          __typename
          ... on Node {
            __isNode: __typename
            id
          }
          ... on RuptureGenerationTask {
            created
            id
            duration
            state
            result
            
          }
          ... on GeneralTask {
            description
            title
            created
            children {
              total_count
            }
          }
          ... on File {
            id
            file_name
            file_size
          }
        }
      }
    }
  }
}

is too slow with total_count field. Shouldn't this be as fast with & without.

Initial beavan cluster rupture generation with API (smoke test)

SRM Team,

we want to run some real-world smoke testing on beavan,

so we can see how things behave

Done when

finalise API schema for opensha Rupture Generation
publish python API client in nshm-toshi-client
modify client-side automation to support API interface
configured on beavan
running tests

Feature: dynamodb migration

We want to use dynamodb for more responsive user experience and better integrity on mutations

add pynamodb model for each data class (File, Thing, etc)
on read 1st read pynamodb, if not there read from S3
on write write just to pynamodb
show pattern for adding test coverage

Feature: define task reqs for typical openquake job

We want to run openquake tasks using nzshm-runzi (automation) and collect job metadata, inputs, outputs etc in toshi-api

So that we have history etc avail to the team.

Epic: https://github.com/orgs/GNS-Science/projects/9#card-67029920

Needed:

list of inputs for a typical openquake task
list of cmd-line arguments that are supplied to run the job
list any environment settings needed?
list of typical outputs
supply examples of input/outputs

BUG: produced_by_id incorrect on sub-query

find why this produces two different node IDs...
in TEST...

  node(id:"SW52ZXJzaW9uU29sdXRpb246MTU0NC4wdlA4QmQ=")
  {
    ... on InversionSolution {

      produced_by_id
      produced_by {
        id
      }
    }
  }}

Incomplete set of objects returned by get_all()

Seems that the objects aren't all returned by this S3 API call... in data_s3.base_s3_data.py

def get_all(self):
        """
        Returns:
            list: a list containing all the objects materialised from the S3 bucket
        """
        task_results = []
        for obj_summary in self._bucket.objects.filter(Prefix='%s/' % self._prefix):
            prefix, task_result_id, _ = obj_summary.key.split('/')
            assert prefix == self._prefix
            task_results.append(self.get_one(task_result_id))
        return task_results

In S3 docs on filtering we see:

The response might contain fewer keys but will never contain more

Note we'll be adding proper relay pagination support soon, but for now lets do this so we can see all the contents.

BUG: GeneralTask does not support the 'parents' relation

this mutation will fail when the child_id refers to a GeneralTask as this does not support the 'parents' attribute

mutation new_gt_link {
  create_task_relation(
    parent_id: "R2VuZXJhbFRhc2s6Mg=="
    child_id: "R2VuZXJhbFRhc2s6NA=="
  )    
....

The relation is created but any query on it via taskrelation will fail

suggest we add this attribute so that GeneralTasks can also have parent tasks.

Integrity: fix missing relations caused by ACID bug

ref #43 for root cause. For now we need away to repair the data inconsistency. Could be a lambda / SQS trigger?

Add HazardAnalysisTask

As SRM Team

we want to configure a HazardAnalysisTask with input & outputs, input arguments and run metrics

so the task results can be recorded and available for further analysis

Input file (opensha solution)
Output (hazard curve(s))
task arguments (from NSHMInversionRunner)
Metrics  - similarity

Optimise: simplify file structure used for thing_relation_data

Same as #88 but for thing_relation only.

ThingsRelation - remove id (it's no longer a Node) , find/fix impacted tests

Accept Google Outh2 token as credential

Investigate StrongMotionStation inclusion in toshi_api

As GMCM team,

We want to record identifiers and attributes of Strong Motion Stations and files associated with them

So that we .....

Setup local S3 vs cloud S3 credentials in YML

So that the config doesn't need manual tweaking for every other deploy

S3 data read/write consistency

Related to the #41 is looks that rapid-fire bursts of updates to a single object - as will occur at the beginning of a cluster job with many sub-tasks can cause the read consistency to fail. It seems likely that writes are buffered at S3 and reads in close proximity will not immediately 'see' the updated status.

This was found on beavan cluster with 40 rupture build tasks submitted via run_rupture_sets.py to TEST API. In this case, while no errors were reports client-side the parent general task R2VuZXJhbFRhc2s6MjA4ckx0Y3M= has just 22 children instead of the expected 40. All the child tasks, files and relationships have been written correctly.

Done when:

confirm issue is 'avoidable' by inserting small start offsets to tasks to give S3 enought time to reach consistency. Suggest 200-500ms per task should be plenty.
test a write-through cache solution between the data manger and the S3 read/write operations. This will need to manage it's memory footprint. (expiring cache objects based on last access time?)

Set up AWS Elastic Search in serverless for evaluation

To support general search functions..

https://www.serverless.com/blog/build-geosearch-graphql-api-aws-appsync-elasticsearch

Migration: convert File objects to RuptureSet or InversionSolution objects

We want to migrate File objects to the new types as appropriate so the users can benefit from the new features for historic objects.

NB: API clients should be able to access these objects as Files using either File NodeID or Subclass NodeID

Done When

Filter results using search fields (search fields)

Fix: s->ms conversion on metrics off by 1e9

Based on solvis-api testing the seconds is off by 1e9. Check and fix if necessary

Update Search index when data changes

performance: total_count children and files are too slow

This may require a pattern change , but as it stands the root node for total_count requires more API activity than desirable. Look at how best to resolve this.

Feature: Add InversionSolution to schema as File sub-class

As API users we want to capture the specifics of InversionSolution files

so that they're more easily used in UI & client code.

Done When:

Write-up DEMO1 notes

Write up notes for DEMO1 which will maybe become basis of API documention.

Add InversionTask

As SRM Team

we want to configure an InversionTask with input & outputs, input arguments and run metrics

so that these task results are recorded and available for further analysis

Input file (opensha ruptureset)
Output file (opensha solution)
task arguments (from NSHMInversionRunner):

        double totalRateM5 = 5d; // expected number of M>=5's per year TODO: OK? ref David Rhodes/Chris Roland? [KKS, CBC]
        double bValue = 1d; // G-R b-value
        // magnitude to switch from MFD equality to MFD inequality
        double mfdTransitionMag = 7.85; // TODO: how to validate this number for NZ? (ref Morgan Page in USGS/UCERF3) [KKS, CBC]
        double mfdEqualityConstraintWt = 10;
        double mfdInequalityConstraintWt = 1000;
        int mfdNum = 40;
        double mfdMin = 5.05d;
        double mfdMax = 8.95;
        GutenbergRichterMagFreqDist mfd = new GutenbergRichterMagFreqDist(
                bValue, totalRateM5, mfdMin, mfdMax, mfdNum);
        int transitionIndex = mfd.getClosestXIndex(mfdTransitionMag);
        // snap it to the discretization if it wasn't already
        mfdTransitionMag = mfd.getX(transitionIndex);
        Preconditions.checkState(transitionIndex >= 0);
        GutenbergRichterMagFreqDist equalityMFD = new GutenbergRichterMagFreqDist(
                bValue, totalRateM5, mfdMin, mfdTransitionMag, transitionIndex);
        MFD_InversionConstraint equalityConstr = new MFD_InversionConstraint(equalityMFD, null);
        GutenbergRichterMagFreqDist inequalityMFD = new GutenbergRichterMagFreqDist(
                bValue, totalRateM5, mfdTransitionMag, mfdMax, mfd.size() - equalityMFD.size());
        MFD_InversionConstraint inequalityConstr = new MFD_InversionConstraint(inequalityMFD, null);

        constraints.add(new MFDEqualityInversionConstraint(rupSet, mfdEqualityConstraintWt,
                Lists.newArrayList(equalityConstr), null));
        constraints.add(new MFDInequalityInversionConstraint(rupSet, mfdInequalityConstraintWt,
                Lists.newArrayList(inequalityConstr)));

        // weight of entropy-maximization constraint (not used in UCERF3)
        double smoothnessWt = 0;

ISSUE: Elastic Search offline in PROD - toshi_index index is down

Looks like something's happened to our ES search index in PROD. Heres' the health panel...
Showing eawrchable document = 0 and the main index ('toshi_index') does not exist...

client (from API lambda logs)

2021-11-19T10:05:19.436+13:00Copy{'error': {'root_cause': [{'type': 'index_not_found_exception', 'reason': 'no such index', 'resource.type': 'index_or_alias', 'resource.id': 'toshi_index', 'index_uuid': '_na_', 'index': 'toshi_index'}], 'type': 'index_not_found_exception', 'reason': 'no such index', 'resource.type': 'index_or_alias', 'resource.id': 'toshi_index', 'index_uuid': '_na_', 'index': 'toshi_index'}, 'status': 404} | {'error': {'root_cause': [{'type': 'index_not_found_exception', 'reason': 'no such index', 'resource.type': 'index_or_alias', 'resource.id': 'toshi_index', 'index_uuid': '_na_', 'index': 'toshi_index'}], 'type': 'index_not_found_exception', 'reason': 'no such index', 'resource.type': 'index_or_alias', 'resource.id': 'toshi_index', 'index_uuid': '_na_', 'index': 'toshi_index'}, 'status': 404}

bug: inversion_solution query for old sub-solution returns wrong ID.

re http://simple-toshi-ui.s3-website-ap-southeast-2.amazonaws.com/GeneralTask/R2VuZXJhbFRhc2s6NjAzOWI5TUNV/Details

To reproduce - go the above view, and view Show Reports . All the pages show the same (incorrect) inversion solution SW52ZXJzaW9uU29sdXRpb246MTY3NDIuMFVrbXJl

Done when

create a test fixture
create a test that reproduces the issue - it will fail
make the fix and see the new test pass. In the comments link to this ticket URL.

Add Report Task

We want to capture the publication of various analysis reports

e.g RupSetDiaganostics so that the team have immediate/reliable access to these

Notes:

publiish to a public S3 bucket acting as http server e.g.

Attributes

report has type (RuptSetDiag, InvSolDiag, (later named-fault-*
zip file of report (so we can re-publish)
published location URI
created
meta {k v}

Optimise: simplify file structure used for file_relation_data

The current design has these relations saved independenty to unique object.json files. So, for every link we have:

  ObjectA.son  <-> Relationship.json <-> ObjectB.json
   - other props       - Role           - other props

this is a very flexible solution, supporting many-to-many relationships and also relationship properties (e.g ROLE) . The price is that the extra file read/writes make certain API operations overly slow and IO intensive.

Proposed:

 ObjectA.json                        <-> ObjectB.json
  - relatedTo [(objectID, Role)]         - relatedTo [(objectID, Role)] 
  - other props                        - other props

Snags:

as currently the Relationship object is a graphql Node, so what do we have that relies on the abilitiy to resolve Node id? If nothing then there's no client impact
old data will need to be migrated
now to test the stability and actual performance impacts of this change

Feature: Query for InversionSolution for AutomationTask (AT) where AT.id_in: [is0, id1...]

We want used to select given IDs client-side, to need a way to retrieve just those efficiently.

fragment AT on AutomationTask {
  files {
    edges {
      node {
        file {
          __typename
          ... on InversionSolution {
            id #etc
          }
        }
      }
    }
  }
}

#current option (not useable in standard relay clients  really...
query multi_at {
 node0: node(id:"QXV0b21hdGlvblRhc2s6Mjk1OVZmTlpj"){
    id
     __typename
		... AT
  }
 #http://simple-toshi-ui.s3-website-ap-southeast-2.amazonaws.com/AutomationTask/QXV0b21hdGlvblRhc2s6Mjk0MmhWck13 
 node1: node(id:"QXV0b21hdGlvblRhc2s6Mjk0MmhWck13"){
    id
     __typename
		... AT
  }  
}

#proposed approach 
query new_AT_id_in_demo {
  automation_task(id_in:["QXV0b21hdGlvblRhc2s6Mjk1OVZmTlpj", "QXV0b21hdGlvblRhc2s6Mjk0MmhWck13" ]) {
    edges {
      node {
        id
        ... AT{
        }
      }
    }
  }
}```