GithubHelp home page GithubHelp logo

skytruth / cerulean-cloud Goto Github PK

View Code? Open in Web Editor NEW
5.0 5.0 2.0 45.88 MB

All cloud services including inference and database structure

License: Apache License 2.0

Python 50.06% HTML 7.87% Jupyter Notebook 41.97% Mako 0.08% Dockerfile 0.03%

cerulean-cloud's People

Contributors

aemonm avatar ancientpi3 avatar cjthomas730 avatar jonaraphael avatar leothomas avatar rbavery avatar vincentsarago avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cerulean-cloud's Issues

Error: Pytest not discovering tests

This is happening locally on Jona's computer, but does not happen during deployment via Github Actions.

Seems possibly related to pydantic?

pydantic/fields.py:550: in pydantic.fields.ModelField._type_analysis ???
../../mambaforge/lib/python3.9/typing.py:852: in __subclasscheck__ return issubclass(cls, self.__origin__)
E TypeError: issubclass() arg 1 must be a class

Error: Cloud Run out of memory

Memory limit of 5120 MiB exceeded with 5261 MiB used.
Happens about 25 times / day with the orchestrator (@5gb).
Happens about 10 times / day with the offset tiles (@5gb).

Can be mitigated by increasing memory value in the following locations. Note that if you increase it substantially, you will also have to increase the number of vCPUs according to https://cloud.google.com/run/docs/configuring/memory-limits

cerulean-cloud/stack/cloud_run_orchestrator.py
cerulean-cloud/stack/cloud_run_offset_tile.py

Optimize resources for orchestrator and offset tile cloud run

The Cloud Run functions we are using are charged per vCPU/s (compute) and GB/s (memory). Currently I have set the orchestrator to have 4 GB / 4 vCPU; and the offset tile 4 GB / 8 vCPU. Looking at the metrics for the orchestrator there is room for optimisation. Average run time is quite low at the moment because of the dry runs (avg billed time is 2.63s). Same can be said of the offset tile metrics, in terms of optimising resources. Average billed time is around 2.02s. I’ve limited the concurrency for the cloud run offset tile to 20 per orchestrator (so there will be a max 20 instances running per orchestrator at the same time). I think we could further reduce this as well without facing significant time increases. Our queue is defined in such a way that we are sending a small amount of tasks at a time, the idea here being that we reuse as much the started up containers, to avoid the startup cost. I’ve created this issue for further reference.

Speed up deployment flow

Currently a single deployment takes around 15min to process. This is in part due to our double stack configuration with pulumi, caused by issues in concurrency management in pulumi-docker.

Ideally these would be merged into a single stack.

In addition, multiple deployments occur even when no code change has occurred. This can also be further optimized.

inference options discussion

There are significant trade-offs between using torch scripting, torch tracing, or torch state_dict. differences between scripting and tracing defined here: https://paulbridger.com/posts/mastering-torchscript/

Finds from inspecting with @rodrigoalmeida94

  • pickling with fastai is not an option, it requires too many extra dependencies and is brittle. Ideally we only depend on torch, not fastai and not icevision for inference: https://docs.fast.ai/learner.html#Learner.export
  • obscure error when trying to export with torch scripting. It's very possible the fastai unet learner violates one of these rules https://pytorch.org/docs/stable/jit_unsupported.html. This would require in the weeds mods that I don't think we want to focus on now if there are easier options (see below)
  • saving a torch state_dict is no problem. Loading it however with only torch as a dependency is difficult because we'd need to pull out the fastai internals that construct the model in our inference code, create the base pytorch model class, and then load the state_dict into that model class.
  • So the simplest solution seems to be torch tracing to get inference running. However this requires us to:
    • fix the batch sized for inference as the one used for training (I think ideally we would use a batch size of 1, with the largest tile we can, and accumulate gradients on a larger step)
    • only run inference on a single gpu since we train on a single gpu and torch tracing populates the traced model with the model parameters and specific name of the cuda device called when it is traced

I don't think the caveats mentioned above for torch tracing are a problem since it's ok if inference takes a while to run. torch tracing will make the best use of gpu resources it sounds like since the model gets jit compiled to machine code.

model performance on cerulean cloud deployment differs from local inference

Once @jonaraphael returns from leave we should review the model inferencing steps within the cerulean deployment and new Mask R-CNN model and compare how they match with local inference for the same test images we've been looking at locally. Jona noted that after resolving deployment and predict endpoint issues, there are no tracebacks but performance is low, with 0 out of 100 slicks inferenced. I think it's possible we are missing the multi class NMS step (and potentially other steps) on cerulean cloud since Jona recently added this.

Error: Merge concatenation across UTM

ValueError: Cannot determine common CRS for concatenation inputs, got ['WGS 84 / UTM zone XXX', 'WGS 84 / UTM zone YYY'] Use "to_crs()" to transform geometries to the same CRS before merging.
at .concat_grids_adjust_conf ( /app/cerulean_cloud/cloud_run_orchestrator/merging.py:34 )
Happens about 15 times / day on orchestrator.

Seems to happen when a scene crosses a UTM boundary, and the merging of the resulting inference fails to find a common UTM.

Options:
• We should identify when the CRSs differ and reproject them all into the same CRS first.

Set up alembic to manage database migrations

The toy dataset does not have an instance table potentially

in the current table, instances can point to scenes they came from and inferences they came from

single column for slick type, values for natural slicks, vessel slicks

@rodrigoalmeida94 and @jonaraphael can meet to figure out if database structure needs to be updated

Setup Cloud Run for merge inference

1 batch = output is a np.array(1, 7, 512, 512), include bbox

  • implement mechanism to save confidences
  • confidence threshold

if maskrcnn: returns a different shape (object detection, bounding box)

merging strategy, takes into account category differences, rule based. Jona to create pseudo code for merging

How to represent tile grid along with inference result in overall image? (bbox), make full image

Error: Scene B3DA causing error with no /bounds

KeyError: 'bounds' at .get_bounds ( /app/cerulean_cloud/titiler_client.py:43 )`

S1A_S3_GRDH_1SDV_20230807T171941_20230807T172010_049772_05FC2A_B3DA

Seems like this one image is repeatedly tossed on the queue, and keeps breaking the container because it does not have any bounds associated with it.

Looking at it, it seems quite strange:
image

Error: Scene tiles intersect Caspian Sea

IndexError: list index out of range at ._orchestrate ( /app/cerulean_cloud/cloud_run_orchestrator/handler.py:349 )
Happens about 8 times / day.

Our custom code in cerulean-cloud/cerulean_cloud/cloud_function_scene_relevancy/main.py considers the Caspian Sea as a body of water that we want to predict on. That means that Scenes touching the Caspian are passed into the orchestrator.

However, the code that determines which tiles to pass into the predict endpoint uses globe.is_ocean(), which is 3rd party code that excludes the Caspian. This results in a zero-length list of tiles, which is a case that is not handled gracefully, causing this error.

Options for resolution:

  1. Handle the error gracefully. This means the Caspian Sea is not processed.
  2. Remove the Caspian Sea from our scene_relevancy function. This means the Caspian Sea is not processed.
  3. Edit the globe.is_ocean() code to return True for tiles over the Caspian.
  4. Replace the globe.is_ocean() code to use our scene_relevancy function logic, making them consistent.

I prefer option 4.

Optimize concurrency limits of the cloud run inference function

Follows #18

In the current implementation we're launching all the tasks at once to run inference. I have set the concurrency limit in the function to 3 to avoid memory overrun.

In the ideal world we would launch the inference tasks in such a way that we avoid lots of containers getting spin up (incurring startup costs) and make use of the up containers as efficiently as possible.

land mask is not being applied properly

slicks are showing up inland, on rivers, etc. either the land masking code is an issue or (what I think) is that the resolution of it is not high enough or it isn't accurate enough.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.