skytruth / cerulean-cloud Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 2.0 45.88 MB

All cloud services including inference and database structure

License: Apache License 2.0

Python 50.06% HTML 7.87% Jupyter Notebook 41.97% Mako 0.08% Dockerfile 0.03%

cerulean-cloud's People

Contributors

Stargazers

Watchers

Forkers

hi-smukherjee tmeckel

cerulean-cloud's Issues

AWS Lambda deployment of titiler

this will serve Sentinel-1 scenes to the Cerulean infra.

Error: Pytest not discovering tests

This is happening locally on Jona's computer, but does not happen during deployment via Github Actions.

Seems possibly related to pydantic?

pydantic/fields.py:550: in pydantic.fields.ModelField._type_analysis ???
../../mambaforge/lib/python3.9/typing.py:852: in __subclasscheck__ return issubclass(cls, self.__origin__)
E TypeError: issubclass() arg 1 must be a class

Error: Cloud Run out of memory

Memory limit of 5120 MiB exceeded with 5261 MiB used.
Happens about 25 times / day with the orchestrator (@5gb).
Happens about 10 times / day with the offset tiles (@5gb).

Can be mitigated by increasing memory value in the following locations. Note that if you increase it substantially, you will also have to increase the number of vCPUs according to https://cloud.google.com/run/docs/configuring/memory-limits

cerulean-cloud/stack/cloud_run_orchestrator.py
cerulean-cloud/stack/cloud_run_offset_tile.py

Optimize resources for orchestrator and offset tile cloud run

The Cloud Run functions we are using are charged per vCPU/s (compute) and GB/s (memory). Currently I have set the orchestrator to have 4 GB / 4 vCPU; and the offset tile 4 GB / 8 vCPU. Looking at the metrics for the orchestrator there is room for optimisation. Average run time is quite low at the moment because of the dry runs (avg billed time is 2.63s). Same can be said of the offset tile metrics, in terms of optimising resources. Average billed time is around 2.02s. I’ve limited the concurrency for the cloud run offset tile to 20 per orchestrator (so there will be a max 20 instances running per orchestrator at the same time). I think we could further reduce this as well without facing significant time increases. Our queue is defined in such a way that we are sending a small amount of tasks at a time, the idea here being that we reuse as much the started up containers, to avoid the startup cost. I’ve created this issue for further reference.

Review all sensitive values and how they are passed to the infrastructure

Sensitive value should be passed to infrastructure (i.e. Cloud Run functions) as secrets. Same goes for Cloud functions.

Speed up deployment flow

Currently a single deployment takes around 15min to process. This is in part due to our double stack configuration with pulumi, caused by issues in concurrency management in pulumi-docker.

Ideally these would be merged into a single stack.

In addition, multiple deployments occur even when no code change has occurred. This can also be further optimized.

Add another mechanism to add EEZs to slicks

For instance by setting up a cronjob in the database.

vessel slicks are preferentially classified as infra or natural when they are detected

Cloud Function for scene relevancy

inference options discussion

There are significant trade-offs between using torch scripting, torch tracing, or torch state_dict. differences between scripting and tracing defined here: https://paulbridger.com/posts/mastering-torchscript/

Finds from inspecting with @rodrigoalmeida94

pickling with fastai is not an option, it requires too many extra dependencies and is brittle. Ideally we only depend on torch, not fastai and not icevision for inference: https://docs.fast.ai/learner.html#Learner.export
obscure error when trying to export with torch scripting. It's very possible the fastai unet learner violates one of these rules https://pytorch.org/docs/stable/jit_unsupported.html. This would require in the weeds mods that I don't think we want to focus on now if there are easier options (see below)
saving a torch state_dict is no problem. Loading it however with only torch as a dependency is difficult because we'd need to pull out the fastai internals that construct the model in our inference code, create the base pytorch model class, and then load the state_dict into that model class.
So the simplest solution seems to be torch tracing to get inference running. However this requires us to:
- fix the batch sized for inference as the one used for training (I think ideally we would use a batch size of 1, with the largest tile we can, and accumulate gradients on a larger step)
- only run inference on a single gpu since we train on a single gpu and torch tracing populates the traced model with the model parameters and specific name of the cuda device called when it is traced

I don't think the caveats mentioned above for torch tracing are a problem since it's ok if inference takes a while to run. torch tracing will make the best use of gpu resources it sounds like since the model gets jit compiled to machine code.

Use single passphrase for all stacks

Save stack passphrase as secret in GCP, fetch it at deployment time.

Figure out how to pass and save different confidence threshold slicks

polygon outside bounds of sentinel-1 scene it is recorded for

https://cerulean-ui-5qkjkyomta-uc.a.run.app/?zoom=8.98925&lat=40.49821200535761&lng=17.070769165779147&startDate=2023-01-01&endDate=2023-01-02&slick=372&showImagery=1&machineConfidence=0_0.4&imageGamma=2

Cerulean database optimization and configuration

Describe HITL interactions with database

mask stitching between slick detections at a boundary isn't happening

model performance on cerulean cloud deployment differs from local inference

Once @jonaraphael returns from leave we should review the model inferencing steps within the cerulean deployment and new Mask R-CNN model and compare how they match with local inference for the same test images we've been looking at locally. Jona noted that after resolving deployment and predict endpoint issues, there are no tracebacks but performance is low, with 0 out of 100 slicks inferenced. I think it's possible we are missing the multi class NMS step (and potentially other steps) on cerulean cloud since Jona recently added this.

Error: Merge concatenation across UTM

ValueError: Cannot determine common CRS for concatenation inputs, got ['WGS 84 / UTM zone XXX', 'WGS 84 / UTM zone YYY'] Use "to_crs()" to transform geometries to the same CRS before merging.
at .concat_grids_adjust_conf ( /app/cerulean_cloud/cloud_run_orchestrator/merging.py:34 )
Happens about 15 times / day on orchestrator.

Seems to happen when a scene crosses a UTM boundary, and the merging of the resulting inference fails to find a common UTM.

Options:
• We should identify when the CRSs differ and reproject them all into the same CRS first.

Error: Anti-meridian scene geometry errors

ValueError: longitude must be >= -180
Caused by a bug in rio-tiler-pds. See cogeotiff/rio-tiler-pds#77 for description of the issue.

Once that issue is resolved, there may be a further issue with globe.is_ocean() and ±180º... so check that before closing this issue.

consider changing license from Apache 2.0 to another open source license

not urgent, but Apache 2.0 is a bit annoying because if anyone was to modify the software, they need to include a notice within every single file that is modified: https://github.com/CosmiQ/solaris/blob/main/LICENSE.txt#L97

MIT and BSD are common alternatives that do not have the above restriction

false positives without any radar signal for oil slicks

potentially related to #82

Additions to database client for orchestrator

Add EEZ at write time
Add link to orchestrator logs
Add test suite

Follows #33

Setup Cloud Run for per offset tile segmentation

Add GCR
Copy GCP object to local folder before building image (include model)
Add docker image for offset tile
Add cloud run deployment with config
Add minimal tests

Set up alembic to manage database migrations

The toy dataset does not have an instance table potentially

in the current table, instances can point to scenes they came from and inferences they came from

single column for slick type, values for natural slicks, vessel slicks

@rodrigoalmeida94 and @jonaraphael can meet to figure out if database structure needs to be updated

Figure out tiling logic for base and offset tiles for inference

https://developmentseed.slack.com/archives/C02DTBAUZ51/p1651765870862779?thread_ts=1651752152.345349&cid=C02DTBAUZ51

false negatives for obvious oil slicks

I'm counting these as no slicks detected where there is an obvious, large, linear slick.

Setup Cloud Run for merge inference

1 batch = output is a np.array(1, 7, 512, 512), include bbox

implement mechanism to save confidences
confidence threshold

if maskrcnn: returns a different shape (object detection, bounding box)

merging strategy, takes into account category differences, rule based. Jona to create pseudo code for merging

How to represent tile grid along with inference result in overall image? (bbox), make full image

Threshold confidence in orchestrator instead of inference

Save multiple confidence geometries in DB.

set up authentication for tifeatures (Identity Platform)

AWS SNS configuration

Set up Cloud Tasks Queue

Handle multiple model outputs (instance and classification) in orchestrator

Output of classification is 2 rasters (confidence + class)
Output of instance is (?)

Merging procedure form offset should also be adapted

Add timvt to speed up display of slicks

https://developmentseed.org/tifeatures/advanced/timvt_and_tifeatures/
https://developmentseed.org/timvt/

Set up Cloud Run orchestrator

Set up Cloud Runs for per base tile classification

Fetch global density vessel data as backup

Setup CI/CD for testing code and deployment to test, staging and prod

Error: Scene B3DA causing error with no /bounds

KeyError: 'bounds' at .get_bounds ( /app/cerulean_cloud/titiler_client.py:43 )`

S1A_S3_GRDH_1SDV_20230807T171941_20230807T172010_049772_05FC2A_B3DA

Seems like this one image is repeatedly tossed on the queue, and keeps breaking the container because it does not have any bounds associated with it.

Looking at it, it seems quite strange:

Setup billing alarms, other billing settings, identity provider

Develop interface to run HITL (human in the loop) operations

Operations are listed in #47 .

Idea for this interface could be a streamlit app (for visualization) or a FastAPI based interface. This could be deployed using cloud run https://www.artefact.com/blog/how-to-deploy-and-secure-your-streamlit-app-on-gcp/

Create script for running historical inference with eodag

Error: Scene tiles intersect Caspian Sea

IndexError: list index out of range at ._orchestrate ( /app/cerulean_cloud/cloud_run_orchestrator/handler.py:349 )
Happens about 8 times / day.

Our custom code in cerulean-cloud/cerulean_cloud/cloud_function_scene_relevancy/main.py considers the Caspian Sea as a body of water that we want to predict on. That means that Scenes touching the Caspian are passed into the orchestrator.

However, the code that determines which tiles to pass into the predict endpoint uses globe.is_ocean(), which is 3rd party code that excludes the Caspian. This results in a zero-length list of tiles, which is a case that is not handled gracefully, causing this error.

Options for resolution:

Handle the error gracefully. This means the Caspian Sea is not processed.
Remove the Caspian Sea from our scene_relevancy function. This means the Caspian Sea is not processed.
Edit the globe.is_ocean() code to return True for tiles over the Caspian.
Replace the globe.is_ocean() code to use our scene_relevancy function logic, making them consistent.

I prefer option 4.

Optimize concurrency limits of the cloud run inference function

Follows #18

In the current implementation we're launching all the tasks at once to run inference. I have set the concurrency limit in the function to 3 to avoid memory overrun.

In the ideal world we would launch the inference tasks in such a way that we avoid lots of containers getting spin up (incurring startup costs) and make use of the up containers as efficiently as possible.

skytruth / cerulean-cloud Goto Github PK

cerulean-cloud's People

Contributors

Stargazers

Watchers

Forkers

cerulean-cloud's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs