skytruth / cerulean-cloud Goto Github PK
View Code? Open in Web Editor NEWAll cloud services including inference and database structure
License: Apache License 2.0
All cloud services including inference and database structure
License: Apache License 2.0
this will serve Sentinel-1 scenes to the Cerulean infra.
This is happening locally on Jona's computer, but does not happen during deployment via Github Actions.
Seems possibly related to pydantic?
pydantic/fields.py:550: in pydantic.fields.ModelField._type_analysis ???
../../mambaforge/lib/python3.9/typing.py:852: in __subclasscheck__ return issubclass(cls, self.__origin__)
E TypeError: issubclass() arg 1 must be a class
Memory limit of 5120 MiB exceeded with 5261 MiB used.
Happens about 25 times / day with the orchestrator (@5gb).
Happens about 10 times / day with the offset tiles (@5gb).
Can be mitigated by increasing memory
value in the following locations. Note that if you increase it substantially, you will also have to increase the number of vCPUs according to https://cloud.google.com/run/docs/configuring/memory-limits
cerulean-cloud/stack/cloud_run_orchestrator.py
cerulean-cloud/stack/cloud_run_offset_tile.py
The Cloud Run functions we are using are charged per vCPU/s (compute) and GB/s (memory). Currently I have set the orchestrator to have 4 GB / 4 vCPU; and the offset tile 4 GB / 8 vCPU. Looking at the metrics for the orchestrator there is room for optimisation. Average run time is quite low at the moment because of the dry runs (avg billed time is 2.63s). Same can be said of the offset tile metrics, in terms of optimising resources. Average billed time is around 2.02s. I’ve limited the concurrency for the cloud run offset tile to 20 per orchestrator (so there will be a max 20 instances running per orchestrator at the same time). I think we could further reduce this as well without facing significant time increases. Our queue is defined in such a way that we are sending a small amount of tasks at a time, the idea here being that we reuse as much the started up containers, to avoid the startup cost. I’ve created this issue for further reference.
Sensitive value should be passed to infrastructure (i.e. Cloud Run functions) as secrets. Same goes for Cloud functions.
Currently a single deployment takes around 15min to process. This is in part due to our double stack configuration with pulumi, caused by issues in concurrency management in pulumi-docker.
Ideally these would be merged into a single stack.
In addition, multiple deployments occur even when no code change has occurred. This can also be further optimized.
For instance by setting up a cronjob in the database.
There are significant trade-offs between using torch scripting, torch tracing, or torch state_dict. differences between scripting and tracing defined here: https://paulbridger.com/posts/mastering-torchscript/
Finds from inspecting with @rodrigoalmeida94
I don't think the caveats mentioned above for torch tracing are a problem since it's ok if inference takes a while to run. torch tracing will make the best use of gpu resources it sounds like since the model gets jit compiled to machine code.
Save stack passphrase as secret in GCP, fetch it at deployment time.
Once @jonaraphael returns from leave we should review the model inferencing steps within the cerulean deployment and new Mask R-CNN model and compare how they match with local inference for the same test images we've been looking at locally. Jona noted that after resolving deployment and predict endpoint issues, there are no tracebacks but performance is low, with 0 out of 100 slicks inferenced. I think it's possible we are missing the multi class NMS step (and potentially other steps) on cerulean cloud since Jona recently added this.
ValueError: Cannot determine common CRS for concatenation inputs, got ['WGS 84 / UTM zone XXX', 'WGS 84 / UTM zone YYY'] Use "to_crs()" to transform geometries to the same CRS before merging.
at .concat_grids_adjust_conf ( /app/cerulean_cloud/cloud_run_orchestrator/merging.py:34 )
Happens about 15 times / day on orchestrator.
Seems to happen when a scene crosses a UTM boundary, and the merging of the resulting inference fails to find a common UTM.
Options:
• We should identify when the CRSs differ and reproject them all into the same CRS first.
ValueError: longitude must be >= -180
Caused by a bug in rio-tiler-pds. See cogeotiff/rio-tiler-pds#77 for description of the issue.
Once that issue is resolved, there may be a further issue with globe.is_ocean()
and ±180º... so check that before closing this issue.
not urgent, but Apache 2.0 is a bit annoying because if anyone was to modify the software, they need to include a notice within every single file that is modified: https://github.com/CosmiQ/solaris/blob/main/LICENSE.txt#L97
MIT and BSD are common alternatives that do not have the above restriction
potentially related to #82
Follows #33
The toy dataset does not have an instance table potentially
in the current table, instances can point to scenes they came from and inferences they came from
single column for slick type, values for natural slicks, vessel slicks
@rodrigoalmeida94 and @jonaraphael can meet to figure out if database structure needs to be updated
I'm counting these as no slicks detected where there is an obvious, large, linear slick.
1 batch = output is a np.array(1, 7, 512, 512), include bbox
if maskrcnn: returns a different shape (object detection, bounding box)
merging strategy, takes into account category differences, rule based. Jona to create pseudo code for merging
How to represent tile grid along with inference result in overall image? (bbox), make full image
Save multiple confidence geometries in DB.
Output of classification is 2 rasters (confidence + class)
Output of instance is (?)
Merging procedure form offset should also be adapted
KeyError: 'bounds'
at .get_bounds ( /app/cerulean_cloud/titiler_client.py:43 )`
S1A_S3_GRDH_1SDV_20230807T171941_20230807T172010_049772_05FC2A_B3DA
Seems like this one image is repeatedly tossed on the queue, and keeps breaking the container because it does not have any bounds associated with it.
Operations are listed in #47 .
Idea for this interface could be a streamlit app (for visualization) or a FastAPI based interface. This could be deployed using cloud run https://www.artefact.com/blog/how-to-deploy-and-secure-your-streamlit-app-on-gcp/
IndexError: list index out of range at ._orchestrate ( /app/cerulean_cloud/cloud_run_orchestrator/handler.py:349 )
Happens about 8 times / day.
Our custom code in cerulean-cloud/cerulean_cloud/cloud_function_scene_relevancy/main.py considers the Caspian Sea as a body of water that we want to predict on. That means that Scenes touching the Caspian are passed into the orchestrator.
However, the code that determines which tiles to pass into the predict endpoint uses globe.is_ocean()
, which is 3rd party code that excludes the Caspian. This results in a zero-length list of tiles, which is a case that is not handled gracefully, causing this error.
Options for resolution:
globe.is_ocean()
code to return True
for tiles over the Caspian.globe.is_ocean()
code to use our scene_relevancy function logic, making them consistent.I prefer option 4.
Follows #18
In the current implementation we're launching all the tasks at once to run inference. I have set the concurrency limit in the function to 3 to avoid memory overrun.
In the ideal world we would launch the inference tasks in such a way that we avoid lots of containers getting spin up (incurring startup costs) and make use of the up containers as efficiently as possible.
Add API key auth to cloud run functions and titiler sentinel
This will scaffold out this Miro Board: https://miro.com/app/board/o9J_lsnfnfY=/?moveToWidget=3074457365501516133&cot=14&fromRedirect=1
slicks are showing up inland, on rivers, etc. either the land masking code is an issue or (what I think) is that the resolution of it is not high enough or it isn't accurate enough.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.