GithubHelp home page GithubHelp logo

microsoft / planetarycomputerdatacatalog Goto Github PK

View Code? Open in Web Editor NEW
35.0 8.0 14.0 107.24 MB

Data catalog for the Microsoft Planetary Computer

Home Page: https://planetarycomputer.microsoft.com

License: MIT License

Python 1.44% JavaScript 14.61% Dockerfile 0.15% Shell 0.92% HTML 0.66% CSS 5.23% TypeScript 77.00%
aiforearth

planetarycomputerdatacatalog's People

Contributors

agentmorris avatar brunosan avatar delgadom avatar dependabot[bot] avatar gadomski avatar ghidalgo3 avatar gregoriomartin avatar lossyrob avatar microsoft-github-operations[bot] avatar microsoftopensource avatar mmcfarland avatar pholleway avatar pjhartzell avatar tomaugspurger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

planetarycomputerdatacatalog's Issues

offer to help documentation

as @TomAugspurger and I discussed in closed issues and PRs, I would like to learn and help better document the datasets and their usage. I am currently actively working on my project and spending a significant time on organizing data pipelines, so I thought I would open this issue to offer help and see if we could push for some documentation while at it.

I am particularly concerned with GOES datasets for now, and I can be verbose (I guess good for documenting stuff!) so I don't mind helping write something if I am pointed towards the resources and how to go about things.

My current "best" workflows:

  1. On local disk, simply use netCDF4 alone
  2. On abfs (or gs, s3, etc.), use fsspec coupled with h5py

(Note: that netCDF4 seems to outperform h5py by almost an order of magnitude in my local testing, hence the decision to use netcdf4 instead of unifying the approach in 1 --- one could easily use fsspec with h5py for local files too, thus only needing to change a single keyword in the workflow rather than significant portions.)

(Interesting caveat: netCDF4 reads the modified data directly, whereas with h5py one needs to apply the scaling and offsetting manually, so it is actually more work 😅)

Metadata issues with the cil-gdpcir-cc0 collection

The geoparquet-items asset in the cil-gdpcir-cc0 collection collection looks weird:

  1. uses media_type instead of type
  2. has an "extra_fields" which should probably be removed and instead the children should be at the asset level (looks like a PySTAC issue?)

Top-level:
3. Also, the cube:dimensions has lat and lon dimensions, but both have the y axis assigned which sounds wrong to me.
4. The description contains a Table, but CommonMark doesn't support tables

Invalid/ Corrupt Landsat file?

I'm not sure if this repository is actually the right place to place this issue, but here comes the description:

Description

We read Landsat data for a study region in central Africa. While this works fine in most cases, we get errors when attempting to open a Landsat8 OLI scene acquired in 2022. The error only occurs when trying to load B4.

How to reproduce

import rasterio as rio
import planetary_computer

# URL to dataset (we load band 4)
url = 'https://landsateuwest.blob.core.windows.net/landsat-c2/level-2/standard/oli-tirs/2022/173/059/LC08_L2SP_173059_20220706_20220722_02_T1/LC08_L2SP_173059_20220706_20220722_02_T1_SR_B4.TIF'

url_signed = planetary_computer.sign_url(url)

ds = rio.open(url_signed)

This gives the following error message

Traceback (most recent call last):
  File "rasterio/_base.pyx", line 310, in rasterio._base.DatasetBase.__init__
  File "rasterio/_base.pyx", line 221, in rasterio._base.open_dataset
  File "rasterio/_err.pyx", line 221, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsicurl/https://landsateuwest.blob.core.windows.net/landsat-c2/level-2/standard/oli-tirs/2022/173/059/LC08_L2SP_173059_20220706_20220722_02_T1/LC08_L2SP_173059_20220706_20220722_02_T1_SR_B4.TIF?st=2023-07-13T09%3A03%3A45Z&se=2023-07-14T09%3A48%3A45Z&sp=rl&sv=2021-06-08&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2023-07-14T07%3A40%3A28Z&ske=2023-07-21T07%3A40%3A28Z&sks=b&skv=2021-06-08&sig=rpO6Ia5Y8JQvnxLCUdC8Bv%2BNjzTbEEbdnItAAizP/Lg%3D' not recognized as a supported file format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/ides/Lukas/venvs/GeoPython/lib64/python3.11/site-packages/rasterio/env.py", line 451, in wrapper
    return f(*args, **kwds)
           ^^^^^^^^^^^^^^^^
  File "/mnt/ides/Lukas/venvs/GeoPython/lib64/python3.11/site-packages/rasterio/__init__.py", line 304, in open
    dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "rasterio/_base.pyx", line 312, in rasterio._base.DatasetBase.__init__
rasterio.errors.RasterioIOError: '/vsicurl/https://landsateuwest.blob.core.windows.net/landsat-c2/level-2/standard/oli-tirs/2022/173/059/LC08_L2SP_173059_20220706_20220722_02_T1/LC08_L2SP_173059_20220706_20220722_02_T1_SR_B4.TIF?st=2023-07-13T09%3A03%3A45Z&se=2023-07-14T09%3A48%3A45Z&sp=rl&sv=2021-06-08&sr=c&skoid=c85c15d6-d1ae-42d4-af60-e2ca0f81359b&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2023-07-14T07%3A40%3A28Z&ske=2023-07-21T07%3A40%3A28Z&sks=b&skv=2021-06-08&sig=rpO6Ia5Y8JQvnxLCUdC8Bv%2BNjzTbEEbdnItAAizP/Lg%3D' not recognized as a supported file format.

Expected behavior

When we do the same for, e.g., Band 3, (url = https://landsateuwest.blob.core.windows.net/landsat-c2/level-2/standard/oli-tirs/2022/173/059/LC08_L2SP_173059_20220706_20220722_02_T1/LC08_L2SP_173059_20220706_20220722_02_T1_SR_B3.TIF) the rio call works with out any problems and as expected:

src = rio.open(url_signed)  # now set to B3 instead of B4
src.meta

outputs

{'driver': 'GTiff', 'dtype': 'uint16', 'nodata': 0.0, 'width': 7591, 'height': 7741, 'count': 1, 'crs': CRS.from_epsg(32635), 'transform': Affine(30.0, 0.0, 716385.0,
       0.0, -30.0, 275715.0)}

as expected.

Any hint why the B4 dataset cannot be read?

Layer panel unexpectedly closes on scroll

When scrolling through the layer panel, I've noticed that immediately after reaching the top or bottom, the panel will immediately close. Because the panel is positioned over the map, this results in suddenly zooming in or out.

scroll-break.mov

Context

OS: macOS 12.0.1
Browser: Chrome 97.0.4692.71
Input: Trackpad

Tooltip on disabled filter inconsistently closes

The tooltip on disabled filters does not always close as expected. This can be annoying because the tooltip covers useful information and actions and feels "stuck".

Adding the medium (300ms) delay on the tooltip would help too, to reduce the frequency of the tooltip triggering on accident: https://developer.microsoft.com/en-us/fluentui#/controls/web/tooltip#TooltipDelay

Chrome

Tooltip persists when cursor moves slowly, or when the cursor hovers the tooltip

tooltip-doesnt-get-hint.mov

Firefox

Tooltip persists when the cursor hovers the tooltip

tooltip-firefox.mov

Enhance header for mobile

I recommend making improvements to the header on mobile. Ideally, it should look like the proposed designs down to 320px width.

Current site

Screen Shot 2022-01-06 at 10 24 16 AM

Proposed designs

mobile-default

mobile-expanded

Sentinel L2A product has incorrect bounds

For example, this is the bounding box stored in bbox for tile 10SGG at timestamp 2016-05-05T19:04:02.027000Z and id S2A_MSIL2A_20160505T190402_R113_T10SGG_20210211T085211.

bbox

But this is what the tile actually contains:

image

For tile 10SGG there are several more of these cases. More examples are:

S2A_MSIL2A_20160714T184922_R113_T10SGG_20210212T043726
S2A_MSIL2A_20160624T184922_R113_T10SGG_20210211T224812
S2A_MSIL2A_20160614T190352_R113_T10SGG_20210211T193819
S2A_MSIL2A_20160604T184922_R113_T10SGG_20210211T163042
S2A_MSIL2A_20160525T190352_R113_T10SGG_20210211T141725
S2A_MSIL2A_20160515T184922_R113_T10SGG_20210211T113732
S2A_MSIL2A_20160505T190402_R113_T10SGG_20210211T085211
S2A_MSIL2A_20160415T190352_R113_T10SGG_20210211T040941
S2A_MSIL2A_20160405T190352_R113_T10SGG_20210211T015225
S2A_MSIL2A_20160326T185252_R113_T10SGG_20210528T193911
S2A_MSIL2A_20160306T190352_R113_T10SGG_20210528T112222
S2A_MSIL2A_20160215T190352_R113_T10SGG_20210528T070200
S2A_MSIL2A_20160205T190342_R113_T10SGG_20210528T030140
S2A_MSIL2A_20160126T185642_R113_T10SGG_20210527T232223
S2A_MSIL2A_20160116T190342_R113_T10SGG_20210527T184113
S2A_MSIL2A_20160106T190352_R113_T10SGG_20210527T104143
S2A_MSIL2A_20151227T185812_R113_T10SGG_20210526T202139
S2A_MSIL2A_20151217T190352_R113_T10SGG_20210526T062535

This is also not the only sentinel tile affected.

ERA5 dataset is three years outdated?

Hi,

I am trying to access ERA5 dataset from Planetary Computer Data Catalog using Python.

https://planetarycomputer.microsoft.com/dataset/era5-pds

On the page, it says that the temporal extent for this dataset is 01/01/1979 – Present, but if I follow the tutorial on the Example Notebook and change the code to

search = catalog.search(
    collections=["era5-pds"], datetime="2021-01", query={"era5:kind": {"eq": "an"}}
)

for example, the search returns no items.

In fact, in https://planetarycomputer.microsoft.com/api/stac/v1/collections/era5-pds/items it is clear that the "end_datetime":"2020-12-31T23:00:00Z".

Is this correct?
I am very frustrated with that because the service worked very well for my research, but I need data up to the present.

Thanks for the service, it accesses the data faster than I was expecting.

Best,

access datasets through `fsspec` and `adlfs`

Hi, thanks for providing these datasets. I would like to access the goes-cmi dataset through fsspec and adlfs (i.e. pip install fsspec adlfs), but I cannot seem to figure it out.

It seems the account_name associated with these datasets is pcstacitems? That doesn't seem to be documented. Anyway, I cannot get a straightforward anonymous access to any dataset. For example, this code

import fsspec
fs = fsspec.filesystem("abfs", account_name="pcstacitems")
fs.ls("/")  # or fs.ls("abfs://items/" and so on

errors with

ErrorCode:NoAuthenticationInformation
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>NoAuthenticationInformation</Code><Message>Server failed to authenticate the request. Please refer to the information in the www-authenticate header.

Or sometimes with things like "not found". Is there an easy way to access these datasets through fsspec? I also tried initiating a SAS token, but to no avail.

I also tried to figure out the details from this, but to no avail. https://planetarycomputer.microsoft.com/api/stac/v1/collections/goes-cmi/items/OR_ABI-L2-F-M6_G17_s20222200300319

For example, this also didn't work.

import fsspec
fs = fsspec.filesystem("abfs", account_name="goeseuwest")
fs.ls("noaa-goes-cogs/goes-17")

It would be nice if these datasets could be provided more transparently. I understand the desire to make them easy to use via Jupyter notebooks (like the examples) but I found those to be extremely hard for proper application/deployment (beyond the simple examples). For comparison, the GOES datasets could be easily used on AWS and GCP without any issue:

import fsspec
gcp = fsspec.filesystem("gs", token="anon", anon=True)
gcp.ls("gcp-public-data-goes-17/")  # prints ['gcp-public-data-goes-17/ABI-L1b-RadC'...
aws = fsspec.filesystem("s3", token="anon", anon=True)
aws.ls("noaa-goes17/")  # prints ['noaa-goes17/ABI-L1b-RadC', 'no...

Now trying:

# now trying:
az = fsspec.filesystem("abfs", token="anon", anon=True)

results in this error:


ValueError: Must provide either a connection_string or account_name with credentials!!

Thank you!

Sentinel 2A: datastrip in item.id does not match datastrip in item.properties.s2:datastrip_id

I was just debugging why I had duplicate tiles in a certain pipeline. I found out that this was related to the datastrip_id property, which depends on to the downlink station - see S2 specification. The data seems to be the same for both item's, although I only checked this for 1 band. Why do you keep duplicate data when multiple downlink stations are used?

Anyways... I then noticed that the datastrip id that is included in the item.id does not match the one that is provided in item.properties.s2:datastrip_id. Not sure if this is important, but I thought it would be worth mentioning. Please see example below.

from copy import deepcopy

import pandas as pd
import planetary_computer
import pystac_client

def items_to_dataframe(items):
    _items = []
    for i in items:
        _i = deepcopy(i)
        _items.append(_i)
    df = pd.DataFrame(pd.json_normalize(_items))
    for field in ["properties.datetime"]:
        if field in df:
            df[field] = pd.to_datetime(df[field])
    df = df.sort_values("properties.datetime")
    return df


catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
)

roi = {
    "type": "Polygon",
    "coordinates": [
        [
            [146.0678527, -15.3746464],
            [147.0909455, -15.3765786],
            [147.0913918, -16.369226],
            [146.0632786, -16.3671625],
            [146.0678527, -15.3746464],
        ]
    ],
}


search = catalog.search(
    collections=["sentinel-2-l2a"],
    intersects=roi,
    datetime="2022-01-01/2022-11-01",
)

items = search.item_collection()

items_ = [i.to_dict() for i in items]
df = items_to_dataframe(items_)


def split_id(x):
    return pd.Series(x.id.split("_"))


df[
    [
        "mission_id",
        "product_level",
        "datetake_start_time",
        "relative_orbit_number",
        "tilenumber",
        "id_datastrip",
    ]
] = df.apply(split_id, axis=1)

# two examples for which I found data which same data, but different datastrips
SAME_DATA_DIFFERENT_DATASTRIP = [
    "S2A_MSIL2A_20220128T002711_R016_T55LDC_20220227T190716",
    "S2A_MSIL2A_20220128T002711_R016_T55LDC_20220212T221526",
]

df_ = df.loc[df["id"].isin(SAME_DATA_DIFFERENT_DATASTRIP)].copy()

# makes it a bit easier to see the difference
def split_s2_datstrip(x):
    return x["properties.s2:datastrip_id"].split("_")[6]


df_["s2_datastrip"] = df_.apply(split_s2_datstrip, axis=1)
df_[["id_datastrip", "s2_datastrip"]]
id_datastrip s2_datastrip
20220227T190716 20220227T190717
20220212T221526 20220212T221527

Copy editing

  • Refer to "private preview" on account request page
  • "Data Catalog" title casing throughout
  • Extraneous "j" after "Azure Storage"
  • "contact us ." contains extra space on DC page

Swagger Ui React warnings

After adding Swagger UI React, the app generates two types of warnings.

On build:

No parser and no filepath given, using 'babel' the parser now but this will throw an error in the future. Please specify a parser or a filepath so one can be inferred.

at runtime:

react_devtools_backend.js:2430 Warning: componentWillReceiveProps has been renamed, and is not recommended for use. See https://reactjs.org/link/unsafe-component-lifecycles for details.

This is at least partly being tracked by swagger-api/swagger-ui#5729, though hasn't gotten much attention.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.