i-guide / catalog Goto Github PK

The I-GUIDE Catalog is part of the I-GUIDE Platform and provides search, discovery, and dynamic interaction with resources created by or used by I-GUIDE researchers.

Home Page: https://i-guide.io/platform/

License: BSD 3-Clause "New" or "Revised" License

Python 49.69% Dockerfile 0.26% Makefile 0.15% JavaScript 0.03% Shell 0.02% HTML 0.13% Vue 36.39% SCSS 0.34% CSS 0.88% TypeScript 12.11%

catalog discovery interactive-content search i-guide actionable-data

catalog's People

Contributors

Stargazers

Watchers

Forkers

cuahsi devincowan castronova cuahsi

catalog's Issues

Create SchemaOrg Example: Web Application

Explore how SchemaOrg properties can be used to describe a Web Application. The outcomes of this task will be:

A document containing:

A JSON+LD example containing required metadata
A JSON+LD example containing recommended metadata
A JSON+LD example for a HydroShare Web Application

Release the new user interface

Collaborators at USU will work on this. Updates include the new interface of the data catalog, mainly the landing page of the submitted items.

#JIRA=CAM-54

Update Pydantic models

Need to update the pydantic models to match the recent changes to data catalog schema specification requirements..

Spatial Coverage map not showing in Search results

Add additionalType attribute to core pydantic schema

The additionalType (https://schema.org/additionalType) will be used to store aggregation type (Multidimensional, Raster, etc.)

Implement isPartOf

The metadata extractor needs to be updated to include IsPartOf for each associatedMedia entry that is part of an aggregation within the resource. https://github.com/CUAHSI/metadata-extractor/blob/agu_demo/hsextract/utils.py#L88

#JIRA=CAM-54

Need a way to add properties to primitive array items in pydantic

We have subschemas consisting of arrays of primitive items for which there is no way to tap into the primitive items and add properties to the resulting schema.

For example, take the field below:

identifier: Optional[List[str]] = Field(
        title="Identifiers",
        description="Any kind of identifier for the resource. Identifiers may be DOIs or unique strings "
                    "assigned by a repository. Multiple identifiers can be entered. Where identifiers can be "
                    "encoded as URLs, enter URLs here."
    )

Generated schema:

"identifier": {
      "title": "Identifiers",
      "description": "Any kind of identifier for the resource. Identifiers may be DOIs or unique strings assigned by a repository. Multiple identifiers can be entered. Where identifiers can be encoded as URLs, enter URLs here.",
      "type": "array",
      "items": {
        "type": "string"
      }
    },

We would like to add a title to the schema generated for the str type so that the resulting schema looks like this:

"identifier": {
      "title": "Identifiers",
      "description": "Any kind of identifier for the resource. Identifiers may be DOIs or unique strings assigned by a repository. Multiple identifiers can be entered. Where identifiers can be encoded as URLs, enter URLs here.",
      "type": "array",
      "items": {
        "title": "Identifier",  // <----------
        "type": "string"
      }
    },

This will allow the renderers to show a title in fields like these:

Create dedicated HTML page for Dataset landing page

The renderers 'view' mode does not provide an elegant enough way to display the information in the Dataset landing page. Now that the Dataset JSON schema has settled we should create a dedicated page with HTML and CSS that elegantly displays the information.

Add Collections example.

Add an example usage for collections of datasets.

Temporal Coverage end date validation fails

Error relates to overlap detection.

Event for metadata extraction

The simplest would be to just fire it after a workflow runs and after files finish uploading in the browser.

#JIRA=CAM-54

Login notification is wrong

When logging in, I can login but the bottom of the page shows a red box saying Not logged in

Clicking on a featured data only takes us to the data repository landing page

We need to have an option that shows the metadata landing page as well.

The error related to incorrect spatial coverage information is not shown explicitly on the page

When entering values for a bounding box that are not within the appropriate ranges (-90 to 90 for latitude and -180 to 180 for longitude), the 'Save' button does not function. Currently, the only message displayed indicates that the submission has failed. To enhance user experience, it would be beneficial to direct the user to the spatial coverage field or provide a specific error message within the same pop-up box where the failure message appears.

Pin Pydantic to version 1

Pydantic v2 has breaking changes. Need to pin it to v1 to make the current code base work. We will do the upgrade to v2 (Issue #38)

List of specific types of CreativeWork for the I-GUIDE Data Catalog

We need to identify a list of specific types of CreativeWork and then create a schema example for each. This will help us complete the "What is a record" and "What is NOT a record" sections in the README.md file.

CreativeWork > TextDigitalDocument
CreativeWork > DigitalDocument
CreativeWork > MediaObject > ImageObject | DataDownload | VideoObject
CreativeWork > DataCatalog
CreativeWork > Dataset
CreativeWork > Course
CreativeWork > SoftwareSourceCode
CreativeWork > SoftwareApplication

We also need to identify a list of specific types of Thing.

Thing > Person
Thing > Organization
Thing > Place
Thing > Intangible > Grant
Thing > Intangible > Language
Thing > StructuredValue> PropertyValue

Clarify Catalog/Schema Terminology

Make sure that the terms we're using are consistent throughout our documentation.

#JIRA=CAM-47

Export user and resource access control to mongo collections

A branch exists in HydroShare that exports user/resource access control to a mongo collection and is deployed to beta.hydroshare.org. Setup listeners on the mongo collection change stream to:

Add/remove discoverable resources from discovery
Map user access control with resources that externally reference S3 resources to console.minio.cuahsi.io

That mongo database can be found on atlas at CUAHSI->CZNET->Cluster0->hydroshare_beta. The two collections to listen to are resourceaccess and userprivileges

Add/remove discoverable resources from discovery
Documents in the resourcesaccess collection look like:

{
    "resource_id": "8bb057d9653c4abba8bb2e48fe3642ce",
    "is_public": true,
    "show_in_discover": true,
    "minio_resource_url": "some url"
}

Only listen for documents that have "minio_resource_url": "Not null value. Add/remove documents from discovery based on show_In_discover.

Map user access control with resources that externally reference S3 resources to console.minio.cuahsi.io
Documents in the userprivileges collection look like:

{
    "username": "sblack",
    "all": {},
    "minio": {
        "owner": [
            {
                "owners": [
                    "sblack"
                ],
                "resource_id": "8bb057d9653c4abba8bb2e48fe3642ce",
                "minio_resource_url": "https://console.minio.cuahsi.io/browser/sblack/YXJnb193b3JrZmxvd3MvcGFyZmxvdy9kYzRlYWZkNi0yNTM0LTQwMjEtODNiZS1iZjM2YWNhNDhhMjIv"
            }
        ],
        "edit": [
            {
                "owners": [
                    "sblack-admin"
                ],
                "resource_id": "b9ac783296cc4a93b8996247e120aa61",
                "minio_resource_url": "https://console.minio.cuahsi.io/browser/sblack-admin/editable"
            }
        ],
        "view": [
            {
                "owners": [
                    "sblack-admin"
                ],
                "resource_id": "b4c9b612f157452dbb6826aabeb15b0e",
                "minio_resource_url": "https://console.minio.cuahsi.io/browser/sblack-admin/viewable"
            }
        ]
    }
}

The username is the user that the access control applies to. The all property is a complete dump of all hydroshare resource privileges for the user, ignore it. The minio property contains user privileges for resources that have an additional metadata key of minio_resource_url and the value is copied to the mongo document. There are 3 lists; view, edit, owner. Each item in those lists has an owners property. The first owner maps to the bucket name. Resource_id is the hydroshare resource id. Minio_resource_url is the value in additional_metadata that points to a path on minio.

#JIRA=CAM-54

Add Contributor and Role

Navigating from Edit Dataset page to Contribute page does not reset component

We reuse the CdContribute component in both of these pages, but it does not reset when navigating as described. We should create a dedicated component for viewing datasets and stop reusing this one.

Catalog Hydroshare resource

Using hydroshare resource identifier/url as an input, user should be able to catalog hydroshare resource metadata. This functionality is similar to hydroshare resource registration in DSP.

Only public hydroshare resource can be cataloged.

Modification to software source code schema

Below are comments from an outstanding PR on software source code. Moving these to an issue so we can merge the PR and consolidate repositories.

Consider changing to something like the following:

To classify a record as a computer programming source code, "@type: "SoftwareSourceCode" should be used in the json schema. This will classify the record such as compile ready solutions, code snippet samples, scripts, etc. as a specific Schema.Org type called SoftwareSourceCode for which the metadata should be described using the core metadata, as well as the software-source-code-specific properties for the Schema:SoftwareSourceCode class. The following table outlines the required and optional properties selected from Schema.Org vocabulary to design the I-GUIDE software source code metadata schema. These properties are encoded as 1 or 1+ for required and 0,1 or 0+ for optional in the Cardinality column of the table below.

To classify a record as a computer programming source code, use "@type: "SoftwareSourceCode" in the json schema. This is appropriate for records including code snippet samples, scripts, notebooks, etc. The following table outlines the required and optional properties to sufficiently describe software source code objects. Required properties have a cardinality of 1 or 1+ and optional properties have a cardinalities of 0, 0+, 1.

I'm not sure that we want to embed source code within the schema. I wonder if this could have security implications. See:

| [text](https://schema.org/text) | CreativeWork | Text | 0,1 | The textual content of the source code. |

Upgrade to pydantic v2

pydantic v2 provides performance improvements.

Register Public S3 datasets

Assuming the extracted metadata files are present and valid, pick up the root metadata file and register it in the catalog.

Sync the Discovery database

Listen to changes in the resourceaccess collection and sync an resource changes to the discovery collection. When a minio resource is made discoverable then retrieve the extracted metadata from S3 and place it in the discovery collection. When a minio resource is made NOT discoverable, then delete the entry from the discovery collection.

#JIRA=CAM-54

Rollback changes implemented in PR #79

#79

Registering data from S3 into catalog and preview the metadata

Collaborators at CUAHSI will work on this to extend the record registration to the data catalog.

#JIRA=CAM-54

Scheduler to keep the catalog up-to-date with registered hydroshare resources

A scheduler to fetch metadata for all registered hydroshare resources from hydroshare repository on a regular interval (once a day) and update the catalog as needed.

Create a Jupyter Notebook to show how to perform an action on data registered to the data catalog

Collaborators at CUAHSI will work on this. To accomplish this, the following sub-tasks are required:

Catalog API only uses OrcID right now. An upgrade to CUAHSI SSO is needed to store the results of an action (e.g., model domain subsetter) on S3.
Finalize the software source code (model Program and Instance) Schema.

#JIRA=CAM-54

Add support for file level metadata

Enhance associatedMedia schema model to support file level metadata as specified in the following two documents:
https://github.com/I-GUIDE/data-catalog/blob/main/schema/dataset-filetype-geotiff.md
https://github.com/I-GUIDE/data-catalog/blob/main/schema/dataset-filetype-shapefile.md

Add "Open With" button to resource page

Open with functionality

When rendering the landing page, check to see if the resource is a HydroShare resource
If yes, add a “Launch on I-GUIDE Platform” button at the top right of the form
Construct a URL for the button from the resource metadata - this URL will launch the whole resource into JupyterHub
When the user clicks this button, it will launch the URL
The user will be taken to the iGUIDE platform JupyterHub instance, which will download the resource using NBFetch
The User can then select a notebook from the folder to run

Create Documentation for Adding Data to Catalog

https://docs.google.com/document/d/1PypBY7sDBYP4jvofM8sckugxrBnBZyIdPPGBsm0hAj4/edit

Make encodingFormat optional - MediaObject model

Develop initial OpenWith requirements document

Add new attributes to MediaObject pydantic model

The MediaObject schema as part of the associatedMedia, needs to include the following two additional attributes:

sha256
isPartOf

Using isPartOf we will be able to associate a content file to its metadata file.

What units should contentSize be expressed as?

According to SchemaOrg contentSize is:

File size in (mega/kilo) bytes."

This will be strange for small or very large files:

256 TB ~= 2.56e+8 MB

256 B ~= 0.256 KB

Moreover, if we round to the nearest integer the becomes even worse: 256 B = 0 KB or 1KB

One solution is to recommend that all contentSize values simply contain units, e.g. Bytes (B), Kilobytes (KB), Megabytes (MB), Gigabytes (GB), Terabytes (TB).

Metadata extractor changes needed

Needs to set the name property of the MediaObject. The name should be set to the file name. The 'hasPart' object name property needs to be set to the name of the metadata file?

Create a working S3 object that uses our schema

We lack a real-world example of how our schema can be used for data in object stores. Create and example and document it in the repository

#JIRA=CAM-75

Store temporal coverage as date type in discovery

Temporal coverage in catalog is of type string. In order to search records based on temporal coverage date range, the string type date values need to be converted to date type and stored in discovery collection.

Turn off fetching of hydroshare resource files metadata

As part of registration of hydroshare resource, we are currently fetching metadata for each of the files in that resource. If a resource has thousands of files, fetching metadata for all files can cause timeout error. Displaying metadata for large number of files on the UI probably needs some changes to the UI. Due to these issues related to large number of files, for now we need to turn off fetching of files metadata as part of hydroshare resource registration.

PropertyValue should have a property called "unitText"

The unitText property is an optional feature when using "PropertyValue" type. For example, if the PropertyValue is used to express a measured variable, the unitText should contain the unit of the measured variable. See the example below:

{
"variableMeasured": {
"@type": "PropertyValue",
"name": "Streambed interface temperature values",
"unitText": "degC"
}
}

Clicking on the "contribute data" button takes users to the CZ Hub web page.

To register a record to the data catalog, a user has two options:

Click on the "contribute data" button in the middle of the home page. This will redirect the user to the CZ Hub website. Please use the correct link.
Click on the "contribute" button found on the top right toolbar. This will take the user directly to the IGUIDE data catalog submission page.

Create and deploy development test environment

Remove Citation Property from core metadata

We may want to use subjectOf wherever we want to cite to another creative work.

API for resource creation and metadata update

Add a POST endpoint that creates a hydroshare resource with the additional metadata key minio_resource_url and a value that points to an S3 path. Also add a PUT endpoint for updating metadata that updates the metadata files on S3.

#JIRA=CAM-54

MediaObject schema changes needed

The following attributes shouldn't be part of the MediaObject. Rather these attributes should be part of the root CreativeWork that represents a dataset as described here for raster dataset: https://github.com/I-GUIDE/data-catalog/blob/main/schema/dataset-filetype-geotiff.md.

additionalProperty
variableMeasured
sourceOrganization

Create Example for Schema with Graph Diagram

Create an example of the schema that illustrates the relationships between schema.org properties. Create a graphical visualization of this example using json-ld.org/playground.

Map user/resource privileges to S3 JSON policies and save to CUAHSI MinIO

The CUAHSI Subsetter application has a router that maps the resourceaccess documents to json policies along with the ability to save the policies on a S3 server, here. Copy this router to the catalogapi and wire it up to events that get the JSON policies saved to MinIO for the user. Below is a proposal to use Mongo changestream but it could instead be accomplished with an alternate solution.

Listen to the Mongo changestrem for userprivileges collection and map each document that has entries in the minio property to S3 JSON policies and save them to console.minio.cuahsi.io. The catalog uses changestreams already and an example of usage can be found at https://github.com/I-GUIDE/catalogapi/blob/develop/triggers/update_catalog.py#L32

The Minio client needs to be installed and configured with the image. Installation of the client is found here https://github.com/CUAHSI/domain-subsetter/blob/subsetter_argo/app/api/Dockerfile#L10

An minio client alias needs to be setup for the cuahsi server. This is done on the fastapi startup event, https://github.com/CUAHSI/domain-subsetter/blob/subsetter_argo/app/api/subsetter/main.py#L103

#JIRA=CAM-54

Resource Landing Page

Using the i-guide resource landing page, with the CUAHSI MinIO server, update it to:

read/write metadata files
read metadata extracted files (includes aggregation metadata)
Upload files
Download files
Create resource (create a resource in hydroshare with the minio_resource_url key)
Bonus a file viewer

The subsetter application has a router for creating presigned urls for GET/PUT. Copy this router to the catalog and use the endpoints to generate the urls. Use the urls to GET or PUT files directly from the browser to the CUAHSI S3 server.

https://github.com/CUAHSI/domain-subsetter/blob/subsetter_argo/app/api/subsetter/app/routers/storage/router.py#L10

@Maurier - I'm happy to help you break this up into smaller chunks as you prepare issues around the resource landing page.

#JIRA=CAM-54

Setup authentication refresh tokens

The site will log users out as soon as the authentication tokens expires, even if they are interacting with the site. We should setup authlib refresh token endpoints.

i-guide / catalog Goto Github PK

catalog's People

Contributors

Stargazers

Watchers

Forkers

catalog's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs