pepkit / pephub Goto Github PK

View Code? Open in Web Editor NEW

12.0 6.0 2.0 17.85 MB

A web API and database for biological sample metadata

Home Page: https://pephub.databio.org

License: BSD 2-Clause "Simplified" License

Dockerfile 0.12% Python 23.96% Shell 0.22% CSS 2.45% HTML 4.89% JavaScript 0.92% TypeScript 65.47% MDX 1.96%

bioinformatics data-sharing metadata-management sample-metadata

pephub's Introduction

Install

pip install --user --upgrade https://github.com/pepkit/pepkit/archive/master.zip

Install dev version

pip install --user --upgrade https://github.com/pepkit/pepkit/archive/dev.zip

pephub's People

Contributors

Stargazers

Watchers

Forkers

shbrief nleroy917

pephub's Issues

Bug with 307 temporary redirect in fastAPI leading to custom validator not working

The custom validator gives an error for any input:

TypeError: NetworkError when attempting to fetch resource.

The original validator was still working successfully. This only happens in the deployed instance, not when running locally, where it works. After spending several hours troubleshooting, I realized:

If you look at the Network tab in the console, it was mentioning something about a redirect, and Mixed content being returned.
In the server logs, I saw: "POST /eido/validate/pep HTTP/1.1" 307 Temporary Redirect whenever I would try to use the new validator

This led me to some searching about fastapi and 307 redirects and I realized that a missing trailing slash triggers this. Sure enough:

Trailing slash:

pephub/pephub/routers/eido.py

Line 170 in cb109b6

@router.post("/validate/pep/")

NO trailing slash:

pephub/pephub/routers/eido.py

Line 231 in cb109b6

@router.post("/validate")

It has something to do with the redirect not being https aware, maybe? Anyway, I think the fix is just to remove that slash in the endpoint definition.

Standardize return package schema/format

@nsheff As I work on these endpoints, I am just returning raw data with nothing else. For example this is the API response for /pep/demo/subsample1/config:

pep_version: "2.0.0"
sample_table: sample_table.csv
subsample_table: subsample_table.csv
looper:
  output_dir: $HOME/example_results

It's the config file in yaml, but nothing else. Should we be standardizing the return data packages? Something like...

{
  "status": "ok",
  "message": "success",
  "data": {
    "pep_version": "2.0.0",
    "sample_table": "sample_table.csv",
    "subsample_table": "subsample_table.csv",
    "looper": {
      "output_dir": "$HOME/example_results"
    }
  }
}

Does FastAPI have a way to standardize responses, sort of how it has a way to standardize request handling (with dependencies, route prefixes, etc)?

Displaying sample attributes with multiple values

Right now, tables are showing Python-syntax lists when there are multiple values for a given entry.

For example, http://pephub.databio.org/pep/nfcore/demo_rna_pep/view

I think I wouldn't show embedded values like this, but think of some other way to display this.

Pinned version requirements

Is there a reason you're pinning the version of every dependency here ? https://github.com/pepkit/pephub/blob/master/requirements/requirements-all.txt

I am not a fan of pinning versions in requirement files -- I prefer at least generally to use minimum bounds, as it makes it more flexible when things get upgraded, I don't have pip yelling at me

Also, this dependency list is extreme, many of these would be covered by sub-dependencies... see, for example, how we usually do it:

https://github.com/refgenie/refgenie/blob/master/requirements/requirements-all.txt

Incorrect links to PEPs and incorrect name of the project

PEPhub has incorrect name of the project on the project website. e.g. https://pephub.databio.org/pep/geo_recent/GSE179805/view?tag=raw

PEPhub has incorrect links to the PEP data. because tag is not added. e.g. :
/pep/geo_recent/None
/pep/geo_recent/None/samples
/pep/geo_recent/None/subsamples

Handling large lists

Now if there's 10k entries in say, the geofetch namespace, it takes a long time to load these pages because there are so many entries.

we don't need to retrieve all of those, somehow.

Is this the best list output ?

Right now the namespace output is an object, with keys corresponding to projects in that namspace, and values are, I guess, the filepath on the local server to the project config file?

I see how the keys are useful, because they are the identifiers that can be used to get further information about the project. But what's the point of those values?

I think this endpoint should be serving an overview about the projects. What more useful information could be served here? Maybe number of samples in the project?

Or maybe it's just a list of the keys. The local paths just confuse things.

Adding example request data to path parameters defined through dependency injection

Overview

As heavily discussed in #14. I can't seem to add route example request data to /docs when the path parameters are defined through dependency injection. I thought I'd pull it out to its own issue here to track that progress instead of leaving the PR in purgatory.

Issue

I guess I am having an issue declaring request data examples for specific endpoint/route parameters when said endpoint/route parameters are defined through dependency injection. In the below example, I need to verify that the namespace actually exists prior to returning data about it. To do that, I followed this example in the FastAPI docs. It works great.

However, in /docs, I'd love to give examples for namespaces. To try and do that, I followed this example in the FastAPI docs.

Visiting /docs, however, provides the user with no example request data:

If I comment out the global dependencies like so:

router = APIRouter(
    prefix="/pep/{namespace}",
    # dependencies=[Depends(verify_namespace)],
)

The example is shown. I can also verify that declaring requirements in the Path() instance like min_length or max_length, these requirements are honored by FastAPI (even though I can't see the requirements in /docs).

Code

# main.py
from fastapi import FastAPI, JSONResponse
from .routers import namespace

app = FastAPI()

app.include_router(
    namespace.router
)

# routers/namespace.py
from fastapi import APIRouter, Depends

from ..dependencies import *
from ..main import _PEP_STORES
from ..route_examples import example_namespace

router = APIRouter(
    prefix="/pep/{namespace}",
    dependencies=[Depends(verify_namespace)],
)

@router.get("/", summary="Fetch details about a particular namespace.")
async def get_namespace(namespace: str = example_namespace):
    """Fetch namespace. Returns a JSON representation of the namespace and the projects inside it."""
    return JSONResponse(content=_PEP_STORES[namespace])

# dependencies.py
from .main import _PEP_STORES

def verify_namespace(namespace: str = example_namespace) -> None:
    if namespace not in _PEP_STORES:
        raise HTTPException(status_code=404, detail=f"namespace '{namespace}' not found.")

# route_examples.py
from fastapi import Path

example_namespace = Path(
    ...,
    description="A namespace that holds projects.",
    example="demo",
)

Splash page example styling

It's strange to me that the heading styles are smaller and less pronounced than the items underneath them:

See how the "GET /pep/demo" is larger than the header class explaining what that is? To my eye that makes it hard to parse at a glance what is going on here. maybe it's just me.

Sample names with slashes break endpoints

Not sure what to do about this. It is probably a low priority, but sample names from GEO PEPs sometimes have slashes in them (/) and that breaks the functionality of the sample endpoints.

e.g.

Infrormation about last project update date

It would be nice to add last update date to the project info info on the page.
Here:

Make pephub API accept query parameters

Example request:

GET https://pephub.databio.org/pep/geo/GSE124224/convert?filter=csv?DATA=some/local/path&IMPORTANT_PARAMETER=value_of_this_parameter&VARIABLE=test

Improve logging

Please make pepdbagent and pephub use consistent loggers. Probably logmuse will help with this.

use consistent logging format
no print messages, always use logger
use logmuse if it makes sense
improve message for Invalid token type message
track incoming IP addresses in logs if possible
Fix issue with ~~BEDbase logs~~ peppy

I see these messages littered all over the logs:

Not all environment variables were populated in derived attribute source: $BEDBASE_DATA_PATH_HOST/bed_files/{file_name}
Not all environment variables were populated in derived attribute source: $BEDBASE_DATA_PATH_HOST/outputs/bedstat_output/bedstat_pipeline_logs/submission/{sample_name}_sample.yaml
Not all environment variables were populated in derived attribute source: $BEDBASE_DATA_PATH_HOST/openSignalMatrix_{genome}_percentile99_01_quantNormalized_round4d.txt.gz

So basically the point here is: run this service and look at the logs and make the logs consistent and useful and understandable, and correct any errors.

`eido/validator` seems broken?

I can't get this endpoint to return anything. I'm getting a 404.

Why are links using external window targets?

It's strange to me that the links on the pephub landing page are all opening targets in new windows. is there a reason you did it that way?

To me it's frustrating, as I generally don't expect links to do that. I don't see a reason for doing it here.

tag is not added to most of the links that are provided in PEPhub

parameter tag is not added to most fo the link that are in pephub
e.g. : https://pephub.databio.org/pep/geo/GSE175232/convert?filter=csv || should be: https://pephub.databio.org/pep/geo/GSE175232/convert?filter=csv&tag=series

`pepdb` Migration

I need a place to keep track of endpoints that are working with the new PepAgent class.

PEP:

/pep/view

Namespace:

/pep/{namespace}
/pep/{namespace}/view
/pep/{namespace}/projects

Project:

/pep/{namespace}/{project}
/pep/{namespace}/{project}/view
/pep/{namespace}/{project}/zip
/pep/{namespace}/{project}/config
/pep/{namespace}/{project}/samples
/pep/{namespace}/{project}/samples/{sample_name}
/pep/{namespace}/{project}/samples/{sample_name}/view
/pep/{namespace}/{project}/subsamples
/pep/{namespace}/{project}/convert

Eido:
Complete

The PepAgent class is nice and mature for the project level data, bet less to for metadata about PEPs and then for namespaces.

Add more endpoints

I thought it would be efficient to have a list of all endpoints we should have for pephub in an issue. This is what I have so far:

/<namespaceid>/<projectid>
/<namespaceid>/<projectid>
/<namespaceid>/<projectid>/config
/<namespaceid>/<projectid>/samples
/<namespaceid>/<projectid>/samples/<sampleid>
/<namespaceid>/<projectid>/subsamples/<sampleid>
/<namespaceid>/<projectid>/zip

@nsheff Feel free to add more that you see as necessary.

PEP v1 support

@nsheff I noticed that when hitting /v1/ChangLab/PEP_1 you get an internal server error as the PEP isn't compatible with PEP 2.0.0:

NotImplementedError: The attribute implications section (implied_columns) follows the old format. 
Reformatting is not implemented. 
Edit the config file manually (add 'sample_modifiers.imply') to comply with PEP 2.0.0 specification:
http://pep.databio.org/en/latest/specification/#sample-modifier-imply

Is there an incentive to support both versions of PEP? As well, how would that be implemented?

Performance

Just so I don't forget --

We need to look into performance of peppy. It takes too long to process/return a large project. It needs to be very fast.

@nleroy917 said it's processing twice, this needs to be reduced to once
might need to explore peppy code a bit to see if we can optimize.

Next Steps

Next Steps for PEPHub

Format conversions <-- Highest Priority
Integrate with bedhost
Add more PEPs

Remove project representation from samples endpoints

Right now if you hit a /samples endpoint, you get the samples, but each sample includes a pointer back to the project under _project.

We shouldn't duplicate the project information like that.

bug in setup shell script

pephub/setup_dev.sh

Line 1 in 317fe5d

!#/bin/bash

this should be #! not !#, right?

Pephub version missing

Right now, pephub shows the version of peppy and Python used, but not the version of pephub

How should pephub handle asterisks?

This sample has asterisks in the file name: http://pephub.databio.org/pep/changlab/pep_2/samples/ACCx-2A5AE757-20D5-49B6-95FF-CAE08E8197A0-X012-S05-L033-B1-T1-P024

These can't be populated by pephub.

Derived attributes with reference to other sample attributes are correctly derived: http://pephub.databio.org/pep/demo/derive/samples

Project splash page: Add links to API endpoints

I think the splash page for a project should have links to show how to actually get some stuff.

for example, this page:
http://pephub.databio.org/pep/nfcore/demo_rna_pep/view

should have a link to :

project: http://pephub.databio.org/pep/nfcore/demo_rna_pep/
samples: http://pephub.databio.org/pep/nfcore/demo_rna_pep/samples
maybe conversion formats? That woudl be:
- http://pephub.databio.org/pep/nfcore/demo_rna_pep/convert?filter=csv
- all other available formats (http://pephub.databio.org/eido/filters)

part of the point of this is to make it easy to access the particular endpoints for a given project, by browsing to the project page and then having those links be there, already populated for you.

Simplifying main landing page

Two suggestions:

The main landing page could show the list of Namespaces shown here: http://pephub.databio.org/pep/view -- I don't really think this needs to be a separate page.
These would be more concise in a table, instead of full-screen-width cards.

Providing file formats with eido conversions

Eido provides a CLI to convert a PEP into different output formats: http://eido.databio.org/en/latest/filters/

pephub should provide an API to retrieve sample metadata in various formats by specifying an eido filter.

Thus we should provide:

an endpoint to list available filters. That just basically runs eido filter (obviously from within python, not on the CLI)
and endpoint that takes a PEP identifier (registry path), plus a filter name, and returns the data in that format. That's basically running eido convert config.yaml -f <filter>.

Redesign the landing page

With use authentication and more features, we decided that it would make more sense to have a better, more stylish landing page that just makes more sense.

The original idea was to emulate docker hub, but this is flexible.

PEP Searching

@nsheff Should we enable searching for peps? @khoroshevskyi said it would be nice to search through peps. We could add a /search endpoint?

New `PepAgent` removes ability to serve configuration files

Maybe this is an issue for pepdb, but originally it was quite easy to serve a config file for any PEP since the actual files were stored next to the server. Now that we are migrating to the database representation of PEPs, it is not so trivial to serve config files through the API.

What can we do to retain the file-serving capability of pephub? Some ideas:

Store binaries in the postgres database
Reconstruct the config file on pephub using peppy
Store files somewhere else?

Project endpoints are case-sensitive

Currently, trying to access a /namespace/project endpoint only works if the case matches. E.g. "biocproject" != "BiocProject". I think that the project endpoints (at least for identifiers) should be case-insensitive.

API endpoint semantics for PEP tags?

With the new tag schema changes, how do we want to handle separate tags for PEPs, say demo/basic:latest? I was thinking maybe /pep/demo/basic/latest? but I am not really sure. Colons (:) are invalid characters in URIs. I'm not sure how dockerhub handles this...

@nsheff thoughts?

The yaml filter result is wrapped in json

http://pephub.databio.org/pep/geo/gse74180/convert?filter=yaml

Output of the yaml filter is an escaped string representation of a yaml inside of json:

Instead, it should just return the yaml file.

Add authentication to submission endpoints

PEP Submission

It has been discussed a lot that we should open an endpoint for submitting PEPs to PEPhub. However, it needs to involve some sort of authentication. I believe the plan is to use Auth0 to do this. We should protect the submission endpoint to only allow authenticated users to submit a new PEP.

What's Been Done

There are two endpoints now to help with PEP submission:

GET /pep/{namespace}/submit:
This returns a webpage with a form that user's can use to name their PEP, upload files, and add a tag. I believe the idea is the username of the user will be used to populate the namespace.
POST /pep/{namespace}/submit:
This accepts a multipart/form-data POST request and utilizes the submission to insert a new PEP into the database using pepdbagent. At present, this is only available if their is a SERVER_ENV=development environment variable set when running the server.

What Needs To Be Done:

Use authentication to only permit the use of these endpoints for authenticated users (through GitHub).
Utilize the authentication information to populate the submission form.

I am assigning this to @nsheff since he has experience with Auth0 and next.js deployment. @rafalstepien since you are familiar with Django, I thought I'd assign you too as you may have insight here.

I am close to another release of pephub, so once that is done, I think we can start to work on this.

Passing variables with query params

Wouldn't it be cool if you could pass variables through query params to adjust attributes in the PEP?

For example, say a PEP has a derived column that uses an environment variable like $DATA, because the files are stored in $DATA/subfolder/{sample_name}.fq, or something.

If when you hit the endpoint to get some file attribute, you could pass /endpoint?DATA=/my/local/path, then the server could return a path that was useful for your local environment.

Catch and report errors

I just added some new PEPs to the pephub.github.org repo. I did it blind without testing. When I try to access those endpoints, I get this error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 375, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/fastapi/applications.py", line 208, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc
  File "/usr/local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.8/site-packages/starlette/middleware/cors.py", line 84, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/usr/local/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 259, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 61, in app
    response = await func(request)
  File "/usr/local/lib/python3.8/site-packages/fastapi/routing.py", line 250, in app
    response = actual_response_class(response_data, **response_args)
  File "/usr/local/lib/python3.8/site-packages/starlette/responses.py", line 49, in __init__
    self.body = self.render(content)
  File "/usr/local/lib/python3.8/site-packages/starlette/responses.py", line 157, in render
    return json.dumps(
  File "/usr/local/lib/python3.8/json/__init__.py", line 234, in dumps
    return cls(
  File "/usr/local/lib/python3.8/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python3.8/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
ValueError: Out of range float values are not JSON compliant

In addition the validating on pephub submission (a github action in pephub, see #15), we should also do some defensive programming here to catch these errors and return an informative message to the user, in case something sneaks by the validation script.

Bug in renewed log in on PEPhub

After one day of using PEPhub as loged in user tocken is expiring. After that page is not working: error pops up:

{"detail":"The token has expired, please log in again."}

validating multi-file peps

This function assumes that every file uploaded is a PEP:

pephub/pephub/routers/eido.py

Lines 185 to 190 in cb109b6

 for pep in peps: 

 contents = await pep.read() 

 pep_path = f"{tmpdirname}/{pep.filename}" 

 async with aiofiles.open(pep_path, mode="wb") as f: 

 await f.write(contents) 

 pep_project = peppy.Project(pep_path)

It creates a Project object from each file. This is false, many PEPs are made up of multiple files.

This means the validation will only work for single-file PEPs.

pephub dotfile standardization

Right now, pephub looks for a dotfile called ".pephub.yaml" in a folder to identify the config file, and expects it look like:

config_file: path/to/config.yaml

Interestingly, looper does something very similar...it wants a file called .looper.yaml in the folder and wants it to look like:

config_file_path: path/to/config.yaml

We should standardize.

To start, let's just use config_file_path for .pephub.yaml
Is it worth merging .looper.yaml and .pephub.yaml into a single file?

Links to validators

Right now, there's no way to get to the new validators. Can you please add in some interface whereby users can discover the new validators from the home page? I believe there are multiple types of new validators, so there needs to be some way to access these various tools.

Missing sample matches

@nsheff Another thought while loading the /v1/ChangLab/PEP_2 endpoint... I see this in the server output logs:

Couldn't find matching sample for subsample: BRCA-6F22B7DA-85CA-4E9C-93A3-859878775DDB-X005-S06-L012-B1-T1-PMRG
Couldn't find matching sample for subsample: BRCA-6F22B7DA-85CA-4E9C-93A3-859878775DDB-X005-S06-L012-B1-T1-PMRG
Couldn't find matching sample for subsample: LGGx-A50A1CE2-549C-4BCF-8E04-F846B09BEA95-X006-S06-L040-B1-T1-PMRG
Couldn't find matching sample for subsample: LGGx-A50A1CE2-549C-4BCF-8E04-F846B09BEA95-X006-S06-L040-B1-T1-PMRG
Couldn't find matching sample for subsample: GBMx-CBA5FDBB-E848-4B2D-82D5-8A33D7A3D205-X005-S11-L025-B1-T2-PMRG
Couldn't find matching sample for subsample: GBMx-CBA5FDBB-E848-4B2D-82D5-8A33D7A3D205-X005-S11-L025-B1-T2-PMRG

Should this be returned to the user requesting the PEP via the API?

Namespace splash page

Piggybacking off #31

It might be nice to have an HTML, user-friendly page that shows the same info you're putting into this endpoint. So, it would be a list of the projects for a namespace, with some metadata (number of samples), and a link to the project, some API endpoint links, etc. For browsability.

Disable hard querying requests

I was checking how does pephub works, and I think we should disable hard querying requests, e.g. get all projects, or similar queries. We will have a lot of errors there and it can cause server overload.
One cause of this issue is loading projects to peppy itself.

I would suggest:

https://github.com/pepkit/pepdbagent/blob/5d60659c4f7cfca0034aac29fe8a0c4ad708f841/pepdbagent/pepdbagent.py#L404-L421 -- disabled this function forever, LOL. pepdbagent won't handle so much projects.
https://github.com/pepkit/pepdbagent/blob/5d60659c4f7cfca0034aac29fe8a0c4ad708f841/pepdbagent/pepdbagent.py#L317-L402 -- for this 2 functions we should set limits.

Looking for discussion on this issue
@nsheff @nleroy917

Internal Server Error

http://pephub.databio.org/pep/nfcore/demo_rna/view

what's wrong with this PEP?

New column in the database schema

We should think about some column in the database, eg is_private, which will be boolean and will handle privacy of the projects.

missing .env file?

Following instructions for local development:

docker compose up --build
open /home/nsheff/code/pephub/.env: no such file or directory

`app.mount` breaks static file serving

It seems that this code:

app.mount(
    "/",
    StaticFiles(directory=STATICS_PATH),
    name="static",
)

breaks the routing. When this code is run, I can no longer access endpoints like /pep/{namespace} or /pep/{namespace}/{project}. Base routes remain in tact, but the routers no longer function. It seems like they are mutually exclusive, as in you can mount or server routers? I am not familiar enough with FastAPI to know. @nsheff any ideas?

Fix link integrity and missing attributes

With the rapid changes to pephub and pepdbagent, some features and API's got misaligned and when navigating the UI, it seems that sometimes features of PEPs are missing. As such, we need to go through each page and ensure that there are no missing values and that all links work well.

Allow PEPs to be marked private

Right now, any uploaded PEP is publicly visible. Instead, a user should be able to mark PEPs as public or private.

As a simple permission system to start:

A private PEP in an organization namespace should be visible to members of that organization
A private PEP in a personal namespace should be visible to only that person

To do this will require:

Database flag to store public or private status of each PEP (a column in the DB?)
Endpoints to set/get this status
Update public interface to only show public PEPs
PEP list endpoints should only show public PEPs for unuthenticated users
All data retrieval endpoints should only allow public for unauthenticated users (using access control decorators), but allow logged in users access to their own
UI to set this status on an edit page of some kind.

	for pep in peps:
	contents = await pep.read()
	pep_path = f"{tmpdirname}/{pep.filename}"
	async with aiofiles.open(pep_path, mode="wb") as f:
	await f.write(contents)
	pep_project = peppy.Project(pep_path)

pepkit / pephub Goto Github PK

pephub's Introduction

Install

Install dev version

pephub's People

Contributors

Stargazers

Watchers

Forkers

pephub's Issues

Overview

Issue

Code

Next Steps for PEPHub

PEP Submission

What's Been Done

What Needs To Be Done:

Recommend Projects

Recommend Topics

Recommend Org

Jobs