GithubHelp home page GithubHelp logo

pepkit / pephub Goto Github PK

View Code? Open in Web Editor NEW
12.0 6.0 2.0 17.85 MB

A web API and database for biological sample metadata

Home Page: https://pephub.databio.org

License: BSD 2-Clause "Simplified" License

Dockerfile 0.12% Python 23.96% Shell 0.22% CSS 2.45% HTML 4.89% JavaScript 0.92% TypeScript 65.47% MDX 1.96%
bioinformatics data-sharing metadata-management sample-metadata

pephub's Introduction

Pepkit

PEP compatible

Install

pip install --user --upgrade https://github.com/pepkit/pepkit/archive/master.zip

Install dev version

pip install --user --upgrade https://github.com/pepkit/pepkit/archive/dev.zip

pephub's People

Contributors

aliparslan avatar ayobi avatar khoroshevskyi avatar nleroy917 avatar nsheff avatar rafalstepien avatar sanghoonio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

shbrief nleroy917

pephub's Issues

Bug with 307 temporary redirect in fastAPI leading to custom validator not working

The custom validator gives an error for any input:

TypeError: NetworkError when attempting to fetch resource.

The original validator was still working successfully. This only happens in the deployed instance, not when running locally, where it works. After spending several hours troubleshooting, I realized:

  1. If you look at the Network tab in the console, it was mentioning something about a redirect, and Mixed content being returned.
  2. In the server logs, I saw: "POST /eido/validate/pep HTTP/1.1" 307 Temporary Redirect whenever I would try to use the new validator

This led me to some searching about fastapi and 307 redirects and I realized that a missing trailing slash triggers this. Sure enough:

Trailing slash:

@router.post("/validate/pep/")

NO trailing slash:
@router.post("/validate")

It has something to do with the redirect not being https aware, maybe? Anyway, I think the fix is just to remove that slash in the endpoint definition.

Standardize return package schema/format

@nsheff As I work on these endpoints, I am just returning raw data with nothing else. For example this is the API response for /pep/demo/subsample1/config:

pep_version: "2.0.0"
sample_table: sample_table.csv
subsample_table: subsample_table.csv
looper:
  output_dir: $HOME/example_results

It's the config file in yaml, but nothing else. Should we be standardizing the return data packages? Something like...

{
  "status": "ok",
  "message": "success",
  "data": {
    "pep_version": "2.0.0",
    "sample_table": "sample_table.csv",
    "subsample_table": "subsample_table.csv",
    "looper": {
      "output_dir": "$HOME/example_results"
    }
  }
}

Does FastAPI have a way to standardize responses, sort of how it has a way to standardize request handling (with dependencies, route prefixes, etc)?

Pinned version requirements

Is there a reason you're pinning the version of every dependency here ? https://github.com/pepkit/pephub/blob/master/requirements/requirements-all.txt

I am not a fan of pinning versions in requirement files -- I prefer at least generally to use minimum bounds, as it makes it more flexible when things get upgraded, I don't have pip yelling at me

Also, this dependency list is extreme, many of these would be covered by sub-dependencies... see, for example, how we usually do it:

https://github.com/refgenie/refgenie/blob/master/requirements/requirements-all.txt

Handling large lists

Now if there's 10k entries in say, the geofetch namespace, it takes a long time to load these pages because there are so many entries.

we don't need to retrieve all of those, somehow.

Is this the best list output ?

Right now the namespace output is an object, with keys corresponding to projects in that namspace, and values are, I guess, the filepath on the local server to the project config file?

image

I see how the keys are useful, because they are the identifiers that can be used to get further information about the project. But what's the point of those values?

I think this endpoint should be serving an overview about the projects. What more useful information could be served here? Maybe number of samples in the project?

Or maybe it's just a list of the keys. The local paths just confuse things.

Adding example request data to path parameters defined through dependency injection

Overview

As heavily discussed in #14. I can't seem to add route example request data to /docs when the path parameters are defined through dependency injection. I thought I'd pull it out to its own issue here to track that progress instead of leaving the PR in purgatory.

Issue

I guess I am having an issue declaring request data examples for specific endpoint/route parameters when said endpoint/route parameters are defined through dependency injection. In the below example, I need to verify that the namespace actually exists prior to returning data about it. To do that, I followed this example in the FastAPI docs. It works great.

However, in /docs, I'd love to give examples for namespaces. To try and do that, I followed this example in the FastAPI docs.

Visiting /docs, however, provides the user with no example request data:
image

If I comment out the global dependencies like so:

router = APIRouter(
    prefix="/pep/{namespace}",
    # dependencies=[Depends(verify_namespace)],
)

The example is shown. I can also verify that declaring requirements in the Path() instance like min_length or max_length, these requirements are honored by FastAPI (even though I can't see the requirements in /docs).

Code

# main.py
from fastapi import FastAPI, JSONResponse
from .routers import namespace

app = FastAPI()

app.include_router(
    namespace.router
)

# routers/namespace.py
from fastapi import APIRouter, Depends

from ..dependencies import *
from ..main import _PEP_STORES
from ..route_examples import example_namespace

router = APIRouter(
    prefix="/pep/{namespace}",
    dependencies=[Depends(verify_namespace)],
)

@router.get("/", summary="Fetch details about a particular namespace.")
async def get_namespace(namespace: str = example_namespace):
    """Fetch namespace. Returns a JSON representation of the namespace and the projects inside it."""
    return JSONResponse(content=_PEP_STORES[namespace])

# dependencies.py
from .main import _PEP_STORES

def verify_namespace(namespace: str = example_namespace) -> None:
    if namespace not in _PEP_STORES:
        raise HTTPException(status_code=404, detail=f"namespace '{namespace}' not found.")

# route_examples.py
from fastapi import Path

example_namespace = Path(
    ...,
    description="A namespace that holds projects.",
    example="demo",
)

Splash page example styling

It's strange to me that the heading styles are smaller and less pronounced than the items underneath them:

image

See how the "GET /pep/demo" is larger than the header class explaining what that is? To my eye that makes it hard to parse at a glance what is going on here. maybe it's just me.

Sample names with slashes break endpoints

Not sure what to do about this. It is probably a low priority, but sample names from GEO PEPs sometimes have slashes in them (/) and that breaks the functionality of the sample endpoints.

e.g.
image

Make pephub API accept query parameters

Example request:

GET https://pephub.databio.org/pep/geo/GSE124224/convert?filter=csv?DATA=some/local/path&IMPORTANT_PARAMETER=value_of_this_parameter&VARIABLE=test

Improve logging

Please make pepdbagent and pephub use consistent loggers. Probably logmuse will help with this.

  • use consistent logging format
  • no print messages, always use logger
  • use logmuse if it makes sense
  • improve message for Invalid token type message
  • track incoming IP addresses in logs if possible
  • Fix issue with BEDbase logs peppy

I see these messages littered all over the logs:

Not all environment variables were populated in derived attribute source: $BEDBASE_DATA_PATH_HOST/bed_files/{file_name}
Not all environment variables were populated in derived attribute source: $BEDBASE_DATA_PATH_HOST/outputs/bedstat_output/bedstat_pipeline_logs/submission/{sample_name}_sample.yaml
Not all environment variables were populated in derived attribute source: $BEDBASE_DATA_PATH_HOST/openSignalMatrix_{genome}_percentile99_01_quantNormalized_round4d.txt.gz

So basically the point here is: run this service and look at the logs and make the logs consistent and useful and understandable, and correct any errors.

Why are links using external window targets?

It's strange to me that the links on the pephub landing page are all opening targets in new windows. is there a reason you did it that way?

To me it's frustrating, as I generally don't expect links to do that. I don't see a reason for doing it here.

`pepdb` Migration

I need a place to keep track of endpoints that are working with the new PepAgent class.

PEP:

  • /pep/view

Namespace:

  • /pep/{namespace}
  • /pep/{namespace}/view
  • /pep/{namespace}/projects

Project:

  • /pep/{namespace}/{project}
  • /pep/{namespace}/{project}/view
  • /pep/{namespace}/{project}/zip
  • /pep/{namespace}/{project}/config
  • /pep/{namespace}/{project}/samples
  • /pep/{namespace}/{project}/samples/{sample_name}
  • /pep/{namespace}/{project}/samples/{sample_name}/view
  • /pep/{namespace}/{project}/subsamples
  • /pep/{namespace}/{project}/convert

Eido:
Complete

The PepAgent class is nice and mature for the project level data, bet less to for metadata about PEPs and then for namespaces.

Add more endpoints

I thought it would be efficient to have a list of all endpoints we should have for pephub in an issue. This is what I have so far:

  • /<namespaceid>/<projectid>
  • /<namespaceid>/<projectid>
  • /<namespaceid>/<projectid>/config
  • /<namespaceid>/<projectid>/samples
  • /<namespaceid>/<projectid>/samples/<sampleid>
  • /<namespaceid>/<projectid>/subsamples/<sampleid>
  • /<namespaceid>/<projectid>/zip

@nsheff Feel free to add more that you see as necessary.

PEP v1 support

@nsheff I noticed that when hitting /v1/ChangLab/PEP_1 you get an internal server error as the PEP isn't compatible with PEP 2.0.0:

NotImplementedError: The attribute implications section (implied_columns) follows the old format. 
Reformatting is not implemented. 
Edit the config file manually (add 'sample_modifiers.imply') to comply with PEP 2.0.0 specification:
http://pep.databio.org/en/latest/specification/#sample-modifier-imply

Is there an incentive to support both versions of PEP? As well, how would that be implemented?

Performance

Just so I don't forget --

We need to look into performance of peppy. It takes too long to process/return a large project. It needs to be very fast.

  • @nleroy917 said it's processing twice, this needs to be reduced to once
  • might need to explore peppy code a bit to see if we can optimize.

Next Steps

Next Steps for PEPHub

  1. Format conversions <-- Highest Priority
  2. Integrate with bedhost
  3. Add more PEPs

Pephub version missing

Right now, pephub shows the version of peppy and Python used, but not the version of pephub

Project splash page: Add links to API endpoints

I think the splash page for a project should have links to show how to actually get some stuff.

for example, this page:
http://pephub.databio.org/pep/nfcore/demo_rna_pep/view

should have a link to :

part of the point of this is to make it easy to access the particular endpoints for a given project, by browsing to the project page and then having those links be there, already populated for you.

Providing file formats with eido conversions

Eido provides a CLI to convert a PEP into different output formats: http://eido.databio.org/en/latest/filters/

pephub should provide an API to retrieve sample metadata in various formats by specifying an eido filter.

Thus we should provide:

  1. an endpoint to list available filters. That just basically runs eido filter (obviously from within python, not on the CLI)
  2. and endpoint that takes a PEP identifier (registry path), plus a filter name, and returns the data in that format. That's basically running eido convert config.yaml -f <filter>.

Redesign the landing page

With use authentication and more features, we decided that it would make more sense to have a better, more stylish landing page that just makes more sense.

The original idea was to emulate docker hub, but this is flexible.

New `PepAgent` removes ability to serve configuration files

Maybe this is an issue for pepdb, but originally it was quite easy to serve a config file for any PEP since the actual files were stored next to the server. Now that we are migrating to the database representation of PEPs, it is not so trivial to serve config files through the API.

What can we do to retain the file-serving capability of pephub? Some ideas:

  1. Store binaries in the postgres database
  2. Reconstruct the config file on pephub using peppy
  3. Store files somewhere else?

Project endpoints are case-sensitive

Currently, trying to access a /namespace/project endpoint only works if the case matches. E.g. "biocproject" != "BiocProject". I think that the project endpoints (at least for identifiers) should be case-insensitive.

API endpoint semantics for PEP tags?

With the new tag schema changes, how do we want to handle separate tags for PEPs, say demo/basic:latest? I was thinking maybe /pep/demo/basic/latest? but I am not really sure. Colons (:) are invalid characters in URIs. I'm not sure how dockerhub handles this...

@nsheff thoughts?

Add authentication to submission endpoints

PEP Submission

It has been discussed a lot that we should open an endpoint for submitting PEPs to PEPhub. However, it needs to involve some sort of authentication. I believe the plan is to use Auth0 to do this. We should protect the submission endpoint to only allow authenticated users to submit a new PEP.

What's Been Done

There are two endpoints now to help with PEP submission:

  1. GET /pep/{namespace}/submit:
    This returns a webpage with a form that user's can use to name their PEP, upload files, and add a tag. I believe the idea is the username of the user will be used to populate the namespace.

  2. POST /pep/{namespace}/submit:
    This accepts a multipart/form-data POST request and utilizes the submission to insert a new PEP into the database using pepdbagent. At present, this is only available if their is a SERVER_ENV=development environment variable set when running the server.

What Needs To Be Done:

  1. Use authentication to only permit the use of these endpoints for authenticated users (through GitHub).
  2. Utilize the authentication information to populate the submission form.

I am assigning this to @nsheff since he has experience with Auth0 and next.js deployment. @rafalstepien since you are familiar with Django, I thought I'd assign you too as you may have insight here.

I am close to another release of pephub, so once that is done, I think we can start to work on this.

Passing variables with query params

Wouldn't it be cool if you could pass variables through query params to adjust attributes in the PEP?

For example, say a PEP has a derived column that uses an environment variable like $DATA, because the files are stored in $DATA/subfolder/{sample_name}.fq, or something.

If when you hit the endpoint to get some file attribute, you could pass /endpoint?DATA=/my/local/path, then the server could return a path that was useful for your local environment.

Catch and report errors

I just added some new PEPs to the pephub.github.org repo. I did it blind without testing. When I try to access those endpoints, I get this error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/http/httptools_impl.py", line 375, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/fastapi/applications.py", line 208, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc
  File "/usr/local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.8/site-packages/starlette/middleware/cors.py", line 84, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/usr/local/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 259, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/starlette/routing.py", line 61, in app
    response = await func(request)
  File "/usr/local/lib/python3.8/site-packages/fastapi/routing.py", line 250, in app
    response = actual_response_class(response_data, **response_args)
  File "/usr/local/lib/python3.8/site-packages/starlette/responses.py", line 49, in __init__
    self.body = self.render(content)
  File "/usr/local/lib/python3.8/site-packages/starlette/responses.py", line 157, in render
    return json.dumps(
  File "/usr/local/lib/python3.8/json/__init__.py", line 234, in dumps
    return cls(
  File "/usr/local/lib/python3.8/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/local/lib/python3.8/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
ValueError: Out of range float values are not JSON compliant

In addition the validating on pephub submission (a github action in pephub, see #15), we should also do some defensive programming here to catch these errors and return an informative message to the user, in case something sneaks by the validation script.

Bug in renewed log in on PEPhub

After one day of using PEPhub as loged in user tocken is expiring. After that page is not working: error pops up:

{"detail":"The token has expired, please log in again."}

validating multi-file peps

This function assumes that every file uploaded is a PEP:

for pep in peps:
contents = await pep.read()
pep_path = f"{tmpdirname}/{pep.filename}"
async with aiofiles.open(pep_path, mode="wb") as f:
await f.write(contents)
pep_project = peppy.Project(pep_path)

It creates a Project object from each file. This is false, many PEPs are made up of multiple files.

This means the validation will only work for single-file PEPs.

pephub dotfile standardization

Right now, pephub looks for a dotfile called ".pephub.yaml" in a folder to identify the config file, and expects it look like:

config_file: path/to/config.yaml

Interestingly, looper does something very similar...it wants a file called .looper.yaml in the folder and wants it to look like:

config_file_path: path/to/config.yaml

We should standardize.

  1. To start, let's just use config_file_path for .pephub.yaml
  2. Is it worth merging .looper.yaml and .pephub.yaml into a single file?

Links to validators

Right now, there's no way to get to the new validators. Can you please add in some interface whereby users can discover the new validators from the home page? I believe there are multiple types of new validators, so there needs to be some way to access these various tools.

Missing sample matches

@nsheff Another thought while loading the /v1/ChangLab/PEP_2 endpoint... I see this in the server output logs:

Couldn't find matching sample for subsample: BRCA-6F22B7DA-85CA-4E9C-93A3-859878775DDB-X005-S06-L012-B1-T1-PMRG
Couldn't find matching sample for subsample: BRCA-6F22B7DA-85CA-4E9C-93A3-859878775DDB-X005-S06-L012-B1-T1-PMRG
Couldn't find matching sample for subsample: LGGx-A50A1CE2-549C-4BCF-8E04-F846B09BEA95-X006-S06-L040-B1-T1-PMRG
Couldn't find matching sample for subsample: LGGx-A50A1CE2-549C-4BCF-8E04-F846B09BEA95-X006-S06-L040-B1-T1-PMRG
Couldn't find matching sample for subsample: GBMx-CBA5FDBB-E848-4B2D-82D5-8A33D7A3D205-X005-S11-L025-B1-T2-PMRG
Couldn't find matching sample for subsample: GBMx-CBA5FDBB-E848-4B2D-82D5-8A33D7A3D205-X005-S11-L025-B1-T2-PMRG

Should this be returned to the user requesting the PEP via the API?

Namespace splash page

Piggybacking off #31

It might be nice to have an HTML, user-friendly page that shows the same info you're putting into this endpoint. So, it would be a list of the projects for a namespace, with some metadata (number of samples), and a link to the project, some API endpoint links, etc. For browsability.

Disable hard querying requests

I was checking how does pephub works, and I think we should disable hard querying requests, e.g. get all projects, or similar queries. We will have a lot of errors there and it can cause server overload.
One cause of this issue is loading projects to peppy itself.

I would suggest:

  1. https://github.com/pepkit/pepdbagent/blob/5d60659c4f7cfca0034aac29fe8a0c4ad708f841/pepdbagent/pepdbagent.py#L404-L421 -- disabled this function forever, LOL. pepdbagent won't handle so much projects.
  2. https://github.com/pepkit/pepdbagent/blob/5d60659c4f7cfca0034aac29fe8a0c4ad708f841/pepdbagent/pepdbagent.py#L317-L402 -- for this 2 functions we should set limits.

Looking for discussion on this issue
@nsheff @nleroy917

missing .env file?

Following instructions for local development:

docker compose up --build
open /home/nsheff/code/pephub/.env: no such file or directory

`app.mount` breaks static file serving

It seems that this code:

app.mount(
    "/",
    StaticFiles(directory=STATICS_PATH),
    name="static",
)

breaks the routing. When this code is run, I can no longer access endpoints like /pep/{namespace} or /pep/{namespace}/{project}. Base routes remain in tact, but the routers no longer function. It seems like they are mutually exclusive, as in you can mount or server routers? I am not familiar enough with FastAPI to know. @nsheff any ideas?

Fix link integrity and missing attributes

With the rapid changes to pephub and pepdbagent, some features and API's got misaligned and when navigating the UI, it seems that sometimes features of PEPs are missing. As such, we need to go through each page and ensure that there are no missing values and that all links work well.

Allow PEPs to be marked private

Right now, any uploaded PEP is publicly visible. Instead, a user should be able to mark PEPs as public or private.

As a simple permission system to start:

  • A private PEP in an organization namespace should be visible to members of that organization
  • A private PEP in a personal namespace should be visible to only that person

To do this will require:

  • Database flag to store public or private status of each PEP (a column in the DB?)
  • Endpoints to set/get this status
  • Update public interface to only show public PEPs
  • PEP list endpoints should only show public PEPs for unuthenticated users
  • All data retrieval endpoints should only allow public for unauthenticated users (using access control decorators), but allow logged in users access to their own
  • UI to set this status on an edit page of some kind.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.