huggingface / datasets-server Goto Github PK

Lightweight web API for visualizing and exploring any dataset - computer vision, speech, text, and tabular - stored on the Hugging Face Hub

Home Page: https://huggingface.co/docs/datasets-server

License: Apache License 2.0

Python 95.87% Makefile 0.72% Dockerfile 1.02% Smarty 2.38% HTML 0.01%

datasets machine-learning api-rest data huggingface nlp

datasets-server's Introduction

Dataset viewer

Integrate into your apps over 10,000 datasets via simple HTTP requests, with pre-processed responses and scalability built-in.

Documentation: https://huggingface.co/docs/datasets-server

Ask for a new feature 🎁

The dataset viewer pre-processes the Hugging Face Hub datasets to make them ready to use in your apps using the API: list of the splits, first rows.

We plan to add more features to the server. Please comment there and upvote your favorite requests.

If you think about a new feature, please open a new issue.

Contribute 🤝

You can help by giving ideas, answering questions, reporting bugs, proposing enhancements, improving the documentation, and fixing bugs. See CONTRIBUTING.md for more details.

To install the server and start contributing to the code, see DEVELOPER_GUIDE.md

Community 🤗

You can star and watch this GitHub repository to follow the updates.

You can ask for help or answer questions on the Forum and Discord.

You can also report bugs and propose enhancements on the code, or the documentation, in the GitHub issues.

datasets-server's People

Contributors

Stargazers

Watchers

datasets-server's Issues

exception seen during `make benchmark`

Not sure which dataset threw this exception though, that's why I put the previous rows for further investigation.

poetry run python ../scripts/get_rows_report.py wikiann___CONFIG___or___SPLIT___test ../tmp/get_rows_reports/wikiann___CONFIG___or___SPLIT___test.json
poetry run python ../scripts/get_rows_report.py csebuetnlp___SLASH___xlsum___CONFIG___uzbek___SPLIT___test ../tmp/get_rows_reports/csebuetnlp___SLASH___xlsum___CONFIG___uzbek___SPLIT___test.json
poetry run python ../scripts/get_rows_report.py clips___SLASH___mfaq___CONFIG___no___SPLIT___train ../tmp/get_rows_reports/clips___SLASH___mfaq___CONFIG___no___SPLIT___train.json
poetry run python ../scripts/get_rows_report.py common_voice___CONFIG___rm-vallader___SPLIT___train ../tmp/get_rows_reports/common_voice___CONFIG___rm-vallader___SPLIT___train.json
https://media.githubusercontent.com/media/persiannlp/parsinlu/master/data/translation/translation_combined_fa_en/test.tsv
poetry run python ../scripts/get_rows_report.py pasinit___SLASH___xlwic___CONFIG___xlwic_en_da___SPLIT___train ../tmp/get_rows_reports/pasinit___SLASH___xlwic___CONFIG___xlwic_en_da___SPLIT___train.json
poetry run python ../scripts/get_rows_report.py indic_glue___CONFIG___wstp.mr___SPLIT___validation ../tmp/get_rows_reports/indic_glue___CONFIG___wstp.mr___SPLIT___validation.json
poetry run python ../scripts/get_rows_report.py banking77___CONFIG___default___SPLIT___test ../tmp/get_rows_reports/banking77___CONFIG___default___SPLIT___test.json
poetry run python ../scripts/get_rows_report.py gem___CONFIG___xsum___SPLIT___challenge_test_bfp_05 ../tmp/get_rows_reports/gem___CONFIG___xsum___SPLIT___challenge_test_bfp_05.json
poetry run python ../scripts/get_rows_report.py turingbench___SLASH___TuringBench___CONFIG___TT_fair_wmt19___SPLIT___validation ../tmp/get_rows_reports/turingbench___SLASH___TuringBench___CONFIG___TT_fair_wmt19___SPLIT___validation.json
poetry run python ../scripts/get_rows_report.py igbo_monolingual___CONFIG___eze_goes_to_school___SPLIT___train ../tmp/get_rows_reports/igbo_monolingual___CONFIG___eze_goes_to_school___SPLIT___train.json
poetry run python ../scripts/get_rows_report.py flax-sentence-embeddings___SLASH___stackexchange_titlebody_best_voted_answer_jsonl___CONFIG___gamedev___SPLIT___train ../tmp/get_rows_reports/flax-sentence-embeddings___SLASH___stackexchange_titlebody_best_voted_answer_jsonl___CONFIG___gamedev___SPLIT___train.json
 * skipping . . .
Exception ignored in: <generator object ParsinluReadingComprehension._generate_examples at 0x7f094caa6dd0>
Traceback (most recent call last):
  File "/home/slesage/hf/datasets-preview-backend/.venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 79, in __iter__
    yield key, example
RuntimeError: generator ignored GeneratorExit

Add a parameter to specify the number of rows

It's a problem for the cache, so until we manage random access, we can:

fill the cache with a large (maximum) number of rows, ie up to 1000
also cache the default request (N = 100) -> set to the parameter used in moon-landing
if a request comes with N = 247, for example, generate the response on the fly, from the large cache (1000), and don't cache that response

refactor to reduce functions complexity

For example,

https://github.com/huggingface/datasets-preview-backend/blob/13e533238eb6b6dfdcd8e7d3c23ed134c67b5525/src/datasets_preview_backend/queries/rows.py#L25

is rightly flagged by https://sourcery.ai/ as too convoluted. It's hard to debug and test, and there are too many special cases.

Add an endpoint to get the dataset card?

See https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/hf_api.py#L427, full argument

The dataset card is the README.md.

Update hub datasets with webhook

Webhook invalidation of community datasets (hf.co):

setup a webhook on hf.co for datasets creation, update, deletion -> waiting for https://github.com/huggingface/moon-landing/issues/1344
add an endpoint to listen to the webhook
parse the webhook to find which caches should be invalidated
refresh these caches
document added in moonrise mongo db

Upgrade datasets to 1.12.0

See https://github.com/huggingface/datasets/releases/tag/1.12.0
launch benchmark and report to #9

endpoint to generate bitmaps for mnist or cifar10 on the fly

if there are very few instances of raw image data in datasets i think it's best to generate server side vs. writing client side code

no strong opinion on this though, depends on the number/variety of datasets i guess

(For Audio I don't know if we have some datasets with raw tensors inside them? @lhoestq @albertvillanova )

"flatten" the nested values?

See https://huggingface.co/docs/datasets/process.html#flatten

Use /info as the source for configs and splits?

It's a refactor. As the dataset info contains the configs and splits, maybe the code can be factorized. Before doing it: review the errors for /info, /configs, and /splits (https://observablehq.com/@huggingface/quality-assessment-of-datasets-loading) and ensure we will not increase the number of erroneous datasets.

Expand the purpose of this backend?

Depending on the evolution of https://github.com/huggingface/datasets, this project might disappear, or its features might be reduced, in particular, if one day it allows caching the data by self-generating:

an arrow or a parquet data file (maybe with sharding and compression for the largest datasets)
or a SQL database
or precompute and store a partial list of known offsets (every 10MB for example)

It would allow getting random access to the data.

Add unit tests to CI

Get random access to the rows

Currently, only the first rows can be obtained with /rows. We want to get access to slices of the rows through pagination, eg /rows?from=40000&rows=10

Should the features be associated to a split, instead of a config?

For now, we assume that all the splits of a config will share the same features, but it seems that it's not necessarily the case (huggingface/datasets#2968). Am I right @lhoestq ?

Is there any example of such a dataset on the hub or in the canonical ones?

Support private datasets

For now, only public datasets can be queried.

To support private datasets :

add use_auth_token argument to all the queries functions (and upstream too in https://github.com/huggingface/datasets/blob/master/src/datasets/inspect.py)
obtain the authentication header ~~or cookie~~ from the request

Remove the `rows`/ `num_rows` argument

We will fix it to 100, in order to simplify the cache management.

Add columns types to /rows response

We currently just have the keys (column identifier) and values. We might want to give each column's type: "our own serialization scheme".
For ClassLabel, this means to pass the "pretty names" (the names associates to the values) along with the type

See https://github.com/huggingface/moon-landing/pull/1040#discussion_r709496849

Install the datasets that require manual download

Some datasets require a manual download (https://huggingface.co/datasets/arxiv_dataset, for example). We might manually download them on the server, so that the backend returns the rows, instead of an error.

Add endpoint to proxy local files inside datasets' data

for instance for:

Move benchmark to a different repo?

It's a client of the API

warm the cache

Warm the cache at application startup. We want:

to avoid blocking the application, so: run asynchronously, and without hammering the server
to have a warm cache as fast as possible (persisting the previous cache, then refreshing it at startup? - related: #35 )
create a function to list all the datasets and fill the cache for all the possible requests for it. It might be make benchmark, or a specific function -> make warm
persist the cache? or start with an empty cache when the application is restarted? -> yes, persisted
launch it at application startup -> it's done at startup, see INSTALL.md.

Ensure non-ASCII characters are handled as expected

See huggingface/datasets-viewer#15

It should be tested

use `environs` to manage the env vars?

https://pypi.org/project/environs/ instead of utils.py

cache both the functions returns and the endpoints results

Currently only the endpoints results are cached. We use them inside the code to get quick results by taking advantage of the cache, but it's not their aim, and we have to parse / decode.

It would be better to directly cache the results of the functions (memoïze).

Also: we could cache the raised exceptions as here:

https://github.com/peterbe/django-cache-memoize/blob/4da1ba4639774426fa928d4a461626e6f841b4f3/src/cache_memoize/__init__.py#L153L157

Provide the ETag header

set and manage the ETag header to save bandwidth when the client (browser) revalidates. See https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching and https://gist.github.com/timheap/1f4d9284e4f4d4f545439577c0ca6300

    # TODO: use key for ETag? It will need to be serialized
    # key = get_rows_json.__cache_key__(
    #     dataset=dataset, config=config, split=split, num_rows=num_rows, token=request.user.token
    # )
    # print(f"key={key} in cache: {cache.__contains__(key)}")

ETag: add an ETag header in the response (hash of the response)
ETag: if the request contains the If-None-Match, parse its ETag (beware the "weak" ETags), compare to the cache, and return an empty 304 response if the cache is fresh (with or without changing the TTL), or 200 with content if it has changed

CI: how to acknowledge a "safety" warning?

We use safety to check vulnerabilities in the dependencies. But in the case below, tensorflow is marked as insecure while the last published version on pipy is still 2.6.0. What to do in this case?

+==============================================================================+
|                                                                              |
|                               /$$$$$$            /$$                         |
|                              /$$__  $$          | $$                         |
|           /$$$$$$$  /$$$$$$ | $$  \__//$$$$$$  /$$$$$$   /$$   /$$           |
|          /$$_____/ |____  $$| $$$$   /$$__  $$|_  $$_/  | $$  | $$           |
|         |  $$$$$$   /$$$$$$$| $$_/  | $$$$$$$$  | $$    | $$  | $$           |
|          \____  $$ /$$__  $$| $$    | $$_____/  | $$ /$$| $$  | $$           |
|          /$$$$$$$/|  $$$$$$$| $$    |  $$$$$$$  |  $$$$/|  $$$$$$$           |
|         |_______/  \_______/|__/     \_______/   \___/   \____  $$           |
|                                                          /$$  | $$           |
|                                                         |  $$$$$$/           |
|  by pyup.io                                              \______/            |
|                                                                              |
+==============================================================================+
| REPORT                                                                       |
| checked 137 packages, using free DB (updated once a month)                   |
+============================+===========+==========================+==========+
| package                    | installed | affected                 | ID       |
+============================+===========+==========================+==========+
| tensorflow                 | 2.6.0     | ==2.6.0                  | 41161    |
+==============================================================================+

`make benchmark` is very long and blocks

Sometimes make benchmark blocks (nothing happens, and only one process is running, while the load is low). Ideally, it would not block, and other processes would be launched anyway so that the full capacity of the CPUs would be used (-j -l 7 parameters of make)

To unblock, I have to kill and relaunch make benchmark manually.

Upgrade `datasets` and adapt the tests

Two issues have been fixed in datasets:

Also, support for streaming compressed files is improving:

On the next datasets release (see https://github.com/huggingface/datasets/releases), upgrade the dependency here and change the integration tests.

Add CI

Check types and code quality

/splits does not error when no config exists and a wrong config is passed

https://datasets-preview.huggingface.tech/splits?dataset=sent_comp&config=doesnotexist

returns:

{

    "splits": [
        {
            "dataset": "sent_comp",
            "config": "doesnotexist",
            "split": "validation"
        },
        {
            "dataset": "sent_comp",
            "config": "doesnotexist",
            "split": "train"
        }
    ]

}

instead of giving an error.

As https://datasets-preview.huggingface.tech/configs?dataset=sent_comp returns

{
    "configs": [
        {
            "dataset": "sent_comp",
            "config": "default"
        }
    ]
}

the only allowed config parameter should be default.

Regenerate dataset-info instead of loading it?

Currently, getting the rows with /rows requires a previous (internal) call to /infos to get the features (type of the columns). But sometimes the dataset-info.json file is missing, or not coherent with the dataset script (for example: https://huggingface.co/datasets/lhoestq/custom_squad/tree/main), while we are using datasets.get_dataset_infos(), which only loads the exported dataset-info.json files:

https://github.com/huggingface/datasets-preview-backend/blob/c2a78e7ce8e36cdf579fea805535fa9ef84a2027/src/datasets_preview_backend/queries/infos.py#L45

https://github.com/huggingface/datasets/blob/26ff41aa3a642e46489db9e95be1e9a8c4e64bea/src/datasets/inspect.py#L115

We might want to call ._info() from the builder to get the info, and features, instead of relying on the dataset-info.json file.

Expose a list of "valid" i.e. previewable datasets for moon-landing to be able to tag/showcase them

(linked to caching and pre-warming, obviously)

Refresh the cache?

Force a cache refresh on a regular basis (cron)

Raise an issue when no row can be fetched

Currently, https://datasets-preview.huggingface.tech/rows?dataset=superb&config=asr&split=train&rows=5 returns

{
    "dataset": "superb",
    "config": "asr",
    "split": "train",
    "rows": [ ]
}

while it should return 5 rows. An error should be raised in that case.

Beware: manage the special case when the query parameter rows is greater than the number of rows in the split. In that case, it's normal that the number of returned rows is lower than requested.

Scale the application

Both uvicorn and pm2 allow specifying the number of workers. pm2 seems interesting since it provides a way to increase or decrease the number of workers without restart.

But before using multiple workers, it's important to instrument the app in order to detect if we need it (eg: monitor the response time).

Properly manage the case config is None

For example:

https://datasets-preview.huggingface.tech/splits?dataset=sent_comp&config=null returns

{
    "dataset": "sent_comp",
    "config": "null",
    "splits": [
        "validation",
        "train"
    ]
}

this should have errored since there is no "null" config (it's null).

https://datasets-preview.huggingface.tech/splits?dataset=sent_comp&config= returns

The split names could not be parsed from the dataset config.

https://datasets-preview.huggingface.tech/splits?dataset=sent_comp returns

{
    "dataset": "sent_comp",
    "config": null,
    "splits": [
        "validation",
        "train"
    ]
}

As a reference for the same dataset https://datasets-preview.huggingface.tech/configs?dataset=sent_comp returns

{
    "dataset": "sent_comp",
    "configs": [
        null
    ]
}

and https://datasets-preview.huggingface.tech/info?dataset=sent_comp returns

{
    "dataset": "sent_comp",
    "info": {
        "default": {
            ...
            "builder_name": "sent_comp",
            "config_name": "default",
            ...
            "splits": {
                "validation": {
                    "name": "validation",
                    "num_bytes": 55823979,
                    "num_examples": 10000,
                    "dataset_name": "sent_comp"
                },
                "train": {
                    "name": "train",
                    "num_bytes": 1135684803,
                    "num_examples": 200000,
                    "dataset_name": "sent_comp"
                }
            },
            ...
        }
    }
}

Increase the proportion of hf.co datasets that can be previewed

For different reasons, some datasets cannot be previewed. It might be because the loading script is buggy, because the data is in a format that cannot be streamed, etc.

The script https://github.com/huggingface/datasets-preview-backend/blob/master/quality/test_datasets.py tests the three endpoints on all the datasets in hf.co and outputs a data file that can be analysed in https://observablehq.com/@huggingface/quality-assessment-of-datasets-loading.

The goal is to understand which problems arise most and try to fix the ones that can be addressed (here or in datasets) so that the largest part of the hf.co datasets can be previewed.

run benchmark automatically every week, and store the results

create a github action?
store the report in... an HF dataset? or in a github repo? should it be private?
get these reports from https://observablehq.com/@huggingface/quality-assessment-of-datasets-loading

Enable the private datasets

The code is already present to pass the token, but it's disabled in the code (hardcoded):

https://github.com/huggingface/datasets-preview-backend/blob/df04ffba9ca1a432ed65e220cf7722e518e0d4f8/src/datasets_preview_backend/cache.py#L119-L120

enable private datasets and manage their cache adequately
separate private caches from public caches: for authenticated requests, we need to check every time, or at least use a much lower TTL, because an access can be removed. Also: since a hub dataset can be turned private, how should we manage them?
add doc. See f6576d5

Fix serialization in benchmark

INFO:     127.0.0.1:38826 - "GET /info?dataset=oelkrise%2FCRT HTTP/1.1" 404 Not Found

Give the cause of the error in the endpoints

It can thus be used in the hub to show hints to the dataset owner (or user?) to improve the script and fix the bug.

Cache the responses

The datasets generally don't change often, so it's surely worth caching the responses.

Three levels of cache are involved:

client (browser, moon-landing): use Response headers (cache-control, ETag, see https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching)
application: serve the cached responses. Invalidation of the cache for a given request:
- when the request arrives, if the TTL has finished
- when a webhook has been received for the dataset (see below)
- (needed?) when a dedicated background process is launched to refresh the cache (cron)
datasets: the library manages its own cache to avoid unneeded downloads

Here we will implement the application cache, and provide the headers for the client cache.

The scope of this issue has been reduced. See the next issues:

And maybe:

Add endpoint to get splits + configs at once

See https://github.com/huggingface/moon-landing/pull/1040#discussion_r709494993

Also evaluate doing the same for the rows (the payload might be too heavy)

Update canonical datasets using a webhook

Webhook invalidation of canonical datasets (GitHub):

setup the revision argument to download datasets from the master branch - #119
set up a webhook on datasets library on every push to the master branch - see https://github.com/huggingface/moon-landing/issues/1345 - not needed anymore because the canonical datasets are mirrored to the hub.
add an endpoint to listen to the webhook
parse the webhook to find which caches should be invalidated (creation, update, deletion)
refresh these caches

Use FastAPI instead of only Starlette?

It would allow to have doc, and surely a lot of other benefits

Establish and meet SLO

https://en.wikipedia.org/wiki/Service-level_objective

as stated in #1 (comment):

we need to "guarantee" that row fetches from moon-landing will be under a specified latency (to be discussed), even in the case of cache misses in datasets-preview-backend

because the data will be needed at server-rendering time, for content to be parsed by Google

What's a reasonable latency you think you can achieve?

If it's too long we might want to pre-warm the cache for all (streamable) dataset, using a system based on webhooks from moon-landing for instance

Support image datasets

Some examples we want to support

Array2D
- mnist - https://datasets-preview.huggingface.tech/rows?dataset=mnist
Array3D
- cifar10 - https://datasets-preview.huggingface.tech/rows?dataset=cifar10
- cifar100 - https://datasets-preview.huggingface.tech/rows?dataset=cifar100
local files:
- food101 - https://datasets-preview.huggingface.tech/rows?dataset=food101
remote files (URL):
- compguesswhat
- severo/wit - https://datasets-preview.huggingface.tech/rows?dataset=severo/wit

pm2 logs

0|datasets | INFO:     3.238.194.17:0 - "GET /configs?dataset=allenai/c4 HTTP/1.1" 200 OK

Check remote data files:  78%|███████▊  | 54330/69219 [14:13<03:05, 80.10it/s]
Check remote data files:  79%|███████▊  | 54349/69219 [14:13<03:51, 64.14it/s]
Check remote data files:  79%|███████▊  | 54364/69219 [14:14<04:44, 52.14it/s]
Check remote data files:  79%|███████▊  | 54375/69219 [14:14<04:48, 51.38it/s]
Check remote data files:  79%|███████▊  | 54448/69219 [14:15<02:37, 93.81it/s]
Check remote data files:  79%|███████▉  | 54543/69219 [14:15<01:56, 125.60it/s]
Check remote data files:  79%|███████▉  | 54564/69219 [14:16<03:22, 72.33it/s]

Manage concurrency

Currently (in the cache branch), only one worker is allowed.

We want to have multiple workers, but for that we need to have a shared cache:

migrate from diskcache to redis
remove the hardcoded limit of 1 worker

huggingface / datasets-server Goto Github PK

datasets-server's Introduction

Dataset viewer

Ask for a new feature 🎁

Contribute 🤝

Community 🤗

datasets-server's People

Contributors

Stargazers

Watchers

Forkers

datasets-server's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs