GithubHelp home page GithubHelp logo

huggingface / datasets-server Goto Github PK

View Code? Open in Web Editor NEW
613.0 34.0 58.0 22.34 MB

Lightweight web API for visualizing and exploring any dataset - computer vision, speech, text, and tabular - stored on the Hugging Face Hub

Home Page: https://huggingface.co/docs/datasets-server

License: Apache License 2.0

Python 95.87% Makefile 0.72% Dockerfile 1.02% Smarty 2.38% HTML 0.01%
datasets machine-learning api-rest data huggingface nlp

datasets-server's Introduction

Dataset viewer

Integrate into your apps over 10,000 datasets via simple HTTP requests, with pre-processed responses and scalability built-in.

Documentation: https://huggingface.co/docs/datasets-server

Ask for a new feature 🎁

The dataset viewer pre-processes the Hugging Face Hub datasets to make them ready to use in your apps using the API: list of the splits, first rows.

We plan to add more features to the server. Please comment there and upvote your favorite requests.

If you think about a new feature, please open a new issue.

Contribute 🀝

You can help by giving ideas, answering questions, reporting bugs, proposing enhancements, improving the documentation, and fixing bugs. See CONTRIBUTING.md for more details.

To install the server and start contributing to the code, see DEVELOPER_GUIDE.md

Community πŸ€—

You can star and watch this GitHub repository to follow the updates.

You can ask for help or answer questions on the Forum and Discord.

You can also report bugs and propose enhancements on the code, or the documentation, in the GitHub issues.

datasets-server's People

Contributors

albertvillanova avatar andreafrancis avatar baskrahmer avatar ccl-core avatar coyotte508 avatar dependabot[bot] avatar egndz avatar geethika-123 avatar glegendre01 avatar jatinkumar001 avatar julien-c avatar keleffew avatar lhoestq avatar lysandrejik avatar marcenacp avatar mariosasko avatar mishig25 avatar n1t0 avatar polinaeterna avatar rtrompier avatar severo avatar stevhliu avatar xcid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasets-server's Issues

exception seen during `make benchmark`

Not sure which dataset threw this exception though, that's why I put the previous rows for further investigation.

poetry run python ../scripts/get_rows_report.py wikiann___CONFIG___or___SPLIT___test ../tmp/get_rows_reports/wikiann___CONFIG___or___SPLIT___test.json
poetry run python ../scripts/get_rows_report.py csebuetnlp___SLASH___xlsum___CONFIG___uzbek___SPLIT___test ../tmp/get_rows_reports/csebuetnlp___SLASH___xlsum___CONFIG___uzbek___SPLIT___test.json
poetry run python ../scripts/get_rows_report.py clips___SLASH___mfaq___CONFIG___no___SPLIT___train ../tmp/get_rows_reports/clips___SLASH___mfaq___CONFIG___no___SPLIT___train.json
poetry run python ../scripts/get_rows_report.py common_voice___CONFIG___rm-vallader___SPLIT___train ../tmp/get_rows_reports/common_voice___CONFIG___rm-vallader___SPLIT___train.json
https://media.githubusercontent.com/media/persiannlp/parsinlu/master/data/translation/translation_combined_fa_en/test.tsv
poetry run python ../scripts/get_rows_report.py pasinit___SLASH___xlwic___CONFIG___xlwic_en_da___SPLIT___train ../tmp/get_rows_reports/pasinit___SLASH___xlwic___CONFIG___xlwic_en_da___SPLIT___train.json
poetry run python ../scripts/get_rows_report.py indic_glue___CONFIG___wstp.mr___SPLIT___validation ../tmp/get_rows_reports/indic_glue___CONFIG___wstp.mr___SPLIT___validation.json
poetry run python ../scripts/get_rows_report.py banking77___CONFIG___default___SPLIT___test ../tmp/get_rows_reports/banking77___CONFIG___default___SPLIT___test.json
poetry run python ../scripts/get_rows_report.py gem___CONFIG___xsum___SPLIT___challenge_test_bfp_05 ../tmp/get_rows_reports/gem___CONFIG___xsum___SPLIT___challenge_test_bfp_05.json
poetry run python ../scripts/get_rows_report.py turingbench___SLASH___TuringBench___CONFIG___TT_fair_wmt19___SPLIT___validation ../tmp/get_rows_reports/turingbench___SLASH___TuringBench___CONFIG___TT_fair_wmt19___SPLIT___validation.json
poetry run python ../scripts/get_rows_report.py igbo_monolingual___CONFIG___eze_goes_to_school___SPLIT___train ../tmp/get_rows_reports/igbo_monolingual___CONFIG___eze_goes_to_school___SPLIT___train.json
poetry run python ../scripts/get_rows_report.py flax-sentence-embeddings___SLASH___stackexchange_titlebody_best_voted_answer_jsonl___CONFIG___gamedev___SPLIT___train ../tmp/get_rows_reports/flax-sentence-embeddings___SLASH___stackexchange_titlebody_best_voted_answer_jsonl___CONFIG___gamedev___SPLIT___train.json
 * skipping . . .
Exception ignored in: <generator object ParsinluReadingComprehension._generate_examples at 0x7f094caa6dd0>
Traceback (most recent call last):
  File "/home/slesage/hf/datasets-preview-backend/.venv/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 79, in __iter__
    yield key, example
RuntimeError: generator ignored GeneratorExit

Add a parameter to specify the number of rows

It's a problem for the cache, so until we manage random access, we can:

  • fill the cache with a large (maximum) number of rows, ie up to 1000
  • also cache the default request (N = 100) -> set to the parameter used in moon-landing
  • if a request comes with N = 247, for example, generate the response on the fly, from the large cache (1000), and don't cache that response

Expand the purpose of this backend?

Depending on the evolution of https://github.com/huggingface/datasets, this project might disappear, or its features might be reduced, in particular, if one day it allows caching the data by self-generating:

  • an arrow or a parquet data file (maybe with sharding and compression for the largest datasets)
  • or a SQL database
  • or precompute and store a partial list of known offsets (every 10MB for example)

It would allow getting random access to the data.

Get random access to the rows

Currently, only the first rows can be obtained with /rows. We want to get access to slices of the rows through pagination, eg /rows?from=40000&rows=10

warm the cache

Warm the cache at application startup. We want:

  • to avoid blocking the application, so: run asynchronously, and without hammering the server

  • to have a warm cache as fast as possible (persisting the previous cache, then refreshing it at startup? - related: #35 )

  • create a function to list all the datasets and fill the cache for all the possible requests for it. It might be make benchmark, or a specific function -> make warm

  • persist the cache? or start with an empty cache when the application is restarted? -> yes, persisted

  • launch it at application startup -> it's done at startup, see INSTALL.md.

cache both the functions returns and the endpoints results

Currently only the endpoints results are cached. We use them inside the code to get quick results by taking advantage of the cache, but it's not their aim, and we have to parse / decode.

It would be better to directly cache the results of the functions (memoΓ―ze).

Also: we could cache the raised exceptions as here:

https://github.com/peterbe/django-cache-memoize/blob/4da1ba4639774426fa928d4a461626e6f841b4f3/src/cache_memoize/__init__.py#L153L157

Provide the ETag header

  • set and manage the ETag header to save bandwidth when the client (browser) revalidates. See https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching and https://gist.github.com/timheap/1f4d9284e4f4d4f545439577c0ca6300
        # TODO: use key for ETag? It will need to be serialized
        # key = get_rows_json.__cache_key__(
        #     dataset=dataset, config=config, split=split, num_rows=num_rows, token=request.user.token
        # )
        # print(f"key={key} in cache: {cache.__contains__(key)}")
  • ETag: add an ETag header in the response (hash of the response)
  • ETag: if the request contains the If-None-Match, parse its ETag (beware the "weak" ETags), compare to the cache, and return an empty 304 response if the cache is fresh (with or without changing the TTL), or 200 with content if it has changed

CI: how to acknowledge a "safety" warning?

We use safety to check vulnerabilities in the dependencies. But in the case below, tensorflow is marked as insecure while the last published version on pipy is still 2.6.0. What to do in this case?

+==============================================================================+
|                                                                              |
|                               /$$$$$$            /$$                         |
|                              /$$__  $$          | $$                         |
|           /$$$$$$$  /$$$$$$ | $$  \__//$$$$$$  /$$$$$$   /$$   /$$           |
|          /$$_____/ |____  $$| $$$$   /$$__  $$|_  $$_/  | $$  | $$           |
|         |  $$$$$$   /$$$$$$$| $$_/  | $$$$$$$$  | $$    | $$  | $$           |
|          \____  $$ /$$__  $$| $$    | $$_____/  | $$ /$$| $$  | $$           |
|          /$$$$$$$/|  $$$$$$$| $$    |  $$$$$$$  |  $$$$/|  $$$$$$$           |
|         |_______/  \_______/|__/     \_______/   \___/   \____  $$           |
|                                                          /$$  | $$           |
|                                                         |  $$$$$$/           |
|  by pyup.io                                              \______/            |
|                                                                              |
+==============================================================================+
| REPORT                                                                       |
| checked 137 packages, using free DB (updated once a month)                   |
+============================+===========+==========================+==========+
| package                    | installed | affected                 | ID       |
+============================+===========+==========================+==========+
| tensorflow                 | 2.6.0     | ==2.6.0                  | 41161    |
+==============================================================================+

`make benchmark` is very long and blocks

Sometimes make benchmark blocks (nothing happens, and only one process is running, while the load is low). Ideally, it would not block, and other processes would be launched anyway so that the full capacity of the CPUs would be used (-j -l 7 parameters of make)

To unblock, I have to kill and relaunch make benchmark manually.

Add CI

Check types and code quality

/splits does not error when no config exists and a wrong config is passed

https://datasets-preview.huggingface.tech/splits?dataset=sent_comp&config=doesnotexist

returns:

{

    "splits": [
        {
            "dataset": "sent_comp",
            "config": "doesnotexist",
            "split": "validation"
        },
        {
            "dataset": "sent_comp",
            "config": "doesnotexist",
            "split": "train"
        }
    ]

}

instead of giving an error.

As https://datasets-preview.huggingface.tech/configs?dataset=sent_comp returns

{
    "configs": [
        {
            "dataset": "sent_comp",
            "config": "default"
        }
    ]
}

the only allowed config parameter should be default.

Regenerate dataset-info instead of loading it?

Currently, getting the rows with /rows requires a previous (internal) call to /infos to get the features (type of the columns). But sometimes the dataset-info.json file is missing, or not coherent with the dataset script (for example: https://huggingface.co/datasets/lhoestq/custom_squad/tree/main), while we are using datasets.get_dataset_infos(), which only loads the exported dataset-info.json files:

https://github.com/huggingface/datasets-preview-backend/blob/c2a78e7ce8e36cdf579fea805535fa9ef84a2027/src/datasets_preview_backend/queries/infos.py#L45

https://github.com/huggingface/datasets/blob/26ff41aa3a642e46489db9e95be1e9a8c4e64bea/src/datasets/inspect.py#L115

We might want to call ._info() from the builder to get the info, and features, instead of relying on the dataset-info.json file.

Scale the application

Both uvicorn and pm2 allow specifying the number of workers. pm2 seems interesting since it provides a way to increase or decrease the number of workers without restart.

But before using multiple workers, it's important to instrument the app in order to detect if we need it (eg: monitor the response time).

Properly manage the case config is None

For example:

https://datasets-preview.huggingface.tech/splits?dataset=sent_comp&config=null returns

{
    "dataset": "sent_comp",
    "config": "null",
    "splits": [
        "validation",
        "train"
    ]
}

this should have errored since there is no "null" config (it's null).

https://datasets-preview.huggingface.tech/splits?dataset=sent_comp&config= returns

The split names could not be parsed from the dataset config.

https://datasets-preview.huggingface.tech/splits?dataset=sent_comp returns

{
    "dataset": "sent_comp",
    "config": null,
    "splits": [
        "validation",
        "train"
    ]
}

As a reference for the same dataset https://datasets-preview.huggingface.tech/configs?dataset=sent_comp returns

{
    "dataset": "sent_comp",
    "configs": [
        null
    ]
}

and https://datasets-preview.huggingface.tech/info?dataset=sent_comp returns

{
    "dataset": "sent_comp",
    "info": {
        "default": {
            ...
            "builder_name": "sent_comp",
            "config_name": "default",
            ...
            "splits": {
                "validation": {
                    "name": "validation",
                    "num_bytes": 55823979,
                    "num_examples": 10000,
                    "dataset_name": "sent_comp"
                },
                "train": {
                    "name": "train",
                    "num_bytes": 1135684803,
                    "num_examples": 200000,
                    "dataset_name": "sent_comp"
                }
            },
            ...
        }
    }
}

Increase the proportion of hf.co datasets that can be previewed

For different reasons, some datasets cannot be previewed. It might be because the loading script is buggy, because the data is in a format that cannot be streamed, etc.

The script https://github.com/huggingface/datasets-preview-backend/blob/master/quality/test_datasets.py tests the three endpoints on all the datasets in hf.co and outputs a data file that can be analysed in https://observablehq.com/@huggingface/quality-assessment-of-datasets-loading.

The goal is to understand which problems arise most and try to fix the ones that can be addressed (here or in datasets) so that the largest part of the hf.co datasets can be previewed.

Enable the private datasets

The code is already present to pass the token, but it's disabled in the code (hardcoded):

https://github.com/huggingface/datasets-preview-backend/blob/df04ffba9ca1a432ed65e220cf7722e518e0d4f8/src/datasets_preview_backend/cache.py#L119-L120

  • enable private datasets and manage their cache adequately
  • separate private caches from public caches: for authenticated requests, we need to check every time, or at least use a much lower TTL, because an access can be removed. Also: since a hub dataset can be turned private, how should we manage them?
  • add doc. See f6576d5

Cache the responses

The datasets generally don't change often, so it's surely worth caching the responses.

Three levels of cache are involved:

  • client (browser, moon-landing): use Response headers (cache-control, ETag, see https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching)
  • application: serve the cached responses. Invalidation of the cache for a given request:
    • when the request arrives, if the TTL has finished
    • when a webhook has been received for the dataset (see below)
    • (needed?) when a dedicated background process is launched to refresh the cache (cron)
  • datasets: the library manages its own cache to avoid unneeded downloads

Here we will implement the application cache, and provide the headers for the client cache.

  • cache the responses (content and status code) during a TTL
    • select a cache library:
    • check the size of the cache and allocate sufficient resources. Note that every request generates a very small JSON (in the worst case, it's the dataset-info.json file, for ~3,000 datasets, else it's a JSON with at most some strings). The only problem would be if we're flooded by random requests (which generate 404 errors and are cached). Anyway, there is a limit to 1GB (the default in diskcache)
    • generate a key from the request
    • store and retrieve the response
    • specify the TTL
    • configure the TTL as an option
  • set the cache-control header so that the client (browser) doesn't retry during some time. See https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching
    • TTL: return "max-age" in the Cache-Control header (computed based on the server TTL to match the same date? or set Expires instead?)
    • Manage If-Modified-Since header in request?: no, this header works with Last-Modified, not Cache-Control/ max-age
  • manage concurrency
    • allow launching various workers. Done with WEB_CONCURRENCY - currently hardcoded to 1
    • migrate to Redis - the cache service will be separated from the application -> moved to a dedicated issue: #31

The scope of this issue has been reduced. See the next issues:

And maybe:

Update canonical datasets using a webhook

Webhook invalidation of canonical datasets (GitHub):

  • setup the revision argument to download datasets from the master branch - #119
  • set up a webhook on datasets library on every push to the master branch - see https://github.com/huggingface/moon-landing/issues/1345 - not needed anymore because the canonical datasets are mirrored to the hub.
  • add an endpoint to listen to the webhook
  • parse the webhook to find which caches should be invalidated (creation, update, deletion)
  • refresh these caches

Establish and meet SLO

https://en.wikipedia.org/wiki/Service-level_objective

as stated in #1 (comment):

we need to "guarantee" that row fetches from moon-landing will be under a specified latency (to be discussed), even in the case of cache misses in datasets-preview-backend

because the data will be needed at server-rendering time, for content to be parsed by Google

What's a reasonable latency you think you can achieve?

If it's too long we might want to pre-warm the cache for all (streamable) dataset, using a system based on webhooks from moon-landing for instance

See also #3 for the cache.

Instrument the application

Measure the response time, status code, RAM usage, etc. to be able to take decision (see #1). Also statistics about the most common requests (endpoint, dataset, parameters)

Prevent DoS when accessing some datasets

For example: https://huggingface.co/datasets/allenai/c4 script is doing 69219 output requests on every received request, which occupies all the CPUs.

pm2 logs
0|datasets | INFO:     3.238.194.17:0 - "GET /configs?dataset=allenai/c4 HTTP/1.1" 200 OK
Check remote data files:  78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 54330/69219 [14:13<03:05, 80.10it/s]
Check remote data files:  79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 54349/69219 [14:13<03:51, 64.14it/s]
Check remote data files:  79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 54364/69219 [14:14<04:44, 52.14it/s]
Check remote data files:  79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 54375/69219 [14:14<04:48, 51.38it/s]
Check remote data files:  79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 54448/69219 [14:15<02:37, 93.81it/s]
Check remote data files:  79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 54543/69219 [14:15<01:56, 125.60it/s]
Check remote data files:  79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  | 54564/69219 [14:16<03:22, 72.33it/s]

Manage concurrency

Currently (in the cache branch), only one worker is allowed.

We want to have multiple workers, but for that we need to have a shared cache:

  • migrate from diskcache to redis
  • remove the hardcoded limit of 1 worker

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.