GithubHelp home page GithubHelp logo

mike100101100011 / emu-gpt4all-datalake-fk Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nomic-ai/gpt4all-datalake

0.0 0.0 0.0 38 KB

API to the GPT4All Datalake

License: Apache License 2.0

Shell 9.84% Python 84.21% Makefile 5.69% Dockerfile 0.26%

emu-gpt4all-datalake-fk's Introduction

gpt4all-datalake

An open-source datalake to ingest, organize and efficiently store all data contributions made to gpt4all.

Hosted version: https://api.gpt4all.io

Architecture

The core datalake architecture is a simple HTTP API (written in FastAPI) that ingests JSON in a fixed schema, performs some integrity checking and stores it. This JSON is transformed into storage efficient Arrow/Parquet files and stored in a target filesystem.

Data formats

  • Data is stored on disk / S3 in parquet files in subdirectories organized by day. These parquet files have a standardized schema allowing for easy manipulation in any programming language.
  • The input data model can be found here.

Open sourcing the data.

Nomic AI will provide automatic snapshots of this raw parquet data. You will be able to interact with the snapshots:

  • In their raw exported form.
  • In automatic Atlas maps over its raw, cleaned and curated form.
  • Through downloads where the data has been curated, de-duplicated and cleaned for LLM training/finetuning.

Data Privacy

By sending data to the GPT4All-Datalake you agree to the following.

Data sent to this datalake will be used to train open-source large language models and released to the public. There is no expectation of privacy to any data entering this datalake. You can, however, expect attribution. If you attach a unique identifier that associates you as the data contributor, Nomic will retain that identifier in any LLM trains that it conducts. You will receive credit and public attribution if Nomic releases any model trained on your submitted data. You can also submit data anonymously.

Where does the gpt4all-datalake run?

While open-sourced under an Apache-2 License, this datalake runs on infrastructure managed and paid for by Nomic AI. You are welcome to run this datalake under your own infrastructure! We just ask you also release the underlying data that gets sent into it under the same attribution terms.

Development

  1. Clone down the repository.
  2. Run make testenv to build all docker images and launch the HTTP server.
  3. Go to 'http://localhost/docs' to view the API documentation.
  4. You can run the unit tests with make test. Any edits made to the FastAPI app will hot reload.

emu-gpt4all-datalake-fk's People

Contributors

andriymulyar avatar yarikoptic avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.