GithubHelp home page GithubHelp logo

nevakrien / contrastors_xpu Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nomic-ai/contrastors

0.0 0.0 0.0 3.52 MB

Train Models Contrastively in Pytorch

License: Apache License 2.0

Shell 0.49% Python 99.45% Makefile 0.06%

contrastors_xpu's Introduction

contrastors

contrastors is contrastive learning toolkit that enables researchers and engineers to train and evaluate contrastive models efficiently.

img

Features

  • Built on top of Flash Attention for fast and efficient training
  • Support for training on multiple GPUs
  • GradCache support for training with large batch sizes in constrained memory environments
  • Huggingface Support for easy loading of common models (Pythia/GPTNeoX, BERT, etc.)
  • Masked Language Modeling (MLM) Pretraining
  • Matryoshka Representation Learning for flexible embedding sizes

Research

Getting Started and Requirements

The contrastors library relies on custom kernels from the Flash Attention repository. To setup your enviornment you will need to follow the steps below.

Make sure that you have Cuda 11.8+. You can check this by running nvcc --version or if you already have torch installed you can run python -c "import torch; print(torch.version.cuda)"

Create a python venv and activate it

python3 -m venv env
source env/bin/activate

Install torch. See the torch docs for specific instructions for your system (e.g. the default CUDA torch supports is 12.1 as of 12/12/2023).

pip3 install torch torchvision torchaudio

Install wheel, packaging, ninja for Flash Attention (so the builds don't take too long)

pip install wheel packaging ninja

Install Flash Attention and the custom kernels

pip install --no-cache-dir flash-attn --no-build-isolation git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/layer_norm git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/fused_dense_lib git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/xentropy

Install the rest of the requirements and the package

pip install -e . 

Data Access

We provide access to the nomic-embed-text-v1 dataset via the nomic package. To access the data, you will need to create an account and login to the nomic package. First create an account at atlas.nomic.ai, download the nomic Python client, and run the following commands:

pip install nomic
nomic login # follow prompts to login
python -c "from nomic import atlas; print(atlas._get_datastream_credentials(name='contrastors'))"

which will print out your access keys. You can then configure them by using aws configure or setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

If you do not have the AWS CLI installed, you can install it here.

To verify your access, you can run the following command to list the contents of the bucket:

aws s3 ls --endpoint-url=https://9fa58365a1a3d032127970d0bd9a1290.r2.cloudflarestorage.com/ s3://contrastive
aws s3 ls --endpoint-url=https://9fa58365a1a3d032127970d0bd9a1290.r2.cloudflarestorage.com/ s3://contrastive-index-filtered

You should be able to see the contents of the bucket and download the data.

If you intend to train using our data and the contrastors repo, you will need to setup fsspec support for Cloudflare R2. To do so, create a file ~/.config/fsspec/s3.json with the following contents:

{
  "s3": {
    "client_kwargs": {
      "endpoint_url": "https://9fa58365a1a3d032127970d0bd9a1290.r2.cloudflarestorage.com/",
      "aws_access_key_id": <ACCESS_KEY_ID>,
      "aws_secret_access_key": <SECRET_KEY_ID>
    }
  }
}

Nomic Data Format

Our text data is stored in gziped jsonl files with which we also store a counts.json file and offsets.json.gzip.

The counts.json file is a dictionary mapping the file name to the number of examples in the file. The offsets.json.gz file is a dictionary mapping the file name to a dictionary where each key is the index of the example and the value is a tuple of the start and end byte offset of the example in the file. We do this to allow for streaming of data in from R2, especially when the data is larger than the buffer size.

Here's a small example of what a dataset configuration might look like:

datasets:
  - name: "paq"
    bucket: "s3://contrastive-index-filtered/paq_full/shard-{00000..00538}.jsonl.gz"
    query_prefix: "search_query"
    document_prefix: "search_document"
    objective: 
        type: "paired"
        columns: ["query", "document"]

objective defines if it's a paired or triplet objective. In both cases, the columns field defines the columns to use for each example.

Training nomic-embed-text-v1

Masked Language Modeling Pretraining

To train your own BERT from scratch (with all the optimizations) run

cd src/contrastors
deepspeed --num_gpus=8 train.py --config=configs/train/mlm.yaml --deepspeed_config_file=configs/deepspeed/ds_config.json --dtype=bf16

Constrastive Pretraining and Finetuning

To launch an experiment run

cd src/contrastors
torchrun --nproc-per-node=8 train.py --config=configs/train/contrastive_pretrain.yaml --dtype=bf16

This will train a bert model on all ~200M examples. To change the dataset, you can modify data_args.input_shards.

To finetune nomic-bert-embed-v1-unsupervised, update the config to configs/train/contrastive_finetune.yaml.

Generating Your Own Data

To generate your own data for any step of the pipeline, you can use the provided scripts in scripts/text.

See the README in scripts/text for more information.

Pretrained Models

We provide pretrained models for nomic-embed-text-v1 at the following locations:

Join the Nomic Community

License

This project and models are licensed under the Apache 2.0 License.

Acknowledgements

We thank Tri Dao for his work on Flash Attention and the custom kernels that make this project possible, the OpenCLIP team for their great repository with which much of this work is based on, and the Huggingface team for their great work on the transformers library.

Citation

If you find the model, dataset, or training code useful, please cite our work

@misc{nussbaum2024nomic,
      title={Nomic Embed: Training a Reproducible Long Context Text Embedder}, 
      author={Zach Nussbaum and John X. Morris and Brandon Duderstadt and Andriy Mulyar},
      year={2024},
      eprint={2402.01613},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

contrastors_xpu's People

Contributors

zanussbaum avatar jxmorris12 avatar nevakrien avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.