GithubHelp home page GithubHelp logo

llmindcraft's Introduction

LLMindCraft

License: CC BY-NC-ND 4.0

Shaping Language Models with Cognitive Insights

LLMindCraft is licensed under CC BY-NC-ND 4.0.

Docker environment

docker pull tothemoon/llm

This image packages all environments of LLMindCraft.

Fine-tuning in Docker environment

For single node:

docker run --gpus all \
    -d --rm \
    --name llm \
    [-v host_path:container_path] \
    [-w workdir] \
    --entrypoint "/bin/bash -c" \
    tothemoon/llm \
    --cmd "sleep infinity"

while for multiple nodes:

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    --privileged \
    --network host \
    [--env env_variable=value] \
    -d --rm \
    --name llm \
    [-v host_path:container_path] \
    [-v ssh_pub_key:/root/.ssh/authorized_keys] \
    [-w workdir] \
    tothemoon/llm \
    --sshd_port [any_port] --cmd "sleep infinity"

You can also enter the container by

docker exec -it llm /bin/bash

Create New Dataset

Create a new data class in preprocess.py, like:

Your dataset should be created in the following format:

class MedMCQA(InstructionDataset):
    dataset = "MedMCQA"
    task_type = "classification"
    choices = ["A", "B", "C", "D"]
    prompt = """Given a medical context and a multiple choice question related to it, select the correct answer from the four options.
Question: {text}
Options: {options}.
Please answer with A, B, C, or D only.
Answer:
"""

    def fetch_data(self, datum):
        return {
            "text": datum["question"], "options": ', '.join([op+': '+datum[k] for k, op in zip(['opa', 'opb', 'opc', 'opd'], self.choices)]),
            "answer": self.choices[datum["cop"]-1],
        }

In this format:

  • dataset: The dataset name
  • task_type: Your task type, should be classification or abstractivesummarization (TODO: More task types)
  • prompt: The prompt of the task, which should be later used to be filled with the real data

For Classification tasks, additional keys should be defined:

  • choices: Set of labels

fetch_data is the interface for fetching the required features from raw data

And you should also append your class in the dictionary:

DATASETS = {
    "MedMCQA": MedMCQA,
}

Finally, you can build and upload the dataset by:

bash preprocess.sh

Note that the parameters in the preprocess.sh should be changed accordingly. For evaluation datasets, -for_eval should be used, while for instruction tuning datasets, it should be omitted.

llmindcraft's People

Contributors

sincere1994 avatar jiminhuang avatar

Watchers

Kostas Georgiou avatar

Forkers

jiminhuang

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.