GithubHelp home page GithubHelp logo

alx-ai / sciphi Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sciphi-ai/synthesizer

0.0 0.0 0.0 35.06 MB

SciPhi is a simple framework for generating synthetic / fine-tuning data, and for robust evaluation of LLMs.

License: Apache License 2.0

Python 100.00%

sciphi's Introduction

SciPhi [ΨΦ]: A Framework for Cata Creation

Screenshot 2023-10-01 at 10 45 12 AM

SciPhi is an configurable Python framework designed to tackle the challenges of efficiently training LLM (Large Language Model) through synthetic data. At its core, SciPhi offers:

  • Configurable Data Generation: Efficiently produce LLM-mediated synthetic training and tuning datasets tailored to your specific needs.
  • The Library of Phi: An initiative to leverage AI-driven techniques to craft high-quality open source textbooks.

Getting Started & Support

  • Engage with our active Discord community for discussions, troubleshooting, and collaboration.

  • For specialized support or collaboration inquiries, feel free to reach out directly.

Library of Phi Generation

Introduction:
The Library of Phi is an initiative sponsored by SciPhi. Its primary goal is to democratize access to high-quality textbooks. The project utilizes AI-driven techniques to generate textbooks by processing information from the MIT OCW course webpages. " Workflow:
The workflow encompasses data scraping, data processing, YAML configuration creation, and RAG over all of Wikipedia, with intermittent work done by LLMs.

  1. Scrape MIT OCW Course Webpages.
  2. Extract Syllabi.
  3. Formulate Table of Contents.
  4. Craft Textbooks.

Generating the default Textbook:

poetry run python sciphi/examples/library_of_phi/generate_textbook.py run --do-wiki=False --textbook=Aerodynamics_of_Viscous_Fluids --log-level=DEBUG

See the example output here

Using a Custom Table of Contents:

  1. Draft a table of contents and save as textbook_name.yaml.
  2. Place it in [Your Working Directory]/sciphi/data/library_of_phi/table_of_contents.
  3. Format similarly to Aerodynamics_of_Viscous_Fluids.yaml.

Incorporating RAG over Wikipedia:

  1. Enable the --do-wiki flag: True.
  2. In .env, set:
    • WIKI_SERVER_URL
    • WIKI_SERVER_USERNAME
    • WIKI_SERVER_PASSWORD

Output:
Generated textbooks reside in:
[Your Working Directory]/sciphi/data/library_of_phi

Note: The Wikipedia embeddings server is not yet public. Meanwhile, ensure your configuration aligns with our specifications if you wish to use wikipedia for RAG. If you would like to peruse more example textbooks, go here.

Installation

# Clone the repository
git clone https://github.com/emrgnt-cmplxty/sciphi.git
cd sciphi

# Install dependencies
# If you don't have poetry installed: pip3 install poetry
poetry install -E all

# Set up your environment
# Note: Modify the .env file as needed after copying
cp .env.example .env && vim .env

Requirements

  • Python: >= 3.11 and < 3.12
  • Poetry: For package management

Optional Features

Install optional dependencies for enhanced features:

poetry install -E <extra_name>

Options include:

  • anthropic_support: For Anthropic models.
  • hf_support: For diverse model access with the HuggingFace package.
  • openai_support: For OpenAI models.
  • vllm_support: For VLLM, aiding fast inference.
  • llama_index_support: For LlamaIndex, enhancing grounded synthesis.
  • chroma_support: For Chroma support in large vector databases.
  • all: Includes all dependencies (excluding vllm, which needs separate installation).
  • all_with_cuda: Everything.

Customizable Data Generation

For fully configurable and flexible data generation, execute the relevant runner.py with various command-line arguments.

poetry run python sciphi/examples/basic_data_gen/runner.py --provider_name=openai --model_name=gpt-4 --log_level=INFO --batch_size=1 --num_samples=1 --output_file_name=example_output.jsonl --example_config=textbooks_are_all_you_need_basic_split

The above command will generate a single sample from GPT-4. This sample is generated using the textbooks_are_all_you_need_basic_split configuration, and the output is appended to example_output.jsonl.

The long-term view of the SciPhi framework is to provide a training-feedback loop as shown below:

Screenshot 2023-09-18 at 9 53 55 AM

Command-Line Arguments

See arguments and their default values in the README. Notable ones include --provider, --model_name, and --temperature.

Replicating Full Table of Contents Generation

Step 0: Scrape MIT OCW for course details.

poetry run python sciphi/examples/library_of_phi/raw_data/ocw_scraper.py scrape

Step 1: Convert scraped data into 'draft' syllabi YAMLs.

poetry run python sciphi/examples/library_of_phi/gen_step_1_draft_syllabi.py run

Step 2: Refine the draft YAML into the finalized syllabi.

poetry run python sciphi/examples/library_of_phi/gen_step_2_clean_syllabi.py run

Step 3: Transition the syllabi to a 'draft' table of contents.

poetry run python sciphi/examples/library_of_phi/gen_step_3_draft_table_of_contents.py run

Step 4: Produce clean table of contents YAML files.

poetry run python sciphi/examples/library_of_phi/gen_step_4_clean_table_of_contents.py run

License

Licensed under the Apache-2.0 License.

Citations

  1. Textbooks Are All You Need
  2. Textbooks Are All You Need II: Phi-1.5 Technical Report

Citation

If using SciPhi in academic work, please cite:

@software{Emergent_AGI_SciPhi,
author = {Colegrove, Owen},
doi = {Pending},
month = {09},
title = {{SciPhi}},
url = {https://github.com/emrgnt-cmplxty/sciphi},
year = {2023}
}

sciphi's People

Contributors

emrgnt-cmplxty avatar krrishdholakia avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.