GithubHelp home page GithubHelp logo

datadreamer-dev / datadreamer Goto Github PK

View Code? Open in Web Editor NEW
644.0 644.0 34.0 773 KB

DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. โ€€ ๐Ÿค–๐Ÿ’ค

Home Page: https://datadreamer.dev

License: MIT License

Shell 3.48% Python 96.52%
alignment deep-learning fine-tuning gpt instruction-tuning llm llmops llms machine-learning natural-language-processing nlp nlp-library openai python pytorch synthetic-data synthetic-dataset-generation transformers

datadreamer's People

Contributors

ajayp13 avatar eltociear avatar preemware avatar younesbelkada avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

datadreamer's Issues

Together 524 Server Error Exception

Uncaught exception when using Together API. I'm using the model Phind/Phind-CodeLlama-34B-v2.

   return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
    return fut.result()
           ^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
    return fut.result()
           ^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/datadreamer/llms/together.py", line 116, in _retry_wrapper
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/together/complete.py", line 48, in create
    response = create_post_request(
               ^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/together/utils.py", line 119, in create_post_request
    response_status_exception(response)
  File "/root/venv/lib/python3.11/site-packages/together/utils.py", line 87, in response_status_exception
    response.raise_for_status()
  File "/root/venv/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 524 Server Error:  for url: https://api.together.xyz/api/inference

Anthropic Claude 3 invalid_request_error Exception

Outdated Anthropic implementation with litellm when using the newer Claude 3 models with DataDreamer.

Traceback (most recent call last):                                                  File "/root/venv/lib/python3.11/site-packages/litellm/main.py", line 987, in completion                                                                               response = anthropic.completion(                                                             ^^^^^^^^^^^^^^^^^^^^^                                                File "/root/venv/lib/python3.11/site-packages/litellm/llms/anthropic.py", line 170, in completion                                                                     raise AnthropicError(                                                         litellm.llms.anthropic.AnthropicError: {"type":"error","error":{"type":"invalid_request_error","message":"\"claude-3-sonnet-20240229\" is not supported on this API. Please use the Messages API instead."}}

Question about the speed

I run two vLLM instances on 2 A100-40G to do inference. However, I found that the throughput is less than 2 times of the vLLM on one card.
I am not sure whether the batch matters. As in vLLM, it uses the continuous batching, but datadreamer seems to do inference in batches.

How to install from the souce code?

I have found that I can change the src code and test it in tests.

How can I install from the source code and use in other directories like:

from datadreamer.llms import VLLM, ParallelLLM
from datadreamer import DataDreamer

Guidance on supporting TGI-based LLM

Great work on the design and documentation of the repo!

I want to introduce a new LLM class to work with TGI servers. I did not find any detailed documentation on how to go about it. I referred to Creating a new LLM and also looked at other LLM implementations (MistralAI) within the repo but I was not able to get it to work as I had hoped.

The flow works successfully but I can see that the endpoint is getting called multiple times per input. I have attached my test script below. Temporarily, I have replaced the TGI call with a dummy response (test response). When I execute my test script, I see get_generated_texts called printed 6 times (as opposed to 2). It either looks like a bug in the implementation or some gap in my understanding. Can you please help clarify?

Test Script -

    class TGI(MistralAI):

        def _run_batch(
            self,
            max_length_func: Callable[[list[str]], int],
            inputs: list[str],
            max_new_tokens: None | int = None,
            temperature: float = 1.0,
            top_p: float = 0.0,
            n: int = 1,
            stop: None | str | list[str] = None,
            repetition_penalty: None | float = None,
            logit_bias: None | dict[int, float] = None,
            batch_size: int = DEFAULT_BATCH_SIZE,
            seed: None | int = None,
            **kwargs,
        ) -> list[str] | list[list[str]]:
            prompts = inputs
            assert (
                stop is None or stop == []
            ), f"`stop` is not supported for {type(self).__name__}"
            assert (
                repetition_penalty is None
            ), f"`repetition_penalty` is not supported for {type(self).__name__}"
            assert (
                logit_bias is None
            ), f"`logit_bias` is not supported for {type(self).__name__}"
            assert n == 1, f"Only `n` = 1 is supported for {type(self).__name__}"
        
            # Run the model
            def get_generated_texts(self, kwargs, prompt) -> list[str]:                
                print("get_generated_texts called")
                return ["test response"]

            if batch_size not in self.executor_pools:
                self.executor_pools[batch_size] = ThreadPoolExecutor(max_workers=batch_size)
            generated_texts_batch = list(
                self.executor_pools[batch_size].map(
                    partial(get_generated_texts, self, kwargs), prompts
                )
            )

            if n == 1:
                return [batch[0] for batch in generated_texts_batch]
            else:  # pragma: no cover
                return generated_texts_batch           
                
    

    with DataDreamer(":memory:"):
        tgi_model = TGI(model_name="tgi_model")

        eli5_dataset = HFHubDataSource(
            "Get ELI5 Questions",
            "eli5_category",
            split="train",
            trust_remote_code=True,
        ).select_columns(["title"])

        # Keep only 2 examples as a quick demo
        eli5_dataset = eli5_dataset.take(2, lazy=False)

        # Ask llm to ELI5
        questions_and_answers = Prompt(
            "Generate Explanations",
            inputs={"prompts": eli5_dataset.output["title"]},
            args={
                "llm": tgi_model,
                "instruction": (
                    'Given the question, give an "Explain it like I\'m 5" answer.'
                ),
                "lazy": False,
                "top_p": 1.0,
            },
            outputs={"prompts": "questions", "generations": "answers"},
        )

        print(f"{questions_and_answers.head()}")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 100: ordinal not in range(128)

Hi,

I am running to an "ascii error" whenever I use the library with models loaded from HFTransformers. The code runs perfectly with other llms like OpenAI or MistralAI.

Example code:

with DataDreamer("./test"):
  hrs_dataset = DataSource('hrs_documents', Dataset.from_list([{'text':'article 1 text'}, {'text':'article 2 text'}]))
  model  = HFTransformers("tiiuae/falcon-40b-instruct", device_map="auto")
  
  hrs_dataset = hrs_dataset.take(10)
  output_dataset = ProcessWithPrompt(
      "Mine arguments from texts",
      inputs={"inputs": hrs_dataset.output['text']},
      args={
          "llm": model,
          "Temperature": 1.2,
          "instruction": (
              "summarize the text"
          ),
      },
      outputs={"inputs": "fullText", "generations": "summaries"},
  )
  
  output_dataset.save()

Too many open files in system

When using the DataDreamer library to interact with Cohere, the system encounters an OSError related to exceeding the maximum number of open files.

    yield llm.format_prompt(
  File "/usr/local/lib/python3.10/dist-packages/datadreamer/llms/llm.py", line 231, in format_prompt
    required_token_count = self.final_count_tokens(construct_final_prompt([]))
  File "/usr/local/lib/python3.10/dist-packages/datadreamer/llms/llm.py", line 92, in final_count_tokens
    return self.count_tokens(value)
  File "/usr/local/lib/python3.10/dist-packages/ring/func/base.py", line 816, in __call__
    return self.run(self._rope.config.default_action, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ring/func/base.py", line 671, in run
    return attr(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ring/func/base.py", line 697, in impl_f
    return attr(self, *fargs, pargs=pargs)
  File "/usr/local/lib/python3.10/dist-packages/ring/func/sync.py", line 54, in get_or_update
    result = self.execute(wire, pargs=pargs)
  File "/usr/local/lib/python3.10/dist-packages/ring/func/base.py", line 380, in execute
    return wire.__func__(*pargs.args, **pargs.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/datadreamer/llms/_litellm.py", line 146, in count_tokens
    return token_counter(
  File "/usr/local/lib/python3.10/dist-packages/litellm/utils.py", line 2851, in token_counter
    tokenizer_json = _select_tokenizer(model=model)
  File "/usr/local/lib/python3.10/dist-packages/litellm/utils.py", line 2582, in _select_tokenizer
    tokenizer = Tokenizer.from_pretrained("Cohere/command-nightly")
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1201, in hf_hub_download
    os.makedirs(storage_folder, exist_ok=True)
  File "/usr/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)
OSError: [Errno 23] Too many open files in system: '/root/.cache/huggingface/hub/models--Cohere--command-nightly'

Add support for "guidance" and "outlines"

I find guidance useful in my own dataset generation, just to add certain constraints to outputs.
Can we add support for these in DataDreamer? I'd be happy to contribute code for this and a few examples myself, if there's interest from the DataDreamer owners.

gpt-4-turbo-preview max_tokens error

When attempting to use gpt-4-turbo-preview with llms.OpenAI() I get the following error:

OpenAI (gpt-4-turbo-preview)] Retrying datadreamer.llms.openai.OpenAI.retry_wrapper.<locals>._retry_wrapper in 3.0 seconds as it raised BadRequestError: Error code: 400 - {'error': {'message': 'max_tokens is too large: 8153. This model supports at most 4096 completion tokens, whereas you provided 8153

This error can be fixed by passing max_new_tokens=4096. I'm attempting to fork datadreamer and fix get_max_content_length() in src/llms/openai.py, but I think there's general confusion between GPT-4's advertised "context length" and it's max number of completion/output tokens.

From src/llms/openai.py

    def get_max_context_length(self, max_new_tokens: int) -> int:  # pragma: no cover
        """Gets the maximum context length for the model. When ``max_new_tokens`` is
        greater than 0, the maximum number of tokens that can be used for the prompt
        context is returned.

        Args:
            max_new_tokens: The maximum number of tokens that can be generated.

        Returns:
            The maximum context length.
        """  # pragma: no cover
        model_name = _normalize_model_name(self.model_name)
        format_tokens = 0
        if _is_chat_model(model_name):
            # Each message is up to 4 tokens and there are 3 messages
            # (system prompt, user prompt, assistant response)
            # and then we have to account for the system prompt
            format_tokens = 4 * 3 + self.count_tokens(cast(str, self.system_prompt))
        if "-preview" in model_name:
            max_context_length = 128000

This code is obviously trying to calculate GPT-4's context length from the model name given. But the error produced later has to do with confusing context length with completion/output tokens and asking for more than the model will give (in this case, 4,096):

Screenshot 2024-03-07 at 2 49 41โ€ฏPM

Just wanted to document the issue as I see it before I change the entire function around (in a PR) to reduce confusion.

Issues with installation

While trying to do pip3 install datadreamer.dev
I get the following issues
ERROR: Could not find a version that satisfies the requirement datadreamer.dev (from versions: none)
ERROR: No matching distribution found for datadreamer.dev

Am I doing something wrong ?

MistralAI Read Timeout

Problem tends to happens when using Mistral API on larger amount of entries. I'm using mistral-large-latest.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/data/filter.py", line 24, in <module>
    filtered = FilterWithPrompt(
               ^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 337, in __init__
    self.__setup_folder_and_resume()
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 442, in __setup_folder_and_resume
    self.__start()
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 451, in __start
    self._set_output(self.run())
                     ^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/prompt/filter_with_prompt.py", line 84, in run
    process_with_prompt = ProcessWithPrompt(
                          ^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 337, in __init__
    self.__setup_folder_and_resume()
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 442, in __setup_folder_and_resume
    self.__start()
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 451, in __start
    self._set_output(self.run())
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 894, in _set_output
    self.__output = _output_to_dataset(
                    ^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step_output.py", line 862, in _output_to_dataset
    output = __output_to_dataset(
             ^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step_output.py", line 559, in __output_to_dataset
    first_row = next(
                ^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/prompt/_prompt_base.py", line 105, in get_generations
    for input, prompt, generation, get_extra_columns in zip(
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/_cachable/_cachable.py", line 797, in _run_over_batches
    yield from self._run_over_batches_locked(
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/_cachable/_cachable.py", line 763, in _run_over_batches_locked
    results = self._run_over_sorted_batches(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/_cachable/_cachable.py", line 585, in _run_over_sorted_batches
    run_batch(
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/llms/mistral_ai.py", line 162, in _run_batch
    generated_texts_batch = list(
                            ^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/llms/mistral_ai.py", line 143, in get_generated_texts
    response = self.retry_wrapper(
               ^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
    return fut.result()
           ^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
    return fut.result()
           ^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/llms/mistral_ai.py", line 81, in _retry_wrapper
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/mistralai/client.py", line 160, in chat
    for response in single_response:
  File "/root/data/venv/lib/python3.11/site-packages/mistralai/client.py", line 98, in _request
    raise MistralException(
mistralai.exceptions.MistralException: Unexpected exception (ReadTimeout): The read operation timed out

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.