datadreamer-dev / datadreamer Goto Github PK

DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. 🤖💤

License: MIT License

Shell 3.48% Python 96.52%

alignment deep-learning fine-tuning gpt instruction-tuning llm llmops llms machine-learning natural-language-processing nlp nlp-library openai python pytorch synthetic-data synthetic-dataset-generation transformers

datadreamer's People

Contributors

Stargazers

Watchers

datadreamer's Issues

TRL version requirement

Is there a need for trl to be locked to 0.7.6 here? Can it be relaxed to at least allow 0.8.1 which is compatible with transformers >= 4.39?

There's an import that fails otherwise. See huggingface/trl#1415.

Does this work deal with the workload balance when scheduing?

Together 524 Server Error Exception

Uncaught exception when using Together API. I'm using the model Phind/Phind-CodeLlama-34B-v2.

   return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
    return fut.result()
           ^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
    return fut.result()
           ^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/datadreamer/llms/together.py", line 116, in _retry_wrapper
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/together/complete.py", line 48, in create
    response = create_post_request(
               ^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.11/site-packages/together/utils.py", line 119, in create_post_request
    response_status_exception(response)
  File "/root/venv/lib/python3.11/site-packages/together/utils.py", line 87, in response_status_exception
    response.raise_for_status()
  File "/root/venv/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 524 Server Error:  for url: https://api.together.xyz/api/inference

Anthropic Claude 3 invalid_request_error Exception

Outdated Anthropic implementation with litellm when using the newer Claude 3 models with DataDreamer.

Traceback (most recent call last):                                                  File "/root/venv/lib/python3.11/site-packages/litellm/main.py", line 987, in completion                                                                               response = anthropic.completion(                                                             ^^^^^^^^^^^^^^^^^^^^^                                                File "/root/venv/lib/python3.11/site-packages/litellm/llms/anthropic.py", line 170, in completion                                                                     raise AnthropicError(                                                         litellm.llms.anthropic.AnthropicError: {"type":"error","error":{"type":"invalid_request_error","message":"\"claude-3-sonnet-20240229\" is not supported on this API. Please use the Messages API instead."}}

I run two vLLM instances on 2 A100-40G to do inference. However, I found that the throughput is less than 2 times of the vLLM on one card.
I am not sure whether the batch matters. As in vLLM, it uses the continuous batching, but datadreamer seems to do inference in batches.

How to install from the souce code?

I have found that I can change the src code and test it in tests.

How can I install from the source code and use in other directories like:

from datadreamer.llms import VLLM, ParallelLLM
from datadreamer import DataDreamer

Can we use the azure gpt key?

As the title said, How can I use the azure gpt key in DataDreamer?

Guidance on supporting TGI-based LLM

Great work on the design and documentation of the repo!

I want to introduce a new LLM class to work with TGI servers. I did not find any detailed documentation on how to go about it. I referred to Creating a new LLM and also looked at other LLM implementations (MistralAI) within the repo but I was not able to get it to work as I had hoped.

The flow works successfully but I can see that the endpoint is getting called multiple times per input. I have attached my test script below. Temporarily, I have replaced the TGI call with a dummy response (test response). When I execute my test script, I see get_generated_texts called printed 6 times (as opposed to 2). It either looks like a bug in the implementation or some gap in my understanding. Can you please help clarify?

Test Script -

    class TGI(MistralAI):

        def _run_batch(
            self,
            max_length_func: Callable[[list[str]], int],
            inputs: list[str],
            max_new_tokens: None | int = None,
            temperature: float = 1.0,
            top_p: float = 0.0,
            n: int = 1,
            stop: None | str | list[str] = None,
            repetition_penalty: None | float = None,
            logit_bias: None | dict[int, float] = None,
            batch_size: int = DEFAULT_BATCH_SIZE,
            seed: None | int = None,
            **kwargs,
        ) -> list[str] | list[list[str]]:
            prompts = inputs
            assert (
                stop is None or stop == []
            ), f"`stop` is not supported for {type(self).__name__}"
            assert (
                repetition_penalty is None
            ), f"`repetition_penalty` is not supported for {type(self).__name__}"
            assert (
                logit_bias is None
            ), f"`logit_bias` is not supported for {type(self).__name__}"
            assert n == 1, f"Only `n` = 1 is supported for {type(self).__name__}"
        
            # Run the model
            def get_generated_texts(self, kwargs, prompt) -> list[str]:                
                print("get_generated_texts called")
                return ["test response"]

            if batch_size not in self.executor_pools:
                self.executor_pools[batch_size] = ThreadPoolExecutor(max_workers=batch_size)
            generated_texts_batch = list(
                self.executor_pools[batch_size].map(
                    partial(get_generated_texts, self, kwargs), prompts
                )
            )

            if n == 1:
                return [batch[0] for batch in generated_texts_batch]
            else:  # pragma: no cover
                return generated_texts_batch           
                
    

    with DataDreamer(":memory:"):
        tgi_model = TGI(model_name="tgi_model")

        eli5_dataset = HFHubDataSource(
            "Get ELI5 Questions",
            "eli5_category",
            split="train",
            trust_remote_code=True,
        ).select_columns(["title"])

        # Keep only 2 examples as a quick demo
        eli5_dataset = eli5_dataset.take(2, lazy=False)

        # Ask llm to ELI5
        questions_and_answers = Prompt(
            "Generate Explanations",
            inputs={"prompts": eli5_dataset.output["title"]},
            args={
                "llm": tgi_model,
                "instruction": (
                    'Given the question, give an "Explain it like I\'m 5" answer.'
                ),
                "lazy": False,
                "top_p": 1.0,
            },
            outputs={"prompts": "questions", "generations": "answers"},
        )

        print(f"{questions_and_answers.head()}")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 100: ordinal not in range(128)

Hi,

I am running to an "ascii error" whenever I use the library with models loaded from HFTransformers. The code runs perfectly with other llms like OpenAI or MistralAI.

Example code:

with DataDreamer("./test"):
  hrs_dataset = DataSource('hrs_documents', Dataset.from_list([{'text':'article 1 text'}, {'text':'article 2 text'}]))
  model  = HFTransformers("tiiuae/falcon-40b-instruct", device_map="auto")
  
  hrs_dataset = hrs_dataset.take(10)
  output_dataset = ProcessWithPrompt(
      "Mine arguments from texts",
      inputs={"inputs": hrs_dataset.output['text']},
      args={
          "llm": model,
          "Temperature": 1.2,
          "instruction": (
              "summarize the text"
          ),
      },
      outputs={"inputs": "fullText", "generations": "summaries"},
  )
  
  output_dataset.save()

Too many open files in system

When using the DataDreamer library to interact with Cohere, the system encounters an OSError related to exceeding the maximum number of open files.

    yield llm.format_prompt(
  File "/usr/local/lib/python3.10/dist-packages/datadreamer/llms/llm.py", line 231, in format_prompt
    required_token_count = self.final_count_tokens(construct_final_prompt([]))
  File "/usr/local/lib/python3.10/dist-packages/datadreamer/llms/llm.py", line 92, in final_count_tokens
    return self.count_tokens(value)
  File "/usr/local/lib/python3.10/dist-packages/ring/func/base.py", line 816, in __call__
    return self.run(self._rope.config.default_action, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ring/func/base.py", line 671, in run
    return attr(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ring/func/base.py", line 697, in impl_f
    return attr(self, *fargs, pargs=pargs)
  File "/usr/local/lib/python3.10/dist-packages/ring/func/sync.py", line 54, in get_or_update
    result = self.execute(wire, pargs=pargs)
  File "/usr/local/lib/python3.10/dist-packages/ring/func/base.py", line 380, in execute
    return wire.__func__(*pargs.args, **pargs.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/datadreamer/llms/_litellm.py", line 146, in count_tokens
    return token_counter(
  File "/usr/local/lib/python3.10/dist-packages/litellm/utils.py", line 2851, in token_counter
    tokenizer_json = _select_tokenizer(model=model)
  File "/usr/local/lib/python3.10/dist-packages/litellm/utils.py", line 2582, in _select_tokenizer
    tokenizer = Tokenizer.from_pretrained("Cohere/command-nightly")
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1201, in hf_hub_download
    os.makedirs(storage_folder, exist_ok=True)
  File "/usr/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)
OSError: [Errno 23] Too many open files in system: '/root/.cache/huggingface/hub/models--Cohere--command-nightly'

Add support for "guidance" and "outlines"

I find guidance useful in my own dataset generation, just to add certain constraints to outputs.
Can we add support for these in DataDreamer? I'd be happy to contribute code for this and a few examples myself, if there's interest from the DataDreamer owners.

How to parse hugging face datasets or dataframe to ProcessWithPrompt input?

I wonder how to parse an existing hugging face dataset column as input of ProcessWithPrompt?

gpt-4-turbo-preview max_tokens error

When attempting to use gpt-4-turbo-preview with llms.OpenAI() I get the following error:

OpenAI (gpt-4-turbo-preview)] Retrying datadreamer.llms.openai.OpenAI.retry_wrapper.<locals>._retry_wrapper in 3.0 seconds as it raised BadRequestError: Error code: 400 - {'error': {'message': 'max_tokens is too large: 8153. This model supports at most 4096 completion tokens, whereas you provided 8153

This error can be fixed by passing max_new_tokens=4096. I'm attempting to fork datadreamer and fix get_max_content_length() in src/llms/openai.py, but I think there's general confusion between GPT-4's advertised "context length" and it's max number of completion/output tokens.

From src/llms/openai.py

    def get_max_context_length(self, max_new_tokens: int) -> int:  # pragma: no cover
        """Gets the maximum context length for the model. When ``max_new_tokens`` is
        greater than 0, the maximum number of tokens that can be used for the prompt
        context is returned.

        Args:
            max_new_tokens: The maximum number of tokens that can be generated.

        Returns:
            The maximum context length.
        """  # pragma: no cover
        model_name = _normalize_model_name(self.model_name)
        format_tokens = 0
        if _is_chat_model(model_name):
            # Each message is up to 4 tokens and there are 3 messages
            # (system prompt, user prompt, assistant response)
            # and then we have to account for the system prompt
            format_tokens = 4 * 3 + self.count_tokens(cast(str, self.system_prompt))
        if "-preview" in model_name:
            max_context_length = 128000

This code is obviously trying to calculate GPT-4's context length from the model name given. But the error produced later has to do with confusing context length with completion/output tokens and asking for more than the model will give (in this case, 4,096):

Just wanted to document the issue as I see it before I change the entire function around (in a PR) to reduce confusion.

Issues with installation

While trying to do pip3 install datadreamer.dev
I get the following issues
ERROR: Could not find a version that satisfies the requirement datadreamer.dev (from versions: none)
ERROR: No matching distribution found for datadreamer.dev

Am I doing something wrong ?

MistralAI Read Timeout

Problem tends to happens when using Mistral API on larger amount of entries. I'm using mistral-large-latest.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/data/filter.py", line 24, in <module>
    filtered = FilterWithPrompt(
               ^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 337, in __init__
    self.__setup_folder_and_resume()
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 442, in __setup_folder_and_resume
    self.__start()
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 451, in __start
    self._set_output(self.run())
                     ^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/prompt/filter_with_prompt.py", line 84, in run
    process_with_prompt = ProcessWithPrompt(
                          ^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 337, in __init__
    self.__setup_folder_and_resume()
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 442, in __setup_folder_and_resume
    self.__start()
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 451, in __start
    self._set_output(self.run())
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 894, in _set_output
    self.__output = _output_to_dataset(
                    ^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step_output.py", line 862, in _output_to_dataset
    output = __output_to_dataset(
             ^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step_output.py", line 559, in __output_to_dataset
    first_row = next(
                ^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/prompt/_prompt_base.py", line 105, in get_generations
    for input, prompt, generation, get_extra_columns in zip(
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/_cachable/_cachable.py", line 797, in _run_over_batches
    yield from self._run_over_batches_locked(
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/_cachable/_cachable.py", line 763, in _run_over_batches_locked
    results = self._run_over_sorted_batches(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/_cachable/_cachable.py", line 585, in _run_over_sorted_batches
    run_batch(
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/llms/mistral_ai.py", line 162, in _run_batch
    generated_texts_batch = list(
                            ^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/llms/mistral_ai.py", line 143, in get_generated_texts
    response = self.retry_wrapper(
               ^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
    return fut.result()
           ^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
    return fut.result()
           ^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/datadreamer/llms/mistral_ai.py", line 81, in _retry_wrapper
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "/root/data/venv/lib/python3.11/site-packages/mistralai/client.py", line 160, in chat
    for response in single_response:
  File "/root/data/venv/lib/python3.11/site-packages/mistralai/client.py", line 98, in _request
    raise MistralException(
mistralai.exceptions.MistralException: Unexpected exception (ReadTimeout): The read operation timed out

Is there a specific reason to not let run the trainer in memory?

DataDreamer/src/trainers/trainer.py

Line 96 in ad3dd9c

if not DataDreamer.initialized() or DataDreamer.is_running_in_memory():

datadreamer-dev / datadreamer Goto Github PK

datadreamer's People

Contributors

Stargazers

Watchers

Forkers

datadreamer's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs