datadreamer-dev / datadreamer Goto Github PK
View Code? Open in Web Editor NEWDataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. โ ๐ค๐ค
Home Page: https://datadreamer.dev
License: MIT License
DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. โ ๐ค๐ค
Home Page: https://datadreamer.dev
License: MIT License
Is there a need for trl
to be locked to 0.7.6 here? Can it be relaxed to at least allow 0.8.1 which is compatible with transformers >= 4.39
?
There's an import that fails otherwise. See huggingface/trl#1415.
Uncaught exception when using Together API. I'm using the model Phind/Phind-CodeLlama-34B-v2
.
return self(f, *args, **kw)
^^^^^^^^^^^^^^^^^^^^
File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
do = self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
return fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 289, in wrapped_f
return self(f, *args, **kw)
^^^^^^^^^^^^^^^^^^^^
File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
do = self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
return fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/root/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/venv/lib/python3.11/site-packages/datadreamer/llms/together.py", line 116, in _retry_wrapper
return func(**kwargs)
^^^^^^^^^^^^^^
File "/root/venv/lib/python3.11/site-packages/together/complete.py", line 48, in create
response = create_post_request(
^^^^^^^^^^^^^^^^^^^^
File "/root/venv/lib/python3.11/site-packages/together/utils.py", line 119, in create_post_request
response_status_exception(response)
File "/root/venv/lib/python3.11/site-packages/together/utils.py", line 87, in response_status_exception
response.raise_for_status()
File "/root/venv/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 524 Server Error: for url: https://api.together.xyz/api/inference
Outdated Anthropic implementation with litellm when using the newer Claude 3 models with DataDreamer.
Traceback (most recent call last): File "/root/venv/lib/python3.11/site-packages/litellm/main.py", line 987, in completion response = anthropic.completion( ^^^^^^^^^^^^^^^^^^^^^ File "/root/venv/lib/python3.11/site-packages/litellm/llms/anthropic.py", line 170, in completion raise AnthropicError( litellm.llms.anthropic.AnthropicError: {"type":"error","error":{"type":"invalid_request_error","message":"\"claude-3-sonnet-20240229\" is not supported on this API. Please use the Messages API instead."}}
I run two vLLM instances on 2 A100-40G to do inference. However, I found that the throughput is less than 2 times of the vLLM on one card.
I am not sure whether the batch matters. As in vLLM, it uses the continuous batching, but datadreamer seems to do inference in batches.
I have found that I can change the src
code and test it in tests
.
How can I install from the source code and use in other directories like:
from datadreamer.llms import VLLM, ParallelLLM
from datadreamer import DataDreamer
As the title said, How can I use the azure gpt key in DataDreamer?
Great work on the design and documentation of the repo!
I want to introduce a new LLM
class to work with TGI servers. I did not find any detailed documentation on how to go about it. I referred to Creating a new LLM and also looked at other LLM implementations (MistralAI
) within the repo but I was not able to get it to work as I had hoped.
The flow works successfully but I can see that the endpoint is getting called multiple times per input. I have attached my test script below. Temporarily, I have replaced the TGI call with a dummy response (test response
). When I execute my test script, I see get_generated_texts called
printed 6 times (as opposed to 2). It either looks like a bug in the implementation or some gap in my understanding. Can you please help clarify?
Test Script -
class TGI(MistralAI):
def _run_batch(
self,
max_length_func: Callable[[list[str]], int],
inputs: list[str],
max_new_tokens: None | int = None,
temperature: float = 1.0,
top_p: float = 0.0,
n: int = 1,
stop: None | str | list[str] = None,
repetition_penalty: None | float = None,
logit_bias: None | dict[int, float] = None,
batch_size: int = DEFAULT_BATCH_SIZE,
seed: None | int = None,
**kwargs,
) -> list[str] | list[list[str]]:
prompts = inputs
assert (
stop is None or stop == []
), f"`stop` is not supported for {type(self).__name__}"
assert (
repetition_penalty is None
), f"`repetition_penalty` is not supported for {type(self).__name__}"
assert (
logit_bias is None
), f"`logit_bias` is not supported for {type(self).__name__}"
assert n == 1, f"Only `n` = 1 is supported for {type(self).__name__}"
# Run the model
def get_generated_texts(self, kwargs, prompt) -> list[str]:
print("get_generated_texts called")
return ["test response"]
if batch_size not in self.executor_pools:
self.executor_pools[batch_size] = ThreadPoolExecutor(max_workers=batch_size)
generated_texts_batch = list(
self.executor_pools[batch_size].map(
partial(get_generated_texts, self, kwargs), prompts
)
)
if n == 1:
return [batch[0] for batch in generated_texts_batch]
else: # pragma: no cover
return generated_texts_batch
with DataDreamer(":memory:"):
tgi_model = TGI(model_name="tgi_model")
eli5_dataset = HFHubDataSource(
"Get ELI5 Questions",
"eli5_category",
split="train",
trust_remote_code=True,
).select_columns(["title"])
# Keep only 2 examples as a quick demo
eli5_dataset = eli5_dataset.take(2, lazy=False)
# Ask llm to ELI5
questions_and_answers = Prompt(
"Generate Explanations",
inputs={"prompts": eli5_dataset.output["title"]},
args={
"llm": tgi_model,
"instruction": (
'Given the question, give an "Explain it like I\'m 5" answer.'
),
"lazy": False,
"top_p": 1.0,
},
outputs={"prompts": "questions", "generations": "answers"},
)
print(f"{questions_and_answers.head()}")
Hi,
I am running to an "ascii error" whenever I use the library with models loaded from HFTransformers. The code runs perfectly with other llms like OpenAI or MistralAI.
Example code:
with DataDreamer("./test"):
hrs_dataset = DataSource('hrs_documents', Dataset.from_list([{'text':'article 1 text'}, {'text':'article 2 text'}]))
model = HFTransformers("tiiuae/falcon-40b-instruct", device_map="auto")
hrs_dataset = hrs_dataset.take(10)
output_dataset = ProcessWithPrompt(
"Mine arguments from texts",
inputs={"inputs": hrs_dataset.output['text']},
args={
"llm": model,
"Temperature": 1.2,
"instruction": (
"summarize the text"
),
},
outputs={"inputs": "fullText", "generations": "summaries"},
)
output_dataset.save()
When using the DataDreamer library to interact with Cohere, the system encounters an OSError
related to exceeding the maximum number of open files.
yield llm.format_prompt(
File "/usr/local/lib/python3.10/dist-packages/datadreamer/llms/llm.py", line 231, in format_prompt
required_token_count = self.final_count_tokens(construct_final_prompt([]))
File "/usr/local/lib/python3.10/dist-packages/datadreamer/llms/llm.py", line 92, in final_count_tokens
return self.count_tokens(value)
File "/usr/local/lib/python3.10/dist-packages/ring/func/base.py", line 816, in __call__
return self.run(self._rope.config.default_action, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ring/func/base.py", line 671, in run
return attr(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ring/func/base.py", line 697, in impl_f
return attr(self, *fargs, pargs=pargs)
File "/usr/local/lib/python3.10/dist-packages/ring/func/sync.py", line 54, in get_or_update
result = self.execute(wire, pargs=pargs)
File "/usr/local/lib/python3.10/dist-packages/ring/func/base.py", line 380, in execute
return wire.__func__(*pargs.args, **pargs.kwargs)
File "/usr/local/lib/python3.10/dist-packages/datadreamer/llms/_litellm.py", line 146, in count_tokens
return token_counter(
File "/usr/local/lib/python3.10/dist-packages/litellm/utils.py", line 2851, in token_counter
tokenizer_json = _select_tokenizer(model=model)
File "/usr/local/lib/python3.10/dist-packages/litellm/utils.py", line 2582, in _select_tokenizer
tokenizer = Tokenizer.from_pretrained("Cohere/command-nightly")
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1201, in hf_hub_download
os.makedirs(storage_folder, exist_ok=True)
File "/usr/lib/python3.10/os.py", line 225, in makedirs
mkdir(name, mode)
OSError: [Errno 23] Too many open files in system: '/root/.cache/huggingface/hub/models--Cohere--command-nightly'
I find guidance useful in my own dataset generation, just to add certain constraints to outputs.
Can we add support for these in DataDreamer? I'd be happy to contribute code for this and a few examples myself, if there's interest from the DataDreamer owners.
I wonder how to parse an existing hugging face dataset column as input of ProcessWithPrompt?
When attempting to use gpt-4-turbo-preview
with llms.OpenAI()
I get the following error:
OpenAI (gpt-4-turbo-preview)] Retrying datadreamer.llms.openai.OpenAI.retry_wrapper.<locals>._retry_wrapper in 3.0 seconds as it raised BadRequestError: Error code: 400 - {'error': {'message': 'max_tokens is too large: 8153. This model supports at most 4096 completion tokens, whereas you provided 8153
This error can be fixed by passing max_new_tokens=4096
. I'm attempting to fork datadreamer and fix get_max_content_length()
in src/llms/openai.py
, but I think there's general confusion between GPT-4's advertised "context length" and it's max number of completion/output tokens.
From src/llms/openai.py
def get_max_context_length(self, max_new_tokens: int) -> int: # pragma: no cover
"""Gets the maximum context length for the model. When ``max_new_tokens`` is
greater than 0, the maximum number of tokens that can be used for the prompt
context is returned.
Args:
max_new_tokens: The maximum number of tokens that can be generated.
Returns:
The maximum context length.
""" # pragma: no cover
model_name = _normalize_model_name(self.model_name)
format_tokens = 0
if _is_chat_model(model_name):
# Each message is up to 4 tokens and there are 3 messages
# (system prompt, user prompt, assistant response)
# and then we have to account for the system prompt
format_tokens = 4 * 3 + self.count_tokens(cast(str, self.system_prompt))
if "-preview" in model_name:
max_context_length = 128000
This code is obviously trying to calculate GPT-4's context length from the model name given. But the error produced later has to do with confusing context length with completion/output tokens and asking for more than the model will give (in this case, 4,096):
Just wanted to document the issue as I see it before I change the entire function around (in a PR) to reduce confusion.
While trying to do pip3 install datadreamer.dev
I get the following issues
ERROR: Could not find a version that satisfies the requirement datadreamer.dev (from versions: none)
ERROR: No matching distribution found for datadreamer.dev
Am I doing something wrong ?
Problem tends to happens when using Mistral API on larger amount of entries. I'm using mistral-large-latest
.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/data/filter.py", line 24, in <module>
filtered = FilterWithPrompt(
^^^^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 337, in __init__
self.__setup_folder_and_resume()
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 442, in __setup_folder_and_resume
self.__start()
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 451, in __start
self._set_output(self.run())
^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/prompt/filter_with_prompt.py", line 84, in run
process_with_prompt = ProcessWithPrompt(
^^^^^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 337, in __init__
self.__setup_folder_and_resume()
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 442, in __setup_folder_and_resume
self.__start()
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 451, in __start
self._set_output(self.run())
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step.py", line 894, in _set_output
self.__output = _output_to_dataset(
^^^^^^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step_output.py", line 862, in _output_to_dataset
output = __output_to_dataset(
^^^^^^^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/step_output.py", line 559, in __output_to_dataset
first_row = next(
^^^^^
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/steps/prompt/_prompt_base.py", line 105, in get_generations
for input, prompt, generation, get_extra_columns in zip(
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/_cachable/_cachable.py", line 797, in _run_over_batches
yield from self._run_over_batches_locked(
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/_cachable/_cachable.py", line 763, in _run_over_batches_locked
results = self._run_over_sorted_batches(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/_cachable/_cachable.py", line 585, in _run_over_sorted_batches
run_batch(
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/llms/mistral_ai.py", line 162, in _run_batch
generated_texts_batch = list(
^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/llms/mistral_ai.py", line 143, in get_generated_texts
response = self.retry_wrapper(
^^^^^^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 289, in wrapped_f
return self(f, *args, **kw)
^^^^^^^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
do = self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
return fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 289, in wrapped_f
return self(f, *args, **kw)
^^^^^^^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
do = self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
return fut.result()
^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/root/data/venv/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/datadreamer/llms/mistral_ai.py", line 81, in _retry_wrapper
return func(**kwargs)
^^^^^^^^^^^^^^
File "/root/data/venv/lib/python3.11/site-packages/mistralai/client.py", line 160, in chat
for response in single_response:
File "/root/data/venv/lib/python3.11/site-packages/mistralai/client.py", line 98, in _request
raise MistralException(
mistralai.exceptions.MistralException: Unexpected exception (ReadTimeout): The read operation timed out
DataDreamer/src/trainers/trainer.py
Line 96 in ad3dd9c
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.