GithubHelp home page GithubHelp logo

Comments (9)

guillaume-be avatar guillaume-be commented on May 11, 2024

Hello @QuantumEntangledAndy

In general I would like to keep the pipelines at a higher level of abstraction, so maybe for this case there would be a solution not requiring manipulating the models and tokenizers directly. Note that

conversation.history = conversation_manager.model.get_tokenizer().convert_tokens_to_ids(
    conversation_manager.model.get_tokenizer().tokenize_list(texts.to_vec())
)

would not generate a properly formed history, expecting eos tokens and actual token ids (tokenize_list returns a list of String).

Would it be more convenient if the pipeline was storing the history sequence (i.e. keep the sequence of prompts/responses separate as a Vec>) instead of an aggregated vector? This would probably be sufficient to trim to the last N inputs.

Would you still require access to the tokenizer? An alternative would be to load a tokenizer manually and use it to re-encode the prompts

tokenizer =   let tokenizer = Gpt2Tokenizer::from_file(vocab_path.to_str().unwrap(), merges_path.to_str().unwrap(), false,)?;

from rust-bert.

QuantumEntangledAndy avatar QuantumEntangledAndy commented on May 11, 2024

I would like a realiable what to read in history from a file too. Currently I save it as text (toml actually) with this format.

enum Speaker {
  Bot,
  User,
}

struct Past {
  speaker: Speaker
  idx: u64,
  message: String,
}

But on load I have noway of getting this data into the history.

from rust-bert.

QuantumEntangledAndy avatar QuantumEntangledAndy commented on May 11, 2024

Also if we do want to go down the abstraction route and focus on the higher level it might be better to not expose history at all. It is the tokenized and id form which is not realiably read or writable without the tokenizer.

from rust-bert.

QuantumEntangledAndy avatar QuantumEntangledAndy commented on May 11, 2024

What about a setup like this for the read part.

enum HistoryKind {
  Input,
  Output,
}
struct HistoryItem {
  kind: HistoryKind,
  message: Vec<i64>
}
struct Conversation {
  history: Vec<HistoryItem>
}

impl Conversation {
  pub fn get_outputs -> Vec<String> {
    self.history.filter(|i| i.kind == HistoryKind::Output).map(|i| tokens_to_string(i.message)).collect()
  }
  pub fn get_inputs -> Vec<String> {
    self.history.filter(|i| i.kind == HistoryKind::Input).map(|i| tokens_to_string(i.message)).collect()
  }
}

This completely disposes of generated_responses and inputs and relies on getting them from the history.

This doesn't solve my issue but it will change the model to one source of truth an set it up for easy trimming to N last input.

from rust-bert.

QuantumEntangledAndy avatar QuantumEntangledAndy commented on May 11, 2024

As a less invasive change we could:

  • maintain the current input and reposnes strings and the source of actual input and outputs regardless of context that generated them.

  • Change history to a list of list of ints split on the Eos

  • Add methods to set and get the history as a list of strings.

from rust-bert.

QuantumEntangledAndy avatar QuantumEntangledAndy commented on May 11, 2024

Issue with getting and setting history as strings is that it requires the tokenizer which is not available to Conversation only to the manager so these two methods would have to be conversation manager method.

from rust-bert.

guillaume-be avatar guillaume-be commented on May 11, 2024

@QuantumEntangledAndy thank you for providing more details, I now understand better what you are trying to achieve.

As a less invasive change we could:

  • maintain the current input and reposnes strings and the source of actual input and outputs regardless of context that generated them.
  • Change history to a list of list of ints split on the Eos
  • Add methods to set and get the history as a list of strings.

These were my thoughts as well, pushing some changes that should allow you do load conversations from snapshots (see #89)

from rust-bert.

QuantumEntangledAndy avatar QuantumEntangledAndy commented on May 11, 2024

Thanks I have tested #89 and can confirm that is it working as intended :)

from rust-bert.

guillaume-be avatar guillaume-be commented on May 11, 2024

Thank you, merged #89 to master

from rust-bert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.