GithubHelp home page GithubHelp logo

cassioml / cassio Goto Github PK

View Code? Open in Web Editor NEW
100.0 100.0 18.0 327 KB

A framework-agnostic Python library to seamlessly integrate Cassandra with ML/LLM/genAI workloads

License: Apache License 2.0

Python 99.17% Makefile 0.83%

cassio's People

Contributors

anisharao16 avatar cbornet avatar dependabot[bot] avatar epinzur avatar hemidactylus avatar jbellis avatar msmygit avatar nicoloboschi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cassio's Issues

Investigate prompt templates that are known to work well with Vector Search for OpenAI & GCP Vertex

When developing the NoSQL assistant, the the prompt template that was used did not work well because the directives "e.g. Use information from vector search results to answer your question", otherwise answer "I don't know"" did not work properly because the template was not properly formatted by using ''' around the vector search results.

This task is not a straight forward task because the prompt template needs to work under various scenarios, not just Q&A scenarios. (User might be having a chat with bot and not even asking a question).

Prompt template will also have to take into account items such as chat history and ability to perform caching.

[LangChain] enrich ChatMessageHistory with on-the-fly session id

Why should the session_id (e.g. the user identity) be specified only at class instantiation time?

So my web app serves thousands of users and I have to instantiate thousands of these classes.

Consider adding a session_id parameter to methods (get messages, put, etc) so that a single class will "statelessly" work on the whole table and serve all users, no?

Map and prepare for metadata-based hybrid filtering in Vector classes

Map the space of metadata exact search combined with ANN and redesign the vector class (or a variation thereof) so that it will support that kind of search.

Possibly compare with the metadata capabilities other vector DBs offer and try to make them available at cassIO level.

"cassio.init()" to get a DB

Currently: each table abstraction class requires session and keyspace.

Proposal: make them optional and have them default to a cassio-global session & keyspace.
This would be set with cassio.init(DB parameters) - this init method having various forms and essentially being a friction-removal utility function (both for cassandra clusters and cloud connections).

CassIO get should default to ConsistencyLevel.LOCAL_ONE

vectors are large and don’t change often, the overhead of doing 2x the work to see if the replicas agree is not a good tradeoff

if we can easily make it configurable, great; if not, just make it L_O across the board

Redesign table-abstraction class hierarchies

Classes abstracting table access are ad-hoc things now, designed after langchain needs. This task is about capturing generalizations and share the code in a system of mixins/subclasses (tbd) with hierarchical responsibilities re: generation of CQL and receiving method parameters. E.g. vector/nonvector, clustering/nonclustering, etc.
Langchain uses some of these, but there is a "rectangle to complete" conceptually:

classes1

Draft for a class system (conceptually):

classes2

Support bulk delete in VectorTable

A delete_many method accepting a list of IDs, that internally does concurrent.

Or (not ideal perhaps) at least make the delete async and have the langchain layer (or equivalent) handle that concurrency (compare #14 for the same discussion). At the moment there's a loop at the langchain level and deletes are serialized.

NoSql Assistant Demo

DataStax is currently working on a NoSql Assistant that is representative of the canonical chatbot. DataStax should package up the application into a demo that can demonstrate the power of CassIO. The demo should include the following:

  • simple Colab notebote
  • chat agent runnable in a flax server
  • sample dataset

Finalize the "key trick" for tables

The key trick fits different-arity choices of the "key" (as abstract concepts) into a single table, i.e.
abstract key = [['name', 'city','age'], ['John', 'Rome', 123]]

becomes the (always 2-)tuple of two strings

cache_key = "['John', 'Rome', 123]"

Possibly cumbersome and/or confusing.
Pro: fits heterogeneous stuff on the same table
Con: essentially repeats what the C* partitioner does

Additional comments (by @jbellis. The first one is not entirely clear to me)

  1. PRIMARY KEY (( key_desc, cache_key ))
    This is fine, however, if you have many smaller caches you’re better off allowing key_desc being the only partition key. (Since then Cassandra can restrict the queries to just the replicas owning that partition.) This is fine for 1.0 but we may end up wanting to expose this either directly (partitioned boolean) or indirectly (expected cache size parameter that lets us make the decision under the hood)

  2. self.keyDesc = '/'.join(self.keys)
    IMO we’d be better served by just providing a cache name parameter and let the caller decide how to build it

Resistance against Sessions with nondefault "row factory"

CassIO expects the Session to have the named-tuple Row factory (i.e. the rows are returned as Row objects from CQL queries).

Sometimes, however, for other reasons users stray off the default and set the row factory to e.g. dict_factory. Then, when passing the session to cassIO, boom.

At least check and give an error, or work around this by either:

  1. spawn a new session with the right factory
  2. pick the fields in the appropriate way whether row or named-tuple (and other custom factories, give an error, whatever)

Rely on DB-native metrics for ANN search

At the time of implementation, something was not yet on DB-side. Things have changed.

  • no need to sort results by metric on cassio-level anymore
  • no need to even have those metrics for computation (with prefetch + calculation) anymore. Except, maybe not all are on DB. What about L1, L2, L-infinity? To be checked.

Implement MMR retrieval at cassIO level

It might make sense for the cassIO VectorTable class to offer its own MMR implementation.

(Currently for the langchain integration case this is done at langchain level, but arguably the right place is cassIO).

Integration tests don't create cassio_tutorials keyspace

I followed the setup in the readme with a local Cassandra instance and tried to run the integration tests. It's easy to create the keyspace, but I think with a local Cassandra setup, we should set up the keyspace as we have it in the CASSANDRA_KEYSPACE environment variable. Also if we do a "CREATE KEYSPACE IF NOT EXISTS" we could just do that in all cases.

It's not a huge deal but would make it simpler for first time users to not run into errors on the happy path.

Database exception philosophy?

Currently: cassIO does nothing, and DB exceptions bubble up to the caller.

(At the integration level, e.g. the langchain code using cassIO, the same happens).

Is there a change in philosophy needed here? Pro: users who don't want to bother have an easier life (in a sense). Con: error swallowing is generally bad.

Data Tracking in Chat History

For every LLM Prompt and Response, it would also be useful to track what pieces of context data was used for generating the prompt. For example, if 10 different entities was retrieved from the database, store the keyspace, table, column, and id for each entity in the Chat History.

This is useful for data lineage and data tracking. Data tracking can used to find bad data, or to help find sources of data that was used to generate answers.

Investigate how to use filtered Vector Search to improve search results

Vector Search - specifically k nearest neighbor - is very sensitive to outliers. Filtered vector search gives the ability to reduce the search space prior to performing vector search. This is being advertised in Pinecone's marketing materials significantly, so we need to figure out how to perform this. Tooling such as langchain actually don't make it possible to do filtered vector search.

Makefile, style and linter

Standardize the code and the flow with these elements.

This includes type hints everywhere.
(and will also expose the leaky abstraction around the current vector mixin, eeeh)

Data Extractor, multiple rows and optimizations

Much work needed on the "Data extractor" facility.

Optimize queries (each table queried at most once)

For multiple-row returning, some thinking is needed (perhaps even just another extractor altogether?)

[LangChain] Plans for good implementation of LangChain's "semantic chat memory"

LangChain has no specific "semantic chat memory": that stems, instead, from a certain usage of the VectorStore.

(see here on cassio.org and here for a howto on LangChain site).

Changes needed, the rationale

In practice, once you have a vectorstore, first a "retriever" is created out of it (langchain standard construct) and then the latter is wrapped by a VectorStoreRetrieverMemory class (another langchain standard). Relevant steps:

vectorstore = whatever-your-backend.init(...)
retriever = vectorstore.as_retriever(search_kwargs=dict(k=1))
memory = VectorStoreRetrieverMemory(retriever=retriever)
# now "memory" can be used e.g. in a chat

So in a realistic usage you don't want to pull relevant chat snippets from the whole store, rather from the conversation with that user of course. In Cassandra terms, this means clustering rows by user_id.

CassIO

Hence we need a parameter in CassIO's VectorTable init that controls whether we have a primary key (( document_id )) or (( session_id) , document_id) in Cassandra terminology. This is not implemented yet: at the moment we only have the first choice and no control.

(Note: I assume we don't want to have a different table per user id !)

LangChain

Once the above is addressed, LangChain also will have to slightly change.

Option 1: new params in the vector store's similarity_search_with_score_id_by_vector

The search_kwargs parameter in spawning the "retriever" will be the place to specify the user_id (i.e. session_id, i.e. partition to use for the subsequent lookup). These end up in the kwargs of the similarity_search_with_score_id_by_vector method of the Cassandra Vector Store, which will be able to pass this partition key to the cassIO search.

Pro: less proliferation of instances of vector store.
Con: might involve more kwargs as this param gets to the Cassandra vector store through several routes (whether mmr, similarity, etc it's different functions being called. See as_retriever method of base VectorStore class.

Option 2:

In this case one creates as many VectorStore classes as there are session_ids, each with the partition key as instance property, and this gets injected into each search() call within that instance. Much less intrusive, a bit heavier resource-wise perhaps.

Cassio Agents Metadata schema for Langchain

There are the following "Objects" in Langchain that can be managed by data stored elsewhere, and not necessarily "hardcoded" in code.

These represent the choices users can make

In the beginning can just be Python objects , or rather JSON configs , then we can move to tables.

AgentTypes (stores the registry of Agent Types)

  • AgentTypeName
  • AgentTypeDescription
  • AgentTypeClass

LLMTypes

  • LLMTypeID
  • LLMProvider (LLM provider, e.g., OpenAI GPT-3, Google PALM2, Cohere, Anthropic etc.)
  • LLMModels (LLM models available)
  • LLMKeys (LLM access keys if required)
  • LLMEndpoint (LLM API endpoint if applicable)

ToolTypes

  • ToolTypeName
  • ToolTypeDescription
  • ToolTypeKeys (Keys required for the tool, if applicable)
  • ToolTypeLLM (LLMTypeID) (Optional, specify the LLM provider used by the tool to override default one)
  • ToolTypeClass (Internal tool type class)
  • ToolTypeDefaultAction (Internal tool type class method)

DocumentLoaderTypes

  • DocumentLoaderID
  • DocumentLoaderName
  • DocumentLoaderParameters
  • DocumentLoaderClass (internal)

This represents what a user can define and store

Can start with JSON and then move to Tables

LLMConfiguration

  • LLMConfigurationID
  • LLMType
  • LLMParameters
  • LLMModel (One model chosen)
  • LLMKeys (LLM access keys if required, overrides global def in LLMTypes)
  • LLMEndpoint (LLM API endpoint if applicable, overrides global def in LLM Types)

Agent

  • AgentID
  • AgentName
  • AgentDescription
  • AgentType { Zero-Shot-React.., React-Docstore, etc... }
  • AgentIterations ( max iterations)
  • AgentLLM (LLMConfigurationID)

Tools

  • ToolID
  • ToolName (optional override)
  • ToolDescription (optional override)
  • ToolClass (optional override)
  • ToolDirectReturn
  • ToolAction (optional override)
  • ToolParameters (optional)

DocumentLoaderConfiguration

  • DocumentLoaderConfigurationID
  • DocumentLoaderType (DocumentLoaderTypeID)
  • DocumentLoaderName
  • DocumentLoaderParameters

Index

  • IndexID
  • IndexName
  • IndexType (IndexTypeID)
  • IndexLLMConnection (LLMConfigurationID)
  • IndexDescription
  • DocumentLoaders? (DocumentLoaderConfigIDs)

Strategy to implement

This is just the overall spec of what goes into an agent at a high level. Recommendation is to first implement with 100% JSON driven agent -- then implement in Schema -- since it will just be an optimization of where to store the config rather than functionality. This is applicable to future agent frameworks.

  • Define JSON examples of these - one for global config , one for agent config
  • Write an agent executor that uses global and agent config and runs an API - should easy ways so that it could be put into a Flask, Chainlit, etc. or expose itself as an API
  • Back the agent executor with CQL Tables
  • Provide a way to export / import configs / merge configs
  • Create API to CRUD the configs
  • Create UI wrapper

Update Langchain docs so that it reflects the updated chathistory capabilities of Cassandra & the ability to use AstraDB

Currently, the docs of the langchain integration looks very basic, and doesn't reflect vector search or CassIO:

https://python.langchain.com/docs/modules/memory/integrations/cassandra_chat_message_history
https://python.langchain.com/docs/ecosystem/integrations/cassandra
https://github.com/hwchase17/langchain/blob/master/langchain/memory/chat_message_histories/cassandra.py

It is also possible to create a version of the chat that discusses the managed version of Cassandra (AstraDB)

https://python.langchain.com/docs/modules/memory/integrations/motorhead_memory_managed

Other places that documentation & integration is missing:
https://python.langchain.com/docs/modules/data_connection/retrievers/
https://python.langchain.com/docs/modules/data_connection/text_embedding/

A suggestions is to drop a link to the CassIO website.

Retry/batching strategy to protect against timeouts

A retry strategy (essentially, it could be three parameters num_retries, retry_timeout, retry_sleep_seconds) for CQL operations.

This might have interplay with a batching strategy at cassIO level. However, given the current design of cassIO/langchain integration (e.g. vector store insert many) one should move the whole insert-many into cassIO (which could even make sense after all).
So cassIO would espose a put_many method that internally handles even batching.

Create Composable Prompt Templates for RAG

Prompting the LLM with the right prompt template is important for the LLM to make sense to the content from Vector search. Important items include:

  • Appropriate Directives
  • Proper content layout (User Context section, Memory section, Index data section, etc)
  • Tools to ensure that prompt fits within the context window

VectorTable's behaviour against null metadata

Passing None as metadata results in the column containing the literal string null (a valid JSON).

Either allow null metadata (currently the idea is to have an empty dict at least) or forbid it (e.g. normalizing to {} when writing).

[LangChain] Wrong handling of ttl_seconds defaults chain in the vector store class

In the add_texts method, the intent is to have an optional TTL which defaults to the class-level one.
This is done via
ttl_seconds = ttl_seconds or self.ttl_seconds

Suppose the class default is 10 seconds and one passes explicitly 0 to the method. The insertions have then 10 seconds, contrary to user expectations.

Find a better interface (this seems like it'll be a general problem with TTL, where zero and None have ambiguous meaning).

Suggestion: a symbolic NOT_PASSED default which is not None and not zero, checked for in the code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.