cassioml / cassio Goto Github PK

View Code? Open in Web Editor NEW

100.0 100.0 18.0 327 KB

A framework-agnostic Python library to seamlessly integrate Cassandra with ML/LLM/genAI workloads

License: Apache License 2.0

Python 99.17% Makefile 0.83%

cassio's People

Contributors

Stargazers

Watchers

Forkers

pmcfadin jeremya jeffbanks weideng1 aploetz aar0np nuo-ai searchgame msmygit anisharao16 clun jeromatron amorton anisharao nicoloboschi epinzur adityasoni17

cassio's Issues

Initial User Experience needs lots of work

Need to improve homepage
Need to improve Prompt Template example.

Walkthrough of user experience - https://drive.google.com/file/d/1w5otdPItPGA9tDo-j282P1UCdDT6z2og/view

Re-Ranking algorithm simliar to MMR

Investigate prompt templates that are known to work well with Vector Search for OpenAI & GCP Vertex

When developing the NoSQL assistant, the the prompt template that was used did not work well because the directives "e.g. Use information from vector search results to answer your question", otherwise answer "I don't know"" did not work properly because the template was not properly formatted by using ''' around the vector search results.

This task is not a straight forward task because the prompt template needs to work under various scenarios, not just Q&A scenarios. (User might be having a chat with bot and not even asking a question).

Prompt template will also have to take into account items such as chat history and ability to perform caching.

Review the Docs and give comments on it

[Llamaindex] Cassandra KVStore

This should be straightforward. A key-value store backed by Cassandra in the kvstore implementations in langchain.

[LangChain] enrich ChatMessageHistory with on-the-fly session id

Why should the session_id (e.g. the user identity) be specified only at class instantiation time?

So my web app serves thousands of users and I have to instantiate thousands of these classes.

Consider adding a session_id parameter to methods (get messages, put, etc) so that a single class will "statelessly" work on the whole table and serve all users, no?

Move Cassio Website to this repo

Move the Cassio website to this repo so both docs and the actual project live in the same repo

Validate the primary_key_type when passed to VectorTable

Add simple validation code to avoid people from passing arbitrary stuff or even CQL injection there!

Map and prepare for metadata-based hybrid filtering in Vector classes

Map the space of metadata exact search combined with ANN and redesign the vector class (or a variation thereof) so that it will support that kind of search.

Possibly compare with the metadata capabilities other vector DBs offer and try to make them available at cassIO level.

Create wrapper to to allow inserting vectors into Cassandra that also allows supports putting metadata into other columns

Currently, langchain out of box only allows inserting a single column with a single embedding.

Add the ability to store multiple columns of data into a vector store at the same time.

"cassio.init()" to get a DB

Currently: each table abstraction class requires session and keyspace.

Proposal: make them optional and have them default to a cassio-global session & keyspace.
This would be set with cassio.init(DB parameters) - this init method having various forms and essentially being a friction-removal utility function (both for cassandra clusters and cloud connections).

Create AutoGPT example with CassIO

https://python.langchain.com/docs/use_cases/autonomous_agents/autogpt

CassIO get should default to ConsistencyLevel.LOCAL_ONE

vectors are large and don’t change often, the overhead of doing 2x the work to see if the replicas agree is not a good tradeoff

if we can easily make it configurable, great; if not, just make it L_O across the board

Redesign table-abstraction class hierarchies

Classes abstracting table access are ad-hoc things now, designed after langchain needs. This task is about capturing generalizations and share the code in a system of mixins/subclasses (tbd) with hierarchical responsibilities re: generation of CQL and receiving method parameters. E.g. vector/nonvector, clustering/nonclustering, etc.
Langchain uses some of these, but there is a "rectangle to complete" conceptually:

Draft for a class system (conceptually):

Add LLM Bootcamp information to CassIO Website

Anant.us and DataStax are hosting LLM bootcamps using CassIO: https://kono.io/bootcamp/ . We should make sure that it is on the website

Support bulk delete in VectorTable

A delete_many method accepting a list of IDs, that internally does concurrent.

Or (not ideal perhaps) at least make the delete async and have the langchain layer (or equivalent) handle that concurrency (compare #14 for the same discussion). At the moment there's a loop at the langchain level and deletes are serialized.

NoSql Assistant Demo

DataStax is currently working on a NoSql Assistant that is representative of the canonical chatbot. DataStax should package up the application into a demo that can demonstrate the power of CassIO. The demo should include the following:

simple Colab notebote
chat agent runnable in a flax server
sample dataset

Finalize the "key trick" for tables

The key trick fits different-arity choices of the "key" (as abstract concepts) into a single table, i.e.
abstract key = [['name', 'city','age'], ['John', 'Rome', 123]]

becomes the (always 2-)tuple of two strings

cache_key = "['John', 'Rome', 123]"

Possibly cumbersome and/or confusing.
Pro: fits heterogeneous stuff on the same table
Con: essentially repeats what the C* partitioner does

Additional comments (by @jbellis. The first one is not entirely clear to me)

PRIMARY KEY (( key_desc, cache_key ))
This is fine, however, if you have many smaller caches you’re better off allowing key_desc being the only partition key. (Since then Cassandra can restrict the queries to just the replicas owning that partition.) This is fine for 1.0 but we may end up wanting to expose this either directly (partitioned boolean) or indirectly (expected cache size parameter that lets us make the decision under the hood)
self.keyDesc = '/'.join(self.keys)
IMO we’d be better served by just providing a cache name parameter and let the caller decide how to build it

Resistance against Sessions with nondefault "row factory"

CassIO expects the Session to have the named-tuple Row factory (i.e. the rows are returned as Row objects from CQL queries).

Sometimes, however, for other reasons users stray off the default and set the row factory to e.g. dict_factory. Then, when passing the session to cassIO, boom.

At least check and give an error, or work around this by either:

spawn a new session with the right factory
pick the fields in the appropriate way whether row or named-tuple (and other custom factories, give an error, whatever)

Batched Iterators when moving batch ops (inserts, etc) to cassIO

Cannot assume passed iterable have a len(), nor that they are indexable. So, batched iterators to the rescue.

Rely on DB-native metrics for ANN search

At the time of implementation, something was not yet on DB-side. Things have changed.

no need to sort results by metric on cassio-level anymore
no need to even have those metrics for computation (with prefetch + calculation) anymore. Except, maybe not all are on DB. What about L1, L2, L-infinity? To be checked.

[Epic] Implement support for LlamaIndex

CassIO currently supports Langchain. We should add support for LlamaIndex next.

Integration options:
Vector Store - https://gpt-index.readthedocs.io/en/latest/how_to/integrations/vector_stores.html
Index Stores - https://gpt-index.readthedocs.io/en/latest/how_to/storage/index_stores.html

[LlamaIndex] A "Data connector" reader from Cassandra (vector store and not)

A data connector (i.e. Reader) ingest data from different data sources and data formats into a simple Document representation (text and simple metadata).

https://gpt-index.readthedocs.io/en/latest/api_reference/readers.html

This would enable a broad class of "reading from Cassandra" use cases.

Create utility to determine thresholds for setting vector search relevancy thresholds

Currently, there is no easy way to determine the distance thresholds for relevancy for Nearest Neighbor search. We need a tool that can help at least "visually" help determine a good cutoff for relevancy (see : https://towardsdatascience.com/k-nearest-neighbors-knn-for-anomaly-detection-fdf8ee160d13)

Implement MMR retrieval at cassIO level

It might make sense for the cassIO VectorTable class to offer its own MMR implementation.

(Currently for the langchain integration case this is done at langchain level, but arguably the right place is cassIO).

Integration tests don't create cassio_tutorials keyspace

I followed the setup in the readme with a local Cassandra instance and tried to run the integration tests. It's easy to create the keyspace, but I think with a local Cassandra setup, we should set up the keyspace as we have it in the CASSANDRA_KEYSPACE environment variable. Also if we do a "CREATE KEYSPACE IF NOT EXISTS" we could just do that in all cases.

It's not a huge deal but would make it simpler for first time users to not run into errors on the happy path.

Expose WITH OPTIONS = { 'similarity_function': 'dot_product' } and make it the default where possible

Dot product is about 40% faster than the default cosine, but we can only use it if the embedding vectors are normalized.

If we know what the embeddings provider is we can make an intelligent default. (OpenAI and Google's are both normalized, for instance. OpenAI's are probably overwhelmingly the most popular.)

Database exception philosophy?

Currently: cassIO does nothing, and DB exceptions bubble up to the caller.

(At the integration level, e.g. the langchain code using cassIO, the same happens).

Is there a change in philosophy needed here? Pro: users who don't want to bother have an easier life (in a sense). Con: error swallowing is generally bad.

Data Tracking in Chat History

For every LLM Prompt and Response, it would also be useful to track what pieces of context data was used for generating the prompt. For example, if 10 different entities was retrieved from the database, store the keyspace, table, column, and id for each entity in the Chat History.

This is useful for data lineage and data tracking. Data tracking can used to find bad data, or to help find sources of data that was used to generate answers.

Create ChatGPT clone example with CassIO

https://python.langchain.com/docs/modules/agents/how_to/chatgpt_clone

Optimize to use "dot" behind the scenes

I.e. the insertions normalize all vectors to norm one and then internally the dot is used for cos.

This saves ~50% cpu time on ANN searches.

Advance metadata filtering to a feature-full support

TypeError: ... got an unexpected keyword argument 'auto_id'

This error is thrown in a call to VectorstoreIndexCreator.from_loaders. It was working a few days ago.

Investigate how to use filtered Vector Search to improve search results

Vector Search - specifically k nearest neighbor - is very sensitive to outliers. Filtered vector search gives the ability to reduce the search space prior to performing vector search. This is being advertised in Pinecone's marketing materials significantly, so we need to figure out how to perform this. Tooling such as langchain actually don't make it possible to do filtered vector search.

Makefile, style and linter

Standardize the code and the flow with these elements.

This includes type hints everywhere.
(and will also expose the leaky abstraction around the current vector mixin, eeeh)

[Langchain] the summary buffer memory never persists its summary

LangChain:

In the current implementation of the Summary Buffer Memory, the summary is never persisted (always in memory).
Investigate whether it pays off / is feasible to use Cassandra for that.

Investigate Reranking Algorithm to improve vector search results

When retrieving data via ANN from Cassandra, a light-weight re-ranking for the purposes of determining what vector search results to pass the the LLM is necessary.

Data Extractor, multiple rows and optimizations

Much work needed on the "Data extractor" facility.

Optimize queries (each table queried at most once)

For multiple-row returning, some thinking is needed (perhaps even just another extractor altogether?)

[LangChain] Plans for good implementation of LangChain's "semantic chat memory"

LangChain has no specific "semantic chat memory": that stems, instead, from a certain usage of the VectorStore.

(see here on cassio.org and here for a howto on LangChain site).

Changes needed, the rationale

In practice, once you have a vectorstore, first a "retriever" is created out of it (langchain standard construct) and then the latter is wrapped by a VectorStoreRetrieverMemory class (another langchain standard). Relevant steps:

vectorstore = whatever-your-backend.init(...)
retriever = vectorstore.as_retriever(search_kwargs=dict(k=1))
memory = VectorStoreRetrieverMemory(retriever=retriever)
# now "memory" can be used e.g. in a chat

So in a realistic usage you don't want to pull relevant chat snippets from the whole store, rather from the conversation with that user of course. In Cassandra terms, this means clustering rows by user_id.

CassIO

Hence we need a parameter in CassIO's VectorTable init that controls whether we have a primary key (( document_id )) or (( session_id) , document_id) in Cassandra terminology. This is not implemented yet: at the moment we only have the first choice and no control.

(Note: I assume we don't want to have a different table per user id !)

LangChain

Once the above is addressed, LangChain also will have to slightly change.

Option 1: new params in the vector store's `similarity_search_with_score_id_by_vector`

The search_kwargs parameter in spawning the "retriever" will be the place to specify the user_id (i.e. session_id, i.e. partition to use for the subsequent lookup). These end up in the kwargs of the similarity_search_with_score_id_by_vector method of the Cassandra Vector Store, which will be able to pass this partition key to the cassIO search.

Pro: less proliferation of instances of vector store.
Con: might involve more kwargs as this param gets to the Cassandra vector store through several routes (whether mmr, similarity, etc it's different functions being called. See as_retriever method of base VectorStore class.

Option 2:

In this case one creates as many VectorStore classes as there are session_ids, each with the partition key as instance property, and this gets injected into each search() call within that instance. Much less intrusive, a bit heavier resource-wise perhaps.

Create example of CassIO working with FLARE

https://python.langchain.com/docs/modules/chains/additional/flare

https://arxiv.org/abs/2305.06983

Cassio Agents Metadata schema for Langchain

There are the following "Objects" in Langchain that can be managed by data stored elsewhere, and not necessarily "hardcoded" in code.

These represent the choices users can make

In the beginning can just be Python objects , or rather JSON configs , then we can move to tables.

AgentTypes (stores the registry of Agent Types)

AgentTypeName
AgentTypeDescription
AgentTypeClass

LLMTypes

LLMTypeID
LLMProvider (LLM provider, e.g., OpenAI GPT-3, Google PALM2, Cohere, Anthropic etc.)
LLMModels (LLM models available)
LLMKeys (LLM access keys if required)
LLMEndpoint (LLM API endpoint if applicable)

ToolTypes

ToolTypeName
ToolTypeDescription
ToolTypeKeys (Keys required for the tool, if applicable)
ToolTypeLLM (LLMTypeID) (Optional, specify the LLM provider used by the tool to override default one)
ToolTypeClass (Internal tool type class)
ToolTypeDefaultAction (Internal tool type class method)

DocumentLoaderTypes

DocumentLoaderID
DocumentLoaderName
DocumentLoaderParameters
DocumentLoaderClass (internal)

This represents what a user can define and store

Can start with JSON and then move to Tables

LLMConfiguration

LLMConfigurationID
LLMType
LLMParameters
LLMModel (One model chosen)
LLMKeys (LLM access keys if required, overrides global def in LLMTypes)
LLMEndpoint (LLM API endpoint if applicable, overrides global def in LLM Types)

Agent

AgentID
AgentName
AgentDescription
AgentType { Zero-Shot-React.., React-Docstore, etc... }
AgentIterations ( max iterations)
AgentLLM (LLMConfigurationID)

Tools

ToolID
ToolName (optional override)
ToolDescription (optional override)
ToolClass (optional override)
ToolDirectReturn
ToolAction (optional override)
ToolParameters (optional)

DocumentLoaderConfiguration

DocumentLoaderConfigurationID
DocumentLoaderType (DocumentLoaderTypeID)
DocumentLoaderName
DocumentLoaderParameters

Index

IndexID
IndexName
IndexType (IndexTypeID)
IndexLLMConnection (LLMConfigurationID)
IndexDescription
DocumentLoaders? (DocumentLoaderConfigIDs)

Strategy to implement

This is just the overall spec of what goes into an agent at a high level. Recommendation is to first implement with 100% JSON driven agent -- then implement in Schema -- since it will just be an optimization of where to store the config rather than functionality. This is applicable to future agent frameworks.

Define JSON examples of these - one for global config , one for agent config
Write an agent executor that uses global and agent config and runs an API - should easy ways so that it could be put into a Flask, Chainlit, etc. or expose itself as an API
Back the agent executor with CQL Tables
Provide a way to export / import configs / merge configs
Create API to CRUD the configs
Create UI wrapper

Investigate how different chunking strategies affect performance of Vector Search results

Different chunking (text-splitting) strategies affect the performance of how well embedding models are able to embed data into vector spaces. Right now the out of box methods being used cuts the sentences off mid way. This task should:

Identify and document best practices on chunking
identify what software can be used for chunking

[Epic] Provide Integration with GPTCache

https://gptcache.readthedocs.io/en/latest/usage.html

Update Langchain docs so that it reflects the updated chathistory capabilities of Cassandra & the ability to use AstraDB

Currently, the docs of the langchain integration looks very basic, and doesn't reflect vector search or CassIO:

https://python.langchain.com/docs/modules/memory/integrations/cassandra_chat_message_history
https://python.langchain.com/docs/ecosystem/integrations/cassandra
https://github.com/hwchase17/langchain/blob/master/langchain/memory/chat_message_histories/cassandra.py

It is also possible to create a version of the chat that discusses the managed version of Cassandra (AstraDB)

https://python.langchain.com/docs/modules/memory/integrations/motorhead_memory_managed

Other places that documentation & integration is missing:
https://python.langchain.com/docs/modules/data_connection/retrievers/
https://python.langchain.com/docs/modules/data_connection/text_embedding/

A suggestions is to drop a link to the CassIO website.

Retry/batching strategy to protect against timeouts

A retry strategy (essentially, it could be three parameters num_retries, retry_timeout, retry_sleep_seconds) for CQL operations.

This might have interplay with a batching strategy at cassIO level. However, given the current design of cassIO/langchain integration (e.g. vector store insert many) one should move the whole insert-many into cassIO (which could even make sense after all).
So cassIO would espose a put_many method that internally handles even batching.

Create Composable Prompt Templates for RAG

Prompting the LLM with the right prompt template is important for the LLM to make sense to the content from Vector search. Important items include:

Appropriate Directives
Proper content layout (User Context section, Memory section, Index data section, etc)
Tools to ensure that prompt fits within the context window

Find a better interface (this seems like it'll be a general problem with TTL, where zero and None have ambiguous meaning).

Suggestion: a symbolic NOT_PASSED default which is not None and not zero, checked for in the code.

[Epic] Provide implementation for GPTCache using Vector Search

https://gptcache.readthedocs.io/en/latest/index.html#roadmap

Do some research on GPTCache
Find a beta customer