Comments (12)
We'll need to cover a lot of things that are missing from this SIP:
• What packages/licenses are needed, and are the compatible?
• What are the security/privacy implications?
• How do we (as an open-source solution) stay vendor-agnostic here? What's the abstraction layer?
This will need to be put up for a DISCUSS thread on the mailing list to move forward, but I think the proposal needs more detail/resolution.
from superset.
We'll need to cover a lot of things that are missing from this SIP:
• What packages/licenses are needed, and are they compatible?
Python langchain package (or) modules required for making HTTP calls
• What are the security/privacy implications?
User configures necessary API keys. LLM calls happen though backend, since Schema needs to be passed to RAG for quality responses.
Either Approaches - support both options self-hosted (protecting security & privacy) or using provider of choice.
• How do we (as an open-source solution) stay vendor-agnostic here? What's the abstraction layer?
we can stay vendor-agnostic by leaving choice to the user with their preferred mode (self-hosted or LLM as service etc.) and also choice of LLM.
What's the abstraction layer --
I have found https://python.langchain.com/docs/use_cases/sql/quickstart/#convert-question-to-sql-query in Langchain which we can directly use.
from superset.
Either two Options (draft) - feel free to add what you think. (probably I need to find a way of doing this as collaborative one).
If maintainers can create a sheet that works, else I can create a spreadsheet. For evaluating or suggesting various approaches or implementation ideas
- Either use HTTP request - where users can configure endpoint.
Or use sophisticated LLM FW like Langchain. Which helps when scaling in terms of functionality
LLM access using LangChain | LLM access using HTTP | |
---|---|---|
Advantages | Scalability while adding/extending features | Little code - leaves LLM to end user |
Supports lot of LLM's but switch might requried | User configures HTTP endpoint - giving him choice of either self-hosted or LLM as servie where he can just configure end point. | |
Congigurations | user configures necessary API keys/options con work | while user define http end point - he will configure headers |
Code solves for 1 particular usescases. extensibility is tough | ||
levrage Langchain | we might need to add code which FW's like langchain already does | |
Provider agnostic - as it supports Almost all providers | Provider-agnostic - user configures endpoint | |
Less changes requried - as these are SDK's | Request need to changed whenever releases happens etc | |
Above table is a draft - dumping my thoughts |
Based on my evaluation - use langchain is better -
- as I provided sophisticated FW while working with LLMs (Although it might not require imminently will definitely useful in future).
- Another concern I had with langchain was able to use self-Hosted LLM modes - it seems Lang chain supports that. It can be has selfHostedPipeline, or we can write generic LLM model (just a HTTP wrapper) for llm to access by langchain
from superset.
@surapuramakhil thanks. I think it makes sense to update the description with all new info and make sure you are covering all the technical/architectural considerations. First question that comes to mind, how do you intend to pull the right metadata from the database for the LLM to use? There is a limited context window and you just can't pull the whole schema for both context and performance limitations.
from superset.
@geido based on my research langchain already solves this.
https://python.langchain.com/docs/use_cases/sql/quickstart/#convert-question-to-sql-query.
They wrote pipelines for generating Queries from text. It works for any llm model. we can just piggyback on that. All I am planning to is have llm_provider or llm_factory which creats llm based on user needs and send to their pipeline.
from superset.
@geido as you said updated description.
First question that comes to mind, how do you intend to pull the right metadata from the database for the LLM to use? There is a limited context window, and you just can't pull the whole schema for both context and performance limitations.
Let's try with langchain and see its results.
from superset.
It looks more like a toy for now:
Has definitions for all the available tables.
This won't work for production databases that might have hundreds of tables and columns.
from superset.
I think having langchain in the repo might be a nice thing to have to enable LLM-related capabilities. However, that would be a separate SIP to illustrate how langchain could be leveraged in the repo. It looks like starting from SQL generation is hard.
from superset.
It looks like starting from SQL generation is hard.
Why do you think so? It's the first use case which Apache superset needs
from superset.
As someone who has actually implemented this exact idea in superset for a hackathon a few months back, this is a pipe-dream at best (to be fairly blunt). Using RAG to pull relevant table metadata at prompt-time still led to unmanageable levels of LLM hallucination that only grows worse as the size of the warehouse being queried increases.
Something like this may be feasible for a user with a handful of tables, but at-scale it simply doesn't work. And a query that is 99% correct is functionally worthless if this is intended to be utilized by folks who don't have the skills necessary to parse through AI-generated SQL.
from superset.
like this may be feasible for a user with a handful of tables, but at-scale it simply doesn't work
This is the problem with Language Model. That's exactly why LLM choice is given to users. If the situation were the scale is high, the best they can with high context size model like Gemini pro-1.5. Thats a separate Data Science problem which Apache Superset doesn't need to solve. just leverage what is available.
Using RAG to pull relevant table metadata at prompt-time still led to unmanageable levels of LLM hallucination that only grows worse as the size of the warehouse being queried increases.
This is a separate data science problem which Apache Superset doesn't need to solve, currently langchain community (quite popular in datascience) are solving this problem. we just leverage it.
this might protect from hallucination https://python.langchain.com/docs/use_cases/sql/query_checking/
Prompting / RAG strategies while working at scale - https://python.langchain.com/docs/use_cases/sql/large_db/
As both evolve (by time), Quality of Queries will become better & better.
a query that is 99% correct is functionally worthless if this is intended to be utilized by folks who don't have the skills necessary to parse through AI-generated SQL.
I agree with you about this, this doesn't solve fully for those who doesn't necessary knowledge to understand AI generated SQL. It's a copilot instead of an auto pilot.
from superset.
Ah, I have found this.
This is a premium feature of Preset
https://preset.io/blog/preset-ai-assist/
from superset.
Related Issues (20)
- Charts duplicating in tabs HOT 3
- Embeded Dashboards Not Using RLS in multiple Tabs
- Unexpected Error on SQL Lab HOT 3
- Bar chart - Impossible to maintain consistent bar width with sparse data HOT 2
- Table does not adjust `Show Totals` calculation when typing to search for values via the Search Bar HOT 1
- Getting message "Access is Denied" On Superset login page and logout page HOT 1
- superset_text.yml not loading the PREFERRED_DATABASES icons HOT 1
- Superset 3.0.0 is showing up " PrestoDBSQLValidator was unable to check your query." if HOT 10
- Cannot load filter Error: ORA-00979: not a GROUP BY expression HOT 18
- Humanize not localised HOT 5
- Installation from PyPi, superset db upgrade fails HOT 7
- Superset Guest Token Endpoint Only is not Working My URL - ( https://superset.mysite.com/api/v1/security/guest_token ) HOT 3
- Remove `can csv upload`, `can excel upload`, `can columnar upload` in favour of just `can upload on Database`
- Slack API `files.upload` v1 deprecation HOT 3
- [SIP-130] Migrating from Mapbox to MapLibre
- MSSQL - ERROR: Could not load database driver: MssqlEngineSpec HOT 5
- superset-frontend proxy is not working when superset backend decides to use zstd encoding HOT 2
- "Can't contact LDAP server HOT 1
- SQL Lab hard fails with "'utf-8' codec can't decode byte 0xe6 in position 0: invalid continuation byte" when reading certain binary types HOT 3
- Installing error HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from superset.