mindsdb / dfsql Goto Github PK

View Code? Open in Web Editor NEW

51.0 51.0 9.0 1.46 MB

SQL interface to Pandas

License: GNU General Public License v3.0

Python 56.90% Jupyter Notebook 43.10%

dfsql's Introduction

Website · Docs · Community Slack

MindsDB is the platform for building AI from enterprise data. You can create, serve, and fine-tune models in real-time from your database, vector store, and application data.

📖 About us

MindsDB is the platform for building AI from enterprise data.

With MindsDB, you can deploy, serve, and fine-tune models in real-time, utilizing data from databases, vector stores, or applications, to build AI-powered apps - using universal tools developers already know.

MindsDB integrates with numerous data sources, including databases, vector stores, and applications, and popular AI/ML frameworks, including AutoML and LLMs. MindsDB connects data sources with AI/ML frameworks and automates routine workflows between them. By doing so, we bring data and AI together, enabling the intuitive implementation of customized AI systems.

Learn more about features and use cases of MindsDB here.

🚀 Get Started

To get started, install MindsDB locally via Docker or Docker Desktop, following the instructions in linked doc pages.

MindsDB enhances SQL syntax to enable seamless development and deployment of AI-powered applications. Furthermore, users can interact with MindsDB not only via SQL API but also via REST APIs, Python SDK, JavaScript SDK, and MongoDB-QL.

🎯 Solutions	⚙️ SQL Query Examples
🤖 Fine-Tuning	`FINETUNE mindsdb.hf_model FROM postgresql.table;`
📚 Knowledge Base	`CREATE KNOWLEDGE_BASE my_knowledge FROM (SELECT contents FROM drive.files);`
🔍 Semantic Search	`SELECT * FROM rag_model WHERE question='What product is best for treating a cold?';`
⏱️ Real-Time Forecasting	`SELECT * FROM binance.trade_data WHERE symbol = 'BTCUSDT';`
🕵️ Agents	`CREATE AGENT my_agent USING model='chatbot_agent', skills = ['knowledge_base'];`
💬 Chatbots	`CREATE CHATBOT slack_bot USING database='slack',agent='customer_support';`
⏲️ Time Driven Automation	`CREATE JOB twitter_bot ( <sql_query1>, <sql_query2> ) START '2023-04-01 00:00:00';`
🔔 Event Driven Automation	`CREATE TRIGGER data_updated ON mysql.customers_data (sql_code)`

💡 Examples

MindsDB enables you to deploy AI/ML models, send predictions to your application, and automate AI workflows.

Discover more tutorials and use cases here.

AI Workflow Automation

This category of use cases involves tasks that get data from a data source, pass it through an AI/ML model, and write the output to a data destination.

Common use cases are anomaly detection, data indexing/labeling/cleaning, and data transformation.

This example showcases the data enrichment flow, where input data comes from a PostgreSQL database and is passed through an OpenAI model to generate new content which is saved into a data destination.

We take customer reviews from a PostgreSQL database. Then, we deploy an OpenAI model that analyzes all customer reviews and assigns sentiment values. Finally, to automate the workflow for incoming customer reviews, we create a job that generates and saves AI output into a data destination.

-- Step 1. Connect a data source to MindsDB
CREATE DATABASE data_source
WITH ENGINE = "postgres",
PARAMETERS = {
    "user": "demo_user",
    "password": "demo_password",
    "host": "samples.mindsdb.com",
    "port": "5432",
    "database": "demo",
    "schema": "demo_data"
};

SELECT *
FROM data_source.amazon_reviews_job;

-- Step 2. Deploy an AI model
CREATE ML_ENGINE openai_engine
FROM openai
USING
    openai_api_key = 'your-openai-api-key';

CREATE MODEL sentiment_classifier
PREDICT sentiment
USING
    engine = 'openai_engine',
    model_name = 'gpt-4',
    prompt_template = 'describe the sentiment of the reviews
						strictly as "positive", "neutral", or "negative".
						"I love the product":positive
						"It is a scam":negative
						"{{review}}.":';

DESCRIBE sentiment_classifier;

-- Step 3. Join input data with AI model to get AI output
SELECT input.review, output.sentiment
FROM data_source.amazon_reviews_job AS input
JOIN sentiment_classifier AS output;

-- Step 4. Automate this workflow to accomodate real-time and dynamic data
CREATE DATABASE data_destination
WITH ENGINE = "engine-name",      -- choose the data source you want to connect to save AI output
PARAMETERS = {                    -- list of available data sources: https://docs.mindsdb.com/integrations/data-overview
    "key": "value",
	...
};

CREATE JOB ai_automation_flow (
	INSERT INTO data_destination.ai_output (
		SELECT input.created_at,
			   input.product_name,
			   input.review,
			   output.sentiment
		FROM data_source.amazon_reviews_job AS input
		JOIN sentiment_classifier AS output
		WHERE input.created_at > LAST
	);
);

AI System Deployment

This category of use cases involves creating AI systems composed of multiple connected parts, including various AI/ML models and data sources, and exposing such AI systems via APIs.

Common use cases are agents and assistants, recommender systems, forecasting systems, and semantic search.

This example showcases AI agents, a feature developed by MindsDB. AI agents can be assigned certain skills, including text-to-SQL skills and knowledge bases. Skills provide an AI agent with input data that can be in the form of a database, a file, or a website.

We create a text-to-SQL skill based on the car sales dataset and deploy a conversational model, which are both components of an agent. Then, we create an agent and assign this skill and this model to it. This agent can be queried to ask questions about data stored in assigned skills.

-- Step 1. Connect a data source to MindsDB
CREATE DATABASE data_source
WITH ENGINE = "postgres",
PARAMETERS = {
    "user": "demo_user",
    "password": "demo_password",
    "host": "samples.mindsdb.com",
    "port": "5432",
    "database": "demo",
    "schema": "demo_data"
};

SELECT *
FROM data_source.car_sales;

-- Step 2. Create a skill
CREATE SKILL my_skill
USING
    type = 'text2sql',
    database = 'data_source',
    tables = ['car_sales'],
    description = 'car sales data of different car types';

SHOW SKILLS;

-- Step 3. Deploy a conversational model
CREATE ML_ENGINE langchain_engine
FROM langchain
USING
      openai_api_key = 'your openai-api-key';
      
CREATE MODEL my_conv_model
PREDICT answer
USING
    engine = 'langchain_engine',
    model_name = 'gpt-4',
    mode = 'conversational',
    user_column = 'question' ,
    assistant_column = 'answer',
    max_tokens = 100,
    temperature = 0,
    verbose = True,
    prompt_template = 'Answer the user input in a helpful way';

DESCRIBE my_conv_model;

-- Step 4. Create an agent
CREATE AGENT my_agent
USING
    model = 'my_conv_model',
    skills = ['my_skill'];

SHOW AGENTS;

-- Step 5. Query an agent
SELECT *
FROM my_agent
WHERE question = 'what is the average price of cars from 2018?';

SELECT *
FROM my_agent
WHERE question = 'what is the max mileage of cars from 2017?';

SELECT *
FROM my_agent
WHERE question = 'what percentage of sold cars (from 2016) are automatic/semi-automatic/manual cars?';

SELECT *
FROM my_agent
WHERE question = 'is petrol or diesel more common for cars from 2019?';

SELECT *
FROM my_agent
WHERE question = 'what is the most commonly sold model?';

Agents are accessible via API endpoints.

🤝 Contribute

If you’d like to contribute to MindsDB, install MindsDB for development following this instruction.

You’ll find the contribution guide here.

We are always open to suggestions, so feel free to open new issues with your ideas, and we can guide you!

This project is released with a Contributor Code of Conduct. By participating in this project, you agree to follow its terms.

Also, check out the rewards and community programs here.

🤍 Support

If you find a bug, please submit an issue on GitHub here.

Here is how you can get community support:

Post a question at MindsDB Slack Community.
Ask for help at our GitHub Discussions.
Ask a question at Stackoverflow with a MindsDB tag.

If you need commercial support, please contact the MindsDB team.

💚 Current contributors

Made with contributors-img.

🔔 Subscribe to updates

Join our Slack community and subscribe to the monthly Developer Newsletter to get product updates, information about MindsDB events and contests, and useful content, like tutorials.

⚖️ License

For detailed licensing information, please refer to the LICENSE file.

dfsql's People

Contributors

Stargazers

Watchers

Forkers

sycomix anmyachev shivamtrivedi1 suryatmodulus personx000 vishalbelsare mobasherah12 rommeldb kkxiaotikk

dfsql's Issues

dfsql.exceptions.dfsqlException: Table from_tables found in from_tables, but not in the SQL query.

Hi, I'm trying to run the example in the README, but I'm not able to run it.

import pandas as pd
from dfsql import sql_query

df = pd.DataFrame({
     "animal": ["cat", "dog", "cat", "dog"],
     "height": [23,  100, 25, 71] 
 })

df.head()

sql_query('SELECT animal, height FROM "animals_df" WHERE height > 50', from_tables={"animals_df": df})

But I'm getting an error with the table name:

File "/home/ray/anaconda3/lib/python3.7/site-packages/dfsql/__init__.py", line 23, in sql_query
    raise dfsqlException(f"Table {table_name} found in from_tables, but not in the SQL query.")
dfsql.exceptions.dfsqlException: Table from_tables found in from_tables, but not in the SQL query.

I don't know if anyone is having the same issue. Thanks a lot for you help

Add contributors agreement and try to apply it to post license-switch contributions

We should go through all contributions since we switch from an MIT License to a GPL-3.0 License and either:

a) Have all contributors agree to and sign something like the ASF Contributor License Agreement or alternatively remove their contributions.

b) In the future we should have some easy way of allowing anyone that contributes code to sign and agreement similar to the way the Apache foundation does it.

This is for {insert legal reasons I would make a mess of explaining}, feel free to ask or send us an email or ask a question here in case you don't agree with this policy or think it's in some way disadvantageous to Mindsdb and/or it's open source contributors.

Remove column name transformations when loading a dataframe into a datasource

Table alias and groupby produces wrong output

This test fails:

def test_complex_groupby(self, googleplay_csv, data_source_googleplay):
        sql = """SELECT sub.category, avg(reviews) AS avg_reviews
                    FROM (
                        SELECT category, CAST(reviews AS float) AS reviews
                         FROM (
                            SELECT category, reviews
                            FROM googleplaystore
                            LIMIT 100
                         )
                    ) AS sub
                    GROUP BY sub.category
                    HAVING avg_reviews > 0.4
                    LIMIT 10"""

        df = pd.read_csv(googleplay_csv)
        inner = df[['Category', 'Reviews']].iloc[:100]
        out_df = inner.groupby(['Category']).agg({'Reviews': 'mean'}).reset_index()
        out_df.columns = ['sub.category', 'avg_reviews']

        query_result = data_source_googleplay.query(sql)
        assert out_df.shape == query_result.shape
        assert (out_df.dropna().values == query_result.dropna().values).all()

Build a license-inclusive dependency graph

Build a depenency graph for this library that includes the licenses of all of our dependencies to make sure they are compatible with GLP-3.0

syntax documentation

Looks cool, but I have no idea what SQL functions the tool supports..

Error if selecting column name is 'status'

query:

result_df = dfsql.sql_query(
                    'select name, status from predictors',
                    ds_kwargs={'case_sensitive': False},
                    reduce_output=False,
                    predictors=predictors_df
                )

error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/maxs/dev/mdb/venv38/sources/dfsql/dfsql/__init__.py", line 30, in sql_query
    result = ds.query(sql, reduce_output=reduce_output)
  File "/home/maxs/dev/mdb/venv38/sources/dfsql/dfsql/data_sources/base_data_source.py", line 171, in query
    query = parse_sql(sql)
  File "/home/maxs/dev/mdb/venv38/sources/mindsdb_sql/mindsdb_sql/__init__.py", line 25, in parse_sql
    ast = parser.parse(tokens)
  File "/home/maxs/dev/mdb/venv38/lib/python3.8/site-packages/sly-0.4-py3.8.egg/sly/yacc.py", line 2119, in parse
    tok = self.error(errtoken)
  File "/home/maxs/dev/mdb/venv38/sources/mindsdb_sql/mindsdb_sql/parser/parser.py", line 544, in error
    raise ParsingException(f"Syntax error at token {p.type}: \"{p.value}\"")
mindsdb_sql.exceptions.ParsingException: Syntax error at token STATUS: "status"

Aliasing a column that is used in WHERE silently produces wrong output

Query:

SELECT passenger_id, titanic.survived as ts FROM titanic WHERE titanic.survived = 1

Exception if two values are compared in 'WHERE'

dfsql.sql_query(
            'SELECT name, status FROM predictors WHERE 1 = 0',
            ds_kwargs={'case_sensitive': False},
            reduce_output=False,
            predictors=predictors_df
        )

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/maxs/dev/mdb/venv38/sources/dfsql/dfsql/__init__.py", line 30, in sql_query
    result = ds.query(sql, reduce_output=reduce_output)
  File "/home/maxs/dev/mdb/venv38/sources/dfsql/dfsql/data_sources/base_data_source.py", line 170, in query
    return self.execute_query(query, reduce_output=reduce_output)
  File "/home/maxs/dev/mdb/venv38/sources/dfsql/dfsql/data_sources/base_data_source.py", line 543, in execute_query
    return self.execute_select(query, reduce_output=reduce_output)
  File "/home/maxs/dev/mdb/venv38/sources/dfsql/dfsql/data_sources/base_data_source.py", line 398, in execute_select
    source_df = source_df[index.values]
AttributeError: 'bool' object has no attribute 'values'

ImportError: cannot import name 'basestring' from 'pyparsing'

Hi guys, I noticed that dfsql doesn't work with new versions of pyparsing (since 3.0.0). This should probably be fixed.

Queries fail for column names with dashes

Given seq.csv:

sequence,affinity,seq-len
ARKKERLW,0.13,8
WAKWAKA,0.1,7
EEERRRKK,0.2,8

The code:

import pandas as pd
from dfsql import sql_query

df = pd.read_csv("demo/alt.csv")
res = sql_query("SELECT * FROM df WHERE seq-len > 7", df=df)
print(res)

Fails with the error:

dfsql.exceptions.QueryExecutionException: Column seq not found.

If the column is renamed seq-len -> seqlen and the query is adjusted accordingly, then no error occurs.

Zero length string value converts to <NA>

If i do:

df = pd.DataFrame([['a', '']], columns=['a','b'])
dfsql.sql_query('select * from df', df=df)

then i get:

   a     b
0  a  <NA>

No mac (or windows/source) packages means the installation fails.

Hello! I'm getting this error when I try to install via poetry add "modin[sql]" on my Macbook (outside of a Docker image):

$ poetry add "modin[sql]"
Using version ^0.10.1 for modin

Updating dependencies
Resolving dependencies... (0.6s)
Resolving dependencies... (0.3s)

Package operations: 2 installs, 0 updates, 0 removals

  • Installing dfsql (0.3.1): Failed

  RuntimeError

  Unable to find installation candidates for dfsql (0.3.1)

  at ~/.poetry/lib/poetry/installation/chooser.py:72 in choose_for
       68│
       69│             links.append(link)
       70│
       71│         if not links:
    →  72│             raise RuntimeError(
       73│                 "Unable to find installation candidates for {}".format(package)
       74│             )
       75│
       76│         # Get the best link


Failed to add packages, reverting the pyproject.toml file to its original content.

Doing a bit of digging looks like this change was made to just publish Linux wheels only, not able to understand the context behind that change given the lack of a description. I'm happy to throw together a quick Dockerfile to get my project to compile so I can try this out but figured I'd create an issue as likely others in the near future will hit this same problem!

Spontaneous error during query

Time-to-time, on random query i get error like this:

data = dfsql.sql_query(
      str(new_statement),
      ds_kwargs={'case_sensitive': False},
      reduce_output=False,
      **{'dataframe': df}
  )