pyspark-ai / pyspark-ai Goto Github PK

English SDK for Apache Spark

License: Apache License 2.0

Python 98.75% Makefile 1.02% Shell 0.23%

pyspark-ai's Introduction

Introduction

The English SDK for Apache Spark is an extremely simple yet powerful tool. It takes English instructions and compile them into PySpark objects like DataFrames. Its goal is to make Spark more user-friendly and accessible, allowing you to focus your efforts on extracting insights from your data.

For a more comprehensive introduction and background to our project, we have the following resources:

Blog Post: A detailed walkthrough of our project.
Demo Video: 2023 Data + AI summit announcement video with demo.
Breakout Session: A deep dive into the story behind the English SDK, its features, and future works at DATA+AI summit 2023.

Installation

pyspark-ai can be installed via pip from PyPI:

pip install pyspark-ai

pyspark-ai can also be installed with optional dependencies to enable certain functionality. For example, to install pyspark-ai with the optional dependencies to plot data from a DataFrame:

pip install "pyspark-ai[plot]"

To install all optionall dependencies:

pip install "pyspark-ai[all]"

For a full list of optional dependencies, see Installation and Setup.

Configuring OpenAI LLMs

As of July 2023, we have found that the GPT-4 works optimally with the English SDK. This superior AI model is readily accessible to all developers through the OpenAI API.

To use OpenAI's Language Learning Models (LLMs), you can set your OpenAI secret key as the OPENAI_API_KEY environment variable. This key can be found in your OpenAI account. Example:

export OPENAI_API_KEY='sk-...'

By default, the SparkAI instances will use the GPT-4 model. However, you're encouraged to experiment with creating and implementing other LLMs, which can be passed during the initialization of SparkAI instances for various use-cases.

Usage

Initialization

from pyspark_ai import SparkAI

spark_ai = SparkAI()
spark_ai.activate()  # active partial functions for Spark DataFrame

You can also pass other LLMs to construct the SparkAI instance. For example, by following this guide:

from langchain.chat_models import AzureChatOpenAI
from pyspark_ai import SparkAI

llm = AzureChatOpenAI(
    deployment_name=...,
    model_name=...
)
spark_ai = SparkAI(llm=llm)
spark_ai.activate()  # active partial functions for Spark DataFrame

Using the Azure OpenAI service can provide better data privacy and security, as per Microsoft's Data Privacy page.

DataFrame Transformation

Given the following DataFrame df:

df = spark_ai._spark.createDataFrame(
    [
        ("Normal", "Cellphone", 6000),
        ("Normal", "Tablet", 1500),
        ("Mini", "Tablet", 5500),
        ("Mini", "Cellphone", 5000),
        ("Foldable", "Cellphone", 6500),
        ("Foldable", "Tablet", 2500),
        ("Pro", "Cellphone", 3000),
        ("Pro", "Tablet", 4000),
        ("Pro Max", "Cellphone", 4500)
    ],
    ["product", "category", "revenue"]
)

You can write English to perform transformations. For example:

df.ai.transform("What are the best-selling and the second best-selling products in every category?").show()

product	category	revenue
Foldable	Cellphone	6500
Nromal	Cellphone	6000
Mini	Tablet	5500
Pro	Tablet	4000

df.ai.transform("Pivot the data by product and the revenue for each product").show()

Category	Normal	Mini	Foldable	Pro	Pro Max
Cellphone	6000	5000	6500	3000	4500
Tablet	1500	5500	2500	4000	null

For a detailed walkthrough of the transformations, please refer to our transform_dataframe.ipynb notebook.

Transform Accuracy Improvement: Vector Similarity Search

To improve the accuracy of transform query generation, you can also optionally enable vector similarity search. This is done by specifying a vector_store_dir location for the vector files when you initialize SparkAI. For example:

from pyspark_ai import SparkAI

spark_ai = SparkAI(vector_store_dir="vector_store/") # vector files will be stored in the dir "vector_store"
spark_ai.activate()

Now when you call df.ai.transform as before, the agent will use word embeddings to generate accurate query values.

For a detailed walkthrough, please refer to our vector_similarity_search.ipynb.

Plot

Let's create a DataFrame for car sales in the U.S.

# auto sales data from https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand
data = [('Toyota', 1849751, -9), ('Ford', 1767439, -2), ('Chevrolet', 1502389, 6),
        ('Honda', 881201, -33), ('Hyundai', 724265, -2), ('Kia', 693549, -1),
        ('Jeep', 684612, -12), ('Nissan', 682731, -25), ('Subaru', 556581, -5),
        ('Ram Trucks', 545194, -16), ('GMC', 517649, 7), ('Mercedes-Benz', 350949, 7),
        ('BMW', 332388, -1), ('Volkswagen', 301069, -20), ('Mazda', 294908, -11),
        ('Lexus', 258704, -15), ('Dodge', 190793, -12), ('Audi', 186875, -5),
        ('Cadillac', 134726, 14), ('Chrysler', 112713, -2), ('Buick', 103519, -42),
        ('Acura', 102306, -35), ('Volvo', 102038, -16), ('Mitsubishi', 102037, -16),
        ('Lincoln', 83486, -4), ('Porsche', 70065, 0), ('Genesis', 56410, 14),
        ('INFINITI', 46619, -20), ('MINI', 29504, -1), ('Alfa Romeo', 12845, -30),
        ('Maserati', 6413, -10), ('Bentley', 3975, 0), ('Lamborghini', 3134, 3),
        ('Fiat', 915, -61), ('McLaren', 840, -35), ('Rolls-Royce', 460, 7)]

auto_df = spark_ai._spark.createDataFrame(data, ["Brand", "US_Sales_2022", "Sales_Change_Percentage"])

We can visualize the data with the plot API:

# call plot() with no args for LLM-generated plot
auto_df.ai.plot()

To plot with an instruction:

auto_df.ai.plot("pie chart for US sales market shares, show the top 5 brands and the sum of others")

Please refer to example.ipynb for more APIs and detailed usage examples.

Contributing

We're delighted that you're considering contributing to the English SDK for Apache Spark project! Whether you're fixing a bug or proposing a new feature, your contribution is highly appreciated.

Before you start, please take a moment to read our Contribution Guide. This guide provides an overview of how you can contribute to our project. We're currently in the early stages of development and we're working on introducing more comprehensive test cases and Github Action jobs for enhanced testing of each pull request.

If you have any questions or need assistance, feel free to open a new issue in the GitHub repository.

Thank you for helping us improve the English SDK for Apache Spark. We're excited to see your contributions!

License

Licensed under the Apache License 2.0.

pyspark-ai's People

Contributors

Stargazers

Watchers

Forkers

fattaneh2016 vsujeesh gengliangwang anuraglahon16 nhat416 diegomangano10 konradburnik alloz920811 avsolatorio krish240574 eduardofv lm0007 rickyfer22 yvathrey tomscut praveen062 duanmeng biswapanda manu87ds prosperityai afiqmuzaffar sharshjot dhirendra-pachchigar talchemist mmmika aftab78 mayerchatgpt case-k-git pritish659 beryllw loloazz zaishijizhidian viirya tian24 gmabreu arsenalfcy run-lin mingzhao-db sir-slade richardsonjf s-ifti laurencewalton nanderoo laucr justinmatters robert-altmiller bjornjorgensen f901107 olahsymbo mattkallo yezhwi fatmac78 zhuohuwu0603 luis8726 zhuqi-lucas asl3 divvymax guixiaowen pranavchiku yousefazizi1982 rshuang1537 sunjincheng121 akhilputhiry shitaoli-db gkvm molu-lzk dennyglee rohan7958 warrenzhu25 dhirendra-lab alejandroniculescu xbinglzh grundprinzip sonalgoyal sreev vjr vadim uht2020 amitsachdeva xinrong-meng blackboxdelta zhangabner hiboyang krishnamurtyp ishaan-jaff rajaramkuberan werthergit gatorsmile cpeeyush zengruios marvsaidev semyonsinchenko ayemonbaraka luciferyang fovi-llc somilgit neosun100 amogh-jahagirdar pacchiappankarthikeyan asears

pyspark-ai's Issues

TypeError: can only concatenate str (not "NoneType") to str when trying to execute sparkai in databricks

Hello, until yesterday everything was working fine on Databricks, but at a certain point the error in the description appeared and the execution stops.
Is it an isolated issue and I am the only one facing it?

VertexAI - Error: NameError("name 'spark' is not defined")

Hi,

I'm having an issue using Vertex AI as LLM.

This is the log:

import plotly.express as px

df = spark.sql("SELECT Nationality, count(*) as cnt FROM football_stats GROUP BY Nationality ORDER BY cnt DESC LIMIT 10")
df_pd = df.toPandas()
fig = px.pie(df_pd, values="cnt", names="Nationality", title="Top 10 Nationalities")
fig.show()

INFO:spark_ai:

import plotly.express as px

df = spark.sql("SELECT Nationality, count(*) as cnt FROM football_stats GROUP BY Nationality ORDER BY cnt DESC LIMIT 10")
df_pd = df.toPandas()
fig = px.pie(df_pd, values="cnt", names="Nationality", title="Top 10 Nationalities")
fig.show()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pyspark_ai/pyspark_ai.py](https://localhost:8080/#) in plot_df(self, df, desc, cache)
    412         try:
--> 413             exec(compile(code, "plot_df-CodeGen", "exec"))
    414         except Exception as e:

3 frames
NameError: name 'spark' is not defined

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pyspark_ai/pyspark_ai.py](https://localhost:8080/#) in plot_df(self, df, desc, cache)
    413             exec(compile(code, "plot_df-CodeGen", "exec"))
    414         except Exception as e:
--> 415             raise Exception("Could not evaluate Python code", e)
    416 
    417     def verify_df(self, df: DataFrame, desc: str, cache: bool = True) -> None:

Exception: ('Could not evaluate Python code', NameError("name 'spark' is not defined"))

and this is the configuration I have:

# Create the Spark session
spark = SparkSession.builder \
  .appName('PySparkAI_with_BQ')\
  .config('spark.jars', "/content/spark-3.3-bigquery-0.32.0.jar") \
  .getOrCreate()

# Init Google AI platform
aiplatform.init(project=project_id)

llm = VertexAI(temperature=0.9)

# Activate pyspark_ai
spark_ai = SparkAI(llm=llm, spark_session=spark, verbose=True)
spark_ai.activate()  # active partial functions for Spark DataFrame

#Load data from BQ
bq_source = spark.read.format('bigquery') \
    .option('project','project-1') \
    .option('parentProject','project-1') \
    .option('table','dataset_name.football_stats') \
    .load()
 
auto_graph = bq_source.ai.plot("Create a pie chart with the top 10 nationalities")

It's curious that in every new run the tool is following a different approach:

import plotly.express as px

df = spark.read.csv('football_stats.csv', header=True, inferSchema=True)

Why is trying to use csv?

Any advice is appreciated.
Thanks!

[Proposal] Issues template

I can create templates for different kinds of issues:

Feature-request or proposal
Databricks-related bug (in such a case we need to ask user about DBR version too)
Bug (in this case we need information about spark version, at least)
Question

For bugs we need to collect a full stack-trace from user too.

The Spark job failed due to a Python worker crashing unexpectedly. The root cause is likely a java.io.EOFException, indicating an unexpected end of file or communication issue.

The Spark job failed due to a Python worker crashing unexpectedly.
The root cause is likely a java.io.EOFException, indicating an unexpected end of file or communication issue.

Am exploring Pyspark AI ValidationError: 1 validation error for PythonExecutor df instance of DataFrame expected (type=type_error.arbitrary_type; expected_arbitrary_type=DataFrame)

Below is the code which iam running in databricks and its throwing the below error
CODE:-
data = [('Toyota', 1849751, -9), ('Ford', 1767439, -2), ('Chevrolet', 1502389, 6),
('Honda', 881201, -33), ('Hyundai', 724265, -2), ('Kia', 693549, -1),
('Jeep', 684612, -12), ('Nissan', 682731, -25), ('Subaru', 556581, -5),
('Ram Trucks', 545194, -16), ('GMC', 517649, 7), ('Mercedes-Benz', 350949, 7),
('BMW', 332388, -1), ('Volkswagen', 301069, -20), ('Mazda', 294908, -11),
('Lexus', 258704, -15), ('Dodge', 190793, -12), ('Audi', 186875, -5),
('Cadillac', 134726, 14), ('Chrysler', 112713, -2), ('Buick', 103519, -42),
('Acura', 102306, -35), ('Volvo', 102038, -16), ('Mitsubishi', 102037, -16),
('Lincoln', 83486, -4), ('Porsche', 70065, 0), ('Genesis', 56410, 14),
('INFINITI', 46619, -20), ('MINI', 29504, -1), ('Alfa Romeo', 12845, -30),
('Maserati', 6413, -10), ('Bentley', 3975, 0), ('Lamborghini', 3134, 3),
('Fiat', 915, -61), ('McLaren', 840, -35), ('Rolls-Royce', 460, 7)]

auto_df = spark_ai._spark.createDataFrame(data, ["Brand", "US_Sales_2022", "Sales_Change_Percentage"])
auto_df.ai.plot()

ERROR:-

ValidationError: 1 validation error for PythonExecutor
df
instance of DataFrame expected (type=type_error.arbitrary_type; expected_arbitrary_type=DataFrame)

Upgrade langchain

pip install jupyter_ai

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyspark-ai 0.1.12 requires langchain<0.0.202,>=0.0.201, but you have langchain 0.0.220 which is incompatible.

[Proposal] Add example notebooks as pages to documentation

It looks like it is possible to add Jupyter notebooks as part of documentation with the following extension: https://pypi.org/project/mkdocs-jupyter/

Rate Limit Reached Error

When using the 0.1.9 version of pyspark-ai, I am consistently hitting the rate limit when using a paid version of GPT 3.5. This happened the first time I attempted to use pyspark-ai with no prior utilization of ChatGPT for the day.

Image of error:

Please let me know if any additional information is needed to troubleshoot the issue.

Unable to instantiate pyspark_ai on cloud.databricks.com

Steps to reproduce:

Start a python notebook.

!pip install pyspark-ai
!export OPENAI_API_KEY='sk-...'
from pyspark_ai import *

TypeError: dataclass_transform() got an unexpected keyword argument 'field_specifiers'

Add an option to pass the model name as string

The current SparkAI init method receives a BaseLanguageModel object and then creates a ChatOpenAI as the default.

I suggested adding an additional string type parameter to the init method (say model_name) and then create ChatOpenAI object with that model.

This will make it more streamlined to try out different open AI models and compare them.
It also removes the need for the additional important from langchain.

I ca create a PR if that sounds helpful

Create a join method

I'm thinking about how to use this in my business to bring people into databricks instead of using low code platforms that are more difficult to support.

One limitation I see today is that you can only work with a single dataframe. There are a few enhancements that I think help:

A table method. Similar to ingesting data from the web, but search through available schemas/tables to find the one that best fits the users query.
A join method. If you have two spark ai dataframes it would be great to just do df.ai.join(df2) and have the tool figure out the best way to join. You could add the option for English explanation like df.ai.join(df2, "everything from df") for a left inner join.
A code method. This should just provide the code that a user can copy/paste to achieve whatever it is the ai method did.

Obviously these are trivial to do in spark for the average spark user, but I'm thinking about how this could allow nontechnical users to fully interact with a whole catalog of data in English. Then the code method would allow them to make the notebook more deterministic if they wanted it to become a job or something.

Signed cache

It is convenient for a developer to ship code in English and the cache data together to optimize speed, save cost, and more importantly reproduce the result. However, we need a way for end users to trust the cache data, which contains "compiled" code and could be modified by others between the user and the trusted developer. One way to do this is to let the developer sign the cache entries with a private key and share the public key with end users to verify the cached content.

Introduce new example for UDF generation

We need a more complicated or useful example for UDF Generation in https://github.com/databrickslabs/pyspark-ai/blob/master/examples/example.ipynb

Demo code has error in DBR 13.2

spark = spark_ai._spark

create a dataframe containing different bike product_catagory and product_count

df = spark.createDataFrame(
[
("children bike", 20),
("comfort bike", 15),
("mountain bike", 10),
("electric bike", 5),
("road bike", 3),
("cruisers bike", 8)
],
["product_category", "product_count"]
)
df.ai.transform("pivot using product_category for product_count").show()\

it got error:
INFO:spark_ai:Creating temp view for the transform:
df.createOrReplaceTempView("spark_ai_temp_view_fb06c3")

Entering new AgentExecutor chain...
OutputParserException: Could not parse LLM output: SELECT * FROM spark_ai_temp_view_fb06c3 PIVOT (SUM(product_count) FOR product_category IN ('category1', 'category2', 'category3', ...))

Could not parse LLM output when using GPT3.5-turbo

When using GPT3.5-turbo with pyspark-ai, I'm getting the error below:

NFO: Creating temp view for the transform:
df.createOrReplaceTempView("spark_ai_temp_view_875e9a")

Entering new AgentExecutor chain...
Traceback (most recent call last):
File "/home/hadoop/p.py", line 19, in
df.ai.transform("count of shipments by mode")
File "/home/hadoop/.local/lib/python3.11/site-packages/pyspark_ai/ai_utils.py", line 39, in transform
return self.spark_ai.transform_df(self.df_instance, desc, cache)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/pyspark_ai/pyspark_ai.py", line 376, in transform_df
sql_query = self._get_transform_sql_query(df, desc, cache)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/pyspark_ai/pyspark_ai.py", line 354, in _get_transform_sql_query
sql_query = self._get_transform_sql_query_from_agent(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/pyspark_ai/pyspark_ai.py", line 332, in _get_transform_sql_query_from_agent
llm_result = self._sql_agent.run(
^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/chains/base.py", line 480, in run
return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/chains/base.py", line 282, in call
raise e
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/chains/base.py", line 276, in call
self._call(inputs, run_manager=run_manager)
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/agents/agent.py", line 1036, in _call
next_step_output = self._take_next_step(
^^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/agents/agent.py", line 844, in _take_next_step
raise e
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/agents/agent.py", line 833, in _take_next_step
output = self.agent.plan(
^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/agents/agent.py", line 457, in plan
return self.output_parser.parse(full_output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/agents/mrkl/output_parser.py", line 52, in parse
raise OutputParserException(
langchain.schema.output_parser.OutputParserException: Could not parse LLM output: SELECT l_shipmode, COUNT(*) AS shipment_count FROM spark_ai_temp_view_875e9a GROUP BY l_shipmode
23/08/28 17:18:52 INFO SparkContext: Invoking stop() from shutdown hook
23/08/28 17:18:52 INFO SparkContext: SparkContext is stopping with exitCode 0.
23/08/28 17:18:52 INFO SparkUI: Stopped Spark web UI at http://ip-10-0-10-180.ec2.internal:4040
23/08/28 17:18:52 INFO YarnClientSchedulerBackend: Interrupting monitor thread
23/08/28 17:18:52 INFO YarnClientSchedulerBackend: Shutting down all executors
23/08/28 17:18:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
23/08/28 17:18:52 INFO YarnClientSchedulerBackend: YARN client scheduler backend Stopped
23/08/28 17:18:52 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/08/28 17:18:52 INFO MemoryStore: MemoryStore cleared
23/08/28 17:18:52 INFO BlockManager: BlockManager stopped
23/08/28 17:18:52 INFO BlockManagerMaster: BlockManagerMaster stopped
23/08/28 17:18:52 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/08/28 17:18:52 INFO SparkContext: Successfully stopped SparkContext
23/08/28 17:18:52 INFO ShutdownHookManager: Shutdown hook called

From the error logs, the view and query created looks correct but the agent is not able to parse the LLM output.

Here is my code:

from pyspark_ai import SparkAI
from pyspark.sql import SparkSession
from langchain.chat_models import ChatOpenAI

spark = SparkSession.builder \
    .appName("ReadFromHiveTable") \
    .enableHiveSupport() \
    .getOrCreate()

spark_ai = SparkAI(llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0))
spark_ai.activate()

df = spark.sql("select * from data_lake.lineitem")
df.show(5)

df.createOrReplaceTempView("spark_ai_temp_view_cb198e")
spark.sql("SELECT l_shipmode, COUNT(*) AS shipment_count FROM spark_ai_temp_view_cb198e GROUP BY l_shipmode").show()

df.ai.transform("count of shipments by mode")
df.show(5)

Remove the prompt about similar_value if vector search is disabled

We shouldn't mention similar_value if vector search is disabled

SPARK_SQL_PREFIX = """You are an assistant for writing professional Spark SQL queries. 
Given a question, you need to write a Spark SQL query to answer the question. The result is ALWAYS a Spark SQL query.
Always use the tool similar_value to find the correct filter value format, unless it's obvious.
Use the COUNT SQL function when the query asks for total number of some non-countable column.
Use the SUM SQL function to accumulate the total number of countable column values."""

Bug: "cannot import name 'Row' from 'sqlalchemy'" caused by import of old Langchain package version

Bug signature: "cannot import name 'Row' from 'sqlalchemy'" caused by import of old Langchain package version
Occurs when importing pyspark-ai==0.1.19 on a machine that already has langchain==0.0314 installed

Recreate the environment:

Prepare a machine with an older version of langchain that is within the :
- pip install langchain==0.0.314
Recreate the issue:
- pip install pyspark-ai==0.1.19
- from langchain.chat_models import ChatOpenAI from pyspark_ai import SparkAI chatOpenAI = ChatOpenAI(model = 'gpt-3.5-turbo') spark_ai = SparkAI(llm = chatOpenAI) spark_ai.activate()

Produces:
"cannot import name 'Row' from 'sqlalchemy' "

See https://github.com/langchain-ai/langchain/pull/14488/files

Proposed Fix:

Modifiying pyspark-ai to import langchain>=0.0.353 fixes the issue
Modify pyproject.toml line 28
to langchain = ">=0.0.353,<0.1.0"

Add end-to-end test for UDF generation

Create a UDF via @spark_ai.udf annotation, and verify the input and output.

How I can use this to connect Azure deployed private OpenAI model?

from pyspark_ai import SparkAI
import os
#from pyspark_ai.openai import OpenAI
from langchain.chat_models import ChatOpenAI

os.environ["OPENAI_API_BASE_URL"] =""
os.environ["OPENAI_API_TYPE"]= ""
os.environ["OPENAI_API_KEY"] = ""

Initialize SparkAI with the ChatOpenAI model

llm = ChatOpenAI(model_name='gpt-35-turbo')

spark_ai = SparkAI(llm=llm, verbose=True)
#spark_ai = SparkAI(engine='gpt-3.5-turbo')
#spark_ai = SparkAI()

The above Code works but fails when I use Explain or transform.

it says " Must provide an 'engine' or 'deployment_id' parameter to create a " but spark_ai = SparkAI(llm=llm, verbose=True) --it does not allow engine or deployment_id as argument

Can you retrieve the SQL query?

We have been playing around with a tool for a bit, super cool!
But is there a way to get the SQL query to help us debug?

We noticed that sometimes it's not using the partitioned columns, which doesn't scale well.

Move all spark related methods to a new class SparkUtils

Create a new class for all the utility method related to Spark, for example, _get_dataframe_results, _get_df_schema, etc.
The refactoring can simplify the code in pyspark_ai.py.

Error while using ChatGooglePalm

from pyspark.sql import SparkSession
import pyspark_ai
spark = SparkSession.builder.appName('dummy').getOrCreate()

data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
columns = ["language","user_cnt"]

rdd = spark.sparkContext.parallelize(data)

from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
    StructField("language", StringType(), nullable=False),
    StructField("user_cnt", StringType(), nullable=False)
])
df=spark.createDataFrame(data,schema=schema)
# llm=palm.get_model("models/chat-bison-001")
from langchain.chat_models import ChatGooglePalm
llm = ChatGooglePalm(temperature=0.9,google_api_key="GOOGLE_API_KEY")
spark_ai=pyspark_ai.SparkAI(llm=llm,spark_session=spark)
spark_ai.activate()
df.ai.transform("language with high user count")

ChatGooglePalmError: ChatResponse must have at least one candidate.

[openai.error.InvalidRequestError]: The model: `gpt-4` does not exist

We have access to gpt-4 still it throws the error. Is there any way to dynamically configure gpt-3.5 or other LLM models?

Setup github action

Description

This RFC proposes setting up github actions to run tests at every commit.

Related Issues

None

Questions

None

Using PySpark code instead of SQL

Right now, every DF transformation creates a new temp view and the transformation is applied as a SQL query on top of the temp view. Unfortunately, this creates a lot of state in the Spark session and makes it harder to trace the actual source of the request.

It would be awesome if one could chose between PySpark and SQL transformations. I've had some good success with the following prompt template.

Given a PySpark dataframe with the name `df` and with the following schema:

id: bigint
dropoff_zip: string
pickup_zip: string
fare_amount: double
toll_amount: double
tip: double
passenger_count: int


Write a Python function called `transform_df` that performs the following transformation and returns a new dataframe:  Show only rows with more than 3 passengers.

The answer MUST contain one function only. Ensure your answer is correct.

Ideally, this could be embedded in PySpark AI like this:

transformed = df.ai.transform("Show only rows with more than 3 passengers")
transformed.explain()

shows the full trace of the operation instead of just the read from the temp view.

Resolve deprecated warnings from langchain

Some of the langchain usages are already deprecated:

./.pyenv/versions/3.9.17/lib/python3.9/site-packages/langchain/__init__.py:39: UserWarning: Importing BasePromptTemplate from langchain root module is no longer supported.
  warnings.warn(
./.pyenv/versions/3.9.17/lib/python3.9/site-packages/langchain/__init__.py:39: UserWarning: Importing GoogleSearchAPIWrapper from langchain root module is no longer supported.
  warnings.warn(
./.pyenv/versions/3.9.17/lib/python3.9/site-packages/langchain/__init__.py:39: UserWarning: Importing LLMChain from langchain root module is no longer supported.
  warnings.warn(
./.pyenv/versions/3.9.17/lib/python3.9/site-packages/langchain/__init__.py:39: UserWarning: Importing FewShotPromptTemplate from langchain root module is no longer supported.
  warnings.warn(
./.pyenv/versions/3.9.17/lib/python3.9/site-packages/langchain/__init__.py:39: UserWarning: Importing BasePromptTemplate from langchain root module is no longer supported.
  warnings.warn(

Loading DataFrame from a saved source

Right now, I don't see a way to load data from a saved source (i.e., a table saved in Databricks or snowflake). It would be helpful if this can be done so we can use existing tables in conjunction with language-based commands.

Table name as parameter for function create_df()

Hi,

could you add a parameter "tablename" for the function create_df()? In same cases a table with an automaticlly detected name cannot be created, but the SQL works with different table name:

Spark AI call:

Debug output:

INFO: Parsing URL: https://www.procontra-online.de/sach-privat/artikel/softfair-ermittelt-die-besten-wohngebaeude-tarife

INFO: SQL query for the ingestion:
CREATE OR REPLACE TEMP VIEW wohngebäude_tarife AS 
SELECT 'Alte Leipziger' AS Versicherer, 'comfort mit Baustein Haus- und Wohnungsschutzbrief' AS Tarifkombination, 5275 AS Punktzahl
UNION ALL
SELECT 'Janitos' AS Versicherer, 'Best Selection mit Bausteinen Allgefahrendeckung, Hausschutzbrief und Multi-Garantie' AS Tarifkombination, 5275 AS Punktzahl
UNION ALL
SELECT 'Dema' AS Versicherer, 'Immo Protect Top mit Baustein Unbenannte Gefahren/ Marktgarantie' AS Tarifkombination, 5225 AS Punktzahl
UNION ALL
SELECT 'Domcura' AS Versicherer, 'Top mit Baustein Unbenannte Gefahren/ Marktgarantie' AS Tarifkombination, 5225 AS Punktzahl
UNION ALL
SELECT 'Adcuri' AS Versicherer, 'Premium mit Bausteinen Elektronik & Haustechnik, Unbenannte Gefahren' AS Tarifkombination, 5200 AS Punktzahl
UNION ALL
SELECT 'Manufaktur Augsburg' AS Versicherer, 'Premium Plus mit Bausteinen Smart Home, Unbenannte Gefahren/Marktgarantie' AS Tarifkombination, 5125 AS Punktzahl
UNION ALL
SELECT 'Axa' AS Versicherer, 'Komfort mit Bausteinen Optimum, Premium' AS Tarifkombination, 5075 AS Punktzahl
UNION ALL
SELECT 'Grundeigentümer Versicherung' AS Versicherer, 'ProtectPremium mit Baustein Soforthilfe' AS Tarifkombination, 5030 AS Punktzahl
UNION ALL
SELECT 'Rhion' AS Versicherer, 'Premium mit Baustein Best-Leistungs-Garantie' AS Tarifkombination, 4975 AS Punktzahl
UNION ALL
SELECT 'Konzept & Marketing' AS Versicherer, 'Allsafe Domo' AS Tarifkombination, 4830 AS Punktzahl

INFO: Storing data into temp view: wohngebäude_tarife

Spark Parse Exception:

[PARSE_SYNTAX_ERROR] Syntax error at or near 'ä'.(line 1, pos 35)

== SQL ==
CREATE OR REPLACE TEMP VIEW wohngebäude_tarife AS 
...

Uses GPT-4 by default, which may be unavailable

It's just a minor issue, but as of today, GPT-4 is unavailable if you are not part of the limited beta. The workaround is passing the LLM to SparkAI

from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
sa = SparkAI(llm=llm)
sa.activate()

I wonder if it makes sense to pass just the name of the model to SparkAI. Or make gpt-3.5 the default as it's readily available

Error in executing spark_ai.activate().. please help

ModuleNotFoundError Traceback (most recent call last)
Cell In[14], line 2
1 # Activate partial functions for Spark DataFrame
----> 2 spark_ai.activate()

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/pyspark_ai/pyspark_ai.py:428, in SparkAI.activate(self)
426 DataFrame.ai = AIUtils(self)
427 # Patch the Spark Connect DataFrame as well.
--> 428 from pyspark.sql.connect.dataframe import DataFrame as CDataFrame
429 CDataFrame.ai = AIUtils(self)

ModuleNotFoundError: No module named 'pyspark.sql.connect'

Convert prompt.py into multiple files as per the target API

Currently, all the prompts are in one file.
To improve the readability, we can have a new module "prompt" and split the file prompt.py into files such as transform.py, plot.py, etc.

Build API Documentations

Currently, there are no API documents. We should start creating one

df.ai.plot(): Regenerate python code if the generated python code can't be evaluated

In the method plot_df, currently there is no retries on code generation failure:

        try:
            exec(compile(code, "plot_df-CodeGen", "exec"))
        except Exception as e:
            raise Exception("Could not evaluate Python code", e)

As an improvement, we can send the error message to LLM and ask it to generate new code.

Change the return type of `df.ai.verify` from None to bool

The current signature is DataFrame.ai.verify(desc: Optional[str] = None, cache: bool = True) -> None. However, it makes more sense to be

DataFrame.ai.verify(desc: Optional[str] = None, cache: bool = True) -> bool

Bug Report: Cannont import library after pip install

After successfully run pip install pyspark-ai, still cannot run from pyspark_ai import SparkAI. It throws a ModuleNotFoundError: No module named 'pyspark' .

Fix project dependencies

Some project dependencies should be flagged as dev-dependencies. We want the fewest dependencies possible when this project is pip installed.

A quinn user recently informed me that dev-dependencies are being deprecated and poetry is moving towards group dependencies.

This should work better for this project too. We should think about what dependency groups would be ideal for this project.

[Proposal] testing against both SparkSession and SparkConnect Session

Currently the code is tested only against regular spark.sql.SparkSession. I can extend tests to spark.sql.connect.SparkSession too; there is some difference in returned types and possible that they won't pass some type-validation. I can add all the necessary tests.

Add example notebook with Code Llama

Add an example notebook with Code Llama. For instance, we can try Code Llama on the following functions:

df.ai.plot()
df.ai.verify()
Create python UDF via @spark_ai.udf annotation

Project maintenance and further plans of development?

Hello!
The project is very cool, but it looks like it has been facing a lack of maintenance for last four months. It looks like the latest actual commit except just versions bumping was in November. Also, there are no new feature-issues. What are the plans of authors about the project and further development?

I have motivation to contribute. As a start I can try to at least update dependencies, like:

Bump Python to ~3.9 because 3.8 is officially legacy and the end of support is October 2024
Bum langchain to ~0.1 version (the latest one is 0.1.9, but the project uses the version 0.0.354)
Bump openai to ~1.0 (the latest version of openai is 1.13 but the project uses 0.27.10
Try to update the overall code to make it works with the latest langchain and openai

Create a new example notebook for using vector search

The SQL agent support vector search as SimilarValueTool in 6e42522.
We need a new example notebook to demo how to use it.

Support robust code generation with Code Llama

Tried with Code Llama and it fails with import code blocks.

Temp View Generation does not properly work with Spark Connect

Since Spark Connect will lazily evaluate the generated code, using the same name for the view for every invocaiton of ai.transform() will not work.

The current behavior works because the execution plan is eagerly evaluated when spark.sql() is called, but this will not work properly, when the analysis of the plan is deferred.

Repro

> env PYSPARK_DRIVER_PYTHON=ipython poetry run pyspark --remote local --packages org.apache.spark:spark-connect_2.12:3.4.1

In [7]: from pyspark_ai import SparkAI
   ...: from langchain.chat_models import ChatOpenAI
   ...:
   ...: llm = ChatOpenAI(model="gpt-3.5-turbo")
   ...: ai = SparkAI(llm=llm, spark_session=spark)
   ...: ai.activate()
   ...:
   ...: df = spark.range(10)
   ...: df2 = spark.range(100)
   ...:
   ...: r = df.ai.transform("count of rows")
   ...: assert(r.collect()[0][0] == 10)
   ...:
   ...: r2 = df2.ai.transform("count of rows")
   ...:
   ...: # Attention
   ...: assert(r2.collect()[0][0] == 100)
   ...: assert(r.collect()[0][0] == 10)
INFO: Creating temp view for the transform:
df.createOrReplaceTempView("temp_view_for_transform")

2023-07-23 11:40:45,277 48026 INFO execute_command Execute command for command create_dataframe_view { input { common { plan_id: 20 } range { start: 0 end: 10 step: 1 } } name: "temp_view_for_transform" replace: true }
2023-07-23 11:40:45,277 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:45,277 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:45,287 48026 INFO schema Schema for plan: root { common { plan_id: 20 } range { start: 0 end: 10 step: 1 } }
INFO: SQL query for the transform:
SELECT COUNT(*) FROM temp_view_for_transform

2023-07-23 11:40:46,064 48026 INFO execute_command Execute command for command sql_command { sql: "SELECT COUNT(*) FROM temp_view_for_transform" }
2023-07-23 11:40:46,064 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:46,064 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:46,074 48026 DEBUG _execute_and_fetch_as_iterator Received the SQL command result.
2023-07-23 11:40:46,079 48026 DEBUG _execute_and_fetch_as_iterator Received metric batch.
2023-07-23 11:40:46,080 48026 INFO to_table Executing plan root { common { plan_id: 24 } sql { query: "SELECT COUNT(*) FROM temp_view_for_transform" } }
2023-07-23 11:40:46,080 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:46,080 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:46,084 48026 DEBUG _execute_and_fetch_as_iterator Received the schema.
2023-07-23 11:40:46,108 48026 DEBUG _execute_and_fetch_as_iterator Received arrow batch rows=1 size=328
2023-07-23 11:40:46,109 48026 DEBUG _execute_and_fetch_as_iterator Received metric batch.
INFO: Creating temp view for the transform:
df.createOrReplaceTempView("temp_view_for_transform")

2023-07-23 11:40:46,109 48026 INFO execute_command Execute command for command create_dataframe_view { input { common { plan_id: 21 } range { start: 0 end: 100 step: 1 } } name: "temp_view_for_transform" replace: true }
2023-07-23 11:40:46,110 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:46,110 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:46,116 48026 INFO schema Schema for plan: root { common { plan_id: 21 } range { start: 0 end: 100 step: 1 } }
INFO: SQL query for the transform:
SELECT COUNT(*) FROM temp_view_for_transform

2023-07-23 11:40:46,119 48026 INFO execute_command Execute command for command sql_command { sql: "SELECT COUNT(*) FROM temp_view_for_transform" }
2023-07-23 11:40:46,119 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:46,119 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:46,123 48026 DEBUG _execute_and_fetch_as_iterator Received the SQL command result.
2023-07-23 11:40:46,128 48026 DEBUG _execute_and_fetch_as_iterator Received metric batch.
2023-07-23 11:40:46,128 48026 INFO to_table Executing plan root { common { plan_id: 27 } sql { query: "SELECT COUNT(*) FROM temp_view_for_transform" } }
2023-07-23 11:40:46,128 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:46,128 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:46,132 48026 DEBUG _execute_and_fetch_as_iterator Received the schema.
2023-07-23 11:40:46,151 48026 DEBUG _execute_and_fetch_as_iterator Received arrow batch rows=1 size=328
2023-07-23 11:40:46,152 48026 DEBUG _execute_and_fetch_as_iterator Received metric batch.
2023-07-23 11:40:46,152 48026 INFO to_table Executing plan root { common { plan_id: 24 } sql { query: "SELECT COUNT(*) FROM temp_view_for_transform" } }
2023-07-23 11:40:46,152 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:46,152 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:46,156 48026 DEBUG _execute_and_fetch_as_iterator Received the schema.
2023-07-23 11:40:46,174 48026 DEBUG _execute_and_fetch_as_iterator Received arrow batch rows=1 size=328
2023-07-23 11:40:46,174 48026 DEBUG _execute_and_fetch_as_iterator Received metric batch.
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[7], line 18
     16 # Attention
     17 assert(r2.collect()[0][0] == 100)
---> 18 assert(r.collect()[0][0] == 10)

AssertionError:

In [8]:

Dataframe from webpage doesn't load entire table

When trying to load a dataframe from a webpage, e.g.:

df = spark_ai.create_df('https://en.wikipedia.org/wiki/List_of_countries_by_vehicles_per_capita')

It often doesn't load the entire table into the dataframe (the table on the webpage has 195 rows):

INFO: Parsing URL: https://en.wikipedia.org/wiki/List_of_countries_by_vehicles_per_capita

INFO: SQL query for the ingestion:
CREATE OR REPLACE TEMP VIEW spark_ai_temp_view_034e03 AS SELECT * FROM VALUES
('Gibraltar', 1444, 48641, 2022),
('Guernsey', 1365, 86000, 2014),
('San Marino', 1300, 44200, 2022),
('Liechtenstein', 1193, 45800, 2022),
('Andorra', 1050, 81000, 2021),
('Monaco', 910, 35500, 2022),
('United States', 908, 305000000, 2023),
('New Zealand', 884, 4529700, 2022),
('Canada', 790, 30754600, 2022),
('Finland', 790, 4368796, 2022)
AS v1(country_or_region, motor_vehicles_per_1000_people, total, year)

INFO: Storing data into temp view: spark_ai_temp_view_034e03

If I retry several times, passing different column names (or subsets) so that it doesn't just use the cache, only sometimes does it return all rows:

df = spark_ai.create_df('https://en.wikipedia.org/wiki/List_of_countries_by_vehicles_per_capita', ['country', 'vehicles'])

INFO: Parsing URL: https://en.wikipedia.org/wiki/List_of_countries_by_vehicles_per_capita

INFO: SQL query for the ingestion:
CREATE OR REPLACE TEMP VIEW spark_ai_temp_view_c345c7 AS 
SELECT * FROM VALUES
('Gibraltar', 1444),
('Guernsey', 1365),
('San Marino', 1300),
('Liechtenstein', 1193),
('Andorra', 1050),
('Monaco', 910),
('United States', 908),
('New Zealand', 884),
('Canada', 790),
('Finland', 790),
('Malta', 786),
('Cyprus', 785),
('Luxembourg', 784),
('Australia', 782),
('Guam', 777),
('Italy', 755),
('Estonia', 715),
('Iceland', 720),
('Poland', 687),
('Jersey', 674),
('France', 668),
('Puerto Rico', 666),
('Japan', 661),
('Slovenia', 660),
('Bahamas', 650),
('Czech Republic', 648),
('Portugal', 639),
('Wales', 637),
('Norway', 635),
('Germany', 628),
('Spain', 627),
('Brunei', 614),
('Slovakia', 611),
('Greece', 606),
('Switzerland', 604),
('United Kingdom', 600),
('Qatar', 591),
('Belgium', 590),
('Netherlands', 588),
('Austria', 572),
('Antigua and Barbuda', 561),
('Scotland', 557),
('Kuwait', 556),
('Dominica', 550),
('Sweden', 545),
('Malaysia', 542),
('Denmark', 540),
('Ireland', 535),
('Lithuania', 507),
('South Korea', 485),
('Bulgaria', 482),
('Croatia', 479),
('Saint Kitts and Nevis', 479),
('Syria', 472),
('Hungary', 463),
('Nauru', 455),
('Suriname', 446),
('Dominican Republic', 442),
('Romania', 441),
('Bahrain', 430),
('Barbados', 417),
('Chile', 416),
('Argentina', 402),
('Lesotho', 400),
('Russia', 395),
('Latvia', 394),
('Mexico', 391),
('Israel', 390),
('Brazil', 386),
('Serbia', 389),
('Georgia', 378),
('Moldova', 367),
('Montenegro', 367),
('Taiwan', 365),
('United Arab Emirates', 354),
('Uruguay', 348),
('Bosnia and Herzegovina', 345),
('Belarus', 343),
('Oman', 335),
('Trinidad and Tobago', 329),
('China', 296),
('Colombia', 296),
('Lebanon', 295),
('Seychelles', 295),
('Costa Rica', 287),
('Guyana', 285),
('Thailand', 280),
('Grenada', 268),
('Botswana', 260),
('Turkey', 254),
('Ukraine', 245),
('Maldives', 241),
('Albania', 238),
('Guatemala', 237),
('Kazakhstan', 226),
('Honduras', 222),
('Belize', 222),
('Panama', 218),
('Mongolia', 217),
('Saint Lucia', 208),
('North Macedonia', 205),
('Saint Vincent and the Grenadines', 204),
('Kyrgyzstan', 201),
('Iran', 183),
('Macau', 180),
('Armenia', 177),
('Tunisia', 177),
('South Africa', 176),
('Bolivia', 174),
('Jordan', 169),
('Tonga', 162),
('Namibia', 161),
('Sri Lanka', 157),
('São Tomé and Príncipe', 157),
('Saudi Arabia', 156),
('Bhutan', 150),
('Singapore', 149),
('Algeria', 149),
('Azerbaijan', 146),
('Fiji', 145),
('Nicaragua', 144),
('Ecuador', 143),
('Venezuela', 140),
('Myanmar', 138),
('Cape Verde', 133),
('Samoa', 130),
('Philippines', 120),
('Peru', 116),
('Nepal', 113),
('Iraq', 111),
('Morocco', 111),
('Hong Kong', 109),
('Turkmenistan', 102),
('Greenland', 100),
('Federated States of Micronesia', 96),
('Kosovo', 94),
('Uzbekistan', 87),
('Indonesia', 82),
('Jamaica', 81),
('Gambia', 80),
('Chad', 77),
('Zimbabwe', 76),
('Vanuatu', 71),
('Egypt', 70),
('Kenya', 69),
('El Salvador', 68),
('Cuba', 67),
('Senegal', 65),
('Nigeria', 61),
('Afghanistan', 61),
('Ivory Coast', 60),
('India', 59),
('Palestine', 58),
('Yemen', 52),
('Tajikistan', 51),
('Madagascar', 48),
('Ghana', 46),
('Comoros', 44),
('Sierra Leone', 40),
('Djibouti', 40),
('Angola', 36),
('Vietnam', 53),
('Guinea-Bissau', 35),
('Kiribati', 34),
('Togo', 33),
('Democratic Republic of the Congo', 32),
('Pakistan', 29),
('Zambia', 29),
('Benin', 27),
('Bangladesh', 27),
('Cambodia', 27),
('Mozambique', 26),
('Gabon', 26),
('Burkina Faso', 22),
('Liberia', 22)
AS v1(country, vehicles)

INFO: Storing data into temp view: spark_ai_temp_view_c345c7

Also, Please clarify whether it is pyspark-ai or pyspark_ai to be imported. (need changes to the github readme).