GithubHelp home page GithubHelp logo

pyspark-ai / pyspark-ai Goto Github PK

View Code? Open in Web Editor NEW
819.0 819.0 120.0 6.57 MB

English SDK for Apache Spark

Home Page: https://pyspark.ai/

License: Apache License 2.0

Python 98.75% Makefile 1.02% Shell 0.23%

pyspark-ai's Introduction

English SDK for Apache Spark

image PyPI - Downloads PyPI version

Introduction

The English SDK for Apache Spark is an extremely simple yet powerful tool. It takes English instructions and compile them into PySpark objects like DataFrames. Its goal is to make Spark more user-friendly and accessible, allowing you to focus your efforts on extracting insights from your data.

For a more comprehensive introduction and background to our project, we have the following resources:

  • Blog Post: A detailed walkthrough of our project.
  • Demo Video: 2023 Data + AI summit announcement video with demo.
  • Breakout Session: A deep dive into the story behind the English SDK, its features, and future works at DATA+AI summit 2023.

Installation

pyspark-ai can be installed via pip from PyPI:

pip install pyspark-ai

pyspark-ai can also be installed with optional dependencies to enable certain functionality. For example, to install pyspark-ai with the optional dependencies to plot data from a DataFrame:

pip install "pyspark-ai[plot]"

To install all optionall dependencies:

pip install "pyspark-ai[all]"

For a full list of optional dependencies, see Installation and Setup.

Configuring OpenAI LLMs

As of July 2023, we have found that the GPT-4 works optimally with the English SDK. This superior AI model is readily accessible to all developers through the OpenAI API.

To use OpenAI's Language Learning Models (LLMs), you can set your OpenAI secret key as the OPENAI_API_KEY environment variable. This key can be found in your OpenAI account. Example:

export OPENAI_API_KEY='sk-...'

By default, the SparkAI instances will use the GPT-4 model. However, you're encouraged to experiment with creating and implementing other LLMs, which can be passed during the initialization of SparkAI instances for various use-cases.

Usage

Initialization

from pyspark_ai import SparkAI

spark_ai = SparkAI()
spark_ai.activate()  # active partial functions for Spark DataFrame

You can also pass other LLMs to construct the SparkAI instance. For example, by following this guide:

from langchain.chat_models import AzureChatOpenAI
from pyspark_ai import SparkAI

llm = AzureChatOpenAI(
    deployment_name=...,
    model_name=...
)
spark_ai = SparkAI(llm=llm)
spark_ai.activate()  # active partial functions for Spark DataFrame

Using the Azure OpenAI service can provide better data privacy and security, as per Microsoft's Data Privacy page.

DataFrame Transformation

Given the following DataFrame df:

df = spark_ai._spark.createDataFrame(
    [
        ("Normal", "Cellphone", 6000),
        ("Normal", "Tablet", 1500),
        ("Mini", "Tablet", 5500),
        ("Mini", "Cellphone", 5000),
        ("Foldable", "Cellphone", 6500),
        ("Foldable", "Tablet", 2500),
        ("Pro", "Cellphone", 3000),
        ("Pro", "Tablet", 4000),
        ("Pro Max", "Cellphone", 4500)
    ],
    ["product", "category", "revenue"]
)

You can write English to perform transformations. For example:

df.ai.transform("What are the best-selling and the second best-selling products in every category?").show()
product category revenue
Foldable Cellphone 6500
Nromal Cellphone 6000
Mini Tablet 5500
Pro Tablet 4000
df.ai.transform("Pivot the data by product and the revenue for each product").show()
Category Normal Mini Foldable Pro Pro Max
Cellphone 6000 5000 6500 3000 4500
Tablet 1500 5500 2500 4000 null

For a detailed walkthrough of the transformations, please refer to our transform_dataframe.ipynb notebook.

Transform Accuracy Improvement: Vector Similarity Search

To improve the accuracy of transform query generation, you can also optionally enable vector similarity search. This is done by specifying a vector_store_dir location for the vector files when you initialize SparkAI. For example:

from pyspark_ai import SparkAI

spark_ai = SparkAI(vector_store_dir="vector_store/") # vector files will be stored in the dir "vector_store"
spark_ai.activate() 

Now when you call df.ai.transform as before, the agent will use word embeddings to generate accurate query values.

For a detailed walkthrough, please refer to our vector_similarity_search.ipynb.

Plot

Let's create a DataFrame for car sales in the U.S.

# auto sales data from https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand
data = [('Toyota', 1849751, -9), ('Ford', 1767439, -2), ('Chevrolet', 1502389, 6),
        ('Honda', 881201, -33), ('Hyundai', 724265, -2), ('Kia', 693549, -1),
        ('Jeep', 684612, -12), ('Nissan', 682731, -25), ('Subaru', 556581, -5),
        ('Ram Trucks', 545194, -16), ('GMC', 517649, 7), ('Mercedes-Benz', 350949, 7),
        ('BMW', 332388, -1), ('Volkswagen', 301069, -20), ('Mazda', 294908, -11),
        ('Lexus', 258704, -15), ('Dodge', 190793, -12), ('Audi', 186875, -5),
        ('Cadillac', 134726, 14), ('Chrysler', 112713, -2), ('Buick', 103519, -42),
        ('Acura', 102306, -35), ('Volvo', 102038, -16), ('Mitsubishi', 102037, -16),
        ('Lincoln', 83486, -4), ('Porsche', 70065, 0), ('Genesis', 56410, 14),
        ('INFINITI', 46619, -20), ('MINI', 29504, -1), ('Alfa Romeo', 12845, -30),
        ('Maserati', 6413, -10), ('Bentley', 3975, 0), ('Lamborghini', 3134, 3),
        ('Fiat', 915, -61), ('McLaren', 840, -35), ('Rolls-Royce', 460, 7)]

auto_df = spark_ai._spark.createDataFrame(data, ["Brand", "US_Sales_2022", "Sales_Change_Percentage"])

We can visualize the data with the plot API:

# call plot() with no args for LLM-generated plot
auto_df.ai.plot()

2022 USA national auto sales by brand

To plot with an instruction:

auto_df.ai.plot("pie chart for US sales market shares, show the top 5 brands and the sum of others")

2022 USA national auto sales_market_share by brand

Please refer to example.ipynb for more APIs and detailed usage examples.

Contributing

We're delighted that you're considering contributing to the English SDK for Apache Spark project! Whether you're fixing a bug or proposing a new feature, your contribution is highly appreciated.

Before you start, please take a moment to read our Contribution Guide. This guide provides an overview of how you can contribute to our project. We're currently in the early stages of development and we're working on introducing more comprehensive test cases and Github Action jobs for enhanced testing of each pull request.

If you have any questions or need assistance, feel free to open a new issue in the GitHub repository.

Thank you for helping us improve the English SDK for Apache Spark. We're excited to see your contributions!

License

Licensed under the Apache License 2.0.

pyspark-ai's People

Contributors

allisonwang-db avatar asl3 avatar bjornjorgensen avatar dennyglee avatar gatorsmile avatar gengliangwang avatar grundprinzip avatar laurencewalton avatar mengxr avatar pkandarpa-cs avatar pohlposition avatar semyonsinchenko avatar sharshjot avatar vinodhthiagarajan1309 avatar vjr avatar xinrong-meng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyspark-ai's Issues

VertexAI - Error: NameError("name 'spark' is not defined")

Hi,

I'm having an issue using Vertex AI as LLM.

This is the log:

import plotly.express as px

df = spark.sql("SELECT Nationality, count(*) as cnt FROM football_stats GROUP BY Nationality ORDER BY cnt DESC LIMIT 10")
df_pd = df.toPandas()
fig = px.pie(df_pd, values="cnt", names="Nationality", title="Top 10 Nationalities")
fig.show()

INFO:spark_ai:

import plotly.express as px

df = spark.sql("SELECT Nationality, count(*) as cnt FROM football_stats GROUP BY Nationality ORDER BY cnt DESC LIMIT 10")
df_pd = df.toPandas()
fig = px.pie(df_pd, values="cnt", names="Nationality", title="Top 10 Nationalities")
fig.show()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pyspark_ai/pyspark_ai.py](https://localhost:8080/#) in plot_df(self, df, desc, cache)
    412         try:
--> 413             exec(compile(code, "plot_df-CodeGen", "exec"))
    414         except Exception as e:

3 frames
NameError: name 'spark' is not defined

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/pyspark_ai/pyspark_ai.py](https://localhost:8080/#) in plot_df(self, df, desc, cache)
    413             exec(compile(code, "plot_df-CodeGen", "exec"))
    414         except Exception as e:
--> 415             raise Exception("Could not evaluate Python code", e)
    416 
    417     def verify_df(self, df: DataFrame, desc: str, cache: bool = True) -> None:

Exception: ('Could not evaluate Python code', NameError("name 'spark' is not defined"))

and this is the configuration I have:

# Create the Spark session
spark = SparkSession.builder \
  .appName('PySparkAI_with_BQ')\
  .config('spark.jars', "/content/spark-3.3-bigquery-0.32.0.jar") \
  .getOrCreate()

# Init Google AI platform
aiplatform.init(project=project_id)

llm = VertexAI(temperature=0.9)

# Activate pyspark_ai
spark_ai = SparkAI(llm=llm, spark_session=spark, verbose=True)
spark_ai.activate()  # active partial functions for Spark DataFrame

#Load data from BQ
bq_source = spark.read.format('bigquery') \
    .option('project','project-1') \
    .option('parentProject','project-1') \
    .option('table','dataset_name.football_stats') \
    .load()
 
auto_graph = bq_source.ai.plot("Create a pie chart with the top 10 nationalities")
 

It's curious that in every new run the tool is following a different approach:

import plotly.express as px

df = spark.read.csv('football_stats.csv', header=True, inferSchema=True)

Why is trying to use csv?

Any advice is appreciated.
Thanks!

[Proposal] Issues template

I can create templates for different kinds of issues:

  1. Feature-request or proposal
  2. Databricks-related bug (in such a case we need to ask user about DBR version too)
  3. Bug (in this case we need information about spark version, at least)
  4. Question

For bugs we need to collect a full stack-trace from user too.

Am exploring Pyspark AI ValidationError: 1 validation error for PythonExecutor df instance of DataFrame expected (type=type_error.arbitrary_type; expected_arbitrary_type=DataFrame)

Below is the code which iam running in databricks and its throwing the below error
CODE:-
data = [('Toyota', 1849751, -9), ('Ford', 1767439, -2), ('Chevrolet', 1502389, 6),
('Honda', 881201, -33), ('Hyundai', 724265, -2), ('Kia', 693549, -1),
('Jeep', 684612, -12), ('Nissan', 682731, -25), ('Subaru', 556581, -5),
('Ram Trucks', 545194, -16), ('GMC', 517649, 7), ('Mercedes-Benz', 350949, 7),
('BMW', 332388, -1), ('Volkswagen', 301069, -20), ('Mazda', 294908, -11),
('Lexus', 258704, -15), ('Dodge', 190793, -12), ('Audi', 186875, -5),
('Cadillac', 134726, 14), ('Chrysler', 112713, -2), ('Buick', 103519, -42),
('Acura', 102306, -35), ('Volvo', 102038, -16), ('Mitsubishi', 102037, -16),
('Lincoln', 83486, -4), ('Porsche', 70065, 0), ('Genesis', 56410, 14),
('INFINITI', 46619, -20), ('MINI', 29504, -1), ('Alfa Romeo', 12845, -30),
('Maserati', 6413, -10), ('Bentley', 3975, 0), ('Lamborghini', 3134, 3),
('Fiat', 915, -61), ('McLaren', 840, -35), ('Rolls-Royce', 460, 7)]

auto_df = spark_ai._spark.createDataFrame(data, ["Brand", "US_Sales_2022", "Sales_Change_Percentage"])
auto_df.ai.plot()

ERROR:-

ValidationError: 1 validation error for PythonExecutor
df
instance of DataFrame expected (type=type_error.arbitrary_type; expected_arbitrary_type=DataFrame)

Upgrade langchain

pip install jupyter_ai

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyspark-ai 0.1.12 requires langchain<0.0.202,>=0.0.201, but you have langchain 0.0.220 which is incompatible.

Rate Limit Reached Error

When using the 0.1.9 version of pyspark-ai, I am consistently hitting the rate limit when using a paid version of GPT 3.5. This happened the first time I attempted to use pyspark-ai with no prior utilization of ChatGPT for the day.

Image of error:
image

Please let me know if any additional information is needed to troubleshoot the issue.

Add an option to pass the model name as string

The current SparkAI init method receives a BaseLanguageModel object and then creates a ChatOpenAI as the default.

I suggested adding an additional string type parameter to the init method (say model_name) and then create ChatOpenAI object with that model.

This will make it more streamlined to try out different open AI models and compare them.
It also removes the need for the additional important from langchain.

I ca create a PR if that sounds helpful

Create a join method

I'm thinking about how to use this in my business to bring people into databricks instead of using low code platforms that are more difficult to support.

One limitation I see today is that you can only work with a single dataframe. There are a few enhancements that I think help:

  1. A table method. Similar to ingesting data from the web, but search through available schemas/tables to find the one that best fits the users query.
  2. A join method. If you have two spark ai dataframes it would be great to just do df.ai.join(df2) and have the tool figure out the best way to join. You could add the option for English explanation like df.ai.join(df2, "everything from df") for a left inner join.
  3. A code method. This should just provide the code that a user can copy/paste to achieve whatever it is the ai method did.

Obviously these are trivial to do in spark for the average spark user, but I'm thinking about how this could allow nontechnical users to fully interact with a whole catalog of data in English. Then the code method would allow them to make the notebook more deterministic if they wanted it to become a job or something.

Signed cache

It is convenient for a developer to ship code in English and the cache data together to optimize speed, save cost, and more importantly reproduce the result. However, we need a way for end users to trust the cache data, which contains "compiled" code and could be modified by others between the user and the trusted developer. One way to do this is to let the developer sign the cache entries with a private key and share the public key with end users to verify the cached content.

Demo code has error in DBR 13.2

spark = spark_ai._spark

create a dataframe containing different bike product_catagory and product_count

df = spark.createDataFrame(
[
("children bike", 20),
("comfort bike", 15),
("mountain bike", 10),
("electric bike", 5),
("road bike", 3),
("cruisers bike", 8)
],
["product_category", "product_count"]
)
df.ai.transform("pivot using product_category for product_count").show()\

it got error:
INFO:spark_ai:Creating temp view for the transform:
df.createOrReplaceTempView("spark_ai_temp_view_fb06c3")

Entering new AgentExecutor chain...
OutputParserException: Could not parse LLM output: SELECT * FROM spark_ai_temp_view_fb06c3 PIVOT (SUM(product_count) FOR product_category IN ('category1', 'category2', 'category3', ...))

Could not parse LLM output when using GPT3.5-turbo

When using GPT3.5-turbo with pyspark-ai, I'm getting the error below:

NFO: Creating temp view for the transform:
df.createOrReplaceTempView("spark_ai_temp_view_875e9a")

Entering new AgentExecutor chain...
Traceback (most recent call last):
File "/home/hadoop/p.py", line 19, in
df.ai.transform("count of shipments by mode")
File "/home/hadoop/.local/lib/python3.11/site-packages/pyspark_ai/ai_utils.py", line 39, in transform
return self.spark_ai.transform_df(self.df_instance, desc, cache)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/pyspark_ai/pyspark_ai.py", line 376, in transform_df
sql_query = self._get_transform_sql_query(df, desc, cache)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/pyspark_ai/pyspark_ai.py", line 354, in _get_transform_sql_query
sql_query = self._get_transform_sql_query_from_agent(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/pyspark_ai/pyspark_ai.py", line 332, in _get_transform_sql_query_from_agent
llm_result = self._sql_agent.run(
^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/chains/base.py", line 480, in run
return self(kwargs, callbacks=callbacks, tags=tags, metadata=metadata)[
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/chains/base.py", line 282, in call
raise e
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/chains/base.py", line 276, in call
self._call(inputs, run_manager=run_manager)
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/agents/agent.py", line 1036, in _call
next_step_output = self._take_next_step(
^^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/agents/agent.py", line 844, in _take_next_step
raise e
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/agents/agent.py", line 833, in _take_next_step
output = self.agent.plan(
^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/agents/agent.py", line 457, in plan
return self.output_parser.parse(full_output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hadoop/.local/lib/python3.11/site-packages/langchain/agents/mrkl/output_parser.py", line 52, in parse
raise OutputParserException(
langchain.schema.output_parser.OutputParserException: Could not parse LLM output: SELECT l_shipmode, COUNT(*) AS shipment_count FROM spark_ai_temp_view_875e9a GROUP BY l_shipmode
23/08/28 17:18:52 INFO SparkContext: Invoking stop() from shutdown hook
23/08/28 17:18:52 INFO SparkContext: SparkContext is stopping with exitCode 0.
23/08/28 17:18:52 INFO SparkUI: Stopped Spark web UI at http://ip-10-0-10-180.ec2.internal:4040
23/08/28 17:18:52 INFO YarnClientSchedulerBackend: Interrupting monitor thread
23/08/28 17:18:52 INFO YarnClientSchedulerBackend: Shutting down all executors
23/08/28 17:18:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
23/08/28 17:18:52 INFO YarnClientSchedulerBackend: YARN client scheduler backend Stopped
23/08/28 17:18:52 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/08/28 17:18:52 INFO MemoryStore: MemoryStore cleared
23/08/28 17:18:52 INFO BlockManager: BlockManager stopped
23/08/28 17:18:52 INFO BlockManagerMaster: BlockManagerMaster stopped
23/08/28 17:18:52 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/08/28 17:18:52 INFO SparkContext: Successfully stopped SparkContext
23/08/28 17:18:52 INFO ShutdownHookManager: Shutdown hook called

From the error logs, the view and query created looks correct but the agent is not able to parse the LLM output.

Here is my code:

from pyspark_ai import SparkAI
from pyspark.sql import SparkSession
from langchain.chat_models import ChatOpenAI

spark = SparkSession.builder \
    .appName("ReadFromHiveTable") \
    .enableHiveSupport() \
    .getOrCreate()

spark_ai = SparkAI(llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0))
spark_ai.activate()

df = spark.sql("select * from data_lake.lineitem")
df.show(5)

df.createOrReplaceTempView("spark_ai_temp_view_cb198e")
spark.sql("SELECT l_shipmode, COUNT(*) AS shipment_count FROM spark_ai_temp_view_cb198e GROUP BY l_shipmode").show()

df.ai.transform("count of shipments by mode")
df.show(5)

Remove the prompt about similar_value if vector search is disabled

We shouldn't mention similar_value if vector search is disabled

SPARK_SQL_PREFIX = """You are an assistant for writing professional Spark SQL queries. 
Given a question, you need to write a Spark SQL query to answer the question. The result is ALWAYS a Spark SQL query.
Always use the tool similar_value to find the correct filter value format, unless it's obvious.
Use the COUNT SQL function when the query asks for total number of some non-countable column.
Use the SUM SQL function to accumulate the total number of countable column values."""

Bug: "cannot import name 'Row' from 'sqlalchemy'" caused by import of old Langchain package version

  • Bug signature: "cannot import name 'Row' from 'sqlalchemy'" caused by import of old Langchain package version
  • Occurs when importing pyspark-ai==0.1.19 on a machine that already has langchain==0.0314 installed

Recreate the environment:

  • Prepare a machine with an older version of langchain that is within the :

    • pip install langchain==0.0.314

    Recreate the issue:

    • pip install pyspark-ai==0.1.19
    • from langchain.chat_models import ChatOpenAI from pyspark_ai import SparkAI chatOpenAI = ChatOpenAI(model = 'gpt-3.5-turbo') spark_ai = SparkAI(llm = chatOpenAI) spark_ai.activate()

Produces:
"cannot import name 'Row' from 'sqlalchemy' "

Proposed Fix:

  • Modifiying pyspark-ai to import langchain>=0.0.353 fixes the issue
  • Modify pyproject.toml line 28
    to langchain = ">=0.0.353,<0.1.0"

How I can use this to connect Azure deployed private OpenAI model?

from pyspark_ai import SparkAI
import os
#from pyspark_ai.openai import OpenAI
from langchain.chat_models import ChatOpenAI

os.environ["OPENAI_API_BASE_URL"] =""
os.environ["OPENAI_API_TYPE"]= ""
os.environ["OPENAI_API_KEY"] = ""

Initialize SparkAI with the ChatOpenAI model

llm = ChatOpenAI(model_name='gpt-35-turbo')

spark_ai = SparkAI(llm=llm, verbose=True)
#spark_ai = SparkAI(engine='gpt-3.5-turbo')
#spark_ai = SparkAI()

The above Code works but fails when I use Explain or transform.

it says " Must provide an 'engine' or 'deployment_id' parameter to create a " but spark_ai = SparkAI(llm=llm, verbose=True) --it does not allow engine or deployment_id as argument

Can you retrieve the SQL query?

We have been playing around with a tool for a bit, super cool!
But is there a way to get the SQL query to help us debug?

We noticed that sometimes it's not using the partitioned columns, which doesn't scale well.

Error while using ChatGooglePalm

from pyspark.sql import SparkSession
import pyspark_ai
spark = SparkSession.builder.appName('dummy').getOrCreate()

data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
columns = ["language","user_cnt"]

rdd = spark.sparkContext.parallelize(data)

from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
    StructField("language", StringType(), nullable=False),
    StructField("user_cnt", StringType(), nullable=False)
])
df=spark.createDataFrame(data,schema=schema)
# llm=palm.get_model("models/chat-bison-001")
from langchain.chat_models import ChatGooglePalm
llm = ChatGooglePalm(temperature=0.9,google_api_key="GOOGLE_API_KEY")
spark_ai=pyspark_ai.SparkAI(llm=llm,spark_session=spark)
spark_ai.activate()
df.ai.transform("language with high user count")

ChatGooglePalmError: ChatResponse must have at least one candidate.

Setup github action

Description

This RFC proposes setting up github actions to run tests at every commit.

Related Issues

None

Questions

None

Using PySpark code instead of SQL

Right now, every DF transformation creates a new temp view and the transformation is applied as a SQL query on top of the temp view. Unfortunately, this creates a lot of state in the Spark session and makes it harder to trace the actual source of the request.

It would be awesome if one could chose between PySpark and SQL transformations. I've had some good success with the following prompt template.

Given a PySpark dataframe with the name `df` and with the following schema:

id: bigint
dropoff_zip: string
pickup_zip: string
fare_amount: double
toll_amount: double
tip: double
passenger_count: int


Write a Python function called `transform_df` that performs the following transformation and returns a new dataframe:  Show only rows with more than 3 passengers.

The answer MUST contain one function only. Ensure your answer is correct.

Ideally, this could be embedded in PySpark AI like this:

transformed = df.ai.transform("Show only rows with more than 3 passengers")
transformed.explain()

shows the full trace of the operation instead of just the read from the temp view.

Resolve deprecated warnings from langchain

Some of the langchain usages are already deprecated:

./.pyenv/versions/3.9.17/lib/python3.9/site-packages/langchain/__init__.py:39: UserWarning: Importing BasePromptTemplate from langchain root module is no longer supported.
  warnings.warn(
./.pyenv/versions/3.9.17/lib/python3.9/site-packages/langchain/__init__.py:39: UserWarning: Importing GoogleSearchAPIWrapper from langchain root module is no longer supported.
  warnings.warn(
./.pyenv/versions/3.9.17/lib/python3.9/site-packages/langchain/__init__.py:39: UserWarning: Importing LLMChain from langchain root module is no longer supported.
  warnings.warn(
./.pyenv/versions/3.9.17/lib/python3.9/site-packages/langchain/__init__.py:39: UserWarning: Importing FewShotPromptTemplate from langchain root module is no longer supported.
  warnings.warn(
./.pyenv/versions/3.9.17/lib/python3.9/site-packages/langchain/__init__.py:39: UserWarning: Importing BasePromptTemplate from langchain root module is no longer supported.
  warnings.warn(

Loading DataFrame from a saved source

Right now, I don't see a way to load data from a saved source (i.e., a table saved in Databricks or snowflake). It would be helpful if this can be done so we can use existing tables in conjunction with language-based commands.

Table name as parameter for function create_df()

Hi,

could you add a parameter "tablename" for the function create_df()? In same cases a table with an automaticlly detected name cannot be created, but the SQL works with different table name:

Spark AI call:

Debug output:

INFO: Parsing URL: https://www.procontra-online.de/sach-privat/artikel/softfair-ermittelt-die-besten-wohngebaeude-tarife

INFO: SQL query for the ingestion:
CREATE OR REPLACE TEMP VIEW wohngebäude_tarife AS 
SELECT 'Alte Leipziger' AS Versicherer, 'comfort mit Baustein Haus- und Wohnungsschutzbrief' AS Tarifkombination, 5275 AS Punktzahl
UNION ALL
SELECT 'Janitos' AS Versicherer, 'Best Selection mit Bausteinen Allgefahrendeckung, Hausschutzbrief und Multi-Garantie' AS Tarifkombination, 5275 AS Punktzahl
UNION ALL
SELECT 'Dema' AS Versicherer, 'Immo Protect Top mit Baustein Unbenannte Gefahren/ Marktgarantie' AS Tarifkombination, 5225 AS Punktzahl
UNION ALL
SELECT 'Domcura' AS Versicherer, 'Top mit Baustein Unbenannte Gefahren/ Marktgarantie' AS Tarifkombination, 5225 AS Punktzahl
UNION ALL
SELECT 'Adcuri' AS Versicherer, 'Premium mit Bausteinen Elektronik & Haustechnik, Unbenannte Gefahren' AS Tarifkombination, 5200 AS Punktzahl
UNION ALL
SELECT 'Manufaktur Augsburg' AS Versicherer, 'Premium Plus mit Bausteinen Smart Home, Unbenannte Gefahren/Marktgarantie' AS Tarifkombination, 5125 AS Punktzahl
UNION ALL
SELECT 'Axa' AS Versicherer, 'Komfort mit Bausteinen Optimum, Premium' AS Tarifkombination, 5075 AS Punktzahl
UNION ALL
SELECT 'Grundeigentümer Versicherung' AS Versicherer, 'ProtectPremium mit Baustein Soforthilfe' AS Tarifkombination, 5030 AS Punktzahl
UNION ALL
SELECT 'Rhion' AS Versicherer, 'Premium mit Baustein Best-Leistungs-Garantie' AS Tarifkombination, 4975 AS Punktzahl
UNION ALL
SELECT 'Konzept & Marketing' AS Versicherer, 'Allsafe Domo' AS Tarifkombination, 4830 AS Punktzahl

INFO: Storing data into temp view: wohngebäude_tarife

Spark Parse Exception:

[PARSE_SYNTAX_ERROR] Syntax error at or near 'ä'.(line 1, pos 35)

== SQL ==
CREATE OR REPLACE TEMP VIEW wohngebäude_tarife AS 
...

Uses GPT-4 by default, which may be unavailable

It's just a minor issue, but as of today, GPT-4 is unavailable if you are not part of the limited beta. The workaround is passing the LLM to SparkAI

from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
sa = SparkAI(llm=llm)
sa.activate()

I wonder if it makes sense to pass just the name of the model to SparkAI. Or make gpt-3.5 the default as it's readily available

Error in executing spark_ai.activate().. please help


ModuleNotFoundError Traceback (most recent call last)
Cell In[14], line 2
1 # Activate partial functions for Spark DataFrame
----> 2 spark_ai.activate()

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/pyspark_ai/pyspark_ai.py:428, in SparkAI.activate(self)
426 DataFrame.ai = AIUtils(self)
427 # Patch the Spark Connect DataFrame as well.
--> 428 from pyspark.sql.connect.dataframe import DataFrame as CDataFrame
429 CDataFrame.ai = AIUtils(self)

ModuleNotFoundError: No module named 'pyspark.sql.connect'

Fix project dependencies

Some project dependencies should be flagged as dev-dependencies. We want the fewest dependencies possible when this project is pip installed.

A quinn user recently informed me that dev-dependencies are being deprecated and poetry is moving towards group dependencies.

This should work better for this project too. We should think about what dependency groups would be ideal for this project.

Add example notebook with Code Llama

Add an example notebook with Code Llama. For instance, we can try Code Llama on the following functions:

  • df.ai.plot()
  • df.ai.verify()
  • Create python UDF via @spark_ai.udf annotation

Project maintenance and further plans of development?

Hello!
The project is very cool, but it looks like it has been facing a lack of maintenance for last four months. It looks like the latest actual commit except just versions bumping was in November. Also, there are no new feature-issues. What are the plans of authors about the project and further development?

I have motivation to contribute. As a start I can try to at least update dependencies, like:

  1. Bump Python to ~3.9 because 3.8 is officially legacy and the end of support is October 2024
  2. Bum langchain to ~0.1 version (the latest one is 0.1.9, but the project uses the version 0.0.354)
  3. Bump openai to ~1.0 (the latest version of openai is 1.13 but the project uses 0.27.10
  4. Try to update the overall code to make it works with the latest langchain and openai

Temp View Generation does not properly work with Spark Connect

Since Spark Connect will lazily evaluate the generated code, using the same name for the view for every invocaiton of ai.transform() will not work.

The current behavior works because the execution plan is eagerly evaluated when spark.sql() is called, but this will not work properly, when the analysis of the plan is deferred.

Repro

> env PYSPARK_DRIVER_PYTHON=ipython poetry run pyspark --remote local --packages org.apache.spark:spark-connect_2.12:3.4.1
In [7]: from pyspark_ai import SparkAI
   ...: from langchain.chat_models import ChatOpenAI
   ...:
   ...: llm = ChatOpenAI(model="gpt-3.5-turbo")
   ...: ai = SparkAI(llm=llm, spark_session=spark)
   ...: ai.activate()
   ...:
   ...: df = spark.range(10)
   ...: df2 = spark.range(100)
   ...:
   ...: r = df.ai.transform("count of rows")
   ...: assert(r.collect()[0][0] == 10)
   ...:
   ...: r2 = df2.ai.transform("count of rows")
   ...:
   ...: # Attention
   ...: assert(r2.collect()[0][0] == 100)
   ...: assert(r.collect()[0][0] == 10)
INFO: Creating temp view for the transform:
df.createOrReplaceTempView("temp_view_for_transform")

2023-07-23 11:40:45,277 48026 INFO execute_command Execute command for command create_dataframe_view { input { common { plan_id: 20 } range { start: 0 end: 10 step: 1 } } name: "temp_view_for_transform" replace: true }
2023-07-23 11:40:45,277 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:45,277 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:45,287 48026 INFO schema Schema for plan: root { common { plan_id: 20 } range { start: 0 end: 10 step: 1 } }
INFO: SQL query for the transform:
SELECT COUNT(*) FROM temp_view_for_transform

2023-07-23 11:40:46,064 48026 INFO execute_command Execute command for command sql_command { sql: "SELECT COUNT(*) FROM temp_view_for_transform" }
2023-07-23 11:40:46,064 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:46,064 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:46,074 48026 DEBUG _execute_and_fetch_as_iterator Received the SQL command result.
2023-07-23 11:40:46,079 48026 DEBUG _execute_and_fetch_as_iterator Received metric batch.
2023-07-23 11:40:46,080 48026 INFO to_table Executing plan root { common { plan_id: 24 } sql { query: "SELECT COUNT(*) FROM temp_view_for_transform" } }
2023-07-23 11:40:46,080 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:46,080 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:46,084 48026 DEBUG _execute_and_fetch_as_iterator Received the schema.
2023-07-23 11:40:46,108 48026 DEBUG _execute_and_fetch_as_iterator Received arrow batch rows=1 size=328
2023-07-23 11:40:46,109 48026 DEBUG _execute_and_fetch_as_iterator Received metric batch.
INFO: Creating temp view for the transform:
df.createOrReplaceTempView("temp_view_for_transform")

2023-07-23 11:40:46,109 48026 INFO execute_command Execute command for command create_dataframe_view { input { common { plan_id: 21 } range { start: 0 end: 100 step: 1 } } name: "temp_view_for_transform" replace: true }
2023-07-23 11:40:46,110 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:46,110 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:46,116 48026 INFO schema Schema for plan: root { common { plan_id: 21 } range { start: 0 end: 100 step: 1 } }
INFO: SQL query for the transform:
SELECT COUNT(*) FROM temp_view_for_transform

2023-07-23 11:40:46,119 48026 INFO execute_command Execute command for command sql_command { sql: "SELECT COUNT(*) FROM temp_view_for_transform" }
2023-07-23 11:40:46,119 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:46,119 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:46,123 48026 DEBUG _execute_and_fetch_as_iterator Received the SQL command result.
2023-07-23 11:40:46,128 48026 DEBUG _execute_and_fetch_as_iterator Received metric batch.
2023-07-23 11:40:46,128 48026 INFO to_table Executing plan root { common { plan_id: 27 } sql { query: "SELECT COUNT(*) FROM temp_view_for_transform" } }
2023-07-23 11:40:46,128 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:46,128 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:46,132 48026 DEBUG _execute_and_fetch_as_iterator Received the schema.
2023-07-23 11:40:46,151 48026 DEBUG _execute_and_fetch_as_iterator Received arrow batch rows=1 size=328
2023-07-23 11:40:46,152 48026 DEBUG _execute_and_fetch_as_iterator Received metric batch.
2023-07-23 11:40:46,152 48026 INFO to_table Executing plan root { common { plan_id: 24 } sql { query: "SELECT COUNT(*) FROM temp_view_for_transform" } }
2023-07-23 11:40:46,152 48026 INFO _execute_and_fetch ExecuteAndFetch
2023-07-23 11:40:46,152 48026 INFO _execute_and_fetch_as_iterator ExecuteAndFetchAsIterator
2023-07-23 11:40:46,156 48026 DEBUG _execute_and_fetch_as_iterator Received the schema.
2023-07-23 11:40:46,174 48026 DEBUG _execute_and_fetch_as_iterator Received arrow batch rows=1 size=328
2023-07-23 11:40:46,174 48026 DEBUG _execute_and_fetch_as_iterator Received metric batch.
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[7], line 18
     16 # Attention
     17 assert(r2.collect()[0][0] == 100)
---> 18 assert(r.collect()[0][0] == 10)

AssertionError:

In [8]:

Dataframe from webpage doesn't load entire table

When trying to load a dataframe from a webpage, e.g.:

df = spark_ai.create_df('https://en.wikipedia.org/wiki/List_of_countries_by_vehicles_per_capita')

It often doesn't load the entire table into the dataframe (the table on the webpage has 195 rows):

INFO: Parsing URL: https://en.wikipedia.org/wiki/List_of_countries_by_vehicles_per_capita

INFO: SQL query for the ingestion:
CREATE OR REPLACE TEMP VIEW spark_ai_temp_view_034e03 AS SELECT * FROM VALUES
('Gibraltar', 1444, 48641, 2022),
('Guernsey', 1365, 86000, 2014),
('San Marino', 1300, 44200, 2022),
('Liechtenstein', 1193, 45800, 2022),
('Andorra', 1050, 81000, 2021),
('Monaco', 910, 35500, 2022),
('United States', 908, 305000000, 2023),
('New Zealand', 884, 4529700, 2022),
('Canada', 790, 30754600, 2022),
('Finland', 790, 4368796, 2022)
AS v1(country_or_region, motor_vehicles_per_1000_people, total, year)

INFO: Storing data into temp view: spark_ai_temp_view_034e03

If I retry several times, passing different column names (or subsets) so that it doesn't just use the cache, only sometimes does it return all rows:

df = spark_ai.create_df('https://en.wikipedia.org/wiki/List_of_countries_by_vehicles_per_capita', ['country', 'vehicles'])
INFO: Parsing URL: https://en.wikipedia.org/wiki/List_of_countries_by_vehicles_per_capita

INFO: SQL query for the ingestion:
CREATE OR REPLACE TEMP VIEW spark_ai_temp_view_c345c7 AS 
SELECT * FROM VALUES
('Gibraltar', 1444),
('Guernsey', 1365),
('San Marino', 1300),
('Liechtenstein', 1193),
('Andorra', 1050),
('Monaco', 910),
('United States', 908),
('New Zealand', 884),
('Canada', 790),
('Finland', 790),
('Malta', 786),
('Cyprus', 785),
('Luxembourg', 784),
('Australia', 782),
('Guam', 777),
('Italy', 755),
('Estonia', 715),
('Iceland', 720),
('Poland', 687),
('Jersey', 674),
('France', 668),
('Puerto Rico', 666),
('Japan', 661),
('Slovenia', 660),
('Bahamas', 650),
('Czech Republic', 648),
('Portugal', 639),
('Wales', 637),
('Norway', 635),
('Germany', 628),
('Spain', 627),
('Brunei', 614),
('Slovakia', 611),
('Greece', 606),
('Switzerland', 604),
('United Kingdom', 600),
('Qatar', 591),
('Belgium', 590),
('Netherlands', 588),
('Austria', 572),
('Antigua and Barbuda', 561),
('Scotland', 557),
('Kuwait', 556),
('Dominica', 550),
('Sweden', 545),
('Malaysia', 542),
('Denmark', 540),
('Ireland', 535),
('Lithuania', 507),
('South Korea', 485),
('Bulgaria', 482),
('Croatia', 479),
('Saint Kitts and Nevis', 479),
('Syria', 472),
('Hungary', 463),
('Nauru', 455),
('Suriname', 446),
('Dominican Republic', 442),
('Romania', 441),
('Bahrain', 430),
('Barbados', 417),
('Chile', 416),
('Argentina', 402),
('Lesotho', 400),
('Russia', 395),
('Latvia', 394),
('Mexico', 391),
('Israel', 390),
('Brazil', 386),
('Serbia', 389),
('Georgia', 378),
('Moldova', 367),
('Montenegro', 367),
('Taiwan', 365),
('United Arab Emirates', 354),
('Uruguay', 348),
('Bosnia and Herzegovina', 345),
('Belarus', 343),
('Oman', 335),
('Trinidad and Tobago', 329),
('China', 296),
('Colombia', 296),
('Lebanon', 295),
('Seychelles', 295),
('Costa Rica', 287),
('Guyana', 285),
('Thailand', 280),
('Grenada', 268),
('Botswana', 260),
('Turkey', 254),
('Ukraine', 245),
('Maldives', 241),
('Albania', 238),
('Guatemala', 237),
('Kazakhstan', 226),
('Honduras', 222),
('Belize', 222),
('Panama', 218),
('Mongolia', 217),
('Saint Lucia', 208),
('North Macedonia', 205),
('Saint Vincent and the Grenadines', 204),
('Kyrgyzstan', 201),
('Iran', 183),
('Macau', 180),
('Armenia', 177),
('Tunisia', 177),
('South Africa', 176),
('Bolivia', 174),
('Jordan', 169),
('Tonga', 162),
('Namibia', 161),
('Sri Lanka', 157),
('São Tomé and Príncipe', 157),
('Saudi Arabia', 156),
('Bhutan', 150),
('Singapore', 149),
('Algeria', 149),
('Azerbaijan', 146),
('Fiji', 145),
('Nicaragua', 144),
('Ecuador', 143),
('Venezuela', 140),
('Myanmar', 138),
('Cape Verde', 133),
('Samoa', 130),
('Philippines', 120),
('Peru', 116),
('Nepal', 113),
('Iraq', 111),
('Morocco', 111),
('Hong Kong', 109),
('Turkmenistan', 102),
('Greenland', 100),
('Federated States of Micronesia', 96),
('Kosovo', 94),
('Uzbekistan', 87),
('Indonesia', 82),
('Jamaica', 81),
('Gambia', 80),
('Chad', 77),
('Zimbabwe', 76),
('Vanuatu', 71),
('Egypt', 70),
('Kenya', 69),
('El Salvador', 68),
('Cuba', 67),
('Senegal', 65),
('Nigeria', 61),
('Afghanistan', 61),
('Ivory Coast', 60),
('India', 59),
('Palestine', 58),
('Yemen', 52),
('Tajikistan', 51),
('Madagascar', 48),
('Ghana', 46),
('Comoros', 44),
('Sierra Leone', 40),
('Djibouti', 40),
('Angola', 36),
('Vietnam', 53),
('Guinea-Bissau', 35),
('Kiribati', 34),
('Togo', 33),
('Democratic Republic of the Congo', 32),
('Pakistan', 29),
('Zambia', 29),
('Benin', 27),
('Bangladesh', 27),
('Cambodia', 27),
('Mozambique', 26),
('Gabon', 26),
('Burkina Faso', 22),
('Liberia', 22)
AS v1(country, vehicles)

INFO: Storing data into temp view: spark_ai_temp_view_c345c7

Facing when trying to import SparkAI from pyspark

facing TypeError: dataclass_transform() got an unexpected keyword argument 'field_specifiers'
code used :
from pyspark_ai import SparkAI

spark_ai=SparkAI(verbose=True)
spark_ai.activate()

Also, Please clarify whether it is pyspark-ai or pyspark_ai to be imported. (need changes to the github readme).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.