databricks / spark-the-definitive-guide Goto Github PK
View Code? Open in Web Editor NEWSpark: The Definitive Guide's Code Repository
Home Page: http://shop.oreilly.com/product/0636920034957.do
License: Other
Spark: The Definitive Guide's Code Repository
Home Page: http://shop.oreilly.com/product/0636920034957.do
License: Other
I enter pyspark command:
maxSql = spark.sql(""" SELECT DEST_COUNTRY_NAME, sum(count) as destination_total FROM flight_data_2015 GROUP BY DEST_COUNTRY_NAME ORDER BY sum(count) DESC LIMIT 5 """)
Error:
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:04 WARN MetaData: Metadata has jdbc-type of integer yet this is not valid. Ignored
19/09/17 13:38:07 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mordekay/Downloads/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/Users/mordekay/Downloads/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __ca
ll__
File "/Users/mordekay/Downloads/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Table or view not found: flight_data_2015; line 1 pos 64'
what is the reason for this error ?
Unable to locate /data/deep-learning-images/
dataset.
Hi - I am following the structured streaming example from the console (using pyspark). I successfully read the JSON files (which I load from S3) and set the files per trigger. However once I start the 'activityQuery' using code below, I can't get back into the shell as it just runs continously. So I cannot execute the 'activityQuery.awaitTermination' command or the spark.sql select command to see the activity_counts table. If I try to create another shell and run pyspark, none of the tables are available or visible.
activityQuery = activityCounts.writeStream.queryName("activity_counts")\
.format("memory").outputMode("complete")\
.start()
Going through the guide and it's a bit tedious to type out all of the code examples. Copy pasting is a bit tedious as well due to line endings and โโ
instead of ""
Would you accept PR which includes the code examples from the book ? Many textbooks have the code examples available on their website.
Hi I am trying to use sqllite to create Data frame. Below is the code.
_
driver = "org.sqllite.JDBC"
path = "/dbfs/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db"
url = "jdbc:sqllite:" + path
table_name = "flight_info"
dfDF = spark.read.format("jdbc").option("url", url)
.option("dbtable", table_name)
.option("driver", driver)
.load()
_
I am getting error class not found:
java.lang.ClassNotFoundException: org.sqllite.JDBC on line that has driver option.
Is it possible to run JDBC code on community edition? please let me know what I need to do sort this error.
It's my first time using databricks , I don't find the way to import your code and data to databricks directly(only found something like upload file, import by URL..). Could you write in readme or just give the link instruction?
HI,
I have downloaded repository and I was able to execute and practice all example . But when I am trying to execute examples related to SQL data source from Chapter 9 Data Source I am getting following error. I don't have any clue on what to do so please guide me. Thanks in advance.
Error"Java.sql.SQLException: path to '/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db': '/databricks-datasets' does not exist"
Path do exists when I run %fs ls and I get following
dbfs:/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db
Following are parameters
driver = "org.sqlite.JDBC" path = "/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db" url = "jdbc:sqlite:" + path tablename = "flight_info"
It will be great if we have Java implementations as well for all the codes in the book
Hi,
I'm really stuck with this section of Spark book.
staticDataFrame = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/mnt/defg/retail-data/by-day/*.csv")
I'm not able to understand the "load("/mnt/...") section.
I have downloaded the data to my local drive. But now the issue is on loading the data. How to load the data ?
Is the mnt/defg being done via S3 ? or by any other method !
Hi
Is it possible to run the JDBC SQLite code in Databricks Community Edition?
If so how would I set this up?
Thanks
Jonathan
In Chapter 2 when loading the flightData2015 csv the action commands do not work at the scala prompt in Spark 2.4.4. You get messages like:
:37 error: value take is not a member of org.apache.spark.sql.SparkSession
:37 error: value sort is not a member of org.apache.spark.sql.SparkSession
:37 error: value show is not a member of org.apache.spark.sql.SparkSession
According to some stack overflow posts it is the wrong version of Maven. However when I read the README file for the databricks download it said that downloading Maven separately was not needed because it was included in the pre-built package that I downloaded from chapter 1.
Do you have any suggestions on how to fix this? I don't see any maven directories or files in this pre-built package spark-2.4.4-bin-hadoop2.7.gz.
Hi,
I have downloaded repository and I was able to execute and practice all example . But when I am trying to execute examples related to SQL data source from Chapter 9 Data Source I am getting following error. I don't have any clue on what to do so please guide me. Thanks in advance.
Error"Java.sql.SQLException: path to '/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db': '/databricks-datasets' does not exist"
Path do exists when I run %fs ls and I get following
dbfs:/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db
Following are parameters
driver = "org.sqlite.JDBC" path = "/databricks-datasets/definitive-guide/data/flight-data/jdbc/my-sqlite.db" url = "jdbc:sqlite:" + path tablename = "flight_info"
In chapter 5 the following dataframe is used:
val df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("dbfs:/mnt/defg/streaming/*.csv")
.coalesce(5)
I cannot find a /streaming subdirectory in the /data subdirectory
When using the retail-data/all/online-retail-dataset.csv I get problems getting the InvoiceDate recognized as a Date.
Is there any specific reason? Can it be updated to run on python 3 since thats the standard now?
Hello,
Getting while importing seasonal decompose-from statsmodels.tsa.seasonal import seasonal_decompose
List of libraries installed on the databricks:
"azure-storage-blob": 12.9.0
"azure-identity": 1.7.1
"azure-keyvault": 4.2.0
"pandas": 1.3.4
"numpy": 1.20.3
"geopandas": 0.9.0
"shapely": 1.7.1
"dill": (latest available version)
"esda": 2.4.1
"rtree": 0.9.7
"scikit-learn": 1.0.1
"numba": 0.54.1
"setuptools": 58.0.4
"libpysal": 4.5.1
"xgboost": 1.4.2
"catboost": 1.0.3
"pyyaml": 6.0.0
"findspark": 1.3.0
"cvxpy": 1.3.0
"apache-sedona": 1.1.0
"scipy": 1.8.1
"hdbscan": 0.8.32
statsmodels
Please guide
On Page 371, for the python example, there's a backslash after "streaming" that should not be there because there's additional code on the same line. Fix is to either remove the backslash or start ".selectExpr(" on the next line.
When running https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/code/Streaming-Chapter_22_Event-Time_and_Stateful_Processing.scala lines 239-319 (plus a few boilerplate lines, using Spark 2.2.0), "if (oldState.hasTimedOut)" on line 281 is never true for me, i.e. the timer never expires. I only see output "if (state.values.length > 1000)" on line 287. The only modification I made to the code is to change the writeStream.format from "memory" to "json" (and add the path and checkpointLocation), and I'm using the first 6 files/parts of the activity-data, both with and without readStream.option("maxFilesPerTrigger",1). I also tried changing line 309 to "CURRENT_TIMESTAMP as timestamp", but it didn't help. Do you get any output (besides empty files) if you comment out lines 287-292? Would you please confirm whether there is a bug with the code or with Spark or am I doing something wrong?
// in Scala
case class Flight(DEST_COUNTRY_NAME: String,
ORIGIN_COUNTRY_NAME: String,
count: BigInt)
val flightsDF = spark.read
.parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightsDF.as[Flight]
error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val flights = flightsDF.as[Flight]
there are many statements in book like this:
// in Scala
df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count()
-- in SQL
SELECT COUNT(DISTINCT(ORIGIN_COUNTRY_NAME, DEST_COUNTRY_NAME)) FROM dfTable
I understand "in Scala " means execution in spark-shell. Does "in SQL" mean short hand for spark.sql("xx")? similar question is how can I execute commands in Spark-The-Definitive-Guide/code/xx.sql file?
When executing code from line 2 to 7 in "code/Advanced_Analytics_and_Machine_Learning-Chapter_25_Preprocessing_and_Feature_Engineering.scala" file as follows,
val sales = spark.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/data/retail-data/by-day/*.csv")
.coalesce(5)
.where("Description IS NOT NULL")
an error occurs:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`Description`' given input columns: [<!DOCTYPE html>]; line 1 pos 0;
'Filter isnotnull('Description)
+- Relation[<!DOCTYPE html>#10] csv
Could u please give proper instruction for path to data uploaded in databricks-datasets?
Also I am not able to load the data folder which is saved locally , directly to my databricks cluster and uploading each file seperately is really cumbursome?
Is there any solution for that?
It seems SCALA version, same as #L319
Issue in last code to get maximum destination countries. Count should be typecast to int before taking sum ,and syntax error is also coming with existing code
Using Python...
Having trouble replicating the use of get_json_object and json_tuple (p110). One small issue is using "... as 'column'" [easily fixed with use of .alias("column") instead], but even when accounting for this I'm getting a "DataFrame object is not callable" error. If I replace the col("jsonString") with jsonDF.jsonString (where ever that is used) the error goes away. This sort of replicates what's in use in the documentation for get_json_object, so I'm guessing it's the right answer. And when I run this I get the results in the book...
Also, the SQL string only calls out one column, "column"... and it doesn't seem like the pure SQL code of previous examples, but I was able to reproduce the results.
If you see what I mean and agree I'll do a PR (errata page still down...)
Won't address the Scala code, but it looks as if it would have the same issues...
I get the following error in the running of the following code from Chapter 3 (Structured Streaming)
streamingDataFrame = spark.readStream
.schema(staticSchema)
.option("maxFilesPerTrigger", 1)
.format("csv")
.option("header", "true")
.load("/data/retail-data/by-day/*.csv")
NameError Traceback (most recent call last)
in
2 #How many files read together is identified by maxFilesPerTrigger
3 streamingDataFrame = spark.readStream
----> 4 .schema(staticSchema)
5 .option("maxFilesPerTrigger", 1)
6 .format("csv")\
NameError: name 'staticSchema' is not defined
Can anyone guide me about it? I am running the code on Databricks community cluster.
Thanks,
On the example:
myRow.getInt(2) // String
It should be
myRow.getInt(2) // Int
right?
recommend:
def color_locator(column, color_string):
return locate(color_string.upper(), column)
.cast("boolean")
.alias("is_" + color_string) # existing code has 'c' instead of 'color_string' - doesn't work
The end of Chapter 24 is scala only. Is this intentional?
The training summary and persisting and applying models are Scala only.
I was able to get the summary to work in python3 with Spark 2.3, but not the persist and apply. Can python be added to the code?
`
trainedPipeline = tvsFitted.bestModel
lrBestModel = trainedPipeline.stages[1]
lrBestModel.summary.objectiveHistory`
Hi,
I start learning Apache Spark by reading that book. I'm now at chapter 3 - Streaming part
.
For snippet code, I choose python3
My problem is that nothing is displayed in console, as said in the book, from that code
purchaseByCustomerPerHour.writeStream \
.format("console") \
.queryName("customer_purchases_3") \
.outputMode("complete") \
.start() \
.show()
I don't know if I'm doing something wrong but tell me how to display result in console at start and at every update too
First, the column Quantity
is parsed as String
. It should be IntegerType
.
Second orderBy(CustomerId)
should be orderBy(desc(Quantity))
Please find the below correction in the book under the Grouping section
In the book:
--in SQL
%sql
SELECT
count()
FROM
dfTable
GROUP BY
InvoiceNo, CustomerId
+---------+----------+-----+
|InvoiceNo|CustomerId|count|
+---------+----------+-----+
| 536846| 14573| 76|
| 537026| 12395| 12|
| 537883| 14437| 5|
...
| 543188| 12567| 63|
| 543590| 17377| 19|
| C543757| 13115| 1|
| C544318| 12989| 1|
+---------+----------+-----+
correct one:
--in SQL
Select InvoiceNo, CustomerId, count()
from dfTable
Group by InvoiceNo, CustomerId
could you please modify the source accordingly?
The line:
"We use the sort is a transformation"
should read
"We use the sort as a transformation"
Hello,
I tried to follow the Transfer learning example in pyspark on page 534-535. However, when I try to do p_model = p.fit(train_df) I get a 'Number of source Raster bands and source color space components do not match'. I tried googling what to do, but i wasn't sure what to do? The trace includes references to java.awt.image.ColorConvertOp.filter, com.sun.imageio.plugins.jpeg.JPEGImageReader.... and javax.imageio.ImageIO.read. Note that I used Spark ml's ImageSchema read function because the sparkdl readImages function seems to be deprecated.
Thanks in advance,
Nick
I'm learning Spark using the book and I ran into this code which returns an error:
from pyspark.sql.functions import expr, locate
simpleColors = ["black", "white", "red", "green", "blue"]
def color_locator(column, color_string):
return locate(color_string.upper(), column)\
.cast("boolean")\
.alias("is_" + c)
selectedColumns = [color_locator(df.Description, c) for c in simpleColors]
selectedColumns.append(expr("*")) # has to be a Column type
df.select(*selectedColumns).where(expr("is_white OR is_red"))\
.select("Description").show(3, False)
Error: NameError: name 'c' is not defined
Even if I define a dummy 'c' within the function, I get the following:
AnalysisException: "cannot resolve '`is_white`' given input columns: [Quantity, StockCode, is_, InvoiceNo, is_, is_, CustomerID, is_, Description, InvoiceDate, UnitPrice, is_, Country]; line 1 pos 0;\n'Filter ('is_white || 'is_red)\n+- Project [cast(locate(BLACK, Description#14, 1) as boolean) AS is_#1632, cast(locate(WHITE, Description#14, 1) as boolean) AS is_#1633, cast(locate(RED, Description#14, 1) as boolean) AS is_#1634, cast(locate(GREEN, Description#14, 1) as boolean) AS is_#1635, cast(locate(BLUE, Description#14, 1) as boolean) AS is_#1636, InvoiceNo#12, StockCode#13, Description#14, Quantity#15, InvoiceDate#16, UnitPrice#17, CustomerID#18, Country#19]\n +- Relation[InvoiceNo#12,StockCode#13,Description#14,Quantity#15,InvoiceDate#16,UnitPrice#17,CustomerID#18,Country#19] csv\n"
Any suggestions?
Thank you!
The below piece of code when run on a spark 3 cluster
colName = "count"
upperBound = 348113L
numPartitions = 10
lowerBound = 0L
fails with File "", line 2
upperBound = 348113L
^
SyntaxError: invalid syntax_
Below is the correct code
colName = "count"
upperBound = 348113
numPartitions = 10
lowerBound = 0
It would be very helpful and time saving to have the databricks or ipython notebooks available to import rather than the .py code files. Could you upload them separately to the a directory?
When introducing the streaming DataFrame, the following code:
from pyspark.sql.functions import window, column, desc, col
staticDataFrame\
.selectExpr( # variante di select che accetta espressioni SQL
"CustomerId",
"(UnitPrice * Quantity) as total_cost",
"InvoiceDate") \
.groupBy(
col("CustomerId"), window(col("InvoiceDate"), "1 day")) \
.sum("total_cost") \
.show(5)
produces an output different from the one shown in the chapter, because it misses a "sorting line".
I think the correct code should be:
from pyspark.sql.functions import window, column, desc, col
staticDataFrame\
.selectExpr( # variante di select che accetta espressioni SQL
"CustomerId",
"(UnitPrice * Quantity) as total_cost",
"InvoiceDate") \
.groupBy(
col("CustomerId"), window(col("InvoiceDate"), "1 day")) \
.sum("total_cost") \
.sort(desc("sum(total_cost)")) \
.show(5)
Under working with JSON section in the book it is mentioned as
"The equivalent in SQL would be
jsonDF.selectExpr("json_tuple(jsonString, '$.myJSONKey.myJSONValue[1]') as column").show(2)"
but it is not a SQL syntax at first . could you please check and let us know the equivalent SQL expression of JSON transformation.
Please find the below correction in book under sumDistinct section
In the book:
--in SQL
select sum(Quantity) from dfTable -- 29310
This query will result in 5176450 rows.
correct one:
--in SQL
select sum(distinct(Quantity)) from dfTable -- 29310
This query will result in 29310 rows.
could you please modify the source accordingly ?
I'm translating this book in person, the repository link is https://github.com/SnailDove/Translation_Spark-The-Definitive-Guide
At least for spark-shell the Spark syntax needs this fixing. Without the parenthesis, the following command on page 421 of the book fails to generate proper Array[...] into the params Dataframe:
val params = new ParamGridBuilder()
.addGrid(rForm.formula, Array( ...
I'm studying spark advanced RDD API and got a little bit confused by one example.
`// in Scala
import org.apache.spark.Partitioner
class DomainPartitioner extends Partitioner {
def numPartitions = 3
def getPartition(key: Any): Int = {
val customerId = key.asInstanceOf[Double].toInt
if (customerId == 17850.0 || customerId == 12583.0) {
return 0
} else {
return new java.util.Random().nextInt(2) + 1
}
}
}`
As far as I can see in code documentation, partitioner must return the same partition id given the same partition key. That is not true for the example in the code above. Isn't "random" id for key break the Partitioner interface ?
When executing the below command to write to the below path facing the issue
val newPath = "jdbc:sqlite://tmp/my-sqlite.db"
val tablename = "flight_info"
val props = new java.util.Properties
props.setProperty("driver", "org.sqlite.JDBC")
csvFile.write.mode("overwrite").jdbc(newPath, tablename, props)
Driver Stack trace at high level:
Caused by: java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such table: flight_info)
at org.sqlite.core.DB.newSQLException(DB.java:890)
at org.sqlite.core.DB.newSQLException(DB.java:901)
at org.sqlite.core.DB.throwex(DB.java:868)
at org.sqlite.core.NativeDB.prepare(Native Method)
at org.sqlite.core.DB.prepare(DB.java:211)
could you please review and let us know how to resolve this issue further?
Created a scala program and trying to run the streaming code. it is filing with below error
19/11/09 07:25:24 ERROR StreamExecution: Query customer_purchases_2 [id = 6863c8d1-fd1c-49eb-a454-92825f0d2782, runId = 8adbd1e6-04e1-4b79-9fad-05950f359e77] terminated with error
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
spark_guide.chapter3.StructuredStreaming$.delayedEndpoint$spark_guide$chapter3$StructuredStreaming$1(StructuredStreaming.scala:12)
spark_guide.chapter3.StructuredStreaming$delayedInit$body.apply(StructuredStreaming.scala:9)
scala.Function0$class.apply$mcV$sp(Function0.scala:34)
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
scala.App$$anonfun$main$1.apply(App.scala:76)
scala.App$$anonfun$main$1.apply(App.scala:76)
scala.collection.immutable.List.foreach(List.scala:392)
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
scala.App$class.main(App.scala:76)
spark_guide.chapter3.StructuredStreaming$.main(StructuredStreaming.scala:9)
spark_guide.chapter3.StructuredStreaming.main(StructuredStreaming.scala)
If I run the following code from Chapter 3 as two separately pasted snippets all is OK.
`// in Scala - snippet 1
import spark.implicits._
case class Flight(DEST_COUNTRY_NAME: String,
ORIGIN_COUNTRY_NAME: String,
count: BigInt)
val flightsDF = spark.read
.parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightsDF.as[Flight]
// COMMAND ----------
// in Scala - snippet 2
flights
.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
.map(flight_row => flight_row)
.take(5)`
If I run it as one pasted snippet the result is a java exception.
java.lang.NullPointerException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
I am using Spark 2.2.2.
Kind regards, Lawrence
Query in the book:
INSERT INTO partitioned_flights
PARTITION (DEST_COUNTRY_NAME="UNITED STATES")
SELECT count, ORIGIN_COUNTRY_NAME FROM flights
WHERE DEST_COUNTRY_NAME='UNITED STATES'
LIMIT 12
In Spark 3.0, The above query returns the below error in SQL statement: AnalysisException: Cannot write incompatible data to table 'default
.partitioned_flights
':
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.