Material for the 2019 ODSC East Workshop (Realish Time Predictive Analytics with Spark Structured Streaming)
ODSC West - Protobuf on Kafka with Data Sketching
Just download, untar, and move spark. I usually just drop into usr/local
tar -xvzf /path/to/spark-2.4.0-bin-hadoop2.7.tgz && mv /path/to/spark-2.4.0-bin-hadoop2.7/ /usr/local/spark-2.4.0/
export SPARK_HOME=/usr/local/spark-2.4.0
export ZEPPELIN_HOME=/usr/local/zeppelin-0.8.1
export HADOOP_HOME=/usr/local/hadoop-2.7.7
# Zeppelin
alias zeppelin_start="$ZEPPELIN_HOME/bin/zeppelin-daemon.sh --config $ZEPPELIN_HOME/conf/ start"
alias zeppelin_stop="$ZEPPELIN_HOME/bin/zeppelin-daemon.sh --config $ZEPPELIN_HOME/conf/ stop"
# Hadoop
alias start_hdfs="$HADOOP_HOME/sbin/start-dfs.sh"
alias stop_hdfs="$HADOOP_HOME/sbin/stop-dfs.sh"
alias hdfs="$HADOOP_HOME/bin/hdfs"
source ~/.bash_profile
You will need to setup some basic options in the zeppelin-env.sh
vim /usr/local/zeppelin-0.8.1/conf/zeppelin-env.sh
/usr/local/zeppelin-0.8.1/conf/zeppelin-env.sh
#### Spark interpreter configuration ####
## Use provided spark installation ##
## defining SPARK_HOME makes Zeppelin run spark interpreter process using spark-submit
##
export SPARK_HOME=/usr/local/spark-2.4.0 # (required) When it is defined, load it instead of Zeppelin embedded Spark libraries
# export SPARK_SUBMIT_OPTIONS # (optional) extra options to pass to spark submit. eg) "--driver-memory 512M --executor-memory 1G".
# export SPARK_APP_NAME # (optional) The name of spark application.
## Spark interpreter options ##
##
# export ZEPPELIN_SPARK_USEHIVECONTEXT # Use HiveContext instead of SQLContext if set true. true by default.
# export ZEPPELIN_SPARK_CONCURRENTSQL # Execute multiple SQL concurrently if set true. false by default.
# export ZEPPELIN_SPARK_IMPORTIMPLICIT # Import implicits, UDF collection, and sql if set true. true by default.
# export ZEPPELIN_SPARK_MAXRESULT # Max number of Spark SQL result to display. 1000 by default.
export ZEPPELIN_WEBSOCKET_MAX_TEXT_MESSAGE_SIZE=2048000 # Size in characters of the maximum text message to be received by websocket. Defaults to 1024000
export ZEPPELIN_INTERPRETER_OUTPUT_LIMIT=2048000
- in the terminal, if you have added the aliases to your bash,
zeppelin_start
- should emit green[OK]
when running - go to http://localhost:8080/#/interpreter
- under the Spark section, click the edit icon, and add
spark.executor.memory: 6g
,zeppelin.spark.maxResult: 50000
. It is worth noting that withspark.cores.max: 4
you will need24g
of ram to run zeppelin.spark.cores.max * spark.executor.memory = runtime ram dependency
- click
save
and the interpreter will restart with your updated settings.
-
sudo vim /usr/local/zeppelin-0.8.1/conf/interpreter.json
-
Add the workshop jar
{
"spark.jars": {
"name": "spark.jars",
"value": "/path/to/.m2/repository/com/twilio/open/odsc/spark-utilities/0.1.0-SNAPSHOT/spark-utilities-0.1.0-SNAPSHOT.jar",
"type": "string"
}
}
-
restart zeppelin
-
Now
DataFrameUtils
can be imported directly into the session. This is equivelant to--jars /path/to/jar
when submitting a spark application