In order to spin up our infra, we will be using Terraform.
cd terraform
terraform init
terraform apply
This should spin up a VPC network and two n1-standard-2
VM instances. One for Kafka, and one for Spark.
Create an ssh key in you local system in the .ssh
folder
ssh-keygen -t rsa -f ~/.ssh/KEY_FILENAME -C USER -b 2048
Add the public key to you VM instance using this link
Create a config file in your .ssh
folder
touch ~/.ssh/config
Add the following content in it after replacing with the relevant values below.
Host streamify-kafka
HostName <External IP Address>
User <username>
IdentityFile <path/to/home/.ssh/gcp>
Host streamify-spark
HostName <External IP Address>
User <username>
IdentityFile <path/to/home/.ssh/gcp>
SSH into the server using the below commands in two separate terminals
ssh streamify-kafka
ssh streamify-spark
Clone the git repo into you VMs
Run the scripts in VM to install anaconda
, docker
and docker-compose
, spark
in your VM
bash scripts/vm_setup.sh username
bash scripts/spark_setup.sh username
- Open the port
9092
on your Kafka server using these steps - Set the environment variable
KAFKA_ADDRESS
to the external IP of your VM machine in both the Spark and the Kafka VM machines:export KAFKA_ADDRESS=IP.ADD.RE.SS
- Run
docker-compose up
in thekafka
folder in the Kafka VM. - Run the following command in the
kafka/test_connection
folder to start producing -python produce_taxi_json.py
- Move to the Spark VM and run the following command in the
spark_streaming/test_connection
folder -spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.3 stream_taxi_json.py
- Fix Spark Script with py4j eval
- Make setup easier with
Makefile
. Possibly a one-click setup.