Simple spark architecture to compute batch on hdfs file source.
Flume will allow to consume data from Kafka and feed a HDFS cluster.
Distribued file system to store all the data our system want to ingest. Allow us to have immutable data , avoiding the data corruption.
Spark allow batch to be compute in RAM. It's has best performance with machine learning algorithm.
- package
Automatic package isn't available for spark. please :
cd spark && sbt package
-
You can start the project with
docker-compose up
Then you can feed Kafka with JSON with DCAT format Note that Dcat is the most common format for open-data. You'll be able to classify your data from your city-API.