GithubHelp home page GithubHelp logo

live-wire / bigdata Goto Github PK

View Code? Open in Web Editor NEW
0.0 4.0 1.0 1.1 MB

Big Data - Spark - Scala :bolt:

Scala 76.88% Shell 2.71% JavaScript 10.15% HTML 2.63% Java 7.14% Batchfile 0.49%
scala spark gdelt

bigdata's Introduction

Supercomputing for Big Data ๐Ÿ’ป SPARK KAFKA โœจ


Stream Processing with Kafka ๐Ÿ„

ETL Is Dead, Long Live Streams talk by Neha Narkhede

Generate real-time insights from a bunch of stuff!

  • Topics
    • Each message that is fed must be apart of a topic.
    • Each topic can have multiple partitions. (So each instance just needs to be able to host a partition instead of the entire topic)
      • Each partition is also replicated for fault tolerance
      • Each partition has a leader(for read/writes) and a few followers(who passively replicate the leader).
      • Hence ordering possible only within a partition
    • In partitions, the records have a sequence assigned to them (called offsets)
  • Producers
    • These apps publish messages on one of the topics
    • Produces choses the partition to publish to (or round robin).
  • Consumers
    • These apps subscribe to one/more topics and consume data
  • Broker
    • Every instance of Kafka (could be on a single machine or on a cluster)
    • If Kafka brokers are godowns for a restaurant, chefs are consumers
  • Streams API
    • Reads a stream(from one or more topics) and produces a stream back to a(one or more) topic(s).
    • Uses Kafka stateful storage
    • Aggregations etc. ๐Ÿ’ฃ
    • Simple is Beautiful
    • It is the T in ETL
  • Connector API
    • Allows building reusable producers/consumers to connect to existing Data solutions/systems.
    • It is the E and L in ETL
  • Kafka > Traditional Queues
    • It allows both queueing (load shared among consumers in a consumer-group) and publish-subscribe (messages broadcasted to all consumer-groups) instead of chosing one.
  • Kafka relies on zookeeper to run.
  • WOW: Treat the past and incoming stream data the same way! And replace everything else for data processing.

Lecture 2

  • MapReduce:
    • Map - Loading of data and defining a set of keys
    • Shuffling is automatic
    • Reduce - Collect key based data to process and output
    • For complex work, chain jobs together (each job will have map-reduce)
      • Use a higher level language (Luigi ๐Ÿ)
    • MapReduce is slow and difficult to master
  • Apache Spark to the rescue:
    • Advantages
      • Parallel distribution
      • High level API
      • Fault tolerance
      • In memory computing
    • Spark's cool libraries: (discussed in next lecture)
      • Spark SQL
      • Spark Streaming (real time streaming)
      • MLib (Machine Learning)
      • GraphX (Graph processing)
    • Standalone scheduler, Yarn, Mesos
    • Resilient Distributed Dataset (RDDs) (USP of spark) (Primary Data Abstraction)
  • RDD
    • Distributed collection of elements
    • Parallelize data across cluster
    • Enables Caching
      • Data Spills to disk if memory exceeds
    • Tracks computation for recomputing lost data. (Called lineage)
    • Types of operations:
      • Transformations - Creates a DAG and lazy evaluation. (map, filter, flatMap)
      • Actions - Actually performs the transformations and returns a value (collect, take, count, reduce, saveAsTextFile)
  • .toDebugString to view the rdd transformations DAG
  • Use rdd.cache() = rdd.persist(StorageLevel.MEMORY_ONLY) to store intermediate transformed result to memory where we can apply actions on.
  • Awesome tuple transformations:
    • groupByKey(), reduceByKey((x,y) => x+y), sortByKey()
    • rdd1.join(rdd2) joins tables on the key

Lab 1

manual

Useful Spark commands and notes:

  • Narrow Dependency: (operations like map), wide dependency : (operations like reduce)
  • Lazy evaluation: Not computed till requested. (Till then only the DAG is calculated)
  • Create RDD using sc.parallelize to enter Spark's programming paradigm
  • Load CSVs using sc.textFile('<name>')
  • Filtered representations of the RDD using command like:
val filterRDD = raw_data.map(_.split(",")).filter(x => x(1) == "3/10/14:1:01")
filterRDD.foreach(a => {println(a.mkString(" "))})
  • Use Dataframes, and Datasets(type checked dataframes) for defining schema of data being read and type checking etc.
case class SensorData (
    sensorName: String,
    timestamp: Timestamp,
    numA: Double,
    numB: Double,
    numC: Long,
    numD: Double,
    numE: Long,
    numF: Double
)

val ds = spark.read.schema(schema)
              .option("timestampFormat", "MM/dd/yy:hh:mm")
              .csv("./lab1/sensordata.csv")
              .as[SensorData]

Lecture 1

  • 3V model
    • Volume
    • Velocity
    • Variety
  • 80% of the data generated is Unstructured
  • Big data pipeline phases
    • Sense/Acquire
    • Store/Ingest
    • Retrieve/filter
    • Analyze
    • Visualize
  • Spark uses in RDD's (in memory) so are significantly faster than traditional map reduce (at-least for batch processing)

LAb 0

  • Hadoop is not a replacement for Relational Database

  • It complements online transaction processing and online analytical processing

  • Used for structured and unstructured data (large quantities)

  • Not good for:

    • Hadoop is not good to process transactions due to its lack random access.
    • It is not good when the work cannot be parallelized or when there are dependencies within the data, that is, record one must be processed before record two.
    • It is not good for low latency data access.
    • Not good for processing lots of small files
  • Ambari GUI for managing hadoop cluster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.