GithubHelp home page GithubHelp logo

nikhilmeghnani / spark-tutorials Goto Github PK

View Code? Open in Web Editor NEW

This project forked from johnsonj561/spark-tutorials

0.0 0.0 0.0 572 KB

Spark and Python for Big Data with PySpark taken from Udemy

Jupyter Notebook 99.72% Python 0.28%

spark-tutorials's Introduction

PySpark and Spark Tutorials

Following along with Jose Portilla on Udemy Course

Hadoop and MapReduce

  • Hadoop is method for distributing very large files across multiple machines
  • uses Hadoop Distributed File System (HDFS) to allow users to work with large data sets
  • Hadoop uses MapReduce to allow for computation across the distributed data set
  • HDFS uses blocks of data (default 128MB) that are replicated 3 times, distributed to support fault tolerance
  • smaller blocks provide more parralelization, multiple copies prevent loss of data
  • MapReduce splits computational task to distributed set of files
  • MapReduce consists ofJob Tracker and Task Tracker
  • Job Tracker to send code to run on Task Trackers
  • Task Trackers allocate CPU and Memory for the task and monitors the tasks on the worker node

Spark

  • Spark is one of latest frameworks to quickly and easily handle big data
  • first released February 2013, created at Berkley
  • written in Scala, so Scala normally gets the latest features
  • Scala is written in Java, so Java API normally does well too!
  • Python and R APIs are slowest to catch up
  • flexible alternative to MapReduce - i.e. it handles splitting of computational tasks across nodes

Spark vs MapReduce

  • Hadoop and MapReduce are bound because MapReduce requires HDFS
  • Spark can perform operations up to 100X faster than MapReduce
  • Spark can work on HDFS, and other formats
  • MapReduce writes most data to disk, while Spark keeps it in RAM and spills over to disk only when necessary. This makes Spark faster!

Spark RDDs

  • Resilient Distributed Dataset (RDD)
    • distributed collection of data
    • fault tolerant
    • parallel operation, partitioned
    • ability to use many data sources
  • immutable
  • lazily evaluated
  • cacheable
  • even if working with DataFrames, they are still RDDs under the hood

Spark DataFrames

  • Spark 2.0 shifted towards DataFrame syntax
  • are now the standard way of using Spark's ML Capabilties
  • Spark Docs are still new
  • DataFrame is very familiar to Pandas DataFrames
  • Columns = features
  • Rows = records

Local vs Distributed Systems

  • local process is limited to computation resources on a single machine
  • distributed process process has access to computational resources across a number of machines connected through a network
  • after certain point, it is easier to scale out to many low cpu machnes than it is to scale up a single machine
  • distributed system is fault tolerant - if one machine fails, network still runs

spark-tutorials's People

Contributors

johnsonj561 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.