GithubHelp home page GithubHelp logo

allensmile / learningpyspark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from drabastomek/learningpyspark

0.0 3.0 0.0 7.32 MB

Code base for the Learning PySpark book (in preparation)

License: GNU General Public License v3.0

Jupyter Notebook 99.56% Python 0.44% Shell 0.01%

learningpyspark's Introduction

Learning PySpark

Code base for the Learning PySpark book by Tomasz Drabas and Denny Lee.

Book cover

Available from Packt and Amazon.

Introduction

It is estimated that in 2013 the whole world produced around 4.4 zettabytes of data; that is, 4.4 billion terabytes! By 2020, we (as a human race) are expected to produce ten times that. With data getting larger literally by the second there is a growing appetite for making sense out of it.

In this book, we will guide you through the latest incarnation of Apache Spark using Python. We will show you how to read structured and unstructured data, how to use some fundamental data types available in PySpark, how to build machine learning models, operate on graphs, read streaming data and deploy your models in the cloud. Each chapter will tackle different problem and by the end of the book we hope you will be knowledgeable enough to solve other problems we did not have space to cover here.

Table of contents:

  1. Understanding Spark
  2. Resilient Distributed Dataset
  3. DataFrames
  4. Preparing Data for Modeling
  5. Introducing MLlib
  6. Introducing the ML Package
  7. GraphFrames
  8. TensorFrames
  9. Polyglot Persistence with Blaze
  10. Structured Streaming
  11. Packaging Spark Applications

About authors

Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in Seattle area. He has over 13 years of experience in data analytics and data science in numerous elds: advanced technology, airlines, telecommunications, nance and consulting he gained while working on three continents: Europe, Australia and North America. While in Australia, Tomasz has been working on his PhD in Operations Research with focus on choice modeling and revenue management applications in airline industry.

At Microsoft, Tomasz works with big data on a daily basis solving machine learning problems such as anomaly detection, churn prediction or pattern recognition using Spark.

Tomasz has also authored the Practical Data Analysis Cookbook published by Packt Publishing in 2016; you can purchase that book on Amazon, Packt and O’Reilly.

Denny Lee is a Principal Program Manager at Microsoft for the Azure DocumentDB team – Microsoft’s blazing fast, planet-scale managed document store service. He is a hands-on distributed systems and data sciences engineer with more than 18 years of experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments.

He has extensive experience in building green eld teams as well as turnaround / change catalyst. Prior to joining the Azure DocumentDB team, Denny worked as a Technology Evangelist at Databricks; he has been working with Apache Spark since 0.5. He was also the Senior Director of Data Sciences Engineering at Concur, and was on the incubation team that built Microsoft’s Hadoop on Windows and Azure service (currently known as HDInsight). Denny also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers for the last fteen years.

learningpyspark's People

Contributors

dennyglee avatar drabastomek avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.