GithubHelp home page GithubHelp logo

xindaiusu / introdatascience Goto Github PK

View Code? Open in Web Editor NEW

This project forked from happyrabbit/introdatascience

0.0 0.0 0.0 60.16 MB

Book Draft: Introduction to Data Science (https://scientistcafe.com/ids/)

License: Creative Commons Zero v1.0 Universal

TeX 3.72% CSS 0.14% HTML 28.24% Jupyter Notebook 67.87% R 0.04%

introdatascience's Introduction

This is a draft of the book Introduction to Data Science

Please note that this work is being written under a Contributor Code of Conduct and released under a CC-BY-NC-SA license. By participating in this project (for example, by submitting a pull request with suggestions or edits) you agree to abide by its terms.

Goal of the Book

This is a book on data science with a specific focus on industrial experience. Data Science is a cross-disciplinary subject involving hands-on experience and business problem-solving exposures. The majority of existing introduction books on data science are about the modeling techniques and the implementation of models using R or Python. However, many of these books lack the context of the industrial environment. Moreover, a crucial part, the art of data science in practice, is often missing. This book intends to fill the gap.

Some key features of this book are as follows:

  • It covers both technical and soft skills.

  • It has a chapter dedicated to the big data cloud environment. In the industry, the practice of data science is often in such an environment.

  • It is hands-on. We provide the data and repeatable R and Python code in notebooks. Readers can repeat the analysis in the book using the data and code provided. We also suggest that readers modify the notebook to perform their analyses with their data and problems whenever possible. The best way to learn data science is to do it!

  • It focuses on the skills needed to solve real-world industrial problems rather than an academic book.

Notebooks

Chapter R Python
Ch4: Big Data Cloud Platform html, rmd Create Spark Data, pyspark Notebook
Ch5: Data Preprocessing html, rmd Notebook
Ch6: Data Wrangling html, rmd Notebook
Ch7: Model Tuning Strategy html, rmd Notebook
Ch8: Measuring Performance html, rmd Notebook
Ch9: Regression Models html, rmd Notebook
Ch10: Regularization Methods html, rmd Notebook
Ch11: Tree-Based Methods html, rmd Notebook
Ch12: Deep Learning html(DNN, CNN, RNN ) , rmd ( DNN, CNN, RNN ) DNN, CNN, RNN, Tokenizing and Padding, MINST with one hidden layer: step by step

How to run R and Python code

Use R code. You should be able to repeat the R code in your local R console or RStudio in all the chapters except for Chapter 4. The code in each chapter is self-sufficient, and you don't need to run the code in previous chapters first to run the code in the current chapter. For code within a chapter, you do need to run from the beginning. At the beginning of each chapter, there is a code block for installing and loading all required packages. We also provide the .rmd notebooks that include the code to make it easier for you to repeat the code.

To repeat the code on big data and cloud platforms, you need to use Databricks, a cloud data platform. We use Databricks because:

  • It provides a user-friendly web-based notebook environment that can create a Spark cluster on the fly to run R/Python/Scala/SQL scripts
  • It has a free community edition that is convenient for teaching purpose

Follow the instructions in section 4.3 on the process of setting up and using the spark environment.

Use Python code. We provide python notebooks for all the chapters on GitHub. Like R notebooks, you should be able to repeat all notebooks in your local machine except for Chapter 4 with reasons stated above. An easy way to repeat the notebook is to import and run in Google Colab. To use Colab, you only need to log in to your Google account in Chrome Browser. To load the notebook to your colab, you can do any of the following:

  • Click the ''Open in Colab" icon on the top of each linked notebook using the Chrome Brower. It should load the notebook and open it in your Colab.

  • In your Colab, choose File -> Upload notebook -> GitHub. Copy-paste the notebook's link in the box, search, and select the notebook to load it.

To repeat the code for big data, like running R notebook, you need to set up Spark in Databricks. Follow the instructions in section 4.3 on the process of setting up and using the spark environment. Then, run the "Create Spark Data" notebook to create Spark data frames. After that, you can run the pyspark notebook to learn how to use pyspark.

Short links:

introdatascience's People

Contributors

happyrabbit avatar alshum avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.