The introdatascience from xindaiusu

This is a draft of the book Introduction to Data Science

Please note that this work is being written under a Contributor Code of Conduct and released under a CC-BY-NC-SA license. By participating in this project (for example, by submitting a pull request with suggestions or edits) you agree to abide by its terms.

Goal of the Book

This is a book on data science with a specific focus on industrial experience. Data Science is a cross-disciplinary subject involving hands-on experience and business problem-solving exposures. The majority of existing introduction books on data science are about the modeling techniques and the implementation of models using R or Python. However, many of these books lack the context of the industrial environment. Moreover, a crucial part, the art of data science in practice, is often missing. This book intends to fill the gap.

Some key features of this book are as follows:

It covers both technical and soft skills.
It has a chapter dedicated to the big data cloud environment. In the industry, the practice of data science is often in such an environment.
It is hands-on. We provide the data and repeatable R and Python code in notebooks. Readers can repeat the analysis in the book using the data and code provided. We also suggest that readers modify the notebook to perform their analyses with their data and problems whenever possible. The best way to learn data science is to do it!
It focuses on the skills needed to solve real-world industrial problems rather than an academic book.

Notebooks

Chapter	R	Python
Ch4: Big Data Cloud Platform	html, rmd	Create Spark Data, `pyspark` Notebook
Ch5: Data Preprocessing	html, rmd	Notebook
Ch6: Data Wrangling	html, rmd	Notebook
Ch7: Model Tuning Strategy	html, rmd	Notebook
Ch8: Measuring Performance	html, rmd	Notebook
Ch9: Regression Models	html, rmd	Notebook
Ch10: Regularization Methods	html, rmd	Notebook
Ch11: Tree-Based Methods	html, rmd	Notebook
Ch12: Deep Learning	html(DNN, CNN, RNN ) , rmd ( DNN, CNN, RNN )	DNN, CNN, RNN, Tokenizing and Padding, MINST with one hidden layer: step by step

How to run R and Python code

Use R code. You should be able to repeat the R code in your local R console or RStudio in all the chapters except for Chapter 4. The code in each chapter is self-sufficient, and you don't need to run the code in previous chapters first to run the code in the current chapter. For code within a chapter, you do need to run from the beginning. At the beginning of each chapter, there is a code block for installing and loading all required packages. We also provide the .rmd notebooks that include the code to make it easier for you to repeat the code.

To repeat the code on big data and cloud platforms, you need to use Databricks, a cloud data platform. We use Databricks because:

It provides a user-friendly web-based notebook environment that can create a Spark cluster on the fly to run R/Python/Scala/SQL scripts
It has a free community edition that is convenient for teaching purpose

Follow the instructions in section 4.3 on the process of setting up and using the spark environment.

Use Python code. We provide python notebooks for all the chapters on GitHub. Like R notebooks, you should be able to repeat all notebooks in your local machine except for Chapter 4 with reasons stated above. An easy way to repeat the notebook is to import and run in Google Colab. To use Colab, you only need to log in to your Google account in Chrome Browser. To load the notebook to your colab, you can do any of the following:

Click the ''Open in Colab" icon on the top of each linked notebook using the Chrome Brower. It should load the notebook and open it in your Colab.
In your Colab, choose File -> Upload notebook -> GitHub. Copy-paste the notebook's link in the box, search, and select the notebook to load it.

To repeat the code for big data, like running R notebook, you need to set up Spark in Databricks. Follow the instructions in section 4.3 on the process of setting up and using the spark environment. Then, run the "Create Spark Data" notebook to create Spark data frames. After that, you can run the pyspark notebook to learn how to use pyspark.

Short links:

xindaiusu / introdatascience Goto Github PK

introdatascience's Introduction

Goal of the Book

How to run R and Python code

introdatascience's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs