GithubHelp home page GithubHelp logo

gabiviana93 / datascienceforbeginners Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bodacea/datascienceforbeginners

0.0 1.0 2.0 165.27 MB

Slides, notebooks, references etc for Data Science for Beginners sessions

License: Other

Jupyter Notebook 98.32% HTML 1.68% Python 0.01%

datascienceforbeginners's Introduction

datascienceforbeginners

Slides, notebooks, references etc for Data Science for Beginners sessions

Introduction

This was a weekly onsite/remote course, designed to get people who aren’t coders up to speed on data science techniques, and trying some of those techniques out for themselves.

Course structure is one topic per week, with:

  • Weekly reading list: blogposts, book chapters etc on that week’s topic, to be read before the session
  • Weekly tool setups: tool needed to be installed on PCs before each session
  • In session: Introduction to techniques on that topic
  • In session: Lab time, trying Python/R code related to those techniques
  • Post-session: “further reading” list of blogposts, books, courses etc, for anyone wanting to dive further into that topic

This gives an overview of 5-7 concepts per week, totalling up to a better understanding of how data scientists work, and hopefully also a desire to explore this topic further. Examples used will be taken from social data science and free/open-source tools; these should be supplemented by guest talks from company data scientists about their projects, work and toolsets. The sessions are based on a Spring 2016 Columbia University.

Helpful files

Sessions

  • 1: Designing and Scoping a Data Science Project
  • 2: Python basics
  • 3: Acquiring Data
  • 4: Communicating Results
  • 5: Cleaning and Exploring Data
  • 6: Machine Learning
  • 7: Handling Text Data
  • 8: Handling Geospatial Data
  • 9: Learning Relationships from Data
  • 10: Handling Big Data

Session 1: Designing and Scoping a Data Science Project

This session:

  • Introduces students to the content and supporting materials needed for data scientists to work from a problem specification. Students will also comment on existing data science problem specifications.

  • Outcome: students will understand some of the needs and pitfalls in problem specifications, and will have started their own data science project specification.

  • Preparing for this session: Look at the problem statements on Kaggle.com, Drivendata.com and Datakind.org, and think about the types of questions being asked, the datasets being used and who benefits from each problem solution.

Session 2: Python basics

This session:

  • Introduces one of the most-used data science languages: Python. Outcome: students will have set up Python and R on their personal machines, and be able to run basic commands in Python.

  • Preparing for this session: Install instructions are in the reference folder. Get familiar with their terminal window, and install iPython (if not already on your machine) and Git.

Session 3: Acquiring Data

This session:

  • Introduces students to the art of finding development data, and the idea that almost anything can be a dataset if you look hard enough at it, to the basic concepts of APIs, webscraping tools (including the google spreadsheets webpage scraping tool) and PDF conversion tools (e.g. Cometdocs).

  • Preparing for this session: Download the Tabula tool, and think about data relevant to your projects that isn’t in machine-readable form (e.g. xls, pdf, images, maps etc).

Session 4: Communicating Results

This session:

  • Introduces communication and visualisation ideas and tools (Tableau, Highcharts/D3 etc). Students will also pitch their project ideas to the rest of the class. Before this lab, students will be asked to install Tableau, and download the Highcharts and D3 libraries.

  • Outcome: students will have a basic knowledge of persuasion through data visualisation, and have set up and know basic commands in Tableau.

  • Preparing for this session: Download the Tableau tool.

Session 5: Cleaning and Exploring Data

This session:

  • Introduces students to data munging and manually exploring patterns in data before using algorithms on it. Introduces the tools used for this: OpenRefine, R, Matplotlib etc. Before this lab, students will be asked to install Google OpenRefine Outcome: students will have cleaned a ‘dirty’ dataset with OpenRefine, and explored its contents with R

  • Preparing for this session:

Session 6: Machine Learning

This session:

  • Introduces students to machine learning, and the regression and classification algorithms used in machine learning.

  • Outcomes: students will have run a regression algorithm on a dataset using both Python and R. students will have run a classification algorithm on a dataset using both Python and R.

  • Preparing for this session:

Session 7: Handling Text Data

This session:

  • Introduces students to the idea of text as data, to methods and tools for obtaining text (Twitter API etc), and for methods for finding patterns in text (the NLTK library, Overview etc)

  • Outcome: students will understand the basic concepts of text analysis and language understanding, including issues specific to development data science (multiple languages, missing stopword lists etc).

  • Preparing for this session:

Session 8: Handling Geospatial Data

This session:

  • Introduces students to the idea of maps as data, and to visualising and reasoning about data with spatial components. Introduces techniques and tools commonly used in these processes (Gdal, Shapely, QGIS, CartoDB etc)

  • Outcome: students will understand basic concepts of spatial data, including issues specific to development data science (missing maps, satellite datasets etc)

  • Preparing for this session:

Session 9: Learning Relationships from Data

This session:

  • Introduces students to the network theory used in machine learning, and often used to understand social relationships. Also introduces some common network visualisation tools (e.g. Gephi, NetworkX)

  • Outcome: student will understand basic network analysis concepts and will have run Python network analysis algorithms and viewed a social dataset in Gephi.

  • Preparing for this session:

Session 10: Handling Big Data

This session:

  • Introduces students to big data concepts (the three Vs, the other three Vs etc) and commonly used tools (Hadoop etc). Introduces students to the analysis of streaming data. If needed, class will also spend time talking about any outstanding issues participants ran into during their projects, and potential ways to work around them.

  • Outcome: students will understand basic mechanisms for handling large volume and velocity data (variety is already covered above).

  • Preparing for this session: Download Hadoop.

Session 11: Enterprise Data Tools

This session:

  • Covers some of the enterprise data science tools out there (IBM Watson, Palantir, Ayasdi, Teradata etc… )

  • Preparing for this session:

Session 12: More machine learning

This session:

  • Continues further into machine learning techniques

  • Preparing for this session:

datascienceforbeginners's People

Contributors

bodacea avatar

Watchers

Gabriela Viana avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.