GithubHelp home page GithubHelp logo

dlab-berkeley / python-data-wrangling-legacy Goto Github PK

View Code? Open in Web Editor NEW
51.0 24.0 53.0 2.33 MB

D-Lab's 3 hour introduction to data wrangling in Python. Learn how to import and manipulate dataframes using pandas in Python.

License: Other

Jupyter Notebook 100.00%
pandas python data-science

python-data-wrangling-legacy's Issues

Spending more time on `groupby()`

Some of the participants asked if there could be more time spent on groupby(), as they need it for their research. I feel it's important to spend at least 20' on this section and challenge. Not sure if I have an immediate idea for how to do this -- another challenge after #10, giving more examples of groupby() functionalities or scenarios, introducing it earlier (e.g. before merging).

Rename Repo

Name should be changed from "Introduction to Pandas" to "Python Data Wrangling and Manipulation With Pandas"

Incorrect header in the notebook

A minor error, but in introduction-to-pandas.ipynb file, the header:

Challenge 7: Another way to get the month

Should actually read:

Challenge 7: Another way to get the year

as the challenge requires to get the year component of the Dataframe column and not the month.

`unemployment_rate` name

unemployment_rate is a bit of a weird name (as it’s not actual unemployment rates but a DF with null proportions for unemployment rates

Challenge 12 solution

The challenge 12 solution seems too complex. We can just do .dropna on our newly created ps column?

command line error

Part 7 has a line of code:
!head -5 ../data/unemployment_missing.csv

Head is a Unix/Linux specific function and folks who were using Windows locally cannot run this line.

Explain `.round()`

Using .round() in Manipulating Columns needs to be explained. If we don’t use it, we will not get to a full integer and then calling int() will truncate. Compare ((unemployment['year_month'] - unemployment['year']) * 100).astype(int) and ((unemployment['year_month'] - unemployment['year']) * 100).round(0).astype(int) - the first starts at 0, second starts at 1

Part 7 Let's Look at Our File Depends on OS

Part 7 says Exporting A DataFrame to csv
it looks at the file using the "!" but that is only viable in mac, not windows
!head -5 data/unemployment_missing.csv
to do: add windows version of terminal "!"

Minor typo in Readme

Click the "Launch" button under "Jupyter Notebooks" and navigate through your file system to the Python-Data-Visualization folder you downloaded above.

Should read Python-Data-Wrangling instead of Data Visualization

pandas plotting

Add examples and descriptions for how to plot with pandas (e.g., pandas.DataFrame.plot).

Curriculum Review: Practice interpreting data

Understand what it is that we’re looking at in the data is another useful skill - that i think most social science / stem grad students have, but for people who are new to data analysis, we can include more of that type of practice

Missing values

Remove some values from the main CSV file

  • replace missing values with .fillna()
    • show method parameter
  • keep rows with no missing values with .dropna()

Suggestions for improvements

  1. unemployment_rate is a bit of a weird name (as it’s not actual unemployment rates but a DF with null proportions for unemployment rates
  2. unemployment_rate_missing = unemployment[unemployment['unemployment_rate'].isnull()] -> no need for double subset
  3. Using round() in Manipulating Columns needs to be explained. If we don’t use it, we will not get to a full integer and then calling int() will truncate. Compare ((unemployment['year_month'] - unemployment['year']) * 100).astype(int) and ((unemployment['year_month'] - unemployment['year']) * 100).round(0).astype(int) - first starts at 0, second starts at 1
  4. The challenge 12 solution seems too complex. We can just do .dropna on our newly created ps column?

Explain DataFrame vs Series

We might want to add a little section at the start of the notebook to explain the difference between DataFrame and Series objects.

double subsetting?

unemployment_rate_missing = unemployment[unemployment['unemployment_rate'].isnull()] -> no need for double subset

Suggested changes to pandas lesson

Possible things to consider for intermediate/advanced pandas class:

  • reading/writing to formats other than csv
  • reading data in by chunks
  • multi-indexing and its methods
  • sparse matrices and their methods
  • ways to speed up mathy things (lots of %timeit)
  • parallelized methods for pandas

We can both also look up other pandas materials and think of other things we might like to add.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.