dlab-berkeley / python-data-wrangling-legacy Goto Github PK

D-Lab's 3 hour introduction to data wrangling in Python. Learn how to import and manipulate dataframes using pandas in Python.

License: Other

Jupyter Notebook 100.00%

pandas python data-science

python-data-wrangling-legacy's Issues

Spending more time on `groupby()`

Some of the participants asked if there could be more time spent on groupby(), as they need it for their research. I feel it's important to spend at least 20' on this section and challenge. Not sure if I have an immediate idea for how to do this -- another challenge after #10, giving more examples of groupby() functionalities or scenarios, introducing it earlier (e.g. before merging).

Rename Repo

Name should be changed from "Introduction to Pandas" to "Python Data Wrangling and Manipulation With Pandas"

Incorrect header in the notebook

A minor error, but in introduction-to-pandas.ipynb file, the header:

Challenge 7: Another way to get the month

Should actually read:

Challenge 7: Another way to get the year

as the challenge requires to get the year component of the Dataframe column and not the month.

`unemployment_rate` name

unemployment_rate is a bit of a weird name (as it’s not actual unemployment rates but a DF with null proportions for unemployment rates

Curriculum Review: Include filter functions

Another thing that could be useful is using filtering functions along with groupby

Challenge 12 solution

The challenge 12 solution seems too complex. We can just do .dropna on our newly created ps column?

Challenge 7 "Month" Should be "Year"

challenge 7 says: Another way to get the month
month should be "year" since that's what is being extracted

command line error

Part 7 has a line of code:
!head -5 ../data/unemployment_missing.csv

Head is a Unix/Linux specific function and folks who were using Windows locally cannot run this line.

Curriculum Review: Include join and merges in pandas

One thing that seems to be missing is how to join two tables (or more) using pandas
https://youtu.be/lXPogGKR-AU

Explain `.round()`

Using .round() in Manipulating Columns needs to be explained. If we don’t use it, we will not get to a full integer and then calling int() will truncate. Compare ((unemployment['year_month'] - unemployment['year']) * 100).astype(int) and ((unemployment['year_month'] - unemployment['year']) * 100).round(0).astype(int) - the first starts at 0, second starts at 1

Part 7 Let's Look at Our File Depends on OS

Part 7 says Exporting A DataFrame to csv
it looks at the file using the "!" but that is only viable in mac, not windows
!head -5 data/unemployment_missing.csv
to do: add windows version of terminal "!"

Minor typo in Readme

Click the "Launch" button under "Jupyter Notebooks" and navigate through your file system to the Python-Data-Visualization folder you downloaded above.

Should read Python-Data-Wrangling instead of Data Visualization

pandas plotting

Add examples and descriptions for how to plot with pandas (e.g., pandas.DataFrame.plot).

Curriculum Review: Practice interpreting data

Understand what it is that we’re looking at in the data is another useful skill - that i think most social science / stem grad students have, but for people who are new to data analysis, we can include more of that type of practice

Missing values

Remove some values from the main CSV file

replace missing values with .fillna()
- show method parameter
keep rows with no missing values with .dropna()

Feedback requested

@pattyf, @deniederhut, @davclark, @jackspaceBerkeley, @jonstiles, @sbenthall

The notebook for the intro to pandas workshop is almost ready to go, I think. I'd appreciate some feedback when you have time.

One specific thing I need help with is determining when we're dealing with methods, attributes, or properties.

Thanks!

Suggestions for improvements

unemployment_rate is a bit of a weird name (as it’s not actual unemployment rates but a DF with null proportions for unemployment rates
unemployment_rate_missing = unemployment[unemployment['unemployment_rate'].isnull()] -> no need for double subset
Using round() in Manipulating Columns needs to be explained. If we don’t use it, we will not get to a full integer and then calling int() will truncate. Compare ((unemployment['year_month'] - unemployment['year']) * 100).astype(int) and ((unemployment['year_month'] - unemployment['year']) * 100).round(0).astype(int) - first starts at 0, second starts at 1
The challenge 12 solution seems too complex. We can just do .dropna on our newly created ps column?

reading/writing to formats other than csv
reading data in by chunks
multi-indexing and its methods
sparse matrices and their methods
ways to speed up mathy things (lots of %timeit)
parallelized methods for pandas

We can both also look up other pandas materials and think of other things we might like to add.

dlab-berkeley / python-data-wrangling-legacy Goto Github PK

python-data-wrangling-legacy's Issues

Challenge 7: Another way to get the month

Challenge 7: Another way to get the year

Recommend Projects

Recommend Topics

Recommend Org

Jobs