aeturrell / coding-for-economists Goto Github PK

View Code? Open in Web Editor NEW

663.0 14.0 121.0 481.73 MB

This repository hosts the code behind the online book, Coding for Economists.

Home Page: https://aeturrell.github.io/coding-for-economists

License: MIT License

Python 0.61% Jupyter Notebook 92.77% TeX 0.79% HTML 5.71% Dockerfile 0.11%

learning economics econometrics research economics-models jupyter-notebook python vscode book data-science

coding-for-economists's People

Contributors

Stargazers

Watchers

Forkers

kwrahman bsllee lsy617004926 vaneseltine hmelberg lnsongxf aialkadri macrofinancehub yangkedc1984 lukestein wenyiyin alamgr89 anhnguyendepocen rodolfopowerbi yangyijane edaerguns schoulten marianefurtado webclinic017 snowdj kitaev-chen ningshisan joleonar chrisgalgojr banksad hieunt27 wawnun linasnas zanmenouluc133 mohammadihpython janrinaldo skolenik galexandros sondalex sheng-charles-cai jameb992g3 jiazichen111 heyyning chaunceydust deluair zhentaoshi ashu-tosh272 babajideowoyele reveurmichael gakkilovemath zactionn atiqurrasel haipinglu apoorvalal jthodge teodorojmartinez murattasdemir elqvixote ramonvo74 fersuax kriaz100 juansapena lubo151 longshen931 andromeda0505 jimmyzac infernalladira yuqi-caoo allisterh alberto-nunez realseqi wzhihui blue0rigin redpoint13 ddr-capital harbes pencode-lab murmansk0928 yambs4 chaoliu-kellogg ese-nguyen pachi jacek-jonca shiwei-ye saif-imtiaz jcecon anhaner rossmck94 exclusivemekus jngod2011 bobo1270 bingqing-econ asmirnov-horis gdecina curtis18 jjcheer cgodlewski u200915986 jiahuining yatingchen27 johnial seraphium skaiphd pitmonticone jaishreejoshita

coding-for-economists's Issues

Fix Binder functionality

There is an error message when using the ' ::rocket:: -> Binder' option on pages with code. This does not appear to be a main/master issue, but to do with the URL that JupyterBook uses to load a given Binder page. Rather than (for example)

https://notebooks.gesis.org/binder/v2/gh/aeturrell/coding-for-economists/main?urlpath=tree/code-advanced.ipynb

being loaded, instead

https://notebooks.gesis.org/binder/v2/gh/aeturrell/coding-for-economists/e27d7c0ba0345eeebdeec37909baf54a744e8b76/v2/gh/aeturrell/coding-for-economists/main?urlpath=tree/code-advanced.ipynb

gets loaded (with some parts of the path repeated).

Set up default google colab env for book

May be good examples on jupyter book website

Lets-Plot: geom_contourf() instead of geom_contour()

In section Common Plots / Contour Plot you could replace the code for Lets-Plot with

contour_data = {'x': X.flatten(), 'y': Y.flatten(), 'z': Z.flatten()}
(
    ggplot(contour_data)
    + geom_contourf(aes(x='x', y='y', z='z', fill='..level..')) 
    + scale_fill_viridis(option="plasma")
    + ggtitle("Maths equations don't currently work")
)

This allows to build a nicer looking plot:

It's already been replaced in PR #43, if you prefer that way of updating code.

Create setup script for dev

Although the instructions for installing the environment and packages are fairly straightforward, it would be good to have a start-up script that also handled extras such as the installation of nltk and spacy models.

Review of all chapters with typo and text fixes

Issue on page /vis-common-plots.html

You may be wondering why Lets-Plot isn’t featured here: its functions have almost exactly the same names as those in lets-plot, and we have opted to include the latter as it is currently the more mature plotting package.

Did you mean

You may be wondering why plotnine isn’t featured here..

Lets-Plot: geom_segment() instead of geom_path()

In section Common Plots / Connected scatter plot you could replace the code for Lets-Plot with

path_df = df.iloc[:-1].reset_index(drop=True).join(
    df.iloc[1:].reset_index(drop=True), lsuffix='_from', rsuffix='_to'
)

(
    ggplot(df, aes("Unemployment", "Vacancies"))
    + geom_segment(aes(x="Unemployment_from", y="Vacancies_from", xend="Unemployment_to", yend="Vacancies_to"), \
                 data=path_df, size=1, color="gray", arrow=arrow(type='closed', length=20, angle=15))
    + geom_point(shape=21, color="gray", fill="#c28dc3", size=5)
    + geom_text(aes(label='Year'), data=df[df['Year'].isin([2001, 2021])], position=position_nudge(y=0.3))
    + labs(x="Unemployment rate, %", y="Vacancy rate, %")
)

This allows to build a nicer looking plot:

It's already been replaced in PR #43, if you prefer that way of updating code.

Regression page clean-up

read in "Unnamed: 0" as index.
axes limits adjust
other tidying eg Jabba

Lets-Plot: notes for pyramid

In section Common Plots / Pyramid there is a few issues with the plot:

Clipped labels: unfortunately, the 20 character limit is hardcoded, so y labels are cut off. But the full text can be seen in the axial tooltip.
Weird-looking tooltips on top of the pyramid: to improve tooltips displaying I suggest not to use identity statistic; you can calculate and add weight for users as shown below:
```
g = (
    ggplot(df, aes(x="Stage", y="Users", fill="Gender", weight='Users'))
    + geom_bar(width=0.8)  # baseplot
    + coord_flip()  # flip coordinates
    + theme_minimal()
    + ylab("Users (millions)")
)
g
```
It's already been replaced in PR #43, if you prefer that way of updating code.

Add a diverging stacked bar chart to Common Plots

See this link for a good example in matplotlib.

No R2 values reported with `pyfixest` in chapter 6

Hi @aeturrell, the pyfixest version which you ran the coding for economists regression chapter with did not report R2 values but only the RMSE. If you upgrade to pyfixest 0.14.0, this should be fixed =)
Best, Alex

Update regression page to reflect new model comparison from pyfixest

See this issue.

Switch to file loading that works out of the box

Sharper arrowheads in the "Connected scatter plot" section.

In this section: Connected scatter plot

Req: Lets-Plot v4.3.0

Problem: arrowheads are sunk into circles.

Solution: use the "spacer" option with the value 5 (i.e. the point size in this chart) + 1 (to account for the circle stroke):

(
    ggplot(df, aes("Unemployment", "Vacancies"))
    + geom_segment(
        aes(
            x="Unemployment_from",
            y="Vacancies_from",
            xend="Unemployment_to",
            yend="Vacancies_to",
        ),
        data=path_df,
        size=1,
        color="gray",
        arrow=arrow(type="closed", length=15, angle=15),     # <-- Slightly smaller arrow (was 20) 
        spacer=5+1                                                      # <-- The spacer !
    )
    + geom_point(shape=21, color="gray", fill="#c28dc3", size=5)
    + geom_text(
        aes(label="Year"),
        data=df[df["Year"].isin([2001, 2021])],
        position=position_nudge(y=0.3),
    )
    + labs(x="Unemployment rate, %", y="Vacancy rate, %")
)

Just as an option: the geom_curve() often times look nicer :):

(
    ggplot(df, aes("Unemployment", "Vacancies"))
    + geom_curve(                                                         # <-- New !
        aes(
            x="Unemployment_from",
            y="Vacancies_from",
            xend="Unemployment_to",
            yend="Vacancies_to",
        ),
        data=path_df,
        size=1,
        color="gray",
        arrow=arrow(type="closed", length=15, angle=15),
        spacer=5+1,                                                         # <-- The spacer !
        curvature=-0.1                                                     # <-- Not too curved.
    )
    + geom_point(shape=21, color="gray", fill="#c28dc3", size=5)
    + geom_text(
        aes(label="Year"),
        data=df[df["Year"].isin([2001, 2021])],
        position=position_nudge(y=0.3),
    )
    + labs(x="Unemployment rate, %", y="Vacancy rate, %")
)

Lets-Plot: add geom_area_ridges()

There are no Lets-Plot examples in section Common Plots / Ridge, or 'joy', plots, but the library does have a suitable function for it: geom_area_ridges(). You can add the following code:

final_year = df["Year"].max()
first_year = df["Year"].min()

breaks = [y for y in list(df.Year.unique()) if y % 10 == 0]
(
    ggplot(df, aes("Anomaly", "Year", fill="Year")) 
    + geom_area_ridges(scale=20, alpha=1, size=.2, trim=True, show_legend=False)
    + scale_y_continuous(breaks=breaks, trans='reverse')
    + scale_fill_viridis(option='inferno')
    + ggtitle("Global daily temperature anomaly {0}-{1} \n(°C above 1951-80 average)".format(first_year, final_year))
)

It's already been replaced in PR #43, if you prefer that way of updating code.

Adding watermarks to jupyter notebooks?

Look into pros and cons of adding watermarks to scripts using watermark, eg as PyMC3 do for their examples.

%load_ext watermark
%watermark -n -u -v -iv -w

Error in Exercise Reqs

In the Working With Data there is an exercise

Create a pandas dataframe using the data=, index=, and columns= keyword arguments. The data should consist of one column with ascending integers from 0 to 5, the column name should be “series”, and the index should be the first 5 letters of the alphabet. Remember that the index and columns keyword arguments expect an iterable of some kind (not just a string).

I believe this is impossible due to there being 6 integers 0 to 5 but only 5 letters of the alphabet.

import pandas as pd

data = {"Series": list(range(6))}
index = list("abcdef")
df = pd.DataFrame(data=data, index=index)

print(df)

I believe this satisfies the exercise requirements.

Special fonts are not included in the dockerfile

Some pages, eg on narrative data visualisation (which uses 'varta'), need special fonts. These are not currently available in the Dockerfile.

In principle, this is possible and some example code to achieve it would be:

FROM continuumio/miniconda3:4.10.3-alpine
WORKDIR /app
COPY ./my-custom-font.ttf ./
RUN mkdir -p /usr/share/fonts/truetype/
RUN install -m644 my-custom-font.ttf /usr/share/fonts/truetype/
RUN rm ./my-custom-font.ttf

But it would be good to pull the font directly from a website, eg using Google fonts.

This article goes into detail of how to install fonts in docker containers:
https://axellarsson.com/blog/install-fonts-in-docker-containers/

PackagesNotFoundError when setting up virtual environment

When running environment.yml from the Anaconda Prompt I get the following:

PackagesNotFoundError: The following packages are not available from current channels:

  - datatable

Current channels:

  - https://conda.anaconda.org/oxfordcontrol/win-64
  - https://conda.anaconda.org/oxfordcontrol/noarch
  - http://conda.anaconda.org/gurobi/win-64
  - http://conda.anaconda.org/gurobi/noarch
  - https://repo.anaconda.com/pkgs/main/win-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/win-64
  - https://repo.anaconda.com/pkgs/r/noarch
  - https://repo.anaconda.com/pkgs/msys2/win-64
  - https://repo.anaconda.com/pkgs/msys2/noarch

I'm pretty sure this is because datatable needs to be installed with pip rather than conda.

A less likely explanation could be operating system dependency (I'm on Windows 10), in which case appending --no-builds may be a solution.

I have put datatable down to the end of the pip section as below:

  - pip:
    - specification_curve
    - twopiece
    - stargazer
    - matplotlib-scalebar
    - black-nb
    - pyhdfe
    - skimpy
    - dataprep
    - graphviz
    - pygraphviz
    - ruptures
    - deadlinks
    - datatable

This works, but the current set of dependencies seem to have conflicts as I get:

Collecting package metadata (repodata.json): done
Solving environment: -
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.

I will investigate which packages have conflicts

Non-breaking API changes for `pyfixest` version `0.17.0`

Hi @aeturrell , please see my comment in the associated PR: #69.
Best, Alex

Add better hot key symbols for, eg, opening the terminal

See the Hotkey list from https://github.com/tchapi/markdown-cheatsheet for examples.

Website is down

@aeturrell Thanks for the fantastic work!
Just to let you know that I tried to access the link in the description, but it's not working.
Cheers!

Page on sharing data: particulate matter sharer has now been disabled (change page to reflect this)

Lets-Plot: add Lets-Plot to geo-spatial visualisation

The Lets-Plot library is not mentioned in the Geo-Spatial Visualization section. However, it can work with cartographic data. Detailed information about geocoding can be found here.

Basic example with UK districts:

from lets_plot.geo_data import *

country = geocode_counties().scope('UK').inc_res().get_boundaries()

ggplot() + geom_map(data=country, show_legend=False, size=0.2)

Also, you can add an interactive basemap layer to create a beautiful map:

(
    ggplot() 
    + geom_livemap() 
    + geom_map(aes(fill='found name'), data=country, show_legend=False, size=0.2) 
)

Add challenges to the basic coding section

Also, make clear that pass is a special word

"Half baths" on /data-exploratory-analysis.html

Thanks for this amazing resource!

By the way, re:

…the number of Baths is a floating point number rather than an integer (is it possible to have half a bathroom? Maybe, but it doesn't sound very private), and there are some NaNs in there too. It's not clear what the fractional values of bathrooms mean (including from the documentation) so we'll just have to take care with that variable.

It's very much the norm in American real estate to refer to/count bathroom with only a toilet and sink as a "half bath" (and sometimes those with a shower but no bathtub as a "three-quarter bath," which also shows up in the data). Nothing surprising in that data ;)

Alternate way of dealing with dates in /data-exploratory-analysis.html section 3.1

As someone somewhat experienced with EDA and data cleaning but new-ish to this work in python, I was interested in learning your (perhaps pythonic) solution:

start_code = 16436
end_code = df['Date'].max() + 1 # +1 because of how ranges are computed; we want to *include* the last date

datetime_dict = dict(zip(range(start_code, end_code),
                               pd.date_range(start='2005/01/01', periods=end_code-start_code)))

df['datetime'] = df['Date'].apply(lambda x: datetime_dict[x])

but thought I'd mention that the solution that occurred to me first and seems perhaps easier both to develop and explain was

def convert_date(d):
    return pd.to_datetime("01-01-2005") + pd.DateOffset(d-16436)

df['datetime'] = df['Date'].apply(convert_date)

Set up continuous integration for coding for economists

Update Pyfixest and add IV table back in

Reorder /vis-common-plots.html ?

It may make more sense to start with scatters before introducing faceted scatters

Lets-Plot: geom_area() instead of geom_freqpoly()

In section Common Plots / Overlapping Area plot you could replace the code for Lets-Plot with

(
    ggplot(
        planets.groupby(["year", "method"])["number"].sum().reset_index(),
        aes(x="year", y="number", fill="method", group="method", color="method"),
    )
    + geom_area(alpha=.5)
    + scale_x_continuous(format="d")
)

This allows to build a nicer looking plot:

It's already been replaced in PR #43, if you prefer that way of updating code.

Broken matplotlib demos

Errors can be seen here:
https://aeturrell.github.io/coding-for-economists/vis-intro.html#categorical-data

IndexError: index 0 is out of bounds for axis 0 with size 0

Zenodo store of 1st edition

plotly no longer produces output in built book

Various issues:

the module is not recognised when book is built (plotly 5.3.1)
pre-execution of the relevant notebooks does not lead to the inclusion of the interactive charts

Lets-Plot: joint_plot() instead of ggmarginal()

In section Common Plots / Marginal histograms you could replace the code for Lets-Plot with

from lets_plot.bistro.joint import *

(
    joint_plot(penguins, x="bill_length_mm", y="bill_depth_mm", reg_line=False)
    + labs(
        x="Bill length (mm)",
        y="Bill depth (mm)"
    )
)

This simplifies the code a bit and uses the function that is designed for the task at hand.

It's already been replaced in PR #43, if you prefer that way of updating code.

Explanations of the merits of list comprehensions are wrong

One section says:

High-level languages like Python and R do not get compiled into highly performant machine code ahead of being run, unlike C++ and FORTRAN. What this means is that although they are much less unwieldy to use, some types of operation can be very slow–and for loops are particularly cumbersome. (Although you may not notice this unless you’re working on a bigger computation.)
But there is a way around this, and it’s with something called a list comprehension. These can combine what a for loop and a condition do in a single line of efficiently executable code. Say we had a list of numbers and wanted to filter it according to whether the numbers divided by 3 or not:

Public sector data science colleagues have pointed out that this isn't right. List comprehensions can actually be slower, and it's not about compilation of code. See for example this article, this SO post, and this video (tldr which one is faster isn't constrained by the spec so their relative performance can change in every version, which in fact they do).

It was also noted that most time is spent reading rather than optimising code, and list comprehensions are arguably a clearer pattern.

aeturrell / coding-for-economists Goto Github PK

coding-for-economists's People

Contributors

Stargazers

Watchers

Forkers

coding-for-economists's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs