GithubHelp home page GithubHelp logo

aeturrell / coding-for-economists Goto Github PK

View Code? Open in Web Editor NEW
663.0 14.0 121.0 481.73 MB

This repository hosts the code behind the online book, Coding for Economists.

Home Page: https://aeturrell.github.io/coding-for-economists

License: MIT License

Python 0.61% Jupyter Notebook 92.77% TeX 0.79% HTML 5.71% Dockerfile 0.11%
learning economics econometrics research economics-models jupyter-notebook python vscode book data-science

coding-for-economists's People

Contributors

aeturrell avatar asmirnov-horis avatar lukestein avatar pitmonticone avatar s3alfisc avatar zekiakyol avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

coding-for-economists's Issues

Fix Binder functionality

There is an error message when using the ' ::rocket:: -> Binder' option on pages with code. This does not appear to be a main/master issue, but to do with the URL that JupyterBook uses to load a given Binder page. Rather than (for example)

https://notebooks.gesis.org/binder/v2/gh/aeturrell/coding-for-economists/main?urlpath=tree/code-advanced.ipynb

being loaded, instead

https://notebooks.gesis.org/binder/v2/gh/aeturrell/coding-for-economists/e27d7c0ba0345eeebdeec37909baf54a744e8b76/v2/gh/aeturrell/coding-for-economists/main?urlpath=tree/code-advanced.ipynb

gets loaded (with some parts of the path repeated).

Lets-Plot: geom_contourf() instead of geom_contour()

In section Common Plots / Contour Plot you could replace the code for Lets-Plot with

contour_data = {'x': X.flatten(), 'y': Y.flatten(), 'z': Z.flatten()}
(
    ggplot(contour_data)
    + geom_contourf(aes(x='x', y='y', z='z', fill='..level..')) 
    + scale_fill_viridis(option="plasma")
    + ggtitle("Maths equations don't currently work")
)

This allows to build a nicer looking plot:

image

It's already been replaced in PR #43, if you prefer that way of updating code.

Create setup script for dev

Although the instructions for installing the environment and packages are fairly straightforward, it would be good to have a start-up script that also handled extras such as the installation of nltk and spacy models.

Issue on page /vis-common-plots.html

You may be wondering why Lets-Plot isn’t featured here: its functions have almost exactly the same names as those in lets-plot, and we have opted to include the latter as it is currently the more mature plotting package.

Did you mean

You may be wondering why plotnine isn’t featured here..

?

Lets-Plot: geom_segment() instead of geom_path()

In section Common Plots / Connected scatter plot you could replace the code for Lets-Plot with

path_df = df.iloc[:-1].reset_index(drop=True).join(
    df.iloc[1:].reset_index(drop=True), lsuffix='_from', rsuffix='_to'
)

(
    ggplot(df, aes("Unemployment", "Vacancies"))
    + geom_segment(aes(x="Unemployment_from", y="Vacancies_from", xend="Unemployment_to", yend="Vacancies_to"), \
                 data=path_df, size=1, color="gray", arrow=arrow(type='closed', length=20, angle=15))
    + geom_point(shape=21, color="gray", fill="#c28dc3", size=5)
    + geom_text(aes(label='Year'), data=df[df['Year'].isin([2001, 2021])], position=position_nudge(y=0.3))
    + labs(x="Unemployment rate, %", y="Vacancy rate, %")
)

This allows to build a nicer looking plot:

image

It's already been replaced in PR #43, if you prefer that way of updating code.

Lets-Plot: notes for pyramid

In section Common Plots / Pyramid there is a few issues with the plot:

  • Clipped labels: unfortunately, the 20 character limit is hardcoded, so y labels are cut off. But the full text can be seen in the axial tooltip.

  • Weird-looking tooltips on top of the pyramid: to improve tooltips displaying I suggest not to use identity statistic; you can calculate and add weight for users as shown below:

    g = (
        ggplot(df, aes(x="Stage", y="Users", fill="Gender", weight='Users'))
        + geom_bar(width=0.8)  # baseplot
        + coord_flip()  # flip coordinates
        + theme_minimal()
        + ylab("Users (millions)")
    )
    g

    It's already been replaced in PR #43, if you prefer that way of updating code.

Sharper arrowheads in the "Connected scatter plot" section.

In this section: Connected scatter plot

Req: Lets-Plot v4.3.0


Problem: arrowheads are sunk into circles.

image

Solution: use the "spacer" option with the value 5 (i.e. the point size in this chart) + 1 (to account for the circle stroke):

image
(
    ggplot(df, aes("Unemployment", "Vacancies"))
    + geom_segment(
        aes(
            x="Unemployment_from",
            y="Vacancies_from",
            xend="Unemployment_to",
            yend="Vacancies_to",
        ),
        data=path_df,
        size=1,
        color="gray",
        arrow=arrow(type="closed", length=15, angle=15),     # <-- Slightly smaller arrow (was 20) 
        spacer=5+1                                                      # <-- The spacer !
    )
    + geom_point(shape=21, color="gray", fill="#c28dc3", size=5)
    + geom_text(
        aes(label="Year"),
        data=df[df["Year"].isin([2001, 2021])],
        position=position_nudge(y=0.3),
    )
    + labs(x="Unemployment rate, %", y="Vacancy rate, %")
)


Just as an option: the geom_curve() often times look nicer :):

(
    ggplot(df, aes("Unemployment", "Vacancies"))
    + geom_curve(                                                         # <-- New !
        aes(
            x="Unemployment_from",
            y="Vacancies_from",
            xend="Unemployment_to",
            yend="Vacancies_to",
        ),
        data=path_df,
        size=1,
        color="gray",
        arrow=arrow(type="closed", length=15, angle=15),
        spacer=5+1,                                                         # <-- The spacer !
        curvature=-0.1                                                     # <-- Not too curved.
    )
    + geom_point(shape=21, color="gray", fill="#c28dc3", size=5)
    + geom_text(
        aes(label="Year"),
        data=df[df["Year"].isin([2001, 2021])],
        position=position_nudge(y=0.3),
    )
    + labs(x="Unemployment rate, %", y="Vacancy rate, %")
)

image

Lets-Plot: add geom_area_ridges()

There are no Lets-Plot examples in section Common Plots / Ridge, or 'joy', plots, but the library does have a suitable function for it: geom_area_ridges(). You can add the following code:

final_year = df["Year"].max()
first_year = df["Year"].min()

breaks = [y for y in list(df.Year.unique()) if y % 10 == 0]
(
    ggplot(df, aes("Anomaly", "Year", fill="Year")) 
    + geom_area_ridges(scale=20, alpha=1, size=.2, trim=True, show_legend=False)
    + scale_y_continuous(breaks=breaks, trans='reverse')
    + scale_fill_viridis(option='inferno')
    + ggtitle("Global daily temperature anomaly {0}-{1} \n(°C above 1951-80 average)".format(first_year, final_year))
)

It's already been replaced in PR #43, if you prefer that way of updating code.

Error in Exercise Reqs

In the Working With Data there is an exercise

Create a pandas dataframe using the data=, index=, and columns= keyword arguments. The data should consist of one column with ascending integers from 0 to 5, the column name should be “series”, and the index should be the first 5 letters of the alphabet. Remember that the index and columns keyword arguments expect an iterable of some kind (not just a string).

I believe this is impossible due to there being 6 integers 0 to 5 but only 5 letters of the alphabet.

import pandas as pd

data = {"Series": list(range(6))}
index = list("abcdef")
df = pd.DataFrame(data=data, index=index)

print(df)

I believe this satisfies the exercise requirements.

Special fonts are not included in the dockerfile

Some pages, eg on narrative data visualisation (which uses 'varta'), need special fonts. These are not currently available in the Dockerfile.

In principle, this is possible and some example code to achieve it would be:

FROM continuumio/miniconda3:4.10.3-alpine
WORKDIR /app
COPY ./my-custom-font.ttf ./
RUN mkdir -p /usr/share/fonts/truetype/
RUN install -m644 my-custom-font.ttf /usr/share/fonts/truetype/
RUN rm ./my-custom-font.ttf

But it would be good to pull the font directly from a website, eg using Google fonts.

This article goes into detail of how to install fonts in docker containers:
https://axellarsson.com/blog/install-fonts-in-docker-containers/

PackagesNotFoundError when setting up virtual environment

When running environment.yml from the Anaconda Prompt I get the following:

PackagesNotFoundError: The following packages are not available from current channels:

  - datatable

Current channels:

  - https://conda.anaconda.org/oxfordcontrol/win-64
  - https://conda.anaconda.org/oxfordcontrol/noarch
  - http://conda.anaconda.org/gurobi/win-64
  - http://conda.anaconda.org/gurobi/noarch
  - https://repo.anaconda.com/pkgs/main/win-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/win-64
  - https://repo.anaconda.com/pkgs/r/noarch
  - https://repo.anaconda.com/pkgs/msys2/win-64
  - https://repo.anaconda.com/pkgs/msys2/noarch

I'm pretty sure this is because datatable needs to be installed with pip rather than conda.

A less likely explanation could be operating system dependency (I'm on Windows 10), in which case appending --no-builds may be a solution.

I have put datatable down to the end of the pip section as below:

  - pip:
    - specification_curve
    - twopiece
    - stargazer
    - matplotlib-scalebar
    - black-nb
    - pyhdfe
    - skimpy
    - dataprep
    - graphviz
    - pygraphviz
    - ruptures
    - deadlinks
    - datatable

This works, but the current set of dependencies seem to have conflicts as I get:

Collecting package metadata (repodata.json): done
Solving environment: -
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
  • I will investigate which packages have conflicts

Website is down

@aeturrell Thanks for the fantastic work!
Just to let you know that I tried to access the link in the description, but it's not working.
Cheers!

Lets-Plot: add Lets-Plot to geo-spatial visualisation

The Lets-Plot library is not mentioned in the Geo-Spatial Visualization section. However, it can work with cartographic data. Detailed information about geocoding can be found here.

Basic example with UK districts:

from lets_plot.geo_data import *

country = geocode_counties().scope('UK').inc_res().get_boundaries()

ggplot() + geom_map(data=country, show_legend=False, size=0.2)

image

Also, you can add an interactive basemap layer to create a beautiful map:

(
    ggplot() 
    + geom_livemap() 
    + geom_map(aes(fill='found name'), data=country, show_legend=False, size=0.2) 
)

image

"Half baths" on /data-exploratory-analysis.html

Thanks for this amazing resource!

By the way, re:

…the number of Baths is a floating point number rather than an integer (is it possible to have half a bathroom? Maybe, but it doesn't sound very private), and there are some NaNs in there too. It's not clear what the fractional values of bathrooms mean (including from the documentation) so we'll just have to take care with that variable.

It's very much the norm in American real estate to refer to/count bathroom with only a toilet and sink as a "half bath" (and sometimes those with a shower but no bathtub as a "three-quarter bath," which also shows up in the data). Nothing surprising in that data ;)

Alternate way of dealing with dates in /data-exploratory-analysis.html section 3.1

As someone somewhat experienced with EDA and data cleaning but new-ish to this work in python, I was interested in learning your (perhaps pythonic) solution:

start_code = 16436
end_code = df['Date'].max() + 1 # +1 because of how ranges are computed; we want to *include* the last date

datetime_dict = dict(zip(range(start_code, end_code),
                               pd.date_range(start='2005/01/01', periods=end_code-start_code)))

df['datetime'] = df['Date'].apply(lambda x: datetime_dict[x])

but thought I'd mention that the solution that occurred to me first and seems perhaps easier both to develop and explain was

def convert_date(d):
    return pd.to_datetime("01-01-2005") + pd.DateOffset(d-16436)

df['datetime'] = df['Date'].apply(convert_date)

Lets-Plot: geom_area() instead of geom_freqpoly()

In section Common Plots / Overlapping Area plot you could replace the code for Lets-Plot with

(
    ggplot(
        planets.groupby(["year", "method"])["number"].sum().reset_index(),
        aes(x="year", y="number", fill="method", group="method", color="method"),
    )
    + geom_area(alpha=.5)
    + scale_x_continuous(format="d")
)

This allows to build a nicer looking plot:

image

It's already been replaced in PR #43, if you prefer that way of updating code.

Lets-Plot: joint_plot() instead of ggmarginal()

In section Common Plots / Marginal histograms you could replace the code for Lets-Plot with

from lets_plot.bistro.joint import *

(
    joint_plot(penguins, x="bill_length_mm", y="bill_depth_mm", reg_line=False)
    + labs(
        x="Bill length (mm)",
        y="Bill depth (mm)"
    )
)

This simplifies the code a bit and uses the function that is designed for the task at hand.

It's already been replaced in PR #43, if you prefer that way of updating code.

Explanations of the merits of list comprehensions are wrong

One section says:

High-level languages like Python and R do not get compiled into highly performant machine code ahead of being run, unlike C++ and FORTRAN. What this means is that although they are much less unwieldy to use, some types of operation can be very slow–and for loops are particularly cumbersome. (Although you may not notice this unless you’re working on a bigger computation.)
But there is a way around this, and it’s with something called a list comprehension. These can combine what a for loop and a condition do in a single line of efficiently executable code. Say we had a list of numbers and wanted to filter it according to whether the numbers divided by 3 or not:

Public sector data science colleagues have pointed out that this isn't right. List comprehensions can actually be slower, and it's not about compilation of code. See for example this article, this SO post, and this video (tldr which one is faster isn't constrained by the spec so their relative performance can change in every version, which in fact they do).

It was also noted that most time is spent reading rather than optimising code, and list comprehensions are arguably a clearer pattern.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.