Project Overview

The U.S. in the 21st century is not particularly adept at recovering from recessions, and another one is upon us. This recession is likely to be deeper and longer than the last. Traditional measures of economic health (unemployment, GDP, stock market) are all too aggregated to be of much use for most Americans. Overall rise and/or fall does nothing for the outlook of a particular town or city, nor does it speak to the resiliency of particular economic sectors in the face of a recession.

This project is a continuation of my capstone work for the Galvanize Data Science Immersive. While the previous iteration was a success given its time constraints, I want to expand the project to provide a scalable framework for working with QCEW data, and use it to analyze the trends in recessions in the US.

While modeling the recovery from the 2020 COVID-19 recession was the initial goal, I have expanded the scope to include making a scalable, reusable library for working with QCEW data.

Project Goals

Develop an object-oriented module to define parameters of recessions, industries, and areas. (COMPLETE)
Expand parameters of the dataframe construction module to allow for automated construction of a timeline on any recession across any dimension(industry, area) or target variable (employment, wages, or firms). (COMPLETE)
Develop a module to automate charting of timelines. (COMPLETE)
Automate adjustment of NAICS industry classification changes into the timeline data and ensure consistency across recession timelines.
Develop a version of the "scariest chart", automated to include any dimension, variable, or recession. (COMPLETE)
Update the Flask interface to include new charts and features.
Create an AWS instance to allow others to run report cards on areas or industries.
Expand data collection to include political, fiscal, and population data for areas.
Experiment with neural networks to model economic recovery with the new dimensions
Add previous recessions as possible parameters.
Update report card to include model projections

Data Source

The dataset is compiled from the Bureau of Labor Statistics(BLS) Quarterly Census of Employment and Wages (QCEW). The BLS archives contain economic data stretching back decades, across geographic designations and NAICS industry classifications.

Data can be downloaded here.

While employment numbers are available on a monthly basis, wages and establishments only have quarterly data. Therefore, I will primarily be working with quarterly timelines.

Instructions for downloading data

The code makes several assumptions about the folder structure and file names. Follow these instructions to make sure the data is in a readable state by the code.

Create "industry_files" and "area_files" folders in the data/ folder.
Download a file archive from the "CSVs By Area-->Quarterly" or the "CSVs By Industry-->Quarterly". The archive will contain many files, but you only need the total file included in each.
Save that file in the relevant folder. Note: Files that belong in the Industry folder will be found in the Area column, and vice-versa. This is because the Total file will be the total for across areas/industries, with breakdowns by the other dimension.
Rename the file to only the four-digit year, retaining the .csv extension.

Definitions

The below chart is the inspiration for this project, and may help contextualize some of the definitions below.

Recession

Recession parameters are stored in recessions.py

Recessions are defined by economists, and while there is some debate on the qualifications of a recession, for simplicity's sake I will use the defined recessions included in the graph.

In order to capture the information required, the timelines in question include the full calendar year before the recession event , extending to the full calendar year prior to the next recession event. While the recession officially ends long before that, not every area/industry recovers on that timeline, and we must capture that information.

Recession Event

An event popularly considered to be a catalyst for the recession.

Included Recessions

2001- Timeline: 2000-2007. Event: Sept 11, Q3 2001.
2008- Timeline: 2007-2019. Event: Financial Crisis, Q3 2008.
Full: Timeline: 2007-2019. This designation exists only to produce timelines across all recessions included.

Dimension

Area

Area parameters are stored in area.py

The BLS data includes four different types of area designations:

National: The full United States economic data. There are also designations for Metropolitan and Non-Metropolitan areas.
State: One of fifty recognized states. This also includes Puerto Rico and the U.S. Virgin Islands
County: County designations within states. This includes Puerto Rican Municipios and individual islands within the U.S. Virgin Islands
Metropolitan Statistical Area: Cities within the US.
Combined Statistical Area: Wider definitions to capture populous areas across cities and even state lines.

area_fips(str) is the index for areas.

More information can be found here.

Industry

Industry parameters are stored in industries.py

Industries are defined according to NAICS. The data is hierarchical and quite complex. Broad-encompassing industries are broken down into more granular ones lower in the hierarchy.

Child Industry

When an industry is broken down into more smaller industries, I am defining each of those as a child industry

Parent Industry

The industry a child industry comes from.

Sibling Industries
The set of child industries from a single parent industry.

Generation

Where in the industry hierarchy an industry falls.

industry_code(int) is the index for industries.

More information can be found here.

Potential future dimensions to add to the project:

Population(U.S. Census)
State budgets
Partisan control of government

Targets

One of the three targets (referred to as variables in the code) that is used to judge economic health.

Employment

The number of jobs in each industry/area. In this project, I will only be using quarterly numbers.

Column Name: month3_emplvl

Wages

The average weekly wage in each industry/area.

Column Name: avg_wkly_wage

Establishments/Firms

The number of firms operating in each industry/area.

Column Name: qtrly_estabs_count

Goals Progress

1. Object-Oriented Approach

area.py, recessions.py, and industries.py each contains variable constants referred to throughout the project, and a class to define and store important parameters for analysis. These will continue to be expanded as the project evolves.

2. Dataframe Construction Refactor

produce_datasets has been streamlined, depreciated code moved to the depreciated code file. helper_functions.py has been depreciated, functions moved to produce_datasets.py.

The main function (create_timeline) has new parameters and options. It can now function on any target, dimension, or recession. It also contains options one whether or not to save the dataframe as a json file, as well as derive the variables listed below. (Derived variables greatly increase computing time).

Derived Variables:

The below variables are computed based on the recession timelines

pre-peak: The high point of the timeline before the nadir.
pre_peak_time: The number of quarters (from the beginning of the timeline) until the pre-peak.
pre_peak_qtr: The quarter at which the pre-peak occurs.
decline_time: the number of quarters between the pre-peak and the nadir.

nadir: The low point in the timeline. Excludes the first seven columns when computing.
nadir_time: The number of quarters (from the beginning of the timeline) until the nadir.
nadir_qtr: The quarter at which the nadir occurs.

recovery: Whether or not the timeline recovers from the recession before the end of the timeline (Is post-peak >= pre-peak).
recovery_time: the number of quarters between the nadir and when the timeline surpasses the pre-peak. Will be NaN if recovery == 0.
recovery_qtr: The quarter at which the recovery occurs.

post-peak: The high point of the timeline after the nadir.
post_peak_qtr: The number of quarters (from the nadir) until the post-peak.
post_peak_qtr: The quarter at which the post-peak occurs.
growth_time: the number of quarters between the recovery and the post-peak.

delta: the difference between pre-peak and post-peak.

3. Timeline Charting: The Vector Class

In order to support the Flask app, I need a good-looking chart to easily show the economic progress of a selected unit. I created charting.py to contain all the various charting functions needed.

I created a class Vector to gather all the necessary information and plot the various graphs. It is agnostic to recession, dimension, and target, to minimize technical debt.

The class vector an produce the following charts:

The vector itself. Shading is optional. Future Item: work on shading between the line and peak, rather than a solid rectangle.
Industry children.
Industry parent.
Industry siblings.

charting.py also has functions to chart relative gains/losses, like the "scary chart" above.

There is function for comparison across recessions:

as well as across target/variables:

The charts surface some issues with the data (e.g. seasonality). These will be addressed in later branches.

cjholcomb / report-card-recession Goto Github PK

report-card-recession's Introduction