chihacknight / chn-ghost-buses Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 14.0 3.61 MB

"Ghost buses" analysis project through Chi Hack Night

Home Page: https://github.com/chihacknight/breakout-groups/issues/217

License: MIT License

Python 0.24% Jupyter Notebook 0.09% HTML 99.67%

chn-ghost-buses's People

Contributors

Stargazers

Watchers

Forkers

amy-corson dcjohnson24 joerovar viennguyen2026 kristenhahn thatmushroom tomlebeau3423 smacmullan greek4tech misterjacko georgeferre

chn-ghost-buses's Issues

Deploy updated scraper & aggregator

Deploy changes in PR #11 (this will not be closed by a PR, actions need to be taken in AWS):

Deploy new scraper to Lambda
Copy existing raw data into the public bucket
Deploy the aggregator to Lambda -- note that it needs more memory, copy settings from test
Batch aggregate the intervening data (August 7-September XX, whenever that second deployment occurs)

[Data] Update data through March and remove Fullerton bus route 74

[Data] Time series plot of scheduled trips since 2019

It will be important to start tracking the number of scheduled trips starting from pre-COVID up to date. This will help to check whether the CTA decides to lower the number of scheduled trips to match the actual trips. The reduction in scheduled trips will improve the trip ratios, but the bus service will still be lower than pre-COVID levels, a less than ideal scenario.

Data

To access older data, you will need to look at schedule versions from transitfeeds.com dating back to 2019. You probably do not need to choose every schedule version for a given year because there is some overlap of the date ranges between the versions. It is good to check, however, that the schedule versions you choose span the entire year. For 2019, for example, you could choose the versions "7 November 2018" (6 November 2018 - 31 January 2019), "31 January 2019" (30 January 2019 - 31 March 2019), "14 April 2019" (29 March 2019 - 31 May 2019), "16 May 2019" (13 May 2019 - 31 July 2019), "5 August 2019" (1 August 2019 - 31 October 2019), "4 October 2019" (4 October 2019 - 31 December 2019). You could then drop the 2018 dates and duplicates that may arise from the overlapping dates.

Set up your virtual environment with the required packages by following the instructions in the README, and activate it. Once you have the schedule feeds of interest, run the snippet from inside the data_analysis directory. If running from the project root, change static_gtfs_analysis to data_analysis.static_gtfs_analysis.

from tqdm import tqdm
import pandas as pd
import logging
import static_gtfs_analysis

logger = logging.getLogger()
logging.basicConfig(level=logging.INFO)


def fetch_data_from_schedule(schedule_feeds: List[dict]) -> List[dict]:
    """Retrieve data from the GTFS file for various schedule versions.

    Args:
        schedule_feeds (List[dict]): A list of dictionaries containing
            the version, start_date, and end_date as keys.

    Returns:
        List[dict]: A list of dictionaries with the schedule version and the
            corresponding data.
    """
    schedule_data_list = []
    pbar = tqdm(schedule_feeds)
    for feed in pbar:
        schedule_version = feed["schedule_version"]
        pbar.set_description(
            f"Generating daily schedule data for "
            f"schedule version {schedule_version}"
        )
        logging.info(
            f"\nDownloading zip file for schedule version "
            f"{schedule_version}"
        )
        CTA_GTFS = static_gtfs_analysis.download_zip(schedule_version)
        logging.info("\nExtracting data")
        data = static_gtfs_analysis.GTFSFeed.extract_data(
            CTA_GTFS,
            version_id=schedule_version
        )
        data = static_gtfs_analysis.format_dates_hours(data)

        logging.info("\nSummarizing trip data")
        trip_summary = static_gtfs_analysis.make_trip_summary(data)

        route_daily_summary = (
            static_gtfs_analysis
            .summarize_date_rt(trip_summary)
        )

        schedule_data_list.append(
            {"schedule_version": schedule_version,
                "data": route_daily_summary}
        )
    return schedule_data_list

schedule_feeds = [
    {
        "schedule_version": "20181107",
        "feed_start_date": "2018-11-06",
        "feed_end_date": "2019-01-31" 
    },
    # Enter remaining schedule versions of interest from 2019, 2020, and 2021
    {
        "schedule_version": "20220507",
        "feed_start_date": "2022-05-20",
        "feed_end_date": "2022-06-02",
    },
    {
        "schedule_version": "20220603",
        "feed_start_date": "2022-06-04",
        "feed_end_date": "2022-06-07",
    },
    {
        "schedule_version": "20220608",
        "feed_start_date": "2022-06-09",
        "feed_end_date": "2022-07-08",
    },
    {
        "schedule_version": "20220709",
        "feed_start_date": "2022-07-10",
        "feed_end_date": "2022-07-17",
    },
    {
        "schedule_version": "20220718",
        "feed_start_date": "2022-07-19",
        "feed_end_date": "2022-07-20",
    },
]

schedule_data_list = fetch_data_from_schedule(schedule_feeds)

Access the schedule data in the list with

schedule_df = pd.concat([feed["data"] for feed in schedule_data_list])
schedule_df.drop_duplicates(inplace=True)

schedule_df should have enough information to generate plots of scheduled trip counts by day.

Example

schedule_feeds = [
    {
        "schedule_version": "20181107",
        "feed_start_date": "2018-11-06",
        "feed_end_date": "2019-01-31"
    },
    {
        "schedule_version": "20190131",
        "feed_start_date": "2019-01-30",
        "feed_end_date": "2019-03-31"
    }
]

schedule_data_list = fetch_data_from_schedule(schedule_feeds)

schedule_df = pd.concat([feed["data"] for feed in schedule_data_list])
print(schedule_df.duplicated().sum())
# 194 duplicates
schedule_df.drop_duplicates(inplace=True)
print(schedule_df.head())
#        date route_id  trip_count
# 0  2018-11-06        1          95
# 1  2018-11-06      100          73
# 2  2018-11-06      103         194
# 3  2018-11-06      106         169
# 4  2018-11-06      108          97

# Let's check the date ranges
print(f"The earliest date is {schedule_df.date.min()}")
print(f"The latest date is {schedule_df.date.max()}")
# The earliest date is 2018-11-06
# The latest date is 2019-03-31

# Drop the 2018 dates
schedule_df['date'] = pd.to_datetime(schedule_df["date"])
schedule_df_2019 = schedule_df.loc[schedule_df['date'].dt.year != 2018].copy()

# Print the date ranges again
print(f"The earliest date is {schedule_df_2019.date.min()}")
print(f"The latest date is {schedule_df_2019.date.max()}")
# The earliest date is 2019-01-01 00:00:00
# The latest date is 2019-03-31 00:00:00

[Data] Start pulling v3 of the CTA API to compare

[Data] Map of trip ratio by ward

[Data] ConnectionTimeoutError

Line 235 of compare_scheduled_and_rt.py throws a botocore.exceptions.ConnectionTimeoutError or a botocore.exceptions.EndpointConnectionError.

[Data] Automate writing daily summaries of scraped data to the public S3 bucket

[Data] Calculate Average Bus Speed

See #54

[Data] Add code for generating maps of bus routes and other data visualizations

[Data] Update schedule configuration

The way we have to load the schedule data is a little arcane, because we have to handle the concept of "when was this online".

I think we need to do the following:

Create a schedule_config.yml file to store information about schedule versions (replacing the dictionary here)
Write up information about how we use schedule data in our data analysis README and/or our actual website methodology section

Here's an explanation I sent to @dcjohnson24 offline about how the schedule config dictionary is constructed:

The version uploaded on June 3 takes effect on June 4 because I don't know what time of day it should become effective
And then that June 3 version ends on June 7 because there's a new version uploaded June 8
It's just non-overlapping ranges showing when that version was online, no reference to actual feed content

Sidebar I think we may have some slight weirdness about these dates because I think that the dates on TransitFeeds are in UTC time and our work is all in Central Time.

[Data] Comparison of ride share rides and bus routes

Using data on ride shares from Chicago's open data portal, compare ride shares to buses covering the same route.

Some things to look for:

time comparison of ride share and bus trips
carbon emissions
cost to Chicagoans of increased ride share usage because of unreliable bus service

See here for more information.

[Data] More Bus Statistics

Calculate some additional bus statistics such as

Average speed
Bunching
Average headway
Excess wait time(Actual headway - Scheduled headway)

These numbers could also be calculated for different sections of each route.

See here for more information.

[Data] Calculate Average Headway

See #54. A headway is defined as the amount of time between two vehicles traveling in the same
direction on a given line or combination of lines.

[Data] Map of Community Areas

Find the community areas that each bus passes through. Statistics for the routes passing through a given community area can be calculated. An approach similar to that of creating the ward maps could be used.

Automate updates to JSON files

Overview
The manual task of updating route information could be turned into an automated process.

Problem
Updating route information requires the deployment of the entire frontend codebase. This is a manual task, and the codebase (including presentation layer and business logic) need not be tied to the json data for routes.

Proposed Solution

A GitHub Actions workflow can run the main function in compare_scheduled_and_rt.py to generate new data.
The action can be based on a cron schedule.
The updated data can either live in the chn-ghost-buses repo or in s3.
The front end can fetch route data from a remote host instead of the local codebase. A proof of concept is here. By replacing the route json, the information data can be updated without a deployment.

[Data / General] Set up pre-commit hooks or linting

It would be helpful to set up some pre-commit configuration (https://pre-commit.com/) for this repo so that we are automatically running a linter (perhaps black).

Acceptance criteria would include:

Configuring some pre-commit hooks; probably at least something to standardize spacing type things
Adding the relevant dependencies to the top-level requirements.yml file for the repo
Adding documentation to the repo README if needed to explain the checks

[Data] Add directories for JSON output and plots

[Data] Automatically check for CTA-observed holidays

Spinout from: #37 (comment)

In compare_scheduled_and_rt.py we have a hard-coded list of holidays in a few places (ex: https://github.com/chihacknight/chn-ghost-buses/blob/main/data_analysis/compare_scheduled_and_rt.py#L99) and ideally that would be handled more automatically.

There is a holidays library in Python: https://github.com/dr-prodigy/python-holidays, which could help us. The thing is that we do not want to check for generic US (or even Chicago / Cook County) holidays, we only want to check for the specific holidays on which the CTA runs Sunday Service.

At time of posting, that is:

Our services operate on a Sunday schedule on New Year’s Day, Memorial Day, July 4th (Independence Day), Labor Day, Thanksgiving Day and Christmas Day.

[Data] Generate maps of ridership by route

A map of ridership by route can provide context about the number of riders affected by poor service.

[Data] Investigate routes with ratio of actual trips to scheduled trips greater than 1

Investigate routes with `ratio > 1`

There are some routes that have a ratio of actual trips to scheduled trips greater than one, and it would be good to know why.

Access the data

Jupyter Notebook

To access the data, run the notebook compare_scheduled_and_rt.ipynb. Add a cell at the bottom with %store summary and run it. The %store magic command allows you to share variables between notebooks https://stackoverflow.com/questions/31621414/share-data-between-ipython-notebooks.

Next, run the static_gtfs_analysis.ipynb. Add a cell at the bottom with %store -r summary and run it to read the summary DataFrame from the compare_scheduled_and_rt.ipynb notebook. Merge the summary DataFrame with the final_gdf GeoDataFrame from the compare_scheduled_and_rt.ipynb using summary_gdf = summary.merge(final_gdf, how="right", on="route_id")

Python

Run the following in an interpreter from the project root:

import pandas as pd

import data_analysis.compare_scheduled_and_rt as csrt
import data_analysis.static_gtfs_analysis as sga

summary_df = csrt.main()

gdf = sga.main()

summary_gdf = summary_df.merge(gdf, how="right", on="route_id")

Find routes with `ratio > 1`

To filter the rows with ratio > 1, use

ratio_over_one = summary_gdf.loc[summary_gdf.ratio > 1]
ratio_over_one.head()

A few things to look for:

Duration of trips
How many trips cross the hour boundary
Number of routes with ratio > 1 after reaggregting data based on a different frequency e.g. daily, see #12

[Data] Text analysis of CTA communication

What has CTA communication looked like since the start of COVID?

A few questions to explore could be

What are the main topics of tweets?
- Word cloud
- Topic modelling
Are delays reported in a timely manner?
Are ghost buses acknowledged?
- If so, are they acknowledged in a timely manner? Are riders given advance warning?
How frequent are CTA tweets?
How does the CTA respond to criticism?
Has CTA communication changed since the start of COVID to now?
- Do the word clouds or topic modelling results change over time?
- Have the explanations for lower service changed?
What about websites, press releases, etc.?
- Is the communication in other places consistent with Twitter?

These questions could also be put to transit agencies in other cities to help compare the communication styles between Chicago and other places.

[Data] Adjust realtime/scheduled data comparison to cover multiple time frames or averages (for example, average over entire period vs. avg by day vs. avg by hour across entire period)

Add JSON data of the maps in 'plots' direcotry

[Data] Calculate Excess Wait Time

See #54. Excess Wait Time = Actual headway - Scheduled headway. See the work done here to get started with actual headway and scheduled headway calculations.

[Data] Create time series plots per route

Plots of the ratio of actual to scheduled trips over time per route

[Data] Investigate the Fullerton bus more

In an early EDA session, we observed that the realtime API data for the Fullerton (74) bus had some trips with missing/non-distinct trip_id values that were a series of asterisks (like ******). At the time the issue did not seem too widespread, but the Fullerton bus is in our bottom 10 routes in terms of performance. It is probably worth taking a second look to see whether this data issue is causing the 74 to seem worse than it actually is.

Goals for this ticket:

Assess prevalence of placeholder trip ID values on the Fullerton bus -- what is the frequency, what days does it occur, etc.
Also assess whether this issue is observed on other routes and whether the frequency on the Fullerton bus is a true outlier

Map of schedule changes

[Data] Update plots and JSON with the data aggregated daily

[Data] Keyword search trends for Google and other social media

It would be nice to know about trends for search terms such as "ghost buses", "late buses", "missed buses", and the like for the CTA. An exploration of trends in news articles about ghost buses would also be useful. Have there been more of late?

This could be done using Google Trends and the APIs of other social media sites such as Reddit, Twitter, etc. A graph of trends for these terms could bolster the content in the introduction section of the website.

[Data] Calculate Bus Bunching

See #54. Bus bunching occurs when two or more buses arrive at a stop at the same time.

[Data] Automate schedule downloads

In addition to scraping realtime data every 5 minutes, we should scrape the GTFS schedule (static) data on a daily basis so we don't have to get historical versions after the fact.

We should write a Lambda function that will scrape the CTA schedule GTFS data from https://www.transitchicago.com/downloads/sch_data/google_transit.zip every day.

Acceptance criteria for this should just be a Python script that will scrape the zipfile as bytes and write it to S3.

Once that's ready we should make a follow up ticket to deploy to AWS (has to be done by me, @lauriemerrell) and another follow up ticket to describe desired follow up processing.