Newyork Taxi Analysis

1. Introduction

In this project, I used data collected by the New York City Taxi and Limousine commission about “Green” Taxis. Green Taxis are taxis that are not allowed to pick up passengers inside of the densely populated areas of Manhattan. We will use the data from September 2015. We use Apache spark to finish the tasks. We group all trips into the following four categories, NJ → NJ, NJ → NYC, NYC → NJ, NYC → NYC. We build some association rules on intra- vs. inter-borough traffic. We try to identify in which hour of the day, there would be more inter-borough traffic than intra-borough traffic. To augment the dataset, you can create derived features such as year, month, hour.

2. The Objective of Analysis

A typical taxi company faces a common problem of efficiently assigning the cabs to passengers. Our objective is to suggest Green Taxi how to improve their business. We explore the data set and check the relationships between the features through statistical information and data visualizations. Using Frequent Itemset Mining & Association Rules find relationships between set of elements. And, by clustering the pick-up and drop-off locations, Green Taxi will be better able to manage its taxi and customer by getting more customers and improve quality of service. In this project, I try to answer the question: What story does it tell about how New Yorkers use their green taxis? and get some meaningful insights/discover the patterns from the Green Taxi data set.

3. Data understanding (Data descriptive)

The data is related with green taxis for 1,494,926 clients. There are 21 features. The dataset consists of 1,494,926 records and has a total of 21 variables. It is needed to predict whether the cab is cancelled.

4. Conclusion / Recommendation (Actionable Insights)

It is concluded that the following insights after analyzing the dataset. I can suggest Green Taxi how to improve their business capacity and answer the question about the profitability for drivers? When to drive? During this project, we check the relationship between the variables in data set through data visualization and statistical information. We also use machine learning technique such as Frequent Itemset Mining & Association Rules and clustering. Firstly, the association rules for NY help us to determine an important factor – pickup month. It can help predict which month of the year might get busy in NY for pickups. Secondly, k- means clustering helps us make market segmentations so we can which areas are concentrated on pick up locations or drop off locations. The green taxi company manage their taxis more efficiently and if taxi drivers also can wait for that area, they can get more passengers in their taxis. Thirdly, we look through fare amount, trip distance based on pick up hour and drop off hour. Usually, we found that the busiest pickup hour is between 17:00 and 18:00. And, 8:00-10:00 is also the busy pick up hours. Busiest pickup hour in NJ is between 17:00 – 21:00. But, when we check the y-axis (the number of trips), most of trips are concentrated in NYC. For trips, the peak hours of pick-ups are usually from 7PM to 9PM. With these observations, trips are mainly contributed by people having dinner and drinking an alcohol at night. It is found that there were less long trips than short trips and long trips ( > 30 miles) are contributed by travelers taking an arrival or a departure. Lastly, we can check a lot of information from data visualization. Histogram Plot of inter and intra borough traffic showing us most traffic for the (NYC, NYC) route. Trips from NJ to NYC are second one. (37133 trips.) The Payment types are usually credit card and cash. There is an interesting point. Even though credit card has become common, many customers still use cash for paying the fees so, taxi drivers have to prepare for changes in advance. And, trips from NYC to NJ has the highest profits (Average fare amount: 25.3709). Therefore, focusing on this area gives taxi company and taxi driver more money. To improve/advance our analysis of data set, deeper data exploration and better feature engineering are needed. And, it is required to increase number of observations and try to get a better balance of the data set. If other factors in analysis such as geographic distance calculation, weather and climatological factors are included, we can augment the data set. MLlib library has machine learning algorithms packages so if we try to predict the total amount based on hour (Regression) or predict the customer trip type (classification), we can gain more meaningful insight from the data set.

branden-kang / newyork-taxi-analysis Goto Github PK