GithubHelp home page GithubHelp logo

valeriapineda23 / time-series-clustering Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 777 KB

Time Series Clustering using Google Mobility Report of Mexico during COVID-19

License: MIT License

Jupyter Notebook 100.00%
data-science time-series-analysis covid-19 data-cleaning data-science-projects jupyter-notebook python cluster-analysis clustering-algorithm

time-series-clustering's Introduction

Time Series Clustering

Time Series Clustering using Google Mobility Report of Mexico during COVID-19

Data Preprocessing

We performed data cleaning tasks initially, such as variable selection, data engineering, and handling missing data.

Variable Selection

Firstly, we eliminated the variables: iso_3166_2_code, sub_region_2, metro_area, census_fips_code, and country_region. The variable iso_3166_2_code was eliminated to avoid multicollinearity as it repeated information from variable sub_region_1. The variable sub_region_2 was deleted as it did not add valuable information about the observations. Finally, the variables metro_area, census_fips_code, and country_region were composed entirely of missing values; thus, these were also removed.

series = pd.read_csv('2020_MX_Region_Mobility_Report.csv', header=0, index_col=0)
del series['sub_region_2']
del series['metro_area']
del series['census_fips_code']
del series['country_region']

Data Engineering

Next, we set the date variable as a DateTime data type, and we proceeded to place it as the index of the data frame.

series['date'] = pd.to_datetime(series['date']).dt.strftime('%d/%m/%Y')
series.set_index('date', inplace=True)

Missing Data

Moreover, we analyzed the missing data, which appeared in sub_region_1 and transit_stations_percent_change_from_baseline variables. Variable sub_region_1 was made up of 3.03% of missing data. However, according to the information given, missing data in sub_region_1 represent data at a national level. Thus, missing values were imputed with the National Level category.

series['sub_region_1'] = series['sub_region_1'].fillna('National Level')

On the other hand, transit_stations_percent_change_from_baseline variable presents 1.75% of missing data. Hence, we proceeded to treat them. As a time series analysis, we could not eliminate the instances that present missing data, so we decided to interpolate them.

series['transit_stations_percent_change_from_baseline'] = series['transit_stations_percent_change_from_baseline'].interpolate()

Modeling

For this stage we grouped the states to identify which of them follow a similar behavior in terms of mobility in workplaces. In this project we applied the Hierarchical Clustering with several methods: the Single Method with the Pearson Correlation, the Single Method with the Spearman Correlation, the Single Method with the Dynamic Time Warpping, and the Ward method with Euclidean distance.

State = ['Aguascalientes','Baja California','Baja California Sur','Campeche','Chiapas','Chihuahua','Coahuila','Colima','Durango','Guanajuato','Guerrero','Hidalgo','Jalisco','Mexico City','Michoacán','Morelos','Nayarit','Nuevo Leon','Oaxaca','Puebla','Querétaro','Quintana Roo','San Luis Potosi','Sinaloa','Sonora','State of Mexico','Tabasco','Tamaulipas','Tlaxcala','Veracruz','Yucatan','Zacatecas']
 
timeSeries = pd.DataFrame()
for i in State:
    data = series[series['sub_region_1']==i]
    data['workplaces_percent_change_from_baseline'].plot(label=i)
    timeSeries = timeSeries.append(data['workplaces_percent_change_from_baseline'])
pyplot.xticks(rotation=45)
pyplot.legend(bbox_to_anchor = (1, 1.2))
pyplot.title('Workplaces')
pyplot.show()
timeSeries.index = State

We developed the graphs of the dendograms implementing the fancy_dendogram function from here

The Ward method with Euclidean distance generates the most balanced clusters. This method, says that the distance between two clusters, A and B, is how much the sum of squares will increase when we merge them. Given its balanced results, we decided to group the States in such form.

Deployment

By implementing the print_clusters function we developed the following cluster visualization: Clusters

Conclusion

In conclusion, we developed three states' groupings in terms of Workplace Mobility using the Ward method with Euclidean distance. Governments could use this information to establish policies for certain States if they show a high level of COVID-19 infections and are part of a cluster that presents an increasing tendency of Workplace mobility.

time-series-clustering's People

Contributors

valeriapineda23 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.