CrimeTrendsExplorer: A Multi-City Crime Analysis Project

Final Project for Data Engineering Zoomcamp Course

Problem Description:
Technologies & Tools
Data Pipeline Architecture diagram
Pipeline explanation
Dashboard & Results
How to reproduce
Future Improvements
Credits

Problem Description: CrimeTrendsExplorer - A Multi-City Crime Analysis Project

The CrimeTrendsExplorer is a comprehensive data engineering project aimed at processing, analyzing, and exploring crime records data for three major cities in the United States: Austin, Los Angeles, and San Diego. The project covers several years of data, providing valuable insights and trends that can help law enforcement agencies, city planners, researchers, and the general public make informed decisions.

Objective

The primary goal of the CrimeTrendsExplorer project is to build a robust data pipeline that ingests, processes, and transforms raw crime data from multiple sources, and stores the results in a highly accessible, structured, and queryable format. The project will involve various data engineering tasks, including data ingestion, data cleaning, data validation, data transformation, and data storage.

Data Sources

The crime records data will be obtained from the following sources:

Austin: Austin Police Department's Public Data Portal (https://data.austintexas.gov/)
Los Angeles: Los Angeles Open Data Portal (https://data.lacity.org/)
San Diego: San Diego Data Portal (https://data.sandiego.gov/)

Data Processing

The raw crime data will be processed using Apache Spark, which provides a highly scalable and distributed computing framework for big data processing. The pipeline will involve the following steps:

Ingest raw crime data from multiple sources in various formats (CSV, JSON, etc.).
Perform data cleaning, validation, and preprocessing to ensure data consistency and integrity.
Transform and enrich the data, extracting relevant features and aggregating the data as needed.
Write the processed data to a partitioned and clustered table in Google BigQuery for efficient querying and analysis.

Data Storage

The processed crime data will be stored in Google BigQuery, a highly-scalable and fully-managed data warehouse solution that enables fast and efficient querying and analysis. The data will be partitioned and clustered to optimize query performance, and the schema will be designed to support easy exploration and analysis of the data.

Analysis and Visualization

The CrimeTrendsExplorer project will provide a foundation for further analysis and visualization of the crime data. The processed data can be used to generate insights and trends, such as identifying high-crime areas, understanding seasonal patterns, and exploring the relationship between different types of crimes.

For visualization purposes, the project will utilize Looker (Google Data Studio), a powerful and user-friendly data visualization tool that integrates seamlessly with Google BigQuery. Looker enables users to create interactive charts, dashboards, and reports, allowing them to explore the data and derive actionable insights without the need for extensive technical expertise. Users can customize and share their visualizations, making it easy for stakeholders to access and interpret the data.

By leveraging modern data engineering technologies and best practices, the CrimeTrendsExplorer project aims to create a powerful platform for understanding and exploring crime patterns in major cities, ultimately contributing to a safer and more informed society.

Back to the top

Technologies & Tools

Apache Spark: A highly scalable and distributed computing framework for big data processing, used for data ingestion, cleaning, transformation, and enrichment.
Docker: A platform for developing, shipping, and running applications in containers, enabling consistent and portable deployment across environments.
Docker Compose: A tool for defining and running multi-container Docker applications, simplifying the management and orchestration of containers.
Google BigQuery: A highly-scalable and fully-managed data warehouse solution that provides fast and efficient querying and analysis.
Google Cloud Storage: A highly-durable and scalable object storage service for storing and managing data.
Google Cloud DataProc: A fully-managed service for running Apache Spark and other big data processing tools on Google Cloud.
Prefect: A workflow management system for building, scheduling, and monitoring data pipelines, used for orchestrating the various tasks in the project.
dbt (Data Build Tool): A modern data transformation tool for data warehouses, used for transforming and modeling data in BigQuery.
Python: A versatile programming language used for various data engineering tasks, such as writing Apache Spark jobs and Prefect flows.
Looker (Google Data Studio): A powerful and user-friendly data visualization tool that integrates with Google BigQuery for creating interactive charts, dashboards, and reports.
Terraform: An infrastructure-as-code tool for provisioning and managing cloud resources, used for automating the creation and configuration of Google Cloud resources.

These technologies and tools were employed throughout the CrimeTrendsExplorer project to create a robust, efficient, and scalable data pipeline, as well as to enable effective analysis and visualization of the processed crime data.

Back to the top

Data Pipeline Architecture diagram

Back to the top

Pipeline explanation

This project processes crime records data for 3 cities (Austin, Los Angeles, and San Diego) for several years, providing insights and analysis of crime trends across these cities.

Data Ingestion: In this stage, the raw crime data from the respective cities is fetched from various web sources as CSV files. The raw data is then stored in Google Cloud Storage, along with the Python file containing the Apache Spark job.
Data Transformation: The raw data is processed using Apache Spark, running on a DataProc Cluster. The Spark job performs tasks such as data cleaning, transformation, enrichment, and partitioning. The transformed data is then directly loaded into Google BigQuery, a fully-managed data warehouse solution.
Data Modeling: The transformed data is further processed using dbt (Data Build Tool) to create meaningful and structured data models in Google BigQuery.
Data Partitioning & Clustering: The data is partitioned and clustered in BigQuery to optimize query performance and storage efficiency. Partitioning is done on a monthly basis, while clustering is done on a daily basis.
Data Visualization: The processed and modeled data in BigQuery is then used to create interactive visualizations, charts, and dashboards in Looker (Google Data Studio), enabling users to explore and analyze crime trends across cities.
Pipeline Orchestration: The entire pipeline is orchestrated using Prefect, a workflow management system that schedules, manages, and monitors the various tasks in the pipeline, ensuring the smooth and efficient operation of the data pipeline.
Dockerization: All components of the project, including the Spark job, Prefect pipeline, and dbt, are dockerized to ensure a consistent and portable environment for development, testing, and deployment.

This pipeline allows for the efficient processing and analysis of large-scale crime data, enabling users to explore trends, patterns, and insights across multiple cities and timeframes.

Back to the top

Dashboard & Results

Results:

Over the last 4 years, the number of crimes per month has remained relatively stable.
The number of crimes in Austin has not shown any significant increase or decrease over the analyzed period.
In Los Angeles, there has been an increase in the number of crimes in the last 2 years.
Conversely, San Diego has experienced a decrease in the number of crimes over the past 2 years.
The raw data for Austin and Los Angeles have similar types, which allows for a comparison between the two cities.
However, the raw data for San Diego is of a different type compared to the other cities, making a direct comparison with Austin and Los Angeles not feasible.

Back to the top

How to reproduce

Step 1. Set Up Cloud Environment. Using Google Console and local terminal

Create an account on Google's Cloud platform with your Google email ID (if you don't have one).
Go to Google Cloud Console and create a new project. Copy Project ID (in my case it was: crime-trends-explorer) and press Create.

Then choose to use your new project:
Create Service Account for this project:
- Go to IAM & Admin -> Service accounts -> Create Service Account
- Name: crime-trends-explorer-user
- Grant those roles to this account:
  - Viewer
  - BigQuery Admin
  - Storage Admin
  - Storage Object Admin
  - Dataproc Administrator
  - Service Account User
- After pressing Done click Actions on created account and choose Manage keys:
  - Add key -> Create new key -> JSON -> Create
  - Create New Key for this account (json)
- Download created service-account-key, rename it to crime-trends-explorer-user-key.json and put it in ~/.gc dir (create folder if needed). It's for convenient. Then we'll put it into our VM.
Enable these APIs for your project:
- https://console.cloud.google.com/apis/library/iam.googleapis.com
- https://console.cloud.google.com/apis/library/iamcredentials.googleapis.com
- Compute Engine API (You'll find it when choose Compute Engine menu)
- Cloud Dataproc API
Generate SSH keys to login to VM instances. This will generate a 2048 bit rsa ssh keypair, named gcp and a comment of de_user. The comment (de_user) will be the user on VM :
- In terminal:
```
cd ~/.ssh
ssh-keygen -t rsa -f ~/.ssh/gcp -C de_user -b 2048
```
- Put generated public key to google cloud: (Compute Engine -> Metadata -> SSH Keys -> Add ssh key) and copy all from file gcp.pub. If you already have SSH key to work with your gcp, you can use it.

Create Virtual Machine Instance (Compute Engine -> VM Instances -> Create Instance).

Name: whatever name you would like to call this VM (crime-vm)
Region, Zone: select a region near you, same with Zone (us-east1-b)
Machine type: Standard, 4vCPu, 16 GB Memory (e2-standard-4)
Operating system: Ubuntu
Version: Ubuntu 20.04 LTS
Boot disk size: 30Gb.

When VM is created, note the external IP address.

Connect to created VM from terminal (Copy an external ip of created VM):
```
ssh -i ~/.ssh/gcp de_user@<external_ip_you_copied>
```
You can also update or create a config file in your ~/.ssh folder, inputing the external IP address of your VM. This will allow you to just do ssh crime-vm to login to your VM
```
$ cd ~/.ssh/
# Create config file (or open if exists)
$ touch config
```
Add:
```
Host crime-vm
 Hostname <external_ip_you_copied>
 User de_user
 IdentityFile ~/.ssh/gcp
```
Then you can run ssh crime-vm to connect to this VM.

Step 2. Set Up Cloud Environment. Using VM Terminal

Connect to remote VM:
```
ssh crime-vm
```

In VM clone repository and cd into it:

git clone https://github.com/albertaleksa/crime-reports-data.git
cd crime-reports-data

Run bash script to install software in VM:
```
bash ./setup/setup.sh
```
This will :
- update system
- download and install Anaconda.
- install Docker
- Give permission to run docker commands without sudo in VM
- Install docker-compose
- Install Terraform
- Install make
IMPORTANT: Log out and log back in so that your group membership is re-evaluated.

Check if docker works:

docker run hello-world

Check docker-compose version:

docker-compose version

(From local terminal) Copy service-account-key crime-trends-explorer-user-key.json from ~/.gc to our VM. It'll make possible to work with our Google Cloud from VM using this service account:
```
scp ~/.gc/crime-trends-explorer-user-key.json de_user@crime-vm:~/.gc/
```
Also copy this key to your folder with project in VM.
(In Remote VM) Configure gcloud with your service account .json file:
- If needed change in file setup/activate_service_account.sh to your value
- Run command:
```
bash ./setup/activate_service_account.sh
```
- Log out and log back

Step 3. Run Terraform to deploy your infrastructure to your Google Cloud Project (In Remote VM)

If needed you can change variables in terraform.tfvars (change region,..)
Initialize terraform:
```
terraform -chdir="./terraform" init
```
Build a deployment plan (change crime-trends-explorer to your project_id if needed):
```
terraform -chdir="./terraform" plan -var="project_id=crime-trends-explorer"
```
Apply the deployment plan and deploy the infrastructure (change crime-trends-explorer to your project_id if needed). If needed type yes to accept actions:
```
terraform -chdir="./terraform" apply -var="project_id=crime-trends-explorer"
```
Go to Google Cloud Console to make sure that infrastructure is created:
- Google Cloud Storage:
- BigQuery
- DataProc
Copy a Dataproc temp bucket name from gcs buckets to file .env in field DATAPROC_TEMP_BUCKET.

Step 4. Install Spark in VM (Optional. Data pipeline will work in Docker container).

In VM Remote run:

bash ./crime-reports-data/setup/install_spark.sh

IMPORTANT: Log out and log back in.
Go to work dir and create lib folder and download GCS connector:

cd ~/crime-reports-data/flows/
mkdir lib
cd lib
gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar gcs-connector-hadoop3-2.2.5.jar

Step 5. Run pipeline using Prefect for orchestration in Docker Container which copy datasets from web to Google Cloud Storage, then save in parquet, save to Big Query and process using dbt (In Remote VM)

Build Docker image in Remote VM:
```
make docker-build
```
Run docker-compose in background. It starts Prefect Orion Server, Prefect Agent, Prefect db and container which will execute pipeline:
```
make docker-up
```
To stop:
```
make docker-down
```
For monitoring and run/schedule pipelines using Prefect UI it's necessary to forward port 4200 (I used PyCharm Pro for this) and open url: http://127.0.0.1:4200/ on local machine.
Run python script to create blocks for Prefect:
```
make create-block
```
You can check in UI that blocks for GCP Credentials and GCS Bucket are created:
Create a Prefect Flow deployment to:
- download datasets from web
- upload them into Goggle Cloud Storage
- upload spark_job file to Goggle Cloud Storage
- submit spark job to DataProcCluster. Spark job will do:
  - read csv files
  - modify columns
  - save to parquet
  - save to Big Query with daily partitioning by crime_date column. This type of partitioning is used to improve performance because I will make aggregation by this field to analyse data.
```
make ingest-data
```
Schedule a deployment in prefect to run daily at 02:00 am (if needed):
```
make ingest-data-schedule
```
To check Agent's logs (interactively):
```
make docker-agent-logs
```

Step 6. dbt transformation data in BigQuery (In Remote VM)

At this moment dbt transformation is working directly from docker container

Build dbt for dev:
```
make dbt-dev
```
Build dbt for production:
```
make dbt-prod
```
Go to the Looker and create visualization from Big Query table crime-trends-explorer.prod_crime_reports.fact_crimedata

Back to the top

Future Improvements

Create tests for all code and sql
Move other commands to Make-file to make a reproducibility in less commands
Implement CI/CD
Add API for extracting data from web

Back to the top

Credits

A special thanks to the instructors for their guidance and support throughout this incredible course. Their expertise and insights have been invaluable in the development of the CrimeTrendsExplorer project. I've learned a lot of useful skills and techniques that have greatly enhanced my knowledge as a data engineer.

Thank you for providing such a comprehensive and engaging course experience!

Back to the top

southboy12 / crime-reports-data Goto Github PK

crime-reports-data's Introduction

CrimeTrendsExplorer: A Multi-City Crime Analysis Project

Table of contents

Problem Description: CrimeTrendsExplorer - A Multi-City Crime Analysis Project

Objective

Data Sources

Data Processing

Data Storage

Analysis and Visualization

Technologies & Tools

Data Pipeline Architecture diagram

Pipeline explanation

Dashboard & Results

Results:

How to reproduce

Step 1. Set Up Cloud Environment. Using Google Console and local terminal

Step 2. Set Up Cloud Environment. Using VM Terminal

Step 3. Run Terraform to deploy your infrastructure to your Google Cloud Project (In Remote VM)

Step 4. Install Spark in VM (Optional. Data pipeline will work in Docker container).

Step 5. Run pipeline using Prefect for orchestration in Docker Container which copy datasets from web to Google Cloud Storage, then save in parquet, save to Big Query and process using dbt (In Remote VM)

Step 6. dbt transformation data in BigQuery (In Remote VM)

Future Improvements

Credits

crime-reports-data's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org

Jobs