Data Science Stuff/Portfolio
A locally hosted application that's used to identify anomalous data trends from key raw & transformed data sources. The application leverages python machine learning packages, as well as database connections to query data, predict upper & lower bounds, and highlight areas where data is below or above expected thresholds. The application triggers alerts, which are sent directly to me via a Zapier <> Mattermost integration, which significantly reduces downtime, as well as time taken to diagnose the root cause of anomalous data.
As your analytics infrastructure grows, so does your database spend. The cost of a database can be greater than the value of the analytics infrastructure if the database and its accompanying data models are not optimized. This occurred at my company, our costs began to exceed the value provided and I identified the need to optimize key data models. In order to do this I needed to focus on improving clustering, incremental build logic, and data stored in cache during each job. In order to accomplish this, I developed a custom SQL script that aggregates usage and calculates cost at an individual data model level using custom calculations based on our contract rates as they pertain to specific warehouse types. Snowflake only aggregates cost at an individual warehouse level, which limits our ability to identify more granular costs. This script allowed me to identify high-spend & high-usage data models to target for optimization. Once the data models were identified, optimized, and enough time had passed to establish new daily spend baselines, I performed time series forecasting, using Facebook's prophet package, to calculate expected spend for the next fiscal year. The forecast provided our finance team with guidance on the upper and lower bounds of database spend to negotiate the minimum required contract price that would allow our analytics infrastructure to continue to grow at the rate our company required.
Our team released a cloud offering of our previously on-prem solution. The revenue model was based on a freemium model that allowed free usage for up to a certain number of registered users. We already knew that keeping a cloud workspace and its user base engaged was key to converting free workspaces to paid workspaces. What we wanted to identify were the key features with the greatest impact on keeping a workspace engaged. The purpose of this was to guide product design decisions that would encourage workspaces and their users to engage with these key features earlier on in their lifecycle. To accomplish this, I wrote a sql script, that I extracted as a dataframe, containing a large set of features. Then I created a python method that looped through several machine learning models, trained them using the dataframe, and output key performance metrics. Then I selected the most performant model based on these metrics, evaluated feature importance, and informed the product team on where to focus their efforts.