GithubHelp home page GithubHelp logo

ankushjain7 / clustering-with-llm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from damiangilgonzalez1995/clustering-with-llm

0.0 0.0 0.0 94.06 MB

A customer segmentation project can be approached in multiple ways. In this repository, we will explore advanced techniques for defining clusters and analyzing the results.

Python 0.01% Jupyter Notebook 100.00%

clustering-with-llm's Introduction

Clustering with LLM

Introduction

A customer segmentation project can be approached in multiple ways. In this repository, we will explore advanced techniques for defining clusters and analyzing the results. This repo is intended for data scientists looking to expand their toolbox for tackling clustering problems and advancing towards becoming senior data scientists.

What Will We Cover?

In this repo, we will explore three methods to approach customer segmentation projects:

  1. Kmeans
  2. K-Prototype
  3. LLM + Kmeans

Getting Started

As a preview, we will provide a comparison of 2D representations (PCA) of the different models created. Additionally, you will learn about dimensionality reduction techniques such as PCA, t-SNE, and MCA, with results included.

Getting Started

Important Note: This project does not cover the exploratory data analysis (EDA) phase or variable selection, which are crucial steps in such projects.

Data

The original data used in this project is from a public Kaggle dataset called "Banking Dataset - Marketing Targets." Each row in this dataset contains information about a company's customers, including both numerical and categorical fields. This diversity in data types opens up various approaches to the problem.

For this project, we will focus on the first 8 columns of the dataset, which include:

  • age (numeric)
  • job: type of job (categorical)
  • marital: marital status (categorical)
  • education: education level (categorical)
  • default: has credit in default? (binary)
  • balance: average yearly balance in euros (numeric)
  • housing: has a housing loan? (binary)
  • loan: has a personal loan? (binary)

The project uses the training dataset from Kaggle, which can be found in the "data" folder of the project repository as a compressed file. Inside the compressed file, you'll find two CSV files: train.csv (the original training dataset) and embedding_train.csv (the dataset after performing an embedding, which will be explained later).

To understand the project's structure, here's an overview of the project directory:

clustering_llm
├─ data
│  ├─ data.rar
├─ img
├─ embedding.ipynb
├─ embedding_creation.py
├─ kmeans.ipynb
├─ kprototypes.ipynb
├─ README.md
└─ requirements.txt

Method 1: Kmeans

Kmeans is a commonly used clustering method, and we will dive into it to demonstrate advanced analysis techniques. The complete procedure can be found in the Jupyter notebook titled kmeans.ipynb.

Method 2: Kprototype

Method to create clusters when you have a mix of features (categorical and numerical). Jupyter notebook titled kprototypes.ipynb.

Method 3: LLM + Kmeans

The jewel in the crown, where you will find how to apply llm to obtain impeccable results in cluttering projects Jupyter notebook titled embedding.ipynb.

clustering-with-llm's People

Contributors

damiangilgonzalez1995 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.