GithubHelp home page GithubHelp logo

data_mining_bfr_alg's Introduction

BFR Algorithm Implementation

Overview of the Assignment

I will implement the Bradley-Fayyad-Reina (BFR) algorithm on a synthetic dataset. The goal is to familiarize myself with the clustering process and various distance measurements using synthetic datasets.

Dataset

Synthetic datasets were generated to meet the strong assumption of the BFR algorithm that clusters are normally distributed with independent dimensions. The dataset includes random centroids, data points, and outliers for algorithm evaluation. (not included because it is too large for Github)

Task

I will implement the Bradley-Fayyad-Reina (BFR) algorithm to cluster the data in hw6_clustering.txt. The algorithm involves three sets of points: Discard set (DS), Compression set (CS), and Retained set (RS). The steps of the BFR algorithm include loading data, running K-Means, and generating DS and CS clusters.

  • Step 1: Load 20% of the data randomly.
  • Step 2: Run K-Means with a large K on the data using the Euclidean distance as the similarity measurement.
  • Step 3: Move single-point clusters to RS (outliers).
  • Step 4: Run K-Means again to cluster the remaining data points.
  • Step 5: Generate DS clusters and statistics from the K-Means result.
  • Step 6: Run K-Means on the points in RS to generate CS and RS clusters.
  • Step 7: Update the DS and CS clusters based on the new points in the dataset.
  • Step 8: Repeat steps 1-7 until all data points are processed.
  • Step 9: Merge the DS and CS clusters to form the final set of clusters.
  • Step 10: Assign each data point to the nearest cluster.
  • Step 11: Output the cluster assignments for each data point.
  • Step 12: Evaluate the clustering performance using appropriate metrics.

The BFR algorithm aims to efficiently cluster the dataset and handle outliers.

Output

The output file is a text file, containing the following information:

a. The intermediate results (the line is named as “The intermediate results”): I will start each line with “Round { }” and output the numbers in the order of “the number of the discard points”, “the number of the clusters in the compression set”, “the number of the compression points”, and “the number of the points in the retained set”. I will leave one line in the middle before writing out the cluster results.

b. The clustering results (the line is named as “The clustering results”): I will include the data points index and their clustering results after the BFR algorithm. The clustering results should be in [0, the number of clusters). The cluster of outliers should be represented as -1.

Final Grade

  • Accuracy: 99.98%
  • Grade: 100%

data_mining_bfr_alg's People

Contributors

drewm8080 avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.