GithubHelp home page GithubHelp logo

ilmercu / mondrian-k-anonimity Goto Github PK

View Code? Open in Web Editor NEW

This project forked from andrew0133/mondrian-k-anonimity

0.0 0.0 0.0 6.59 MB

Implementation of Mondrian k-anonymity algorithm

License: Apache License 2.0

Python 100.00%
data-privacy data-protection k-anonimity mondrian-algorithm pandas python

mondrian-k-anonimity's Introduction

Mondrian k-anonimity algorithm

Introduction

Mondrian is a multidimensional k-anonymity model that anonymizes data through recursively splitting the attributes' dimensions with a median-partition strategy. This model is very fast and scalable.
The proposed solution is a Top-down greedy data anonymization algorithm for relational dataset, proposed by Kristen LeFevre in his papers.
The algorithm proposed by LeFevre imposes an intuitive ordering on each attribute. So, there are no generalization hierarchies for categorical attributes. This operation brings lower information loss, but worse semantic results.

Key concepts

Quasi-identifier (QI)

A QI set is a subset of attributes X1, ..., Xd in table T that could potentially be used to re-identify individual records. Examples of quasi-identifiers are sex, age, zip code, and race. There is no specific rule for determining which attributes are quasi-identifiers; in practical applications, a domain expert might need to determine the QI set among all the attributes in a table.

Equivalence class

With respect to attributes X1, ..., Xd in table T, an equivalence class is the set of all records in T containing identical values (x1, ..., xd) for X1, ..., Xd.

Generalization

The method of generalization transforms QI values into less specific, but semantically useful values so that records with the same transformed QI values are indistinguishable.

Local recoding

With local recoding, individual records are mapped to generalized forms. In this method, the data space is partitioned into different regions and then all records in the same region are mapped to the same generalized record. For example, a value of "60" for age in two records could possibly be mapped to two different intervals of "[55–60]" and "[60–65]" respectively if the two records are partitioned to different regions. Both multi-dimensional recoding (described above) and local recoding can improve the quality of anonymization without over-generalization and result in less information loss.

Strict model

In the strict mode, the algorithm splits a partition in two parts, lhs and rhs, by using a split value (median of partition projected on a quasi-identifier). These parts can't contain common split value, so there is no intersection between the two parts.

Performance metric

To asses the performance of the anonymization algorithm we used the concept of normalized average equivalence class size metric (CAVG). This metric measures how well partitioning approaches the best case (information loss). The formula is shown below:

Algorithm explanation

The algorithm starts with choosing the dim, one heuristic, used in our implementation, chooses the dimension (quasi-identifier) with the widest (normalized) range of values. If there is no allowable cut for choosed dim and the partition is greater than 2*k the algorithm changes dimension (QI) with the next widest (normalized) range of values, until there is no allowable cut for partition considering all QI’s.
After the value has been chosen it is calculated the frequency set of the dim.
Then the split value is calculated using the standard median-finding algorithm. The split value is the median of the partition projected on dim.
Then the current partition is splitted in two sub partitions based on a split_val parameter.
In the end, the Anonymize algorithm is called recursively in both sub partitions.

Pseudo-code algorithm

The following pseudo-code is based on paper and it was the starting point for the implemented solution.

Anonymize(partition)
    if (no allowable multidimensional cut for partition)
        return φ : partitionsummary
    else
        dimchoose_dimension()
        fsfrequency_set(partition, dim)
        splitValfind_median(fs)
        lhs ← {tpartition : t.dimsplitVal}
        rhs ← {tpartition : t.dim > splitVal}
        return Anonymize(rhs) ∪ Anonymize(lhs)

Datasets

For testing purpose there are different datasets in data folder:

  • adult.csv
  • data_test.csv
  • db_100.csv

Execution

Arguments

--qi, Quasi Identifier, required
--k, K-Anonimity, required
--dataset, Dataset to be anonymized, required
--rid, Remove id column, "y" otherwise "n"
--plt, Graphical representation test for different K, "y" otherwise "n"

Run examples

foo@bar:~$ python main.py --qi "age" "zip_code" "city_birth" --k 2 --dataset "data/db_100.csv"
...
foo@bar:~$ python main.py --qi "age" "work_class" "education_num" "marital_status" "occupation" "race" "sex" "native_country" --k 200 --dataset "data/adult.csv" --plt y
...

Output

The anonymized result is saved on data/output.csv file.
The file plot_output.jps will be generated if the argument --plt y is specified.

References

Mondrian_Multidimensiona_K-Anonymity.pdf

Authors

mondrian-k-anonimity's People

Contributors

andrew0133 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.