GithubHelp home page GithubHelp logo

sdv-dev / sdmetrics Goto Github PK

View Code? Open in Web Editor NEW
190.0 13.0 43.0 2.34 MB

Metrics to evaluate quality and efficacy of synthetic datasets.

License: MIT License

Makefile 0.78% Python 99.22%
synthetic-data metrics quality

sdmetrics's Introduction


This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Development Status PyPI Shield Downloads Tests Coverage Status Slack Tutorial

Overview

The SDMetrics library evaluates synthetic data by comparing it to the real data that you're trying to mimic. It includes a variety of metrics to capture different aspects of the data, for example quality and privacy. It also includes reports that you can run to generate insights, visualize data and share with your team.

The SDMetrics library is model-agnostic, meaning you can use any synthetic data. The library does not need to know how you created the data.

Install

Install SDMetrics using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

pip install sdmetrics
conda install -c conda-forge sdmetrics

For more information about using SDMetrics, visit the SDMetrics Documentation.

Usage

Get started with SDMetrics Reports using some demo data,

from sdmetrics import load_demo
from sdmetrics.reports.single_table import QualityReport

real_data, synthetic_data, metadata = load_demo(modality='single_table')

my_report = QualityReport()
my_report.generate(real_data, synthetic_data, metadata)
Creating report: 100%|██████████| 4/4 [00:00<00:00,  5.22it/s]

Overall Quality Score: 82.84%

Properties:
Column Shapes: 82.78%
Column Pair Trends: 82.9%

Once you generate the report, you can drill down on the details and visualize the results.

my_report.get_visualization(property_name='Column Pair Trends')

Save the report and share it with your team.

my_report.save(filepath='demo_data_quality_report.pkl')

# load it at any point in the future
my_report = QualityReport.load(filepath='demo_data_quality_report.pkl')

Want more metrics? You can also manually apply any of the metrics in this library to your data.

# calculate whether the synthetic data respects the min/max bounds
# set by the real data
from sdmetrics.single_column import BoundaryAdherence

BoundaryAdherence.compute(
    real_data['start_date'],
    synthetic_data['start_date']
)
0.8503937007874016
# calculate whether the synthetic data is new or whether it's an exact copy of the real data
from sdmetrics.single_table import NewRowSynthesis

NewRowSynthesis.compute(
    real_data,
    synthetic_data,
    metadata
)
1.0

What's next?

To learn more about the reports and metrics, visit the SDMetrics Documentation.




The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

  • 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
  • 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
  • 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

sdmetrics's People

Contributors

amontanez24 avatar csala avatar fealho avatar frances-h avatar gsheni avatar k15z avatar katxiao avatar lajohn4747 avatar npatki avatar pvk-developer avatar r-palazzo avatar rwedge avatar sdv-team avatar tejuafonja avatar zhuofanxie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sdmetrics's Issues

KSTestExtended - Fail when data contains PII fields

Environment Details

  • SDMetrics version: 0.3.0
  • Python version: Python 3.7
  • Operating System: Pop OS!

Error Description

When attempting to evaluate data that contains PII fields, this fails because the fake data didn't contain a given record.

Steps to reproduce

Using the SDV tabular demo for PII:

from sdv.demo import load_tabular_demo
from sdv.tabular import GaussianCopula

data_pii = load_tabular_demo('student_placements_pii')
model = GaussianCopula(
    primary_key='student_id',
    anonymize_fields={
        'address': 'address'
    }
)

model.fit(data_pii)
new_data_pii = model.sample(200)

from sdv.metrics.tabular import KSTestExtended
KSTestExtended.compute(data_pii, new_data_pii)

This will end up producing the following error:

~/.virtualenvs/SDV/lib/python3.7/site-packages/rdt/transformers/categorical.py in _get_value(self, category)
    111             category = np.nan
    112 
--> 113         mean, std = self.intervals[category][2:]
    114 
    115         if self.fuzzy:

KeyError: 'USS Fowler\nFPO AA 99303'

Where the KeyError will change depending on the data that you may have on the real dataset.

Temporal bypass solution

Simply drop all the PII fields that are within the real_data and the synthetic_data in order to evaluate with this metric.

Here is a working solution for this demo:

ks_data_pii = data_pii.drop('address', axis=1)
ks_new_data_pii = new_data_pii.drop('address', axis=1)
KSTestExtended.compute(ks_data_pii, ks_new_data_pii)

[Security] Workflow tests.yml is using vulnerable action actions/checkout

The workflow tests.yml is referencing action actions/checkout using references v1. However this reference is missing the commit a6747255bd19d7a757dbdda8c654a9f84db19839 which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.

README doesn't accurately describe the output of `compute_metrics`

The current README doesn't print the latest output. More specifically, the command sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata) currently doesn't print the same as what the README prints (e.g. the current code produces a column named error containing None values which the README doesn't have, as well as other changes).

Improve usability & documentation: Make SDMetrics easily interpretable

  • SDMetrics version: 0.4.5
  • Python version: 3.6.9
  • Operating System: Linux RHEL

Description

As a end user we are finding it difficult to interpret the result of SDMetrics. The documentation for SD matrices could be more elaborated, so that user can interpret the quality of data generated easily. There are some parameter as in the report which are hard to guess -

  1. Overall score.
  2. Detectability of synthetic data.

More compressive documentation is needed for SDMetrics.

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Implement privacy metrics

  • SDMetrics version: 0.1.1.dev0
  • Python version: 3.7.4
  • Operating System: MacOS Catalina 10.15.6

Description

To implement several privacy metrics for single_tables.

Scipy 1.6.0 causes an AttributeError

The latest version of scipy causes the following error:

AttributeError: 'str' object has no attribute 'decode' #981

Downgrading to a previous version fixes the issue, as suggested here.

More splits than classes

The code crashes when there are less member of a class than the requested number of splits for the dataset. Having a try/except inside single table detection metrics base and returning Nan could solve the issue.

ValueError: n_splits=3 cannot be greater than the number of members in each class.

Relational `KSTest` crashes with `IncomputableMetricError` if a table doesn't have numerical columns

Environment Details

  • SDMetrics version: 0.4.1
  • Python version: 3.7

Error Description

The relational KSTest is supposed to run the KSTest on all numerical columns in all tables and return the average score.

However, this test crashes if it encounters a table that has no numerical columns. I expect this test to succeed as long as there is at least 1 numerical column in any of the tables.

Steps to reproduce

Use the relational demo dataset and pass it in with the metadata.

from sdv.metrics.demos import load_multi_table_demo
from sdv.metrics.relational import KSTest

real_data, synthetic_data, metadata = load_multi_table_demo()
KSTest.compute(real_data, synthetic_data, metadata)

Output:

/usr/local/lib/python3.7/dist-packages/sdmetrics/single_table/base.py in _select_fields(cls, metadata, types)
     78 
     79         if len(fields) == 0:
---> 80             raise IncomputableMetricError(f'Cannot find fields of types {types}')
     81 
     82         return fields

IncomputableMetricError: Cannot find fields of types ('numerical',)

I believe this is happening because table sessions has no numerical columns. Interestingly, it does work if I exclude the metadata object -- because then it starts assuming that the id field is a numerical column.

KSTest.compute(real_data, synthetic_data)

0.8555555555555556

New features for `KSTest.compute`

Update KSTest to include some new features for usability and evaluation.

Expected behavior

  1. Rename class to KSComplement to be more descriptive
  2. Accept column type 'datetime'. If there are datetime columns, convert them to numerical using pandas built-in functionality (NOT RDTs)
  3. Ignore NaN values instead of filling them with 0s. By converting to 0s right now, we are changing the distributions

[Security] Workflow tests.yml is using vulnerable action actions/checkout

The workflow tests.yml is referencing action actions/checkout using references v1. However this reference is missing the commit a6747255bd19d7a757dbdda8c654a9f84db19839 which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.

Report metric errors from `compute_metrics`

Problem Description

Currently, compute_metrics does not capture what metrics are erroring out and what errors are being thrown. Instead, metrics that error out are being reported as NaNs and ultimately are ignored. As a result, end users have little information about what metrics are erroring out and why. Capturing this information could be useful for users of the library (such as sdv.evaluate) when debugging usage.

Expected behavior

We could add an error column to the final scores DataFrame.

Why are the frequencies normalized in `CSTest`?

I noticed that when I use the SDMetrics version for CSTest for a single column, it is providing higher pvalues than if I directly use scipy.stats.chisquare.

Upon closer examination, I think this is because CSTest is normalizing the frequencies (so they add up to 1) before calling chisquare. Why is this done? In fact, if I read the scipy docs, it provides a note that makes me believe it's expecting the total counts, not the normalized frequency:

This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5.

Concrete example below:

expected = ['A']*10 + ['B']*10 + ['C']*10
observed = ['A']*15 + ['B']*15

# returns p value 0.7788
CSTest.compute(pd.DataFrame(data=expected), pd.DataFrame(data=observed))

# using normalized frequencies, pvalue is the same
chisquare([0.5, 0.5, 0], [0.33333333, 0.333333333, 0.33333333])

# using actual frequencies, pvalue is much lower at 0.00055
chisquare([15, 15, 0], [10, 10, 10]) 

If this is to be an indication of whether the synthesized data properly fits the real data, shouldn't we stop normalizing this way? As we synthesize more data points, we should expect more & more confidence about whether the synthesized data matches the real data?

For summarized metrics, add a method to get more details

Problem Description

A few of the metrics in this library will return a single number that is actually the combination of several values. For example,

  • CSTest runs chisquare across each categorical column and returns the average
  • ContinuousKLDivergence computes an entropy score across each pairwise combination (nC2) of numerical columns, and then returns a overall normalized score

A singular metric that summarizes multiple scores may not be that useful for debugging or auditing in greater detail.

Expected behavior

What if for each metric class, we added a get_details method that will return the intermediary results used to generate the summarized metric?

Eg.

  • For CSTest, it would return the chisquare results for each individual column
  • For ContinuousKLDivergence, it would return the pairwise entropy scores
  • For metrics that aren't summaries of multiple values, it can return None or raise an error

Cap pomegranate to <0.14.7

The latest pomegranate version, 0.14.7, does not work well with numpy<1.22 because of incompatibilities between the Python code and the compiled C backend.

When installed, SDMetrics currently ends up using numpy==1.21.5 because of third party restrictions (numba is not compatible with numpy~=1.22 yet), which means that SDMetrics crashes when pomegranate is used with the following error:

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

An example of this can be seen here: https://github.com/sdv-dev/SDMetrics/runs/4871113960?check_suite_focus=true

We can fix this by temporarily capping pomegranate on <0.14.7 until numba extends support to the latest numpy version.

Time series metrics fails with variable length timeseries

The LSTM classifier doesn't support variable length time series unless they are sorted from longest to shortest. Since we don't need ONNX compatibility, we can remove this restriction:

RuntimeError: lengths array must be sorted in decreasing order when enforce_sorted is True. You can pass enforce_sorted=False to pack_padded_sequence and/or pack_sequence to sidestep this requirement if you do not need ONNX exportability.

Support mixed types in Privacy Metrics

Problem Description

The Privacy Metrics assume an adversarial attack model where a user with access to a few key_fields might be able to predict sensitive_fields.

I understand that we need to fit different models based on whether the sensitive_fields are categorical vs. numeric. However, it is expected that all the key_fields are also of the same type. Does this need to be the case? What if I think some categorical columns might be crucial in leaking numeric data (and vice versa)?

Expected behavior

Depending on the type of the sensitive_fields, it would be nice to convert the input columns so that they are compatible with the tests.

  1. If the sensitive_fields are numeric, then we can convert categorical key_fields to numeric similar to how we do it in KSTestExtended
  2. If the sensitive_fields are categorical, then it may be possible to bin the key_fields

Additional context

  • What should the user API be? It would be ideal to guide the user into making a choice (to drop the columns or convert them)
  • Should we be converting the columns ourselves or should we expect users to do this first (eg. using a transformer)?

[Security] Workflow tests.yml is using vulnerable action actions/checkout

The workflow tests.yml is referencing action actions/checkout using references v1. However this reference is missing the commit a6747255bd19d7a757dbdda8c654a9f84db19839 which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.

KS Test Error Handling

Problem Description

As a user, I want only the relevant errors surfaced to me and expected behavior to be suppressed.

For now, focus on the KS Test metric (see #129 and #130 for more details)

Expected behavior

  • Create a new MetricComputationError to be used when there is a mathematical error when computing the metric (eg. when calling scipy or dividing by zero)

For tabular and relational tests compute method:

  • If there are 0 columns with valid data types, return None and throw a warning. This is not an error; the metric is simply undefined.
>>> InvertedKSTest.compute(real_data, synthetic_data, metadata)
Warning: Incompatible data types. The InvertedKSTest is only defined for column types ['datetime', 'numerical']. None were found in the data.
None
  • If the entire test is resulting in mathematical errors (eg all results are invalid), throw a MetricComputationError
>>> InvertedKSTest.compute(real_data, synthetic_data, metadata)
MetricComputationError: <message>
  • However, if only certain columns are returning a MetricComputationError, show a warning but keep going with the other columns
>>> InvertedKSTest.compute(real_data, synthetic_data, metadata)
Warning: InvertedKSTest returned a MetricComputationError for column 'user_age'. Skipping this column.
Warning: InvertedKSTest returned a MetricComputationError for column 'weight'. Skipping this column.
0.699382

SDMetrics 0.4.2 has incompatible copula version with SDV

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDMetrics version: 0.4.2
  • SDV version: 0.14.1
  • Python version: 3.8.10
  • Operating System: ubuntu server lts 20.4.04

Error Description

The latest SDMetrics version (which is installed by default when installing SDV) has incompatible copula requirements with downstream SDV.

Steps to reproduce

On a fresh virtual environment, install pip-tools.

Place the following on a file named requirements.in

sdv
#sdmetrics==0.4.1

Type the following commands

pip install -r requirements.in
pip-compile requirements.in

pip-compile reports:

Could not find a version that matches copulas<0.7,<0.8,>=0.6.1,>=0.7.0 (from sdv==0.14.1->-r requirements.txt (line 1))
Tried: 0.0.0, 0.0.0, 0.1.0, 0.1.0, 0.1.1, 0.1.1, 0.2.0, 0.2.0, 0.2.1, 0.2.1, 0.2.3, 0.2.3, 0.2.4, 0.2.4, 0.2.5, 0.2.5, 0.3.0, 0.3.0, 0.3.2, 0.3.2, 0.3.3, 0.3.3, 0.4.0, 0.4.0, 0.5.0, 0.5.0, 0.5.1, 0.5.1, 0.6.0, 0.6.0, 0.6.1, 0.6.1, 0.7.0, 0.7.0
Skipped pre-versions: 0.3.0.dev0, 0.3.0.dev0, 0.3.2.dev1, 0.3.2.dev1, 0.3.3.dev0, 0.3.3.dev0, 0.4.0.dev0, 0.4.0.dev0, 0.5.0.dev0, 0.5.0.dev0, 0.5.0.dev1, 0.5.0.dev1, 0.5.1.dev0, 0.5.1.dev0, 0.5.1.dev1, 0.5.1.dev1, 0.5.2.dev0, 0.5.2.dev0, 0.5.2.dev1, 0.5.2.dev1, 0.6.0.dev0, 0.6.0.dev0, 0.6.1.dev0, 0.6.1.dev0, 0.7.0.dev0, 0.7.0.dev0
There are incompatible versions in the resolved dependencies:
  copulas<0.8,>=0.7.0 (from sdmetrics==0.4.2->sdv==0.14.1->-r requirements.txt (line 1))
  copulas<0.7,>=0.6.1 (from sdv==0.14.1->-r requirements.txt (line 1))

From the setup.py of both projects, we can verify the above requirements.

pip install works correctly, but we get the following (snippet):

Collecting llvmlite<0.39,>=0.38.0rc1
  Using cached llvmlite-0.38.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
Collecting charset-normalizer~=2.0.0; python_version >= "3"
  Using cached charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
Collecting idna<4,>=2.5; python_version >= "3"
  Using cached idna-3.3-py3-none-any.whl (61 kB)
Collecting certifi>=2017.4.17
  Using cached certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
Collecting urllib3<1.27,>=1.21.1
  Using cached urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
ERROR: numba 0.55.1 has requirement numpy<1.22,>=1.18, but you'll have numpy 1.22.3 which is incompatible.
ERROR: rdt 0.6.4 has requirement scipy<1.8,>=1.5.4, but you'll have scipy 1.8.0 which is incompatible.
ERROR: sdmetrics 0.4.2 has requirement copulas<0.8,>=0.7.0, but you'll have copulas 0.6.1 which is incompatible.
Installing collected packages: tqdm, typing-extensions, torch, numpy, six, python-dateutil, pytz, pandas, deepecho, scipy, threadpoolctl, joblib, scikit-learn, llvmlite, numba, pyts, pyyaml, psutil, rdt, fonttools, cycler, pyparsing, packaging, pillow, kiwisolver, matplotlib, copulas, sdmetrics, charset-normalizer, idna, certifi, urllib3, requests, torchvision, ctgan, graphviz, text-unidecode, Faker, sdv

When uncommenting sdmetrics from requirements.in, both commands run "correctly".

Furthermore, when pip-compile and pip have cached sdmetrics==0.4.1, they both select that version instead and no error is shown.

The following file never compiles:

sdv==0.14.1
sdmetrics==0.4.2

I don't know what the appropriate solution to something like this would be. I'm not a library developer.

Does `CSTest` quantify the synthesis of missing values?

If I have a table with some missing values, I want to synthesize data with missing values too -- ideally in the same ratio. I'm curious whether CSTest is an appropriate signal of this? If it isn't, should we modify it to be?

Details: From the API reference

This function applies the single column CSTest metric to all the discrete columns found in the table and then returns the average of all the scores obtained.

I know that the SDV internally creates a new, discrete binary column representing whether a column is null. But I don't now if this column is used in the CSTest computation because it's dropped before returning the synthetic data.

Privacy Metrics error if target column has missing values

Environment Details

  • SDV version: 0.13.0
  • Python version: 3.8.9
  • Operating System: MacOS

Error Description

The Numerical Privacy Metrics throw an error whenever the target columns (sensitive_fields) contain missing values.

Steps to Reproduce

Go through the User Guide to import & load data. Then, scroll down to the Privacy Metrics section.

The following code should work as-is according to the user guide.

NumericalLR.compute( real_data, synthetic_data,
    key_fields=['second_perc', 'mba_perc', 'degree_perc'],
    sensitive_fields=['salary'])

However, when I try to run this, I get an error from sklearn because the salary column contains NaN values:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Note: The same error is thrown when any of the key_fields containing missing values too. Eg. if I switch around salary and degree_perc in the above example.

Suggested Fix

This used to work, so either this was a recent change on SDV or in sklearn. What were we doing before? Were we dropping the NaN values, filling them or imputing them?

Also, maybe it's ok if it crashes upon first running. Maybe the user can re-run with a flag for handling them missing values.

ValueError: Input contains infinity or a value too large for dtype('float64').

sdv.evaluate call sometimes fails with a ValueError: Input contains infinity or a value too large for dtype('float64').

The exception is raised in pipeline fit step inside sdmetrics/detection/tabular/logistic.py.

We should review if we can prevent this, or at least capture it and return a 0.

This is the full traceback:

  ------------------
  from sdv.evaluation import evaluate
  
  evaluate(new_data, data)
  ------------------
  
  ---------------------------------------------------------------------------
  ValueError                                Traceback (most recent call last)
  <ipython-input-1-349ebfb54984> in <module>
        1 from sdv.evaluation import evaluate
        2 
  ----> 3 evaluate(new_data, data)
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdv/evaluation.py in evaluate(synthetic_data, real_data, metadata, root_path, table_name, metrics, get_report, aggregate)
      152     computed = {}
      153     for metric in metrics:
  --> 154         computed[metric] = METRICS[metric](synth, real, metadata, details=get_report)
      155 
      156     if get_report:
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdv/evaluation.py in _logistic_detection(synthetic, real, metadata, details)
       98 
       99 def _logistic_detection(synthetic, real, metadata=None, details=False):
  --> 100     return _tabular_metric(LogisticDetector(), synthetic, real, metadata, details)
      101 
      102 
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdv/evaluation.py in _tabular_metric(sdmetric, synthetic, real, metadata, details)
       86         return list(metrics)
       87 
  ---> 88     return np.mean([metric.value for metric in metrics])
       89 
       90 
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdv/evaluation.py in <listcomp>(.0)
       86         return list(metrics)
       87 
  ---> 88     return np.mean([metric.value for metric in metrics])
       89 
       90 
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdmetrics/detection/tabular/base.py in metrics(self, metadata, real_tables, synthetic_tables)
       48             Metric: The next metric.
       49         """
  ---> 50         yield from self._single_table_detection(metadata, real_tables, synthetic_tables)
       51         yield from self._parent_child_detection(metadata, real_tables, synthetic_tables)
       52 
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdmetrics/detection/tabular/base.py in _single_table_detection(self, metadata, real_tables, synthetic_tables)
       57             auroc = self._compute_auroc(
       58                 real_tables[table_name][table_fields],
  ---> 59                 synthetic_tables[table_name][table_fields])
       60 
       61             yield Metric(
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdmetrics/detection/tabular/base.py in _compute_auroc(self, real_table, synthetic_table)
      123         kf = StratifiedKFold(n_splits=3, shuffle=True)
      124         for train_index, test_index in kf.split(X, y):
  --> 125             self.fit(X[train_index], y[train_index])
      126             y_pred = self.predict_proba(X[test_index])
      127             auroc = roc_auc_score(y[test_index], y_pred)
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sdmetrics/detection/tabular/logistic.py in fit(self, X, y)
       22             ('classifier', LogisticRegression(solver="lbfgs")),
       23         ])
  ---> 24         self.model.fit(X, y)
       25 
       26     def predict_proba(self, X):
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
      328         """
      329         fit_params_steps = self._check_fit_params(**fit_params)
  --> 330         Xt = self._fit(X, y, **fit_params_steps)
      331         with _print_elapsed_time('Pipeline',
      332                                  self._log_message(len(self.steps) - 1)):
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
      294                 message_clsname='Pipeline',
      295                 message=self._log_message(step_idx),
  --> 296                 **fit_params_steps[name])
      297             # Replace the transformer of the step with the fitted
      298             # transformer. This is necessary when loading the transformer
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
      350 
      351     def __call__(self, *args, **kwargs):
  --> 352         return self.func(*args, **kwargs)
      353 
      354     def call_and_shelve(self, *args, **kwargs):
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
      738     with _print_elapsed_time(message_clsname, message):
      739         if hasattr(transformer, 'fit_transform'):
  --> 740             res = transformer.fit_transform(X, y, **fit_params)
      741         else:
      742             res = transformer.fit(X, y, **fit_params).transform(X)
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
      691         else:
      692             # fit method of arity 2 (supervised transformation)
  --> 693             return self.fit(X, y, **fit_params).transform(X)
      694 
      695 
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y)
     1200         X = self._validate_data(X, accept_sparse='csc', estimator=self,
     1201                                 dtype=FLOAT_DTYPES,
  -> 1202                                 force_all_finite='allow-nan')
     1203 
     1204         q_min, q_max = self.quantile_range
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
      418                     f"requires y to be passed, but the target y is None."
      419                 )
  --> 420             X = check_array(X, **check_params)
      421             out = X
      422         else:
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
       70                           FutureWarning)
       71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
  ---> 72         return f(**kwargs)
       73     return inner_f
       74 
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
      643         if force_all_finite:
      644             _assert_all_finite(array,
  --> 645                                allow_nan=force_all_finite == 'allow-nan')
      646 
      647     if ensure_min_samples > 0:
  
  ~/work/SDV/SDV/.tox/py36/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
       97                     msg_err.format
       98                     (type_err,
  ---> 99                      msg_dtype if msg_dtype is not None else X.dtype)
      100             )
      101     # for object dtype data, we only check for NaNs (GH-13254)
  
  ValueError: Input contains infinity or a value too large for dtype('float64').

`CategoricalSVM` not being imported

The CategoricalSVM class should be added to the imports in the sdmetrics/single_table/privacy/__init__.py and sdmetrics/single_table/__init__.py files. It should also be added to the readme.

Running a detection metric on time series data with no entity_columns fails

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDMetrics version: 0.3.2
  • Python version: 3.8.6
  • Operating System: CentOS

Error Description

My data is a sequence of a single type with no entity_columns set. As described in the user guide, model training works fine, however, I'm unable to run the detection metrics to check the "goodness" of fit. The error I get is ValueError: No group keys passed!

Looking into sources, indeed there's the problem as the code relies on the entity_columns variable:

@staticmethod
def _build_x(data, transformer, entity_columns):
X = pd.DataFrame()
for entity_id, entity_data in data.groupby(entity_columns):
entity_data = entity_data.drop(entity_columns, axis=1)
entity_data = transformer.transform(entity_data)
entity_data = pd.Series({
column: entity_data[column].values
for column in entity_data.columns
}, name=entity_id)
X = X.append(entity_data)
return X

I'm wondering whether a simple change can fix the problem correcly (see below). Could you please confirm if this is the right way of thinking?

This is change in _build_x:

    def _build_x(data, transformer, entity_columns):
        X = pd.DataFrame()
        if entity_columns:
            for entity_id, entity_data in data.groupby(entity_columns):
                # code as in the original detection.py L41-47...

        else:
            entity_data = transformer.transform(data)
            entity_data = pd.Series({
                column: entity_data[column].values
                for column in entity_data.columns
            })
            X = pd.DataFrame([entity_data])

        return X

and one more in the compute method, line

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, stratify=y)

to be changed to:

       if entity_columns:
            X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, stratify=y)
        else:
            X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True)

Steps to reproduce

A simple example based on the user guide:

from sdv.demo import load_timeseries_demo
from sdv.timeseries import PAR
from sdv.metrics.timeseries import LSTMDetection

data = load_timeseries_demo()
no_context = data[['Symbol', 'Date', 'Open', 'Close', 'Volume']].copy()
real_data = no_context[no_context.Symbol == 'TSLA'].copy()
del real_data['Symbol']

sequence_index = 'Date'
model = PAR(sequence_index = sequence_index)

print('Fitting model...')
model.fit(real_data)

print('Sampling data...')
synth_data = model.sample()

print('Running evaluation...')
val = LSTMDetection.compute(real_data, synth_data, metadata={
    'sequence_index': sequence_index,
    'fields': {
        'Date': {'type': 'datetime'},
        'Open': {'type': 'numerical', 'subtype': 'float'},
        'Close': {'type': 'numerical', 'subtype': 'float'},
        'Volume': {'type': 'numerical', 'subtype': 'integer'}
    }})

Once this is run, the error traceback is as follows:

     val = LSTMDetection.compute(real_data, synth_data, metadata={
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\sdmetrics\timeseries\detection.py", line 85, in compute
    real_x = cls._build_x(real_data, transformer, entity_columns)
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\sdmetrics\timeseries\detection.py", line 40, in _build_x
    for entity_id, entity_data in data.groupby(entity_columns):
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\pandas\core\frame.py", line 6515, in groupby
    return DataFrameGroupBy(
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\pandas\core\groupby\groupby.py", line 525, in __init__
    grouper, exclusions, obj = get_grouper(
  File "...\.virtualenvs\SDV-tFkSueKv\lib\site-packages\pandas\core\groupby\grouper.py", line 821, in get_grouper
    raise ValueError("No group keys passed!")
ValueError: No group keys passed!

Logistic Detection metric goal

The logistic detection metric is calculated as 1 minus the average ROC AUC score across all validation splits.
A ROC AUC of 0.5 means that the classifier is making predictions at random because data is indistinguishable.

If the goal is to determine how similar synthetic data and original data are, shouldn't the goal of this metric to be as near as 0.5 (meaning it can't distinguish between the two sets) and not maximize this metric ?

Please help me clarify this
Thank you

Add `pip check` to CI workflows

The current CI workflows and local test and lint commands do not catch dependency incompatibilities.

For example, installing the repository for development on the v0.6.0-dev branch results in these errors:

ERROR: numba 0.54.1 has requirement numpy<1.21,>=1.17, but you'll have numpy 1.21.3 which is incompatible.
ERROR: sphinx-rtd-theme 0.5.2 has requirement docutils<0.17, but you'll have docutils 0.18 which is incompatible.
ERROR: autopep8 1.6.0 has requirement pycodestyle>=2.8.0, but you'll have pycodestyle 2.7.0 which is incompatible.

A pip check command should be made part of the local and CI tests to make sure that our dependency tree is always clean.

Get `KSComplement` Score Breakdown

Problem Description

As a user, I want to get a per-column breakdown of KS Scores when using this test on a tabular or multi-table dataset.

Expected behavior

See #129 for background.

Create a new method compute_breakdown that has the same arguments as compute but returns a per-column breakdown instead of a summarized score.

Requirements:

  • Return a dictionary of column names to scores. In the case of relational, it will be a nested dictionary.
  • All column names should be present, even ones that are not the correct data type. Score should be None when the data type is not compatible
  • If any other error occurs, the score should have the error object

Examples

Tabular:

from sdv.metrics.tabular import KSComplement 

KSComplement.compute_breakdown(real_data, synthetic_data, metadata)
{
  'age': 0.545656,
  'weight': 2343434,
  'gender': None, # the data type is categorical, which is not compatible
  'gpa': Error # this is a valid type but there was an error running it
}

Relational: Returned object is in the nested form, with the table names at the top

from sdv.metrics.relational import KSComplement 

KSComplement.compute_breakdown(real_data, synthetic_data, metadata)
{
  'users': {
    'age': 0.545656,
    'weight': 2343434,
    'gender': None, # the data type is categorical, which is not compatible
    'gpa': Error # this is a valid type but there was an error running it
  }, 
 'transactions': {
    'transaction_id': None,
    'purchase_amt': 0.988191
 }
}

Other Context

Eventually, we'll want to do the same thing for other metrics that are actually summarize of multiple scores

`TSF` cannot handle series of variable length

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDMetrics version: 0.3.2
  • Python version: 3.7

Error Description

Metrics TSFClassifierEfficacy and TSFCDetection crash when we try to compute the metrics of time series with non-fixed length. That is because sktime.classification.compose.TimeSeriesForestClassifier does not handle time series of variable length. We get the following error:

ValueError: Tabularization failed, it's possible that not all series were of equal length

We could either pad or truncate the timeseries handle this situation.

Steps to reproduce

import pandas as pd
from sdmetrics.timeseries import TSFCDetection, TSFClassifierEfficacy

real = pd.DataFrame({
    "seq_index": [1, 1, 2, 2, 2],
    "dim_0": [0, 0, 0, 0, 1],
    "dim_1": [3, 4, 3, 3, 3] 
})

synth = pd.DataFrame({
    "seq_index": [1, 1, 2, 2, 2],
    "dim_0": [1, 1, 0, 0, 1],
    "dim_1": [4, 4, 3, 3, 3] 
})

TSFClassifierEfficacy.compute(real, synth, entity_columns=['seq_index'], target='dim_0')

full trace

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-29d0a52ef600> in <module>()
----> 1 TSFClassifierEfficacy.compute(real, synth, entity_columns=['seq_index'], target='dim_0')

/usr/local/lib/python3.7/dist-packages/sdmetrics/timeseries/efficacy/base.py in compute(cls, real_data, synthetic_data, metadata, entity_columns, target)
    107             real_data, synthetic_data, metadata, entity_columns, target)
    108 
--> 109         return cls._compute_score(real_data, synthetic_data, entity_columns, target)

/usr/local/lib/python3.7/dist-packages/sdmetrics/timeseries/efficacy/base.py in _compute_score(cls, real_data, synthetic_data, entity_columns, target)
     78 
     79         real_acc = cls._scorer(real_x_train, real_x_test, real_y_train, real_y_test)
---> 80         synt_acc = cls._scorer(synt_x, real_x_test, synt_y, real_y_test)
     81 
     82         return synt_acc / real_acc

/usr/local/lib/python3.7/dist-packages/sdmetrics/timeseries/ml_scorers.py in tsf_classifier(X_train, X_test, y_train, y_test)
     19     ]
     20     clf = Pipeline(steps)
---> 21     clf.fit(X_train, y_train)
     22     return clf.score(X_test, y_test)
     23 

/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    339         """
    340         fit_params_steps = self._check_fit_params(**fit_params)
--> 341         Xt = self._fit(X, y, **fit_params_steps)
    342         with _print_elapsed_time('Pipeline',
    343                                  self._log_message(len(self.steps) - 1)):

/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
    305                 message_clsname='Pipeline',
    306                 message=self._log_message(step_idx),
--> 307                 **fit_params_steps[name])
    308             # Replace the transformer of the step with the fitted
    309             # transformer. This is necessary when loading the transformer

/usr/local/lib/python3.7/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
    752     with _print_elapsed_time(message_clsname, message):
    753         if hasattr(transformer, 'fit_transform'):
--> 754             res = transformer.fit_transform(X, y, **fit_params)
    755         else:
    756             res = transformer.fit(X, y, **fit_params).transform(X)

/usr/local/lib/python3.7/dist-packages/sktime/transformations/base.py in fit_transform(self, Z, X)
     89         else:
     90             # Fit method of arity 2 (supervised transformation)
---> 91             return self.fit(Z, X).transform(Z)
     92 
     93     # def inverse_transform(self, Z, X=None):

/usr/local/lib/python3.7/dist-packages/sktime/transformations/panel/compose.py in transform(self, X, y)
    217         # them into a single column
    218         if isinstance(X, pd.DataFrame):
--> 219             Xt = from_nested_to_2d_array(X)
    220         else:
    221             Xt = from_3d_numpy_to_2d_array(X)

/usr/local/lib/python3.7/dist-packages/sktime/utils/data_processing.py in from_nested_to_2d_array(X, return_numpy)
    178         if Xt.ndim != 2:
    179             raise ValueError(
--> 180                 "Tabularization failed, it's possible that not "
    181                 "all series were of equal length"
    182             )

ValueError: Tabularization failed, it's possible that not all series were of equal length

Improve metric subclasses organization

The current class organization includes some metrics which are usable on their own, like BNLikelhood, which have subclasses and therefore are not picked by the get_subclasses method. As a result of this, BNLikelihood is not available in the sdv.evaluation.evaluate function: sdv-dev/SDV#327

The class organization and subclass selection should be reviewed to ensure that all the usable metrics are properly selected by the get_subclasses method.

Detection test test doesn't look at metadata when determining which columns to use

Environment details

If you are already running SDMetrics, please indicate the following details about the environment in
which you are running it:

  • SDMetrics version: 2.4.2-dev0
  • Python version: 3.9
  • Operating System: ubuntu 20.04

Problem description

When the primary_key is set, the generated data index restarts from zero.
As a consequence, detection metrics can trivially detect generated instances by setting a threshold on the primary_key.

What I already tried

I will propose a patch to remove primary_key columns if sets form these tests.

Numerical data passed to a categorical privacy metric should raise an error

And vice-versa. Currently if the wrong datatype is passed it will simply return nan. It should raise an error instead.

Below is code to reproduce this phenomena:

import pandas as pd
from sdmetrics.single_table.privacy import CategoricalCAP


data = pd.DataFrame({   # data containing only numerical values
    'key': [1.4, 10.12, 3.4],
    'sensitive': [10.9, 9.8, 8.8]
})

score = CategoricalCAP.compute(  # privacy metric that's supposed to only work with categorical values
    data,
    data, 
    key_fields=['key'],
    sensitive_fields=['sensitive']
)

print(score) # this will print `nan`

Omit id columns from metric calculations

Problem Description

Columns of type id should be dropped when computing metrics. ID columns are not synthetically generated, so should not contribute to the metric calculation. Currently they are being classified as categorical, which is incorrect.

Expected behavior

DetectionMetric should use the provided metadata to drop id columns from the real and synthetic data when computing the metric. This logic should also be applied to any other relevant metrics.

`compute_metrics` for metrics with different signatures

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDMetrics version: 0.3.2
  • Python version: 3.7

Error Description

When using sdmetrics.compute_metrics we get None for metrics that were not able to resolve keyword arguments. In most cases, we expect the user to pass a dictionary for metrics to compute_metrics but if the signature of these metrics differ, we get an error that causes us to catch it then store None in its location.

For example, take the following two time series metrics TSFCDetection and TSFClassifierEfficacy. The first one is expected to be called with

TSFCDetection.compute(data, sampled, entity_columns=entity_columns)

and the second one with an additional argument target

TSFClassifierEfficacy.compute(data, sampled, entity_columns=entity_columns, target=target)

compute_metrics will pass target to both classes which TSFCDetection cannot handle, thus crashing.

Steps to reproduce

import pandas as pd
from sdmetrics.timeseries import TSFCDetection, TSFClassifierEfficacy

length = 10

real = pd.DataFrame({
    "seq_index": [1, 2, 3] * length,
    "dim_0": [0, 0, 0] * length,
    "dim_1": [4, 4, 4] * length
})

synth = pd.DataFrame({
    "seq_index": [1, 2, 3] * length,
    "dim_0": [1, 0, 0] * length,
    "dim_1": [4, 4, 3] * length
})

metrics = {
    'TSFClassifierEfficacy': TSFClassifierEfficacy,
    'TSFCDetection': TSFCDetection
}

sdmetrics.compute_metrics(metrics, real, synth, entity_columns=['seq_index'], target='dim_0')

ParentChildDetection metrics KeyError

  • SDMetrics version: v0.1.1

Description

The LogisticParentChildDetection metric crashes with a KeyError if the names of the primary_key and foreign_key are different and there is another field on either of the tables that is called like the key on the other table.

For example, the parent table has the field id as its primary key and a child table contains both the id as its own primary key and parent_id as the foreign key to the parent. When this happens, the id fields end up converted to id_x and id_y during the merge, and then the del statements after that fail.

How to reproduce

In [1]: import pandas as pd

In [2]: parent = pd.DataFrame({'id': [1, 2, 3, 4]})

In [3]: child = pd.DataFrame({'id': [1, 2, 3, 4], 'parent_id': [1, 2, 3, 4]})

In [4]: foreign_keys = [('parent', 'id', 'child', 'parent_id')]

In [5]: data = {'parent': parent, 'child': child}

In [6]: from sdmetrics.multi_table import LogisticParentChildDetection

In [7]: LogisticParentChildDetection.compute(data, data, foreign_keys=foreign_keys)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.virtualenvs/SDMetrics/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2894             try:
-> 2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'id'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-7-91c73836e519> in <module>
----> 1 LogisticParentChildDetection.compute(data, data, foreign_keys=foreign_keys)

~/Projects/MIT/SDMetrics/sdmetrics/multi_table/detection/parent_child.py in compute(cls, real_data, synthetic_data, metadata, foreign_keys)
    104         scores = []
    105         for foreign_key in foreign_keys:
--> 106             real = cls._denormalize(real_data, foreign_key)
    107             synth = cls._denormalize(synthetic_data, foreign_key)
    108             scores.append(cls.single_table_metric.compute(real, synth))

~/Projects/MIT/SDMetrics/sdmetrics/multi_table/detection/parent_child.py in _denormalize(data, foreign_key)
     61         )
     62 
---> 63         del flat[parent_key]
     64         if child_key != parent_key:
     65             del flat[child_key]

~/.virtualenvs/SDMetrics/lib/python3.8/site-packages/pandas/core/generic.py in __delitem__(self, key)
   3709             # there was no match, this call should raise the appropriate
   3710             # exception:
-> 3711             loc = self.axes[-1].get_loc(key)
   3712             self._mgr.idelete(loc)
   3713 

~/.virtualenvs/SDMetrics/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:
-> 2897                 raise KeyError(key) from err
   2898 
   2899         if tolerance is not None:

KeyError: 'id'

Total Variation Distance (TVD) to compare categorical distributions

Problem Description

To compare a single column of data (synthetic vs. real), the SDV offers KSTest for numerical values and CSTest for categorical.

While KSTest is easy to understand & reason about, CSTest is not: It's returning a p-value, which is based on a confidence interval and subject to change if there are a different # of rows (even if the categories are in the same proportions) .

I would like another, easy-to-understand metric for comparing 2 categorical distributions. It should generally tell me if the frequencies of the categories are similar between the real vs. synthetic data.

Expected behavior

Should we add a TVD metric to compute the distance between 2 categorical columns? Per this tutorial the metric definition would be:

1/2 * SUM{across all categories}( abs(synthetic_frequency - real_frequency) )

Additional context

It seems to me that TVD for categorical variables is a similar & useful complement to the KSTest. It's computing the absolute value of differences between the two distributions, and it's also naturally bounded between 0 and 1.

Add normalize method to metrics

Description

Currently metrics return one score which can be defined in an arbitrary range and can be either a MAXIMIZATION or MINIMIZATION metric.

We could add a normalize method to all the base classes (with specific overrides for the individual metrics that require it) which gets the raw score as input and returns it normalized between 0 and 1 and using always MAXIMIZATION goal.

Usage

raw_score = WhateverMetric.compute(real, synthetic, metadata)
normalized = WhateverMetric.normalize(raw_score)

Numerical Privacy Metrics should support NaN values

NaN values should be supported by numerical privacy metrics, but currently it raises ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The code below reproduces this issue:

import pandas as pd
from sdmetrics.single_table.privacy import NumericalLR

data = pd.DataFrame({
    'key': [1, 2, None],
    'sensitive': [1, 2, 3]
})

privacy_metric = NumericalLR.compute(
    data,
    data, 
    key_fields=['key'],
    sensitive_fields=['sensitive']
)

print(privacy_metric) # this will print nan

Explore other statistical tests for numerical columns

Currently SDMetrics only provides the two samples KS test to compare numerical values. We should consider adding other tests as an optional parameter, so the user can choose a test which better matches their understanding of their data.

Additionally, we should explore substituting the KS test with the Anderson-Darling test as the default, since it is a more powerful test overall. Such change would require experimentation to show that the AD test indeed outperforms the KS test in most use cases.

Increase code style lint

Problem Description

Currently our code is being validated only by flake8 'vanilla' and just a few plugins. We would like to increase the code style checks by adding more add-on's that follow our code style and our standards.

Also we would like to ensure that our docstrings are properly written and follow the rest of our format.

Additional context

We have performed this task already on RDT , more precisely on the following issue:
sdv-dev/RDT#248 (comment)

Docstring plugin

We need to add pydocstyle plugin with the following lines on our setup.cfg file as we are following the google convention.

[pydocstyle]
convention = google
add-ignore = D107, D407, D417

Flake8 plugins to be added

Flake8 comes with a lot of different addons that we can use to adapt it to our codestyle and checking, here is a list of plugins that I found to be interesting for us:

  • flake8-builtins - Check for python builtins being used as variables or parameters.
  • flake8-comprehensions - Helps you write better list/set/dict comprehensions.
  • flake8-debugger - Debug statement checker.
  • flake8-variables-names - Extension that helps to make more readable variables names.
  • Dlint - Tool for encouraging best coding practices and helping ensure Python code is secure.
  • flake8-mock - Provides checking mock non-existent methods.
  • flake8-fixme - Check for FIXME, TODO and other temporary developer notes.
  • flake8-eradicate - Plugin to find commented out or dead code.
  • flake8-mutable - Extension for mutable default arguments.
  • flake8-print - Check for print statements in python files.
  • flake8-pytest-style - Plugin checking common style issues or inconsistencies.
  • flake8-quotes - Extension for checking quotes in python.
  • flake8-multiline-containers - Plugin to ensure a consistent format for multiline containers.
  • pandas-vet - Plugin that provides opinionated linting for pandas code.
  • pep8-naming - Check the PEP-8 naming conventions.
  • flake8-expression-complexity - Plugin to validate expressions complexity.
  • flake8-sfs - String formatting.

Upgrade dependency ranges

The latest versions of the libraries pandas, sktime and pomegranate are not supported by sdmetrics:

Library Upper bound (unsupported) Latest release
pandas 1.1.5 1.3.1
pomegranate 0.14.2 0.14.5
sktime 0.6 0.7.0

We should investigate why and update the code if necessary to support them.

Upgrade sktime

sktime==0.5.2 is already released but SDMetrics only supports 'sktime>=0.4,<0.5'.
On top of that, only sktime>0.5 versions are available in Conda, so SDMetrics can only support conda installation if we upgrade to the latest sktime versions.

Provide baseline measurement for ML efficacy

Problem Description

Right now the metrics are computed based on real data vs synthetic data for ML efficacy. While this information is perfect to gauge if a model could be fit that is good enough, it would also be interesting to learn how much performance we lose because of the synthesization.

Expected behavior

Not sure how to best integrate it with the other metrics? Maybe as additional return values? To stay backwards compatible, returning them conditionen on the caller adding the arg compute_relative_performance=True to compute?

synth_f1 = BinaryDecisionTreeClassifier.compute(data, new_data, target='placed')

vs

synth_f1, real_f1, rel_perf = \
    BinaryDecisionTreeClassifier.compute(data, new_data, target='placed', 
                                         compute_relative_performance=True)

Anyhow the result I'd like to see is something like this:

from sdv.demo import load_tabular_demo
from sdv.metrics.tabular import BinaryDecisionTreeClassifier, BinaryAdaBoostClassifier, BinaryMLPClassifier
from sdv.tabular import CopulaGAN

data = load_tabular_demo('student_placements')

model = CopulaGAN()
model.fit(data)

new_data = model.sample(200)

for clf in [BinaryDecisionTreeClassifier, BinaryAdaBoostClassifier, BinaryMLPClassifier]:
    r_f1 = clf.compute(data, data, target='placed') 
    s_f1 = clf.compute(data, new_data, target='placed') 
    print(f'{clf.__name__:30s} real f1: {r_f1:5.4f} synth f1: {s_f1:5.4f} performance: {s_f1/r_f1:5.2f}')

Resulting in:

BinaryDecisionTreeClassifier   real f1: 1.0000 synth f1: 0.5391 performance:  0.54
BinaryAdaBoostClassifier       real f1: 1.0000 synth f1: 0.6296 performance:  0.63
BinaryMLPClassifier            real f1: 1.0000 synth f1: 0.5693 performance:  0.57

Additional context

I am a bit at a loss here if it is ok to compare both models so directly, as the SDV generation process may produce NaNs and infinities, that are silently replaced in the evaluation code, but may still have an impact.

Add a metric to evaluate anonymization

Description

Having an easy way of measuring the privacy of synthesized data would be very useful for users of the tool. It could be added on top of the existing evaluation metrics sdv-dev/SDV#52 .

An easy way to measure it would be to calculate average euclidean distance to the closest neighbour between real and synthetic data. It was used in TableGAN paper. However that would apply only to numerical data which in case of SDV sometimes is not enough.

It could also be implemnted in a way that the user can specify which fields he wants to use in the evaluation, for example some more sensitive fields should be taken into account while others can be ignored.

New SDMetrics from Smartnoise

Hi all!
I am one of the smartnoise-sdk maintainers (part of the OpenDP collaboration). Specifically, I work on differentially private (DP) data synthesizers.

Problem Description

It would be nice if SDMetrics had some more methods geared towards DP synthesizers! (specifically, methods from from https://arxiv.org/pdf/2004.07740.pdf, https://arxiv.org/pdf/1604.06651.pdf and https://arxiv.org/pdf/1806.11345.pdf)

Expected behavior

SDMetrics will be able to produce pMSE (Snoke et al) and Wasserstein randomization test (Arnold et al) scores for single_table synthetic data (under privacy). (Potentially, also SRA scores (Jordon et al), although this is not as high priority, and may require too much support code to be feasible.)

Additional context

Here, we have some light implementations of the aforementioned methods. Though we use them to evaluate DP synthetic data, these metrics would also work for general purpose synthetic data (pMSE and Wasserstein essentially fit the interface described by the single_table metrics as is).

Reasoning for transition: The SDMetrics package is far more mature and well supported than our DP synthetic data gym, and so we would like to be able to use SDMetrics instead of our gym for smartnoise synthesizer evaluations. Metric parity would be nice before that transition, and so we hope that we can contribute at least pMSE, hopefully Wasserstein, and perhaps SRA, to the SDMetrics package.

I'm adding this issue to gather feedback, before I begin this effort in earnest! Would these metrics be welcome in SDMetrics? Are there concerns/limitations I should be aware of?

Review GMLogLikelihood metric test noise

The GMLogLikelihood metric was added to cover the metrics that existed in the original SDGym iteration, which used GM Log Likelhood metric over datasets that were simulated using GaussianMixtures.

However, even though the implementation was optimized and improved to make the output as stable and meaningful as possible, the scores produced when this metric is run on datasets which not simulated from GMs tends to be very noisy and may produce inconsistent results between runs. As a consequence of this, the ranking-based integration test fails randomly.

We may want to remove this metric from the ranking test and have a separated one which is tested using GM simulated data, and also add a disclaimer in the documentation indicating what this metric is best suited for.

Metric to measure the time series similarity

Problem Description

Current time series metrics in SDMetrics are detection/classifier based. It would be beneficial to have a metric that assesses the quality of the synthetic time series and the original one. An example of such metric would be something to compare the autocorrelation of the original time series and the correlation of the synthetic one.

acf

In this case, the sampled sequences do not preserve the correlation of the time series with itself.

Discussion

Since the most important value in autocorrelation are the ones with low lag values, we can take the maximum as a "metric" of how well AC is. Other ideas of how we can construct a metric to assess the seasonality/periodicity of the signal can be constructed around the FFT of the two signals.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.