microsoft / otdd Goto Github PK

View Code? Open in Web Editor NEW

137.0 9.0 47.0 156 KB

Optimal Transport Dataset Distance

License: MIT License

Python 100.00%

otdd's People

Contributors

Stargazers

Watchers

otdd's Issues

Calculating dataset distances of CSV format datasets

I see a function called "dataset_from_numpy"

I want read some data from CSV files, then calculating dataset distances .

`
import torch
import numpy as np
from torch.utils.data import TensorDataset
from otdd.pytorch.distance import DatasetDistance

def dataset_from_numpy(X, Y, classes = None):
targets = torch.LongTensor(list(Y))
ds = TensorDataset(torch.from_numpy(X).type(torch.FloatTensor),targets)
ds.targets = targets
ds.classes = classes if classes is not None else [i for i in range(len(np.unique(Y)))]
return ds

x1 = np.array([[1,2,3,5,6,9],[4,5,6,2,4,5]])
y1 = np.array([0,1])

x2 = np.array([[2,2,3,5,3,7],[4,5,6,9,1,4]])
y2 = np.array([1,2])

ds1 = dataset_from_numpy(x1,y1)
ds2 = dataset_from_numpy(x2,y2)
dist = DatasetDistance(ds1,ds2)
dist.distance()
`

and i face a problem:

`
Traceback (most recent call last):
File "", line 1, in
File "/home/xxx/otdd/otdd/pytorch/distance.py", line 595, in distance
_ = self._get_label_distances()
File "/home/xxx/otdd/otdd/pytorch/distance.py", line 439, in _get_label_distances
Means, Covs = self._get_label_stats()
File "/home/xxx/otdd/otdd/pytorch/distance.py", line 385, in _get_label_stats
**shared_args)
File "/home/xxx/otdd/otdd/pytorch/moments.py", line 321, in compute_label_stats
M = torch.stack([μ.to(device) for i,μ in sorted(M.items()) if μ is not None], dim=0)
RuntimeError: stack expects a non-empty TensorList

Could you please help me how to solve it?

Question:

Hi
I have a small question about 2-Wassertain distance between gaussians as it describe in the paper (Geometric Dataset Distances via Optimal Transport).

I guess, in this sentence "Furthermore, whenever \sigma_alpha and \sigma_beta commute,",,the matrices should be square root.

Possible to use otdd with coco dataset?

Hello,

I am using a COCO style dataset.

from torchvision.datasets.vision import VisionDataset

class MyCocoDataset(VisionDataset)

OTDD fails in this block of code

    if hasattr(dataset, 'targets'): # most torchivision datasets
        targets = dataset.targets
    elif hasattr(dataset, '_data'): # some torchtext datasets
        targets = torch.LongTensor([e[0] for e in dataset._data])
    elif hasattr(dataset, 'tensors') and len(dataset.tensors) == 2: # TensorDatasets
        targets = dataset.tensors[1]
    elif hasattr(dataset, 'tensors') and len(dataset.tensors) == 1:
        logger.warning('Dataset seems to be unlabeled - this modality is in beta mode!')
        targets = None
    else:
        raise ValueError("Could not find targets in dataset.")

Raising a value error becasue the dataset doesn't have any of the checked attributes.

My dataset doesn't have targets, tensors etc. is it possible to use otdd?

It looks like the perfect tool for what I want to achieve!

Cov is nan in the flow example in readme

Hello
When I run the example in the reademe. I got the following error after fixing issues mentioned in #20 .
After tracing the code, I found it is caused by Nan happened in ln:383 compute_label_stats in flow.py.
I tried to use using eigen_correction='constant' to ensure the PSDness of cov matrix but it didn't work.
A workaround I found helpful is setting diagonal_cov = True.

I would like to confirm whether setting diagnoal_cov = True is a valid way to deal with the issue. A side question: xonly, xonly-attached, and xyaug are corresponding to fd, jd-fl, and jd-vl in the paper? Thanks for your time in advance.

TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_48265/2326306459.py in <module>
     22                           device='cpu'
     23                           )
---> 24 d,out = flow.flow()

~/Desktop/kuan/otdd/otdd/pytorch/flows.py in flow(self, tol)
    477             pbar.set_description(f'Flow Step {iter}/{len(self.times)}, F_t={obj:8.2f}')
    478             self.callback.on_step_begin(self.otdd, iter)
--> 479             obj = self.step(iter)
    480             logger.info(f't={t:8.2f}, F(a_t)={obj:8.2f}') # Although things have been updated, this is obj of time t still
    481             self.history.append(obj)

~/Desktop/kuan/otdd/otdd/pytorch/flows.py in step(self, iter)
    454         if self.otdd.inner_ot_method != 'exact':
    455             logger.info('Performing stats update...')
--> 456             self.stats_update()
    457 
    458         if self.compute_coupling == 'every_iteration':

~/Desktop/kuan/otdd/otdd/pytorch/flows.py in stats_update(self)
    393                                         )
    394             if torch.isnan(self.otdd.Covs[0]).any():
--> 395                 pdb.set_trace(header='Nans in Cov Matrices')
    396 
    397 

TypeError: set_trace() got an unexpected keyword argument 'header'

Questions regarding using the "exact" method and default "gaussian_approx" method

Dear auther,

I am trying to calculate the distance between the same dataset(USPS training dataset). According to the paper, it says $d_{OT-N} \le d_{OT}$. So I use the exact and default gaussian_approx method to compute the distance. However, the result gets from the gaussian approx is a little bit larger than the exact method which is almost 0 in this case, which seems different from the previous statement.

Here is the setting for two experiment:

# For d_OT
from otdd.pytorch.datasets import load_torchvision_data
from otdd.pytorch.distance import DatasetDistance


# Load datasets
loaders_src = load_torchvision_data('USPS', valid_size=0, resize = 28, maxsize=20000)[0]
loaders_tgt = load_torchvision_data('USPS',  valid_size=0, resize = 28, maxsize=20000)[0]

# Instantiate distance
dist = DatasetDistance(loaders_src['train'], loaders_tgt['train'],
                       inner_ot_method='exact',
                       inner_ot_debiased=True,
                       device='cpu')

d = dist.distance(maxsamples = 20000)
print(f'OTDD(src,tgt)={d}')

# For d_OT_N

# Instantiate distance
dist = DatasetDistance(loaders_src['train'], loaders_tgt['train'],
                       inner_ot_debiased=True,
                       device='cpu')

d = dist.distance(maxsamples = 20000)
print(f'OTDD(src,tgt)={d}')

Unexpected results on Apple M1 processor

The following code compares the exact otdd where the source and target datasets are the same. The expected result is that the distance d is zero. When the code is run on an Apple M1 processor, the code does not return zero, for the given random seed it returns 2.99 (2d.p.). One work-around when using an M1 is to set device="mps" and then the expected result of zero is returned.

from otdd.pytorch.distance import DatasetDistance
import torch
from torch.utils.data import TensorDataset

n_samples = 1000
n_feats = 100000
n_labels = 10
max_samples = n_samples

torch.manual_seed(42)
data_A = torch.randn(n_samples, n_feats)
labels_A = torch.randint(low=0, high=n_labels, size=(n_samples,))

ds_A = TensorDataset(data_A, labels_A)

dist = DatasetDistance(
    ds_A,
    ds_A,# compare to itself
    inner_ot_method="exact",
    debiased_loss=True,
    p=2,
    entreg = 1e-1,
    inner_ot_debiased=True,
    device="cpu",
)
d = dist.distance(maxsamples=max_samples)
print(f"distance: {d}")

>>>> HEAD in setup.py

Thanks for your impressive work! I have found some diff marks in the setup.py, you can remove them if you have time:)

Using OTDD on two different datasets with different sizes?

Is it possible to use OTDD on two datasets with different number of labels and different size of X and Y as well as different number of distinct features?
If yes what are the recommended setting for this problem?

When I execute the function `pwdist_exact`, I get an error

When I try to compare the distance of two subsets, which randomly sampled from the EMNIST dataset, I use the 'exact' method and follow the format of example.py, but I always enter except at the function pwdist_exact.

from otdd.pytorch.datasets import load_torchvision_data
from otdd.pytorch.distance import DatasetDistance
from torch.utils.data import DataLoader, SubsetRandomSampler, Dataset
from torchvision.datasets import EMNIST
from torchvision.transforms import Compose, ToTensor, Normalize, Grayscale
import json

class CustomDataset(Dataset):
    def __init__(self, dataset, indices):
        self.dataset = dataset
        self.indices = [int(i) for i in indices]
        self.targets = dataset.targets # 保留targets属性
        self.classes = dataset.classes # 保留classes属性
        
    def __len__(self):
        return len(self.indices)

    def __getitem__(self, item):
        x, y = self.dataset[self.indices[item]]
        return x, y
    
    def get_class_distribution(self):
        sub_targets = self.targets[self.indices]
        return sub_targets.unique(return_counts=True)

raw_data_path = '/mnt/linuxidc_client/dataset/Amazon_Review_split/EMNIST'
sub_train_config_path = '/mnt/linuxidc_client/dataset/Amazon_Review_split/sub_train_datasets_config.json'
sub_test_config_path = '/mnt/linuxidc_client/dataset/Amazon_Review_split/test_dataset_config.json'
train_id = 0
test_id = 0
dataset_name = "EMNIST"
sub_train_key = 'train_sub_{}'.format(train_id)
sub_test_key = 'test_sub_{}'.format(test_id)
BATCH_SIZE = 2048
with open(sub_train_config_path, 'r+') as f:
    current_subtrain_config = json.load(f)
    f.close()
with open(sub_test_config_path, 'r+') as f:
    current_subtest_config = json.load(f)
    f.close()
real_train_index = sorted(list(current_subtrain_config[dataset_name][sub_train_key]["indexes"]))
print("check last real_train_index: ", real_train_index[-1])
print(len(real_train_index))
real_test_index = sorted(list(current_subtest_config[dataset_name][sub_test_key]["indexes"])) 
print("check last real_test_index: ", real_test_index[-1])
print(len(real_test_index))

transform = Compose([
    Grayscale(3),
    ToTensor(),
    Normalize((0.1307,), (0.3081,))
])
train_dataset = EMNIST(
    root=raw_data_path,
    split="bymerge",
    download=False,
    train=True,
    transform=transform
)
test_dataset = EMNIST(
    root=raw_data_path,
    split="bymerge",
    download=False,
    train=False,
    transform=transform
)
print("begin train: {} test: {}".format(train_id, test_id))
print("check all size: train[{}] and test[{}]".format(len(train_dataset), len(test_dataset)))
train_dataset = CustomDataset(train_dataset, real_train_index)
train_loader = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE)
test_dataset = CustomDataset(test_dataset, real_test_index)
test_loader = DataLoader(dataset=test_dataset, batch_size=BATCH_SIZE)
print("Finished split datasets!")
print("check train_loader: {}".format(len(train_loader) * BATCH_SIZE))
print("check test_loader: {}".format(len(test_loader) * BATCH_SIZE))


# Instantiate distance
dist = DatasetDistance(train_loader, test_loader,
                          inner_ot_method = 'exact',
                          debiased_loss = True,
                          p = 2, entreg = 1e-1,
                          device='cuda:3')

d = dist.distance(maxsamples = 1000)
print(f'OTDD-EMNIST(train,test)={d:8.2f}')

It is worth noting that the label distribution of the MNIST subset is not the same as that of the entire EMNIST dataset. In the subset, the number of instances of some labels is 0.

The function pwdist_exact seems to return the correct result when I take evenly spaced samples. Here is the code.

# real_train_index = sorted(list(current_subtrain_config[dataset_name][sub_train_key]["indexes"]))
real_train_index = list(range(0, 697932, 10))
# real_test_index = sorted(list(current_subtest_config[dataset_name][sub_test_key]["indexes"])) 
real_test_index = list(range(0, 116323, 3))

You can download my sub_train_datasets_config.json and test_dataset_config.json in Google Drive. Link: https://drive.google.com/drive/folders/1r_vyLJ-RmuuNZqneBP3meexrEZvgc_Ce?usp=sharing

How to extend this library to tabular datasets?

Hi,

Thanks for providing this library for vision datasets. My question is can one apply the methods to tabular datasets? Is it even possible to compare two tabular datasets which mixed column types?

OTDD in datasets with missing values?

Hi, I would like to ask what is the ideal/preferred preprocessing technique or setting for using IncomparableDatasetDistance in datasets with missing values for tabular datasets?

Questions about debiased_loss

Hi, thanks for your work and codes. I'm confused about the debiased_loss parameter in DatasetDistance. And I have two questions:

In get_label_distances, why do we also need to compute class distance in the same dataset if the debiased_loss is True?
I run the example.py given by this repo, and I notice that in batch_augmented_cost function, size of W is [20, 20] rather than [10, 10]. I realise this is because we concatenate class distance of D1 and D2 like [[DYY1, DYY12], [DYY21, DYY2]] here . But I'm afraid that the following operation gives a wrong index to get class distance in W. For example, the index of class 0 in D1 and class 0 in D2 is 0 * 20 + 0 = 0, but W.flatten()[0] is the distance between class 0 in D1 and class 0 in D1.

M = W.shape[1] * Y1[:, :, None] + Y2[:, None, :]
C2 = W.flatten()[M.flatten(start_dim=1)].reshape(-1,Y1.shape[1], Y2.shape[1])

I don’t know whether my understanding is correct.

"Unexpected keyword argument" bug in last commit

Hi! The refactoring in the last commit seems to have introduced a small bug where load_full_dataset() (here) is called with an argument reindex_start that does not exist (e.g. here and in a couple other places).

I'm not sure, but it might be enough to add reindex_start=None to the signature of load_full_dataset() and then change this into the following:

if type(reindex) is bool:
    if reindex_start is None:
        reindex_start = 0
    reindex_vals = range(reindex_start, reindex_start + len(labels))

Does it make sense? Thanks!

Parameter setup for the *MNIST+USPS distance

Thank you for your attention!
I was trying to reproduce the pairwise distance among (*MINST+USPS) datasets (figure 4 of Geometric Distance via Optimal Transportation). And I used the parameter setup provided by example.py. But the distances measured are much smaller than the figure 4, even try the upper bound as inner ot method, and some of the relations are different. May I ask is there a detailed parameter setup for reproducing this figure 4? Thank you very much!

Input matrix is ill-conditioned

Thank you for your notice.

When I ran d,out = flow.flow()
I got:

RuntimeError: symeig_cpu: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 635).

Is there any solution for this error? Thanks a lot!

ACTION REQUIRED: Microsoft needs this private repository to complete compliance info

There are open compliance tasks that need to be reviewed for your otdd repo.

Action required: 4 compliance tasks

To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your microsoft GitHub organization.

Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/microsoft/repos/otdd/compliance

The GitHub AE (GitHub inside Microsoft) migration survey has not been completed for this private repository
No Service Tree mapping has been set for this repo. If this team does not use Service Tree, they can also opt-out of providing Service Tree data in the Compliance tab.
No repository maintainers are set. The Open Source Maintainers are the decision-makers and actionable owners of the repository, irrespective of administrator permission grants on GitHub.
Classification of the repository as production/non-production is missing in the Compliance tab.

You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.

If you no longer need this repository, it might be quickest to delete the repo, too.

GitHub inside Microsoft program information

More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]

FYI: current admins at Microsoft include @nfusi, @philrosenfield, @JamesBHall, @dmelis, @ntenenz, @evanrgreen

Distance between the same dataset > 0?

Hi,

I don't understand why the difference between the same distribution is greater than 0.
MWE

import torch
import numpy as np
from torch.utils.data import TensorDataset
from otdd.pytorch.distance import DatasetDistance

import openml
d1 = openml.datasets.get_dataset(31)
d2 = openml.datasets.get_dataset(31)

def dataset_from_numpy(X, Y, classes = None):
    targets =  torch.LongTensor(list(Y))
    ds = TensorDataset(torch.from_numpy(X).type(torch.FloatTensor),targets)
    ds.targets =  targets
    ds.classes = classes if classes is not None else [i for i in range(len(np.unique(Y)))]
    return ds


samples = 100
dim     = 6

x1,y1,_,_ = d1.get_data(dataset_format="array", target=d1.default_target_attribute)
x2,y2,_,_ = d2.get_data(dataset_format="array", target=d2.default_target_attribute)


# x2 = np.random.randn(samples, dim)
# y2 = np.random.randint(0, 2, size=(samples))


ds1 = dataset_from_numpy(x1,y1)
ds2 = dataset_from_numpy(x2,y2)
dist = DatasetDistance(ds1,ds2, inner_ot_method = 'exact',
                       debiased_loss = True, entreg = 1e-1,
                       device='cuda')
print(dist.distance())


output:
tensor(1.8688, device='cuda:0')

RuntimeError: symeig_cpu: the algorithm failed to converge

When i ran the example of gradient flow example in github, it ran into the following errors:

RuntimeError: symeig_cpu: the algorithm failed to converge; 624 off-diagonal elements of an intermediate tridiagonal form did not converge to zero.

OTDD between the complete dataset of CIFAR10 with itself gives non-zero value

Hi,

Thanks for the great work. Many thanks for releasing the code to the public. I have an issue with OTDD on the CIFAR10 dataset. Below is the code.

from otdd.pytorch.datasets import load_torchvision_data
from otdd.pytorch.distance import DatasetDistance

loaders_tgt = load_torchvision_data('CIFAR10', valid_size = 0, resize = 28)[0]
loaders_src = load_torchvision_data('CIFAR10', valid_size = 0, resize = 28)[0]

print('===> Reading both datasets done')

dist = DatasetDistance(loaders_src['train'], loaders_src['train'],
method = 'precomputed_labeldist',
inner_ot_method = 'exact',
inner_ot_debiased = True,
debiased_loss = True,
p = 2, entreg = 1e-1,
device='cuda')

d = dist.distance()
print(f'OTDD-Exact-CompleteData(CIFAR10 Img, CIFAR10 Img)={d:8.2f}')

No subset random sampling is happening. The complete dataset is read and loaded only once since I am feeding src data in place of tgt data. It should give a zero distance, but below is the output.

$OTDD-Exact-CompleteData(CIFAR10 Img, CIFAR10 Img)= 723.36

Surprisingly, it gives 0 distance as expected when computed only on 2000 samples, the same as the default.

In place of CIFAR10, when used MNIST/FashionMNIST complete dataset, the below error is thrown,
$Distance computation failed. Aborting.

The exact problem is as below
$geomloss/sinkhorn_samples.py", line 327, in lse_genred
"( B - (P * " + cost + " ) )",
TypeError: can only concatenate str (not "function") to str

But the same code works when given 2000 samples, as in the default code.

Please help understand why this could be the case. Especially the CIFAR10 issue.

Thanks!

Best,
Anuradha

Missing Dependency munkres

Following installation instructions leads to "ModuleNotFoundError: No module named 'munkres'". Manual pip install resolves issue.

Unsupervised dataset RunTime error (setting the debiased_loss=False and ignore_target_labels=True)

For unsupervised datasets and setting the parameters
debiased_loss=False
ignore_source_labels=True
ignore_target_labels=True

the code in file distance.py lines 300-303 will run:

if (targets2 is None) or self.ignore_target_labels:
    reindex_start = len(self.V1) if (self.loss == 'sinkhorn' and self.debiased_loss) else True
    X, Y_infer, Y_true = self._load_infer_labels(D2, classes2, reindex=True, reindex_start=reindex_start)
    self.targets2 = targets2 = Y_infer - reindex_start

Which results in RuntimeError: Subtraction, the - operator, with a bool tensor is not supported, because of
Y_infer - reindex_start (I think that in line 301 instead of True maybe the correct is 0 ?

changes to:
reindex_start = len(self.V1) if (self.loss == 'sinkhorn' and self.debiased_loss) else 0

How to use multiple GPUs for calculation?

When I use cpu for calculation, the time cost is so high.
So I want to use multiple gpus for calculation.
But compared with cpu, even with the same 'maxsample' setting, the memory cost of gpus increases dramatically and easily comes to ' CUDA: out of memory'.
Is it possible to speed up the calculation with GPUs?

Problem when using "Exact" method to calculate large samples

When I tried to calculate the "exact" distance between two large datasets (with 2 labels and more than 10000 samples in total), the error would raise:

TypeError                                 Traceback (most recent call last)
~/otdd/otdd/pytorch/wasserstein.py in pwdist_exact(X1, Y1, X2, Y2, symmetric, loss, cost_function, p, debias, entreg, device)
    335         try:
--> 336             D[i, j] = distance(X1[Y1==c1[i]].to(device), X2[Y2==c2[j]].to(device)).item()
    337         except:

~/test/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1101                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102             return forward_call(*input, **kwargs)
   1103         # Do not call functions when jit is used

~/test/lib/python3.7/site-packages/geomloss-0.2.4-py3.7.egg/geomloss/samples_loss.py in forward(self, *args)
    283             labels_y=l_y,
--> 284             verbose=self.verbose,
    285         )

~/test/lib/python3.7/site-packages/geomloss-0.2.4-py3.7.egg/geomloss/sinkhorn_samples.py in sinkhorn_online(α, x, β, y, p, blur, reach, diameter, scaling, cost, debias, potentials, **kwargs)
    142     softmin = partial(
--> 143         softmin_online, log_conv=keops_lse(cost, D, dtype=str(x.dtype)[6:])
    144     )

~/test/lib/python3.7/site-packages/geomloss-0.2.4-py3.7.egg/geomloss/sinkhorn_samples.py in keops_lse(cost, D, dtype)
    107 #         "( B - (P * " + cost + " ) )",
--> 108         "( B - (P * " + cost+ " ) )",
    109         "A = Vi(1)",
TypeError: can only concatenate str (not "function") to str

I wonder would it be possible to use "exact" method to compute a large number of samples? Thanks

Why the same datasets otdd is not zero ?

I tried to measure the same dataset otdd, but the result told me that it's not zero.

loaders_src  = load_torchvision_data('MNIST', valid_size=0, resize = 28, maxsize=2000)[0]
loaders_tgt  = load_torchvision_data('MNIST',  valid_size=0, resize = 28, maxsize=2000)[0]

The result is 346.16

ImportError: cannot import name 'embedded_feature_cost' from 'otdd.pytorch.distance'

Hi, I get the above error when trying to run the second example in README (first example in Advanced Usage), and I couldn't find where embedded_feature_cost is defined. I've followed the installation instructions in a conda virtual environment. Could you help take a look? Thanks!

Went into (Pdb) mode when using 'exact' as inner_ot_method

I run into (Pdb) mode when I was testing the code with inner_ot_method='exact'. It does not happen when inner_ot_method='gaussian_approx'. Basically I just followed the vanilla example. Here is my code:

dist = DatasetDistance(dataset1, dataset2,
                       inner_ot_method = 'exact',
                       debiased_loss = True,
                       p = 2, entreg = 1e-1,
                       device='cpu')
d = dist.distance(maxsamples = 1000)

Both dataset1 and dataset2 are created by TensorDataset.

And below is the output from the program.

Computing label-to-label distances:   0%|                                                                                                                                                                                                                              | 0/4 [00:00<?, ?it/s]
> mypath/envs/OTDD/lib/python3.6/site-packages/otdd/pytorch/wasserstein.py(338)pwdist_exact()
-> if symmetric:
(Pdb) True
True
(Pdb) False
False
(Pdb) 1
1
(Pdb) 0
0
(Pdb)

There is no way for me to quit (Pdb) except for quit, which would terminal the program as well with an error returned

Traceback (most recent call last):                                                                                                                                                                                                                                                           
  File "otdd_distance.py", line 63, in <module>
    d = dist.distance(maxsamples = 1000)
  File "mypath/envs/OTDD/lib/python3.6/site-packages/otdd/pytorch/distance.py", line 595, in distance
    _ = self._get_label_distances()
  File "mypath/envs/OTDD/lib/python3.6/site-packages/otdd/pytorch/distance.py", line 529, in _get_label_distances
    DYY12 = pwdist(self.X1,self.Y1,self.X2, self.Y2)
  File "mypath/envs/OTDD/lib/python3.6/site-packages/otdd/pytorch/wasserstein.py", line 338, in pwdist_exact
    if symmetric:
  File "mypath/envs/OTDD/lib/python3.6/site-packages/otdd/pytorch/wasserstein.py", line 338, in pwdist_exact
    if symmetric:
  File "mypath/envs/OTDD/lib/python3.6/bdb.py", line 51, in trace_dispatch
    return self.dispatch_line(frame)
  File "mypath/envs/OTDD/lib/python3.6/bdb.py", line 70, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit

About my environment:
A new environment with conda create -n name python=3.6
Pytorch installed by conda install pytorch torchvision torchaudio cudatoolkit=10.1 -c pytorch
Then I followed the instructions in this repository

python -m pip install -r requirements.txt
python -m pip install .

Please let me know how to solve this problem. Thank you very much!

IndexError: list index out of range while computing distance for datasets

Code:

import numpy as np
from sklearn.impute import SimpleImputer
# With incomparable dataset distance cost
import torch
import openml
from otdd.pytorch.distance import IncomparableDatasetDistance
from torchvision.models import resnet18
from otdd.pytorch.datasets import load_torchvision_data
import numpy as np
from torch.utils.data import TensorDataset


imp = SimpleImputer(missing_values=np.nan, strategy='mean')
d1 = openml.datasets.get_dataset(8)
d2 = openml.datasets.get_dataset(39)

def dataset_from_numpy(X, Y, classes = None):
    targets =  torch.LongTensor(list(Y))
    ds = TensorDataset(torch.from_numpy(X).type(torch.FloatTensor),targets)
    ds.targets =  targets
    ds.classes = classes if classes is not None else [i for i in range(len(np.unique(Y)))]
    return ds


x1,y1,_,_ = d1.get_data(dataset_format="array", target=d1.default_target_attribute)
x2,y2,_,_ = d2.get_data(dataset_format="array", target=d2.default_target_attribute)
x1 = imp.fit_transform(x1)
x2 = imp.fit_transform(x2)
ds1 = dataset_from_numpy(x1,y1)
ds2 = dataset_from_numpy(x2,y2)
print('datasets created')



dist = IncomparableDatasetDistance(ds1, ds2,
                          debiased_loss = False,
                          inner_ot_method = 'exact',
                          p = 5, entreg = 10e-1,
                          device='cpu')

d = dist.distance()

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
[<ipython-input-3-1e152b2bf0fd>](https://localhost:8080/#) in <module>()
     37                           inner_ot_method = 'exact',
     38                           p = 5, entreg = 10e-1,
---> 39                           device='cpu')
     40 
     41 d = dist.distance()

3 frames
[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/distance.py](https://localhost:8080/#) in __init__(self, *args, **kwargs)
   1093     """
   1094     def __init__(self, *args, **kwargs):
-> 1095         super(IncomparableDatasetDistance, self).__init__(*args, **kwargs)
   1096         if self.debiased_loss:
   1097             raise ValueError('Debiased GWOTDD not implemented yet')

[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/distance.py](https://localhost:8080/#) in __init__(self, D1, D2, method, symmetric_tasks, feature_cost, src_embedding, tgt_embedding, ignore_source_labels, ignore_target_labels, loss, debiased_loss, p, entreg, λ_x, λ_y, inner_ot_method, inner_ot_loss, inner_ot_debiased, inner_ot_p, inner_ot_entreg, diagonal_cov, min_labelcount, online_stats, sqrt_method, sqrt_niters, sqrt_pref, nworkers_stats, coupling_method, nworkers_dists, eigen_correction, device, precision, verbose, *args, **kwargs)
    232 
    233         if self.D1 is not None and self.D2 is not None:
--> 234             self._init_data(self.D1, self.D2)
    235         else:
    236             logger.warning('DatasetDistance initialized with empty data')

[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/distance.py](https://localhost:8080/#) in _init_data(self, D1, D2)
    320 
    321 
--> 322         self.classes1 = [classes1[i] for i in self.V1]
    323         self.classes2 = [classes2[i] for i in self.V2]
    324 

[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/distance.py](https://localhost:8080/#) in <listcomp>(.0)
    320 
    321 
--> 322         self.classes1 = [classes1[i] for i in self.V1]
    323         self.classes2 = [classes2[i] for i in self.V2]
    324 

IndexError: list index out of range

No such file or directory: dist/otdd/data/ag_news_csv/train.tsv

Hi, when I run the text code, I encounter this error. Can you take a look, @dmelis? Thanks!

RuntimeError: symeig_cpu: the algorithm failed to converge; 643 off-diagonal elements of an intermediate tridiagonal form did not converge to zero.

Hi, I tried to run your gradient flow code:
Here is my code:

import os
import matplotlib
%matplotlib inline 
#Comment out if not on notebook
import torch
from torchvision.models import resnet18

from otdd.pytorch.datasets import load_torchvision_data
from otdd.pytorch.distance import DatasetDistance, FeatureCost
from otdd.pytorch.flows import OTDD_Gradient_Flow
from otdd.pytorch.flows import CallbackList, ImageGridCallback, TrajectoryDump

# Load datasets
loaders_src = load_torchvision_data('MNIST', valid_size=0, resize = 28, maxsize=1000)[0]
loaders_tgt = load_torchvision_data('USPS',  valid_size=0, resize = 28, maxsize=1000)[0]


outdir =  os.path.join('out', 'flows')
callbacks = CallbackList([
  ImageGridCallback(display_freq=2, animate=False, save_path = outdir + '/grid'),
])

flow = OTDD_Gradient_Flow(loaders_src['train'], loaders_tgt['train'],
                          ### Gradient Flow Args
                          method = 'xonly-attached',                          
                          use_torchoptim=True,
                          optim='adam',
                          steps=10,
                          step_size=1,
                          callback=callbacks,              
                          clustering_method='kmeans',                                      
                          ### OTDD Args                          
                          online_stats=True,
                          diagonal_cov = False,
                          device='cuda'
                          )
d,out = flow.flow()

then I received this error:

RuntimeError: symeig_cpu: the algorithm failed to converge; 643 off-diagonal elements of an intermediate tridiagonal form did not converge to zero.

Do you know what is wrong? Thank you so much!

microsoft / otdd Goto Github PK

otdd's People

Contributors

Stargazers

Watchers

Forkers

otdd's Issues

Action required: 4 compliance tasks

GitHub inside Microsoft program information

Recommend Projects

Recommend Topics

Recommend Org

Jobs