microsoft / otdd Goto Github PK
View Code? Open in Web Editor NEWOptimal Transport Dataset Distance
License: MIT License
Optimal Transport Dataset Distance
License: MIT License
I see a function called "dataset_from_numpy"
I want read some data from CSV files, then calculating dataset distances .
`
import torch
import numpy as np
from torch.utils.data import TensorDataset
from otdd.pytorch.distance import DatasetDistance
def dataset_from_numpy(X, Y, classes = None):
targets = torch.LongTensor(list(Y))
ds = TensorDataset(torch.from_numpy(X).type(torch.FloatTensor),targets)
ds.targets = targets
ds.classes = classes if classes is not None else [i for i in range(len(np.unique(Y)))]
return ds
x1 = np.array([[1,2,3,5,6,9],[4,5,6,2,4,5]])
y1 = np.array([0,1])
x2 = np.array([[2,2,3,5,3,7],[4,5,6,9,1,4]])
y2 = np.array([1,2])
ds1 = dataset_from_numpy(x1,y1)
ds2 = dataset_from_numpy(x2,y2)
dist = DatasetDistance(ds1,ds2)
dist.distance()
`
and i face a problem:
`
Traceback (most recent call last):
File "", line 1, in
File "/home/xxx/otdd/otdd/pytorch/distance.py", line 595, in distance
_ = self._get_label_distances()
File "/home/xxx/otdd/otdd/pytorch/distance.py", line 439, in _get_label_distances
Means, Covs = self._get_label_stats()
File "/home/xxx/otdd/otdd/pytorch/distance.py", line 385, in _get_label_stats
**shared_args)
File "/home/xxx/otdd/otdd/pytorch/moments.py", line 321, in compute_label_stats
M = torch.stack([μ.to(device) for i,μ in sorted(M.items()) if μ is not None], dim=0)
RuntimeError: stack expects a non-empty TensorList
`
Could you please help me how to solve it?
Hello,
I am using a COCO style dataset.
from torchvision.datasets.vision import VisionDataset
class MyCocoDataset(VisionDataset)
OTDD fails in this block of code
if hasattr(dataset, 'targets'): # most torchivision datasets
targets = dataset.targets
elif hasattr(dataset, '_data'): # some torchtext datasets
targets = torch.LongTensor([e[0] for e in dataset._data])
elif hasattr(dataset, 'tensors') and len(dataset.tensors) == 2: # TensorDatasets
targets = dataset.tensors[1]
elif hasattr(dataset, 'tensors') and len(dataset.tensors) == 1:
logger.warning('Dataset seems to be unlabeled - this modality is in beta mode!')
targets = None
else:
raise ValueError("Could not find targets in dataset.")
Raising a value error becasue the dataset doesn't have any of the checked attributes.
My dataset doesn't have targets, tensors etc. is it possible to use otdd?
It looks like the perfect tool for what I want to achieve!
Hello
When I run the example in the reademe. I got the following error after fixing issues mentioned in #20 .
After tracing the code, I found it is caused by Nan
happened in ln:383 compute_label_stats
in flow.py
.
I tried to use using eigen_correction='constant'
to ensure the PSDness of cov matrix but it didn't work.
A workaround I found helpful is setting diagonal_cov = True
.
I would like to confirm whether setting diagnoal_cov = True
is a valid way to deal with the issue. A side question: xonly, xonly-attached, and xyaug are corresponding to fd, jd-fl, and jd-vl in the paper? Thanks for your time in advance.
TypeError Traceback (most recent call last)
/tmp/ipykernel_48265/2326306459.py in <module>
22 device='cpu'
23 )
---> 24 d,out = flow.flow()
~/Desktop/kuan/otdd/otdd/pytorch/flows.py in flow(self, tol)
477 pbar.set_description(f'Flow Step {iter}/{len(self.times)}, F_t={obj:8.2f}')
478 self.callback.on_step_begin(self.otdd, iter)
--> 479 obj = self.step(iter)
480 logger.info(f't={t:8.2f}, F(a_t)={obj:8.2f}') # Although things have been updated, this is obj of time t still
481 self.history.append(obj)
~/Desktop/kuan/otdd/otdd/pytorch/flows.py in step(self, iter)
454 if self.otdd.inner_ot_method != 'exact':
455 logger.info('Performing stats update...')
--> 456 self.stats_update()
457
458 if self.compute_coupling == 'every_iteration':
~/Desktop/kuan/otdd/otdd/pytorch/flows.py in stats_update(self)
393 )
394 if torch.isnan(self.otdd.Covs[0]).any():
--> 395 pdb.set_trace(header='Nans in Cov Matrices')
396
397
TypeError: set_trace() got an unexpected keyword argument 'header'
Dear auther,
I am trying to calculate the distance between the same dataset(USPS training dataset). According to the paper, it says
Here is the setting for two experiment:
# For d_OT
from otdd.pytorch.datasets import load_torchvision_data
from otdd.pytorch.distance import DatasetDistance
# Load datasets
loaders_src = load_torchvision_data('USPS', valid_size=0, resize = 28, maxsize=20000)[0]
loaders_tgt = load_torchvision_data('USPS', valid_size=0, resize = 28, maxsize=20000)[0]
# Instantiate distance
dist = DatasetDistance(loaders_src['train'], loaders_tgt['train'],
inner_ot_method='exact',
inner_ot_debiased=True,
device='cpu')
d = dist.distance(maxsamples = 20000)
print(f'OTDD(src,tgt)={d}')
# For d_OT_N
# Instantiate distance
dist = DatasetDistance(loaders_src['train'], loaders_tgt['train'],
inner_ot_debiased=True,
device='cpu')
d = dist.distance(maxsamples = 20000)
print(f'OTDD(src,tgt)={d}')
The following code compares the exact otdd where the source and target datasets are the same. The expected result is that the distance d
is zero. When the code is run on an Apple M1 processor, the code does not return zero, for the given random seed it returns 2.99 (2d.p.). One work-around when using an M1 is to set device="mps"
and then the expected result of zero is returned.
from otdd.pytorch.distance import DatasetDistance
import torch
from torch.utils.data import TensorDataset
n_samples = 1000
n_feats = 100000
n_labels = 10
max_samples = n_samples
torch.manual_seed(42)
data_A = torch.randn(n_samples, n_feats)
labels_A = torch.randint(low=0, high=n_labels, size=(n_samples,))
ds_A = TensorDataset(data_A, labels_A)
dist = DatasetDistance(
ds_A,
ds_A,# compare to itself
inner_ot_method="exact",
debiased_loss=True,
p=2,
entreg = 1e-1,
inner_ot_debiased=True,
device="cpu",
)
d = dist.distance(maxsamples=max_samples)
print(f"distance: {d}")
Thanks for your impressive work! I have found some diff marks in the setup.py, you can remove them if you have time:)
Is it possible to use OTDD on two datasets with different number of labels and different size of X and Y as well as different number of distinct features?
If yes what are the recommended setting for this problem?
When I try to compare the distance of two subsets, which randomly sampled from the EMNIST dataset, I use the 'exact' method and follow the format of example.py
, but I always enter except at the function pwdist_exact
.
from otdd.pytorch.datasets import load_torchvision_data
from otdd.pytorch.distance import DatasetDistance
from torch.utils.data import DataLoader, SubsetRandomSampler, Dataset
from torchvision.datasets import EMNIST
from torchvision.transforms import Compose, ToTensor, Normalize, Grayscale
import json
class CustomDataset(Dataset):
def __init__(self, dataset, indices):
self.dataset = dataset
self.indices = [int(i) for i in indices]
self.targets = dataset.targets # 保留targets属性
self.classes = dataset.classes # 保留classes属性
def __len__(self):
return len(self.indices)
def __getitem__(self, item):
x, y = self.dataset[self.indices[item]]
return x, y
def get_class_distribution(self):
sub_targets = self.targets[self.indices]
return sub_targets.unique(return_counts=True)
raw_data_path = '/mnt/linuxidc_client/dataset/Amazon_Review_split/EMNIST'
sub_train_config_path = '/mnt/linuxidc_client/dataset/Amazon_Review_split/sub_train_datasets_config.json'
sub_test_config_path = '/mnt/linuxidc_client/dataset/Amazon_Review_split/test_dataset_config.json'
train_id = 0
test_id = 0
dataset_name = "EMNIST"
sub_train_key = 'train_sub_{}'.format(train_id)
sub_test_key = 'test_sub_{}'.format(test_id)
BATCH_SIZE = 2048
with open(sub_train_config_path, 'r+') as f:
current_subtrain_config = json.load(f)
f.close()
with open(sub_test_config_path, 'r+') as f:
current_subtest_config = json.load(f)
f.close()
real_train_index = sorted(list(current_subtrain_config[dataset_name][sub_train_key]["indexes"]))
print("check last real_train_index: ", real_train_index[-1])
print(len(real_train_index))
real_test_index = sorted(list(current_subtest_config[dataset_name][sub_test_key]["indexes"]))
print("check last real_test_index: ", real_test_index[-1])
print(len(real_test_index))
transform = Compose([
Grayscale(3),
ToTensor(),
Normalize((0.1307,), (0.3081,))
])
train_dataset = EMNIST(
root=raw_data_path,
split="bymerge",
download=False,
train=True,
transform=transform
)
test_dataset = EMNIST(
root=raw_data_path,
split="bymerge",
download=False,
train=False,
transform=transform
)
print("begin train: {} test: {}".format(train_id, test_id))
print("check all size: train[{}] and test[{}]".format(len(train_dataset), len(test_dataset)))
train_dataset = CustomDataset(train_dataset, real_train_index)
train_loader = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE)
test_dataset = CustomDataset(test_dataset, real_test_index)
test_loader = DataLoader(dataset=test_dataset, batch_size=BATCH_SIZE)
print("Finished split datasets!")
print("check train_loader: {}".format(len(train_loader) * BATCH_SIZE))
print("check test_loader: {}".format(len(test_loader) * BATCH_SIZE))
# Instantiate distance
dist = DatasetDistance(train_loader, test_loader,
inner_ot_method = 'exact',
debiased_loss = True,
p = 2, entreg = 1e-1,
device='cuda:3')
d = dist.distance(maxsamples = 1000)
print(f'OTDD-EMNIST(train,test)={d:8.2f}')
It is worth noting that the label distribution of the MNIST subset is not the same as that of the entire EMNIST dataset. In the subset, the number of instances of some labels is 0.
The function pwdist_exact
seems to return the correct result when I take evenly spaced samples. Here is the code.
# real_train_index = sorted(list(current_subtrain_config[dataset_name][sub_train_key]["indexes"]))
real_train_index = list(range(0, 697932, 10))
# real_test_index = sorted(list(current_subtest_config[dataset_name][sub_test_key]["indexes"]))
real_test_index = list(range(0, 116323, 3))
You can download my sub_train_datasets_config.json
and test_dataset_config.json
in Google Drive. Link: https://drive.google.com/drive/folders/1r_vyLJ-RmuuNZqneBP3meexrEZvgc_Ce?usp=sharing
Hi,
Thanks for providing this library for vision datasets. My question is can one apply the methods to tabular datasets? Is it even possible to compare two tabular datasets which mixed column types?
Hi, I would like to ask what is the ideal/preferred preprocessing technique or setting for using IncomparableDatasetDistance in datasets with missing values for tabular datasets?
Hi, thanks for your work and codes. I'm confused about the debiased_loss
parameter in DatasetDistance
. And I have two questions:
debiased_loss
is True
?example.py
given by this repo, and I notice that in batch_augmented_cost function, size of W
is [20, 20] rather than [10, 10]. I realise this is because we concatenate class distance of D1 and D2 like [[DYY1, DYY12], [DYY21, DYY2]]
here . But I'm afraid that the following operation gives a wrong index to get class distance in W
. For example, the index of class 0
in D1
and class 0
in D2
is 0 * 20 + 0 = 0
, but W.flatten()[0]
is the distance between class 0
in D1
and class 0
in D1
.M = W.shape[1] * Y1[:, :, None] + Y2[:, None, :]
C2 = W.flatten()[M.flatten(start_dim=1)].reshape(-1,Y1.shape[1], Y2.shape[1])
I don’t know whether my understanding is correct.
Hi! The refactoring in the last commit seems to have introduced a small bug where load_full_dataset()
(here) is called with an argument reindex_start
that does not exist (e.g. here and in a couple other places).
I'm not sure, but it might be enough to add reindex_start=None
to the signature of load_full_dataset()
and then change this into the following:
if type(reindex) is bool:
if reindex_start is None:
reindex_start = 0
reindex_vals = range(reindex_start, reindex_start + len(labels))
Does it make sense? Thanks!
Thank you for your attention!
I was trying to reproduce the pairwise distance among (*MINST+USPS) datasets (figure 4 of Geometric Distance via Optimal Transportation). And I used the parameter setup provided by example.py. But the distances measured are much smaller than the figure 4, even try the upper bound as inner ot method, and some of the relations are different. May I ask is there a detailed parameter setup for reproducing this figure 4? Thank you very much!
Thank you for your notice.
When I ran d,out = flow.flow()
I got:
RuntimeError: symeig_cpu: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 635).
Is there any solution for this error? Thanks a lot!
There are open compliance tasks that need to be reviewed for your otdd repo.
To bring this repository to the standard required for 2021, we require administrators of this and all Microsoft GitHub repositories to complete a small set of tasks within the next 60 days. This is critical work to ensure the compliance and security of your microsoft GitHub organization.
Please take a few minutes to complete the tasks at: https://repos.opensource.microsoft.com/orgs/microsoft/repos/otdd/compliance
You can close this work item once you have completed the compliance tasks, or it will automatically close within a day of taking action.
If you no longer need this repository, it might be quickest to delete the repo, too.
More information about GitHub inside Microsoft and the new GitHub AE product can be found at https://aka.ms/gim or by contacting [email protected]
FYI: current admins at Microsoft include @nfusi, @philrosenfield, @JamesBHall, @dmelis, @ntenenz, @evanrgreen
Hi,
I don't understand why the difference between the same distribution is greater than 0.
MWE
import torch
import numpy as np
from torch.utils.data import TensorDataset
from otdd.pytorch.distance import DatasetDistance
import openml
d1 = openml.datasets.get_dataset(31)
d2 = openml.datasets.get_dataset(31)
def dataset_from_numpy(X, Y, classes = None):
targets = torch.LongTensor(list(Y))
ds = TensorDataset(torch.from_numpy(X).type(torch.FloatTensor),targets)
ds.targets = targets
ds.classes = classes if classes is not None else [i for i in range(len(np.unique(Y)))]
return ds
samples = 100
dim = 6
x1,y1,_,_ = d1.get_data(dataset_format="array", target=d1.default_target_attribute)
x2,y2,_,_ = d2.get_data(dataset_format="array", target=d2.default_target_attribute)
# x2 = np.random.randn(samples, dim)
# y2 = np.random.randint(0, 2, size=(samples))
ds1 = dataset_from_numpy(x1,y1)
ds2 = dataset_from_numpy(x2,y2)
dist = DatasetDistance(ds1,ds2, inner_ot_method = 'exact',
debiased_loss = True, entreg = 1e-1,
device='cuda')
print(dist.distance())
output:
tensor(1.8688, device='cuda:0')
When i ran the example of gradient flow example in github, it ran into the following errors:
RuntimeError: symeig_cpu: the algorithm failed to converge; 624 off-diagonal elements of an intermediate tridiagonal form did not converge to zero.
Hi,
Thanks for the great work. Many thanks for releasing the code to the public. I have an issue with OTDD on the CIFAR10 dataset. Below is the code.
from otdd.pytorch.datasets import load_torchvision_data
from otdd.pytorch.distance import DatasetDistance
loaders_tgt = load_torchvision_data('CIFAR10', valid_size = 0, resize = 28)[0]
loaders_src = load_torchvision_data('CIFAR10', valid_size = 0, resize = 28)[0]
print('===> Reading both datasets done')
dist = DatasetDistance(loaders_src['train'], loaders_src['train'],
method = 'precomputed_labeldist',
inner_ot_method = 'exact',
inner_ot_debiased = True,
debiased_loss = True,
p = 2, entreg = 1e-1,
device='cuda')
d = dist.distance()
print(f'OTDD-Exact-CompleteData(CIFAR10 Img, CIFAR10 Img)={d:8.2f}')
$OTDD-Exact-CompleteData(CIFAR10 Img, CIFAR10 Img)= 723.36
Surprisingly, it gives 0 distance as expected when computed only on 2000 samples, the same as the default.
The exact problem is as below
$geomloss/sinkhorn_samples.py", line 327, in lse_genred
"( B - (P * " + cost + " ) )",
TypeError: can only concatenate str (not "function") to str
But the same code works when given 2000 samples, as in the default code.
Please help understand why this could be the case. Especially the CIFAR10 issue.
Thanks!
Best,
Anuradha
Following installation instructions leads to "ModuleNotFoundError: No module named 'munkres'". Manual pip install resolves issue.
For unsupervised datasets and setting the parameters
debiased_loss=False
ignore_source_labels=True
ignore_target_labels=True
the code in file distance.py lines 300-303 will run:
if (targets2 is None) or self.ignore_target_labels:
reindex_start = len(self.V1) if (self.loss == 'sinkhorn' and self.debiased_loss) else True
X, Y_infer, Y_true = self._load_infer_labels(D2, classes2, reindex=True, reindex_start=reindex_start)
self.targets2 = targets2 = Y_infer - reindex_start
Which results in RuntimeError: Subtraction, the -
operator, with a bool tensor is not supported, because of
Y_infer - reindex_start (I think that in line 301 instead of True maybe the correct is 0 ?
changes to:
reindex_start = len(self.V1) if (self.loss == 'sinkhorn' and self.debiased_loss) else 0
When I use cpu for calculation, the time cost is so high.
So I want to use multiple gpus for calculation.
But compared with cpu, even with the same 'maxsample' setting, the memory cost of gpus increases dramatically and easily comes to ' CUDA: out of memory'.
Is it possible to speed up the calculation with GPUs?
When I tried to calculate the "exact" distance between two large datasets (with 2 labels and more than 10000 samples in total), the error would raise:
TypeError Traceback (most recent call last)
~/otdd/otdd/pytorch/wasserstein.py in pwdist_exact(X1, Y1, X2, Y2, symmetric, loss, cost_function, p, debias, entreg, device)
335 try:
--> 336 D[i, j] = distance(X1[Y1==c1[i]].to(device), X2[Y2==c2[j]].to(device)).item()
337 except:
~/test/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1101 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1102 return forward_call(*input, **kwargs)
1103 # Do not call functions when jit is used
~/test/lib/python3.7/site-packages/geomloss-0.2.4-py3.7.egg/geomloss/samples_loss.py in forward(self, *args)
283 labels_y=l_y,
--> 284 verbose=self.verbose,
285 )
~/test/lib/python3.7/site-packages/geomloss-0.2.4-py3.7.egg/geomloss/sinkhorn_samples.py in sinkhorn_online(α, x, β, y, p, blur, reach, diameter, scaling, cost, debias, potentials, **kwargs)
142 softmin = partial(
--> 143 softmin_online, log_conv=keops_lse(cost, D, dtype=str(x.dtype)[6:])
144 )
~/test/lib/python3.7/site-packages/geomloss-0.2.4-py3.7.egg/geomloss/sinkhorn_samples.py in keops_lse(cost, D, dtype)
107 # "( B - (P * " + cost + " ) )",
--> 108 "( B - (P * " + cost+ " ) )",
109 "A = Vi(1)",
TypeError: can only concatenate str (not "function") to str
I wonder would it be possible to use "exact" method to compute a large number of samples? Thanks
I tried to measure the same dataset otdd, but the result told me that it's not zero.
loaders_src = load_torchvision_data('MNIST', valid_size=0, resize = 28, maxsize=2000)[0]
loaders_tgt = load_torchvision_data('MNIST', valid_size=0, resize = 28, maxsize=2000)[0]
The result is 346.16
Hi, I get the above error when trying to run the second example in README (first example in Advanced Usage), and I couldn't find where embedded_feature_cost
is defined. I've followed the installation instructions in a conda virtual environment. Could you help take a look? Thanks!
I run into (Pdb) mode when I was testing the code with inner_ot_method='exact'
. It does not happen when inner_ot_method='gaussian_approx'
. Basically I just followed the vanilla example. Here is my code:
dist = DatasetDistance(dataset1, dataset2,
inner_ot_method = 'exact',
debiased_loss = True,
p = 2, entreg = 1e-1,
device='cpu')
d = dist.distance(maxsamples = 1000)
Both dataset1
and dataset2
are created by TensorDataset
.
And below is the output from the program.
Computing label-to-label distances: 0%| | 0/4 [00:00<?, ?it/s]
> mypath/envs/OTDD/lib/python3.6/site-packages/otdd/pytorch/wasserstein.py(338)pwdist_exact()
-> if symmetric:
(Pdb) True
True
(Pdb) False
False
(Pdb) 1
1
(Pdb) 0
0
(Pdb)
There is no way for me to quit (Pdb) except for quit
, which would terminal the program as well with an error returned
Traceback (most recent call last):
File "otdd_distance.py", line 63, in <module>
d = dist.distance(maxsamples = 1000)
File "mypath/envs/OTDD/lib/python3.6/site-packages/otdd/pytorch/distance.py", line 595, in distance
_ = self._get_label_distances()
File "mypath/envs/OTDD/lib/python3.6/site-packages/otdd/pytorch/distance.py", line 529, in _get_label_distances
DYY12 = pwdist(self.X1,self.Y1,self.X2, self.Y2)
File "mypath/envs/OTDD/lib/python3.6/site-packages/otdd/pytorch/wasserstein.py", line 338, in pwdist_exact
if symmetric:
File "mypath/envs/OTDD/lib/python3.6/site-packages/otdd/pytorch/wasserstein.py", line 338, in pwdist_exact
if symmetric:
File "mypath/envs/OTDD/lib/python3.6/bdb.py", line 51, in trace_dispatch
return self.dispatch_line(frame)
File "mypath/envs/OTDD/lib/python3.6/bdb.py", line 70, in dispatch_line
if self.quitting: raise BdbQuit
bdb.BdbQuit
About my environment:
A new environment with conda create -n name python=3.6
Pytorch installed by conda install pytorch torchvision torchaudio cudatoolkit=10.1 -c pytorch
Then I followed the instructions in this repository
python -m pip install -r requirements.txt
python -m pip install .
Please let me know how to solve this problem. Thank you very much!
Code:
import numpy as np
from sklearn.impute import SimpleImputer
# With incomparable dataset distance cost
import torch
import openml
from otdd.pytorch.distance import IncomparableDatasetDistance
from torchvision.models import resnet18
from otdd.pytorch.datasets import load_torchvision_data
import numpy as np
from torch.utils.data import TensorDataset
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
d1 = openml.datasets.get_dataset(8)
d2 = openml.datasets.get_dataset(39)
def dataset_from_numpy(X, Y, classes = None):
targets = torch.LongTensor(list(Y))
ds = TensorDataset(torch.from_numpy(X).type(torch.FloatTensor),targets)
ds.targets = targets
ds.classes = classes if classes is not None else [i for i in range(len(np.unique(Y)))]
return ds
x1,y1,_,_ = d1.get_data(dataset_format="array", target=d1.default_target_attribute)
x2,y2,_,_ = d2.get_data(dataset_format="array", target=d2.default_target_attribute)
x1 = imp.fit_transform(x1)
x2 = imp.fit_transform(x2)
ds1 = dataset_from_numpy(x1,y1)
ds2 = dataset_from_numpy(x2,y2)
print('datasets created')
dist = IncomparableDatasetDistance(ds1, ds2,
debiased_loss = False,
inner_ot_method = 'exact',
p = 5, entreg = 10e-1,
device='cpu')
d = dist.distance()
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
[<ipython-input-3-1e152b2bf0fd>](https://localhost:8080/#) in <module>()
37 inner_ot_method = 'exact',
38 p = 5, entreg = 10e-1,
---> 39 device='cpu')
40
41 d = dist.distance()
3 frames
[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/distance.py](https://localhost:8080/#) in __init__(self, *args, **kwargs)
1093 """
1094 def __init__(self, *args, **kwargs):
-> 1095 super(IncomparableDatasetDistance, self).__init__(*args, **kwargs)
1096 if self.debiased_loss:
1097 raise ValueError('Debiased GWOTDD not implemented yet')
[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/distance.py](https://localhost:8080/#) in __init__(self, D1, D2, method, symmetric_tasks, feature_cost, src_embedding, tgt_embedding, ignore_source_labels, ignore_target_labels, loss, debiased_loss, p, entreg, λ_x, λ_y, inner_ot_method, inner_ot_loss, inner_ot_debiased, inner_ot_p, inner_ot_entreg, diagonal_cov, min_labelcount, online_stats, sqrt_method, sqrt_niters, sqrt_pref, nworkers_stats, coupling_method, nworkers_dists, eigen_correction, device, precision, verbose, *args, **kwargs)
232
233 if self.D1 is not None and self.D2 is not None:
--> 234 self._init_data(self.D1, self.D2)
235 else:
236 logger.warning('DatasetDistance initialized with empty data')
[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/distance.py](https://localhost:8080/#) in _init_data(self, D1, D2)
320
321
--> 322 self.classes1 = [classes1[i] for i in self.V1]
323 self.classes2 = [classes2[i] for i in self.V2]
324
[/usr/local/lib/python3.7/dist-packages/otdd/pytorch/distance.py](https://localhost:8080/#) in <listcomp>(.0)
320
321
--> 322 self.classes1 = [classes1[i] for i in self.V1]
323 self.classes2 = [classes2[i] for i in self.V2]
324
IndexError: list index out of range
Hi, when I run the text code, I encounter this error. Can you take a look, @dmelis? Thanks!
Hi, I tried to run your gradient flow code:
Here is my code:
import os
import matplotlib
%matplotlib inline
#Comment out if not on notebook
import torch
from torchvision.models import resnet18
from otdd.pytorch.datasets import load_torchvision_data
from otdd.pytorch.distance import DatasetDistance, FeatureCost
from otdd.pytorch.flows import OTDD_Gradient_Flow
from otdd.pytorch.flows import CallbackList, ImageGridCallback, TrajectoryDump
# Load datasets
loaders_src = load_torchvision_data('MNIST', valid_size=0, resize = 28, maxsize=1000)[0]
loaders_tgt = load_torchvision_data('USPS', valid_size=0, resize = 28, maxsize=1000)[0]
outdir = os.path.join('out', 'flows')
callbacks = CallbackList([
ImageGridCallback(display_freq=2, animate=False, save_path = outdir + '/grid'),
])
flow = OTDD_Gradient_Flow(loaders_src['train'], loaders_tgt['train'],
### Gradient Flow Args
method = 'xonly-attached',
use_torchoptim=True,
optim='adam',
steps=10,
step_size=1,
callback=callbacks,
clustering_method='kmeans',
### OTDD Args
online_stats=True,
diagonal_cov = False,
device='cuda'
)
d,out = flow.flow()
then I received this error:
RuntimeError: symeig_cpu: the algorithm failed to converge; 643 off-diagonal elements of an intermediate tridiagonal form did not converge to zero.
Do you know what is wrong? Thank you so much!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.