oxfordml / gpz Goto Github PK
View Code? Open in Web Editor NEWGPz 2.0: Heteroscedastic Gaussian processes for uncertain and incomplete data
GPz 2.0: Heteroscedastic Gaussian processes for uncertain and incomplete data
Hello,
I'm a user of the GPz 1.0 (the Python version) and I am wondering if it is possible to save a trained model to a file, so I can just load this trained model and predict the redshifts of a different sample.
Something like this:
I use a training/validation sample and a testing sample to train the model for the first time. After the training is done, no information of the trained model is saved, so if I want to evaluate the performance on another test set (on another time, so I won't be doing the prediction on two test samples the first time I run the model) I would need to train everything again. It would be best if I could load the already trained model and just apply it to predict the redshifts on new data (something like Keras does with model saving/loading).
Is it possible? Thanks.
I am trying to modify your demo_sinc.py code to test it for an engineering problem I am working on and I have a few questions.
Ultimately, I would like to be able to use GPz to draw random measurements of the underlying process for usage in Monte Carlo simulations.
Thanks and great work, I found your paper very interesting.
Hi @csibrahim @bperezorozco @a5a ,
I am trying the algorithm on a new data set and I am getting some strange results.
The data set has been simulated and I am using the demo_photoz script as a template.
I am getting a large rmse and I am not sure where I am going wrong in my analysis.
I have have attached the matlab code below. I was wondering if you could take a look at my logic and advise on improvements accordingly.
Sincere thanks in advance.
Best,
Andrew
clear all;clc;clear;
rng(1); % fix random seed
addpath GPz/ % path to GPz
addpath(genpath('minFunc_2012/')) % path to minfunc
%%%%%%%%%%%%%% Model options %%%%%%%%%%%%%%%%
m = 1000; % number of basis functions to use [required]
method = 'VD'; % select a method, options = GL, VL, GD, VD, GC and VC [required]
heteroscedastic = true; % learn a heteroscedastic noise process, set to false if only interested in point estimates [default=true]
normalize = true; % pre-process the input by subtracting the means and dividing by the standard deviations [default=true]
maxIter = 500; % maximum number of iterations [default=200]
maxAttempts = 50; % maximum iterations to attempt if there is no progress on the validation set [default=infinity]
trainSplit = 0.2; % percentage of data to use for training
validSplit = 0.2; % percentage of data to use for validation
testSplit = 0.6; % percentage of data to use for testing
inputNoise = true; % false = use mag errors as additional inputs, true = use mag errors as additional input noise
csl_method = 'balanced'; % cost-sensitive learning option: [default='normal']
% 'balanced': to weigh
% rare samples more heavily during training
% 'normalized': assigns an error cost for each sample = 1/(z+1)
% 'normal': no weights assigned, all samples are equally important
binWidth = 0.1; % the width of the bin for 'balanced' cost-sensitive learning [default=range(output)/100]
%%%%%%%%%%%%%% Prepare data %%%%%%%%%%%%%%
%new data set
n = 500; % Sample size
Mdl = regARIMA('MA',{1.4,0.8},'AR',0.5,'Intercept',3,...
'Variance',1,'Beta',[2;-1.5],'D',1);
rng(1); % For reproducibility
X = randn(n,2);
Y = simulate(Mdl,n,'X',X);
figure;
plot(Y);
title 'Simulated Responses';
axis tight;
[n,d] = size(X);
filters = d/2;
% you can also select the size of each sample
% [training,validation,testing] = sample(n,10000,10000,10000);
% get the weights for cost-sensitive learning
omega = getOmega(Y,csl_method,binWidth);
if(inputNoise)
% treat the mag-errors as input noise variance
Psi = X(:,filters+1:end).^2;
X(:,filters+1:end) = [];
else
% treat the mag-errors as input additional inputs
X(:,filters+1:end) = log(X(:,filters+1:end));
Psi = [];
end
% split data into training, validation and testing
[training,validation,testing] = sample(n,trainSplit,validSplit,testSplit);
%%%%%%%%%%%%%% Fit the model %%%%%%%%%%%%%%
% initialize the model
model = init(X,Y,method,m,'omega',omega,'training',training,'heteroscedastic',heteroscedastic,'normalize',normalize,'Psi',Psi);
% train the model
model = train(model,X,Y,'omega',omega,'training',training,'validation',validation,'maxIter',maxIter,'maxAttempts',maxAttempts,'Psi',Psi);
%%%%%%%%%%%%%% Compute Metrics %%%%%%%%%%%%%%
% use the model to generate predictions for the test set
[mu,sigma,nu,beta_i,gamma] = predict(X,model,'Psi',Psi,'selection',testing);
% mu = the best point estimate
% nu = variance due to data density
% beta_i = variance due to output noise
% gamma = variance due to input noise
% sigma = nu+beta_i+gamma
% compute metrics
errors=Y(testing)-mu;
hist(errors)
plot(mu)
hold on
plot(Y(testing))
%root mean squared error, i.e. sqrt(mean(errors^2))
rmse = sqrt(metrics(Y(testing),mu,sigma,@(y,mu,sigma) (y-mu).^2));
% mean log likelihood mean(-0.5errors^2/sigma -0.5log(sigma)-0.5log(2pi))
mll = metrics(Y(testing),mu,sigma,@(y,mu,sigma) -0.5*(y-mu).^2./sigma - 0.5log(sigma)-0.5log(2*pi));
% fraction of data where |z_spec-z_phot|/(1+z_spec)<0.15
fr15 = metrics(Y(testing),mu,sigma,@(y,mu,sigma) 100*(abs(y-mu)./(y+1)<0.15));
% fraction of data where |z_spec-z_phot|/(1+z_spec)<0.05
fr05 = metrics(Y(testing),mu,sigma,@(y,mu,sigma) 100*(abs(y-mu)./(y+1)<0.05));
% bias, i.e. mean(errors)
bias = metrics(Y(testing),mu,sigma,@(y,mu,sigma) y-mu);
% print metrics for the entire data
fprintf('RMSE\t\tMLL\t\tFR15\t\tFR05\t\tBIAS\n')
fprintf('%f\t%f\t%f\t%f\t%f\n',rmse(end),mll(end),fr15(end),fr05(end),bias(end))
%%%%%%%%%%%%%% Display Results %%%%%%%%%%%%%%%%
% reduce the sample for efficient plotting
[x,y,color,counts]=reduce(Y(testing),mu,sigma,200);
figure;scatter(x,y,12,log(color),'s','filled');title('Uncertainty');xlabel('Spectroscopic Redshift');ylabel('Photometric Redshift');colormap jet;
figure;scatter(x,y,12,log(counts),'s','filled');title('Density');xlabel('Spectroscopic Redshift');ylabel('Photometric Redshift');colormap jet;
% plot the change in metrics as functions of data percentage
x = [1 5:5:100];
ind = round(x*length(rmse)/100);
figure;plot(x,rmse(ind),'o-');xlabel('Percentage of Data');ylabel('RMSE');
figure;plot(x,mll(ind),'o-');xlabel('Percentage of Data');ylabel('MLL');
figure;plot(x,fr05(ind),'o-');xlabel('Percentage of Data');ylabel('FR05');
figure;plot(x,fr15(ind),'o-');xlabel('Percentage of Data');ylabel('FR15');
figure;plot(x,bias(ind),'o-');xlabel('Percentage of Data');ylabel('BIAS');
% plot mean and standard deviation of different scores as functions of spectroscopic redshift using 20 bins
[centers,means,stds] = bin(Y(testing),Y(testing)-mu,20);
figure;errorbar(centers,means,stds,'s');xlabel('Spectroscopic Redshift');ylabel('Bias');
[centers,means,stds] = bin(Y(testing),sqrt(nu),20);
figure;errorbar(centers,means,stds,'s');xlabel('Spectroscopic Redshift');ylabel('Model Uncertainty');
[centers,means,stds] = bin(Y(testing),sqrt(beta_i),20);
figure;errorbar(centers,means,stds,'s');xlabel('Spectroscopic Redshift');ylabel('Noise Uncertainty');
Hi,
the total variance is computed in the code as follows:
sigma = nu+beta_i+gamma;
Do these variables correspond to the ones explained in the paper? that is nu as nu and then beta_i as beta^-1.
If that is the case, what is gamma since in the paper the variance is computed as nu+beta^-1 only. There is no gamma variable in the paper.
Thank you so much!
Hi,
Thank you for sharing your work!. I am familiar with the basic Gaussian Process Regression. There the length scale can be adjusted according to the data. However, I have not been able to find where the length scale por each parameter is stored in the model. I understand that this is sparse GP therefore there is a parametrization in some sense. However, I believe that parametrization was done based on a particular covariance function with length scales. Can I modify the length scales before the parametrization is done?
I am just starting to use these methods. Therefore if you have any clarification that would be highly appreciated! Thank you
Great library - I am very interested in your heteroscedastic GPs, though I am also new to the methods.
I have looked at your thesis and tried running your demo code, "demo_sinc.m," and it looks like, while you are modeling variance very well with your BFMs on data with heteroscedastic noise, you are not able to generate samples from the learned distributions that reflect this variance. Here is a zoom-in of a section of the figure generated from "demo_sinc.m." Notice that the colored lines in this segment are all very close to the mean, despite much of the data being far away from the mean here.
Is this a feature or bug of the method and/or the implementation?
Many thanks in advance for the advice here.
Best,
Stephen
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.