oxfordml / gpz Goto Github PK

View Code? Open in Web Editor NEW

45.0 45.0 10.0 9.53 MB

GPz 2.0: Heteroscedastic Gaussian processes for uncertain and incomplete data

M 0.07% MATLAB 93.88% C 6.05%

gpz's People

Contributors

Stargazers

Watchers

Forkers

pwhatfield drmattjarvis astrohaopeng salasraj troyraen rissl stephenjlee kw-lee mq-jonathan-xu

gpz's Issues

Saving trained model to file.

Hello,

I'm a user of the GPz 1.0 (the Python version) and I am wondering if it is possible to save a trained model to a file, so I can just load this trained model and predict the redshifts of a different sample.

Something like this:
I use a training/validation sample and a testing sample to train the model for the first time. After the training is done, no information of the trained model is saved, so if I want to evaluate the performance on another test set (on another time, so I won't be doing the prediction on two test samples the first time I run the model) I would need to train everything again. It would be best if I could load the already trained model and just apply it to predict the redshifts on new data (something like Keras does with model saving/loading).

Is it possible? Thanks.

Sampling the GP model w/ heteroscedastic noise

I am trying to modify your demo_sinc.py code to test it for an engineering problem I am working on and I have a few questions.

What exactly is going on on lines 43-53? Is this code drawing a random sample from the distribution?
If so, it appears that it is sampling using the model variance only. How could this be modified to sample using both the model and heteroscedastic noise variance?

Ultimately, I would like to be able to use GPz to draw random measurements of the underlying process for usage in Monte Carlo simulations.

Thanks and great work, I found your paper very interesting.

Strange Results - New Simulated Data Set.

Hi @csibrahim @bperezorozco @a5a ,

I am trying the algorithm on a new data set and I am getting some strange results.
The data set has been simulated and I am using the demo_photoz script as a template.
I am getting a large rmse and I am not sure where I am going wrong in my analysis.
I have have attached the matlab code below. I was wondering if you could take a look at my logic and advise on improvements accordingly.

Sincere thanks in advance.
Best,
Andrew

clear all;clc;clear;
rng(1); % fix random seed
addpath GPz/ % path to GPz
addpath(genpath('minFunc_2012/')) % path to minfunc

%%%%%%%%%%%%%% Model options %%%%%%%%%%%%%%%%

m = 1000; % number of basis functions to use [required]

method = 'VD'; % select a method, options = GL, VL, GD, VD, GC and VC [required]

heteroscedastic = true; % learn a heteroscedastic noise process, set to false if only interested in point estimates [default=true]

normalize = true; % pre-process the input by subtracting the means and dividing by the standard deviations [default=true]

maxIter = 500; % maximum number of iterations [default=200]
maxAttempts = 50; % maximum iterations to attempt if there is no progress on the validation set [default=infinity]

trainSplit = 0.2; % percentage of data to use for training
validSplit = 0.2; % percentage of data to use for validation
testSplit = 0.6; % percentage of data to use for testing

inputNoise = true; % false = use mag errors as additional inputs, true = use mag errors as additional input noise

csl_method = 'balanced'; % cost-sensitive learning option: [default='normal']
% 'balanced': to weigh
% rare samples more heavily during training
% 'normalized': assigns an error cost for each sample = 1/(z+1)
% 'normal': no weights assigned, all samples are equally important

binWidth = 0.1; % the width of the bin for 'balanced' cost-sensitive learning [default=range(output)/100]
%%%%%%%%%%%%%% Prepare data %%%%%%%%%%%%%%

%new data set

n = 500; % Sample size
Mdl = regARIMA('MA',{1.4,0.8},'AR',0.5,'Intercept',3,...
'Variance',1,'Beta',[2;-1.5],'D',1);

rng(1); % For reproducibility
X = randn(n,2);

Y = simulate(Mdl,n,'X',X);

figure;
plot(Y);
title 'Simulated Responses';
axis tight;

[n,d] = size(X);
filters = d/2;

% you can also select the size of each sample
% [training,validation,testing] = sample(n,10000,10000,10000);

% get the weights for cost-sensitive learning
omega = getOmega(Y,csl_method,binWidth);

if(inputNoise)
% treat the mag-errors as input noise variance
Psi = X(:,filters+1:end).^2;
X(:,filters+1:end) = [];
else
% treat the mag-errors as input additional inputs
X(:,filters+1:end) = log(X(:,filters+1:end));
Psi = [];
end

% split data into training, validation and testing
[training,validation,testing] = sample(n,trainSplit,validSplit,testSplit);

%%%%%%%%%%%%%% Fit the model %%%%%%%%%%%%%%

% initialize the model
model = init(X,Y,method,m,'omega',omega,'training',training,'heteroscedastic',heteroscedastic,'normalize',normalize,'Psi',Psi);
% train the model
model = train(model,X,Y,'omega',omega,'training',training,'validation',validation,'maxIter',maxIter,'maxAttempts',maxAttempts,'Psi',Psi);

%%%%%%%%%%%%%% Compute Metrics %%%%%%%%%%%%%%

% use the model to generate predictions for the test set
[mu,sigma,nu,beta_i,gamma] = predict(X,model,'Psi',Psi,'selection',testing);

% mu = the best point estimate
% nu = variance due to data density
% beta_i = variance due to output noise
% gamma = variance due to input noise
% sigma = nu+beta_i+gamma

% compute metrics

errors=Y(testing)-mu;
hist(errors)
plot(mu)
hold on
plot(Y(testing))
%root mean squared error, i.e. sqrt(mean(errors^2))
rmse = sqrt(metrics(Y(testing),mu,sigma,@(y,mu,sigma) (y-mu).^2));

% mean log likelihood mean(-0.5errors^2/sigma -0.5log(sigma)-0.5log(2pi))
mll = metrics(Y(testing),mu,sigma,@(y,mu,sigma) -0.5*(y-mu).^2./sigma - 0.5log(sigma)-0.5log(2*pi));

% fraction of data where |z_spec-z_phot|/(1+z_spec)<0.15
fr15 = metrics(Y(testing),mu,sigma,@(y,mu,sigma) 100*(abs(y-mu)./(y+1)<0.15));

% fraction of data where |z_spec-z_phot|/(1+z_spec)<0.05
fr05 = metrics(Y(testing),mu,sigma,@(y,mu,sigma) 100*(abs(y-mu)./(y+1)<0.05));

% bias, i.e. mean(errors)
bias = metrics(Y(testing),mu,sigma,@(y,mu,sigma) y-mu);

% print metrics for the entire data
fprintf('RMSE\t\tMLL\t\tFR15\t\tFR05\t\tBIAS\n')
fprintf('%f\t%f\t%f\t%f\t%f\n',rmse(end),mll(end),fr15(end),fr05(end),bias(end))

%%%%%%%%%%%%%% Display Results %%%%%%%%%%%%%%%%

% reduce the sample for efficient plotting
[x,y,color,counts]=reduce(Y(testing),mu,sigma,200);

figure;scatter(x,y,12,log(color),'s','filled');title('Uncertainty');xlabel('Spectroscopic Redshift');ylabel('Photometric Redshift');colormap jet;
figure;scatter(x,y,12,log(counts),'s','filled');title('Density');xlabel('Spectroscopic Redshift');ylabel('Photometric Redshift');colormap jet;

% plot the change in metrics as functions of data percentage
x = [1 5:5:100];
ind = round(x*length(rmse)/100);

figure;plot(x,rmse(ind),'o-');xlabel('Percentage of Data');ylabel('RMSE');
figure;plot(x,mll(ind),'o-');xlabel('Percentage of Data');ylabel('MLL');
figure;plot(x,fr05(ind),'o-');xlabel('Percentage of Data');ylabel('FR05');
figure;plot(x,fr15(ind),'o-');xlabel('Percentage of Data');ylabel('FR15');
figure;plot(x,bias(ind),'o-');xlabel('Percentage of Data');ylabel('BIAS');

% plot mean and standard deviation of different scores as functions of spectroscopic redshift using 20 bins
[centers,means,stds] = bin(Y(testing),Y(testing)-mu,20);
figure;errorbar(centers,means,stds,'s');xlabel('Spectroscopic Redshift');ylabel('Bias');

[centers,means,stds] = bin(Y(testing),sqrt(nu),20);
figure;errorbar(centers,means,stds,'s');xlabel('Spectroscopic Redshift');ylabel('Model Uncertainty');

[centers,means,stds] = bin(Y(testing),sqrt(beta_i),20);
figure;errorbar(centers,means,stds,'s');xlabel('Spectroscopic Redshift');ylabel('Noise Uncertainty');

Question regarding standard deviation computation

Hi,

the total variance is computed in the code as follows:

sigma = nu+beta_i+gamma;

Do these variables correspond to the ones explained in the paper? that is nu as nu and then beta_i as beta^-1.

If that is the case, what is gamma since in the paper the variance is computed as nu+beta^-1 only. There is no gamma variable in the paper.

Thank you so much!

Question regarding length scale parameters

Hi,

Thank you for sharing your work!. I am familiar with the basic Gaussian Process Regression. There the length scale can be adjusted according to the data. However, I have not been able to find where the length scale por each parameter is stored in the model. I understand that this is sparse GP therefore there is a parametrization in some sense. However, I believe that parametrization was done based on a particular covariance function with length scales. Can I modify the length scales before the parametrization is done?

I am just starting to use these methods. Therefore if you have any clarification that would be highly appreciated! Thank you

Samples from BFMs do not reflect heteroscedastic noise. Is this a bug or a feature?

Great library - I am very interested in your heteroscedastic GPs, though I am also new to the methods.

I have looked at your thesis and tried running your demo code, "demo_sinc.m," and it looks like, while you are modeling variance very well with your BFMs on data with heteroscedastic noise, you are not able to generate samples from the learned distributions that reflect this variance. Here is a zoom-in of a section of the figure generated from "demo_sinc.m." Notice that the colored lines in this segment are all very close to the mean, despite much of the data being far away from the mean here.

Is this a feature or bug of the method and/or the implementation?

Many thanks in advance for the advice here.

Best,
Stephen

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble