bkelly-lab / ipca Goto Github PK
View Code? Open in Web Editor NEWInstrumented Principal Components Analysis
License: MIT License
Instrumented Principal Components Analysis
License: MIT License
Matrix is singular to machine precision.
Could you please tell me the possible reason for the error? I know it is my code problem, but I really can't find the reason, thank you very much
Hi guys,
Thanks for your kind sharing of this awesome package. However, when I checked the function of test_ipca.py file, I think there are some kinds of testing about whether these functions work or not.
I tried to generate Gamma, Factors by myself and then compared with the Gamma solved by this package. I think these two matrices are not identical. Therefore, is there any method for me to check the correctness of this package? Look forward to your reply.
Best.
I wanted to create a separate issue from #3 to discuss some potential further changes:
In order to fully align with sklearn we need to distinguish the characteristics in X
from the indices (time periods/assets).
1.a Maybe the most natural way is to add an input called something like indices
or groups
. Then X
would just be the chars, y
would just be the returns, and indices
would hold the time/asset labels. What I'm imagining here then is:
fit(X=None, y=None, indices=None, ...)
1.b This might provide a good route to direct integration with pandas. Our indices
would align with the MultiIndex in pandas and we could have a method to break things out properly when given a pandas X
and y
with a MultiIndex.
1.c sklearn also has a series of "multipleoutput" methods which somewhat align with what we're doing here. I need to read into this more but this might be one way to go.
Ultimately, we should add an IPCARegressorCV
class instead of the current fit_path
method. I'll take care of this once we handle the indices issue. This is also where I can add the "hot-start" approach for cross-validation (which should be faster on most machines).
It might make more sense to name the main class something like InstrumentedPCA
instead of IPCARegressor
. Two reasons for this:
3.a This aligns with how other packages do it (e.g. IncrementalPCA) and allows us to better distinguish which IPCA we're working with here.
3.b Regressor
doesn't seem to add much information to the name.
I'm not sure if this is pandas related or if the statsmodels example dataset changed, but running the example throws the following error:
29 from ipca import IPCARegressor
30 regr = IPCARegressor(n_factors=1, intercept=False)
---> 31 regr = regr.fit(X=X, y=y)
32 Gamma, Factors = regr.get_factors(label_ind=True)
~/anaconda3/lib/python3.7/site-packages/ipca/ipca.py in fit(self, X, y, PSF, Gamma, Factors, data_type, **kwargs)
170
171 # init data dimensions
--> 172 self = self._init_dimensions(X)
173
174 # Handle pre-specified factors
~/anaconda3/lib/python3.7/site-packages/ipca/ipca.py in _init_dimensions(self, X)
965 """
966
--> 967 self.dates = np.unique(X[:, 1])
968 self.ids = np.unique(X[:, 0])
969 self.T = np.size(self.dates, axis=0)
~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
2978 if self.columns.nlevels > 1:
2979 return self._getitem_multilevel(key)
-> 2980 indexer = self.columns.get_loc(key)
2981 if is_integer(indexer):
2982 indexer = [indexer]
~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2895 )
2896 try:
-> 2897 return self._engine.get_loc(key)
2898 except KeyError:
2899 return self._engine.get_loc(self._maybe_cast_indexer(key))
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
TypeError: '(slice(None, None, None), 1)' is an invalid key
Specifically X[:, 1]
throws the error.
My env:
Python 3.7.4
pandas 0.25.1
statsmodels 0.10.1
numpy 1.17.2
The line(s) of d_temp = np.random.standard_t(5) should be scaled by (sqrt(5/(5-2)). For example, np.std(np.random.standard_t(5,10**6)) is about 1.2910 instead of unit variance.
I noticed we used Numba packages here to accelerate matrix calculations. Since Q, W could potentially be large matrix (consider when we have 500+ characteristics and 500+ point in time), have we considered using CUDA in the code to facilitate GPU acceleration?
@matbuechner, I'm doing some work with @AllenHu95 using the IPCA code here and had a couple of questions I was hoping you could help with, as I'm fiddling around with the backend. In particular, we're doing some work with the bootstrap testing procedure and are hoping to improve the runtime somewhat.
Do you have a good sense of whether 10e-6 is the "right" threshold for convergence? Since we have to refit IPCA for each bootstrap sample, I'm considering using a less conservative threshold but wanted to check whether you foresee any issues there.
Would you be open to a PR that allows users to modify the fit params at run-time (as opposed to using IPCA class values)? What I'm imagining here is adding some additional key-words to _fit_ipca
for iter_tol
and max_iter
(which could be passed through the bootstrap tests). In cases where these key-words weren't provided we'd default to the class values.
My use-case here would be a more conservative threshold for the main IPCA run, and then a lighter one for the bootstrap sample runs. Obviously, there are work-arounds (could set the class value after fitting the main run), which might be preferable, so I wanted to get your input.
It looks like most of the work is done in the alternating-least squares component. Off-hand, do you think there are any opportunities for performance improvement there?
I'll note that when we run each bootstrap iteration sequentially the run-time per iteration is better (about 5 minutes), vs. in parallel (< 1.2 hours). I'm guessing this is because something clever that numpy/numba/etc. are doing gets screwed up when joblib is layered on top, but I'm not familiar with all the details. Let me know if you see an obviously solution based on this result.
Thanks, and let me know if there are any other details I can provide.
-Leland
Orthogonalization is not adapted for the case with PSF yet.
Line 1044 in 57c0488
Thanks to Steffen Windmüller for pointing this out.
Thank you for providing this package.
When applying the fit()
-function I run into the following error: LinAlgError: Matrix is singular to machine precision.
. (I've attached the full stack trace below)
I'm working on a large X matrix with shape `(344753, 77)``. Both the X DataFrame and the y Series have a multi-index consisting of a datetime64[ns] and int64.
X matrix (head of DataFrame):
yy ret prc size q10 q20 q30 q40 q50 \
firm time
19940 1963-07-31 -1.0 -1.000000 1.000000 1.000000 -1.0 -1.0 -1.0 -1.0 -1.0
25160 1963-07-31 -1.0 -0.578433 0.487544 -1.000000 -1.0 -1.0 -1.0 -1.0 -1.0
25478 1963-07-31 -1.0 1.000000 -1.000000 -0.936839 -1.0 -1.0 -1.0 -1.0 -1.0
19940 1963-08-31 -1.0 -1.000000 1.000000 1.000000 -1.0 -1.0 -1.0 -1.0 -1.0
25160 1963-08-31 -1.0 1.000000 0.815603 -1.000000 -1.0 -1.0 -1.0 -1.0 -1.0
q60 ... idio_vol total_vol std_volume std_turn \
firm time ...
19940 1963-07-31 -1.0 ... -1.000000 -1.000000 -1.000000 -1.000000
25160 1963-07-31 -1.0 ... 1.000000 1.000000 -0.742541 1.000000
25478 1963-07-31 -1.0 ... 0.922743 0.875899 1.000000 0.396275
19940 1963-08-31 -1.0 ... -1.000000 -1.000000 -0.916715 -1.000000
25160 1963-08-31 -1.0 ... 0.165515 0.626392 -1.000000 0.115954
lme_adj beme_adj pm_adj at_adj mm_sin mm_cos
firm time
19940 1963-07-31 1.000000 -1.000000 0.29254 1.000000 -1.0 -1.0
25160 1963-07-31 -1.000000 1.000000 -1.00000 -1.000000 -1.0 -1.0
25478 1963-07-31 -0.319608 -0.638467 1.00000 0.438206 -1.0 -1.0
19940 1963-08-31 1.000000 -1.000000 0.29254 1.000000 -1.0 -1.0
25160 1963-08-31 -1.000000 1.000000 -1.00000 -1.000000 -1.0 -1.0
[5 rows x 77 columns]
y matrix (head of Series):
firm date
19940 1963-07-31 -1.000000
25160 1963-07-31 1.000000
25478 1963-07-31 -0.809350
19940 1963-08-31 1.000000
25160 1963-08-31 0.193671
Name: TARGET, dtype: float64
I use the following code to perform the IPCA:
ipca = InstrumentedPCA(n_factors=4, max_iter=20000)
ipca = ipca.fit(X=X_ipca,y=y_ipca)
gamma, factors = ipca.get_factors(label_ind=True)
factors.head()
The shapes are then recognized correctly:
The panel dimensions are:
n_samples: 3262 , L: 77 , T: 270
Afterwards I get the LinAlgError: Matrix is singular to machine precision.
.
I already tried to increase max_iter
without any luck. Is there any explanation causing this behaviour? Are there ways to circumvent it by applying more pre-processing or additional args?
Please let me know if you need more information. Thank you.
Here is the full stack trace:
---------------------------------------------------------------------------
LinAlgError Traceback (most recent call last)
<ipython-input-110-830cb2dc72be> in <module>
2 ipca = InstrumentedPCA(n_factors=4, max_iter=20000)
3
----> 4 ipca = ipca.fit(X=X_ipca,y=y_ipca)
5 gamma, factors = ipca.get_factors(label_ind=True)
6
~\anaconda3\lib\site-packages\ipca\ipca.py in fit(self, X, y, indices, PSF, Gamma, Factors, data_type, label_ind, **kwargs)
219
220 # Run IPCA
--> 221 Gamma, Factors = self._fit_ipca(X=X, y=y, indices=indices, Q=Q,
222 W=W, val_obs=val_obs, PSF=PSF,
223 Gamma=Gamma, Factors=Factors,
~\anaconda3\lib\site-packages\ipca\ipca.py in _fit_ipca(self, X, y, indices, PSF, Q, W, val_obs, Gamma, Factors, quiet, data_type, **kwargs)
1011 while((iter <= self.max_iter) and (tol_current > self.iter_tol)):
1012
-> 1013 Gamma_New, Factor_New = ALS_fit(Gamma_Old, *ALS_inputs,
1014 PSF=PSF, **kwargs)
1015
~\anaconda3\lib\site-packages\ipca\ipca.py in _ALS_fit_portfolio(self, Gamma_Old, Q, W, val_obs, PSF, **kwargs)
1097
1098 # ALS Step 2
-> 1099 Gamma_New = _Gamma_fit_portfolio(F_New, Q, W, val_obs, PSF, L, K,
1100 Ktilde, T)
1101
~\anaconda3\lib\site-packages\ipca\ipca.py in _Gamma_fit_portfolio(F_New, Q, W, val_obs, PSF, L, K, Ktilde, T)
1518 * val_obs[t]
1519
-> 1520 Gamma_New = _numba_solve(Denom, Numer).reshape((L, Ktilde))
1521
1522 return Gamma_New
~\anaconda3\lib\site-packages\numba\np\linalg.py in _inv_err_handler()
822 assert 0 # unreachable
823 if r > 0:
--> 824 raise np.linalg.LinAlgError(
825 "Matrix is singular to machine precision.")
826
LinAlgError: Matrix is singular to machine precision.
I encountered these two issues when I was trying to apply IPCA model to a stock market index option, following the pattern in Büchner M, Kelly B(2022). To be specific, I construct 15 features as it was in the paper. However, when I was fitting the IPCA model, it never converged when number of factors in this model is larger than 1. In fact, the aggregate update in every step tends to go up and explore for these multi-factors situation. For reference, my feature dataset looks like this:
The last eight columns are Greeks interacted with a 0-1 variable indicating put options. And the features are re-scaled into [-0.5,0.5].
Next, as I was fitting the IPCA model with characteristic managed portfolio, not only that it did not converge, but also that there was a error raised saying: LinAlgError: Matrix is singular to machine precision.
As I looked into the source code where this error is raised, I found that it was that a singular matrix was passed into function _Ft_fit_portfolio during ALS process.
`Traceback (most recent call last):
File "C:\code.py", line 345, in
analyzer.run()
File "C:\code.py", line 298, in run
rgsr = self.IPCA(K, False, 'portfolio', quiet=False)
File "C:\code.py", line 235, in IPCA
rgsr = rgsr.fit(X=self.features, y=self.label, data_type=datatype, **kwargs)
File "D:\ana\lib\site-packages\ipca\ipca.py", line 221, in fit
Gamma, Factors = self._fit_ipca(X=X, y=y, indices=indices, Q=Q,
File "D:\ana\lib\site-packages\ipca\ipca.py", line 1013, in _fit_ipca
Gamma_New, Factor_New = ALS_fit(Gamma_Old, *ALS_inputs,
File "D:\ana\lib\site-packages\ipca\ipca.py", line 1074, in _ALS_fit_portfolio
F_New[:,t] = _Ft_fit_portfolio(Gamma_Old, W[:,:,t],
File "D:\ana\lib\site-packages\ipca\ipca.py", line 1449, in _Ft_fit_portfolio
return np.squeeze(_numba_solve(m1, m2.reshape((-1, 1))))
File "D:\ana\lib\site-packages\numba\np\linalg.py", line 899, in _inv_err_handler
raise np.linalg.LinAlgError(
LinAlgError: Matrix is singular to machine precision.`
Combining these two issues, I think that it was something wrong numerically(or statistically) in my features data, but I have no clue where exactly goes wrong and how I can fix this problem and let my code work. Additionally, I have seen issue on the same LinAlgError raised before, however, I guess that was a different situation to my problem since there's no columns with all entries being 0.
Please tell me how I can fix this issue, And any hint or suggestion is welcomed as well.
Hi author, I tried to do out-of-sample prediction for portfolio but realized that there is no data_type parameter in the predictOOS function, so I tried to run "Ypred = regr.predict(X=data_OOS, data_type='portfolio', mean_factor=True)". In your source code predict function exists such a parameter as data_type, but if you run the above code, it selects the elif that X is not None, and then runs pred = self.predict_portfolio(W, L, T, mean_factor), but the values of L and T are not returned in the elif. This makes it impossible to predict the out-of-sample portfolio. my solution is to calculate the values of W, L and T myself and then run the predict_portfolio function.
Hi,
first thank you for providing this great package!
I have been playing around with the example code and noticed that the length of the array returned by predictOOS
does not match the length of the data when setting mean_factor
to False.
One can use the Grunfeld data to see this:
regr = InstrumentedPCA(n_factors=1, intercept=True)
regr = regr.fit(X=data_IS, y=y_IS)
Ypred1 = regr.predictOOS(X=data_OOS, y=y_OOS, mean_factor=True)
Ypred2 = regr.predictOOS(X=data_OOS, y=y_OOS, mean_factor=False)
print(y_OOS, "\n")
print(Ypred1, "\n")
print(Ypred2, "\n")
This is what the output looks like:
firm year
11 1954 1486.700
14 1954 459.300
10 1954 189.600
8 1954 172.490
7 1954 81.430
13 1954 135.720
15 1954 89.510
16 1954 68.600
12 1954 49.340
9 1954 5.120
6 1954 6.281
Name: invest, dtype: float64
[780.89716506 291.46947239 380.59479685 101.20025869 65.84867622
126.53074081 36.65394197 160.14977461 72.50646239 7.91635846
8.04336863]
[1244.4912375]
Apparently the method only returns a prediction for the first entity-time pair.
Hello, I tried to feed in a bunch of barra factors but I bumped into the following problems:
I wonder do you have some intuitions on why? My intuition is that Barra factors are pre-processed to be orthogonal to each other, and somehow it cannot have larger/smaller number of latent factors?
Hello,
I don't know if this is the right place for my question. I'm currently writing my master thesis about IPCA and I want to test if the assumption that the matrix Gamma is constant over time holds. Does anyone have a clue how this coud be tested? Is it reasonable to compare the model with constant Gamma to a model with a time-varying Gamma? Or would you say that there are only theoretical arguments for and against this assumption?
Thank you in advance.
Simon
Hi,
Thank you for writing a great package!
I'm fitting a model with only PSF (e.g. 4
factors) and an intercept. The returned Factors
has 6
, including the original 4
and another 2 constant
, but the Gamma
gives correctly 5
. Do you know why this is the case? Does the case currently nested here?
Hi Bryan,
Thank you for your reply! It did help me a lot to dive deep into this code. I think there is a typo inside the ipca.py about parameters n_jobs
and backend
. In line 1082 of ipca.py file, I think it is better to use n_jobs = self.n_jobs
and backend=self.backend
.
Besides, I wonder if this ALS method's estimation result is a global minimum of this objective function. If not, how should we convince ourselves about using SVD decomposition initialization of \Gamma is a good method in some specific contexts?
Best,
Kendrick
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.