A qc.matrix design proposal: I've been uncomfortable with the qc.m

My thoughts (mostly the same as when we first talked about this <a class="user-mention

A qc.matrix design proposal about tabmat HOT 3 CLOSED

quantco commented on May 29, 2024

A qc.matrix design proposal

from tabmat.

Comments (3)

MarcAntoineSchmidtQC commented on May 29, 2024

My thoughts (mostly the same as when we first talked about this @tbenthompson):

Overall, I agree with you that we should limit the API in the short-term.
However, I'm a bit sad about this. I got really excited about quantcore.matrix recently. When I was trying to write a function that would take in both dense numpy array and sparse scipy matrix, I ran into a hell of incompatibility. I believe that quantcore.matrix has the potential to become the go-to data storage for modeling. But we are a very long way from there.
I'm not convinced that we need to go as far as removing Dense-, Sparse-, and CategoricalMatrix classes from the public-facing API in order to streamline the release. A couple of points regarding this:
- If we want to implement something with SplitMatrix, it needs to be implemented with the other underlying classes. Currently, do we need to work on something with those underlying classes? I thought most of the work was with SplitMatrix anyway.
- For a data scientist with limited knowledge of things like AVX instructions or multiprocessing, CategoricalMatrix is a super nice feature that is easy to understand. Having a public-facing API makes it more salient.
- Would this mean that the underlying type of X in quantcore.glm will always be a SplitMatrix? If so, we should make sure that we are not much slower compared to a direct DenseMatrix or SparseMatrix.
Having a broader goal allows us to gradually improve it. I feel like qc.matrix is a good way for interns or new employees to have a 1-week task to learn about git, review process, scientific computing, modeling.

Overall:

We should drop the idea to support basic arithmetic operations. Let's keep the classes like the Dataset class of LightGBM (good analogy btw)
@tbenthompson, can you explain in more details what would become easier by dropping the "public" support for Dense, Sparse, and CategoricalMatrix? Right now I lean towards keeping them, but maybe it's because I'm not seeing an obvious and large benefit to this.

from tabmat.

tbenthompson commented on May 29, 2024

Thanks Marc!!

First, an alternate proposal: don't change anything but make it clear which methods are "supported" and which are simply accidentally inherited from their parent classes (np.ndarray and scipy.sparse.csr_matrix). I would be pretty happy with this option! In some ways, I prefer it.

This alternate proposal is also the least amount of work.

On to the original proposal.

@tbenthompson, can you explain in more details what would become easier by dropping the "public" support for Dense, Sparse, and CategoricalMatrix? Right now I lean towards keeping them, but maybe it's because I'm not seeing an obvious and large benefit to this.

My basic logic is that this will:

Remove API inconsistencies.
be easier to maintain/understand because the API is smaller.

I'll expand on both these points.

Removing API inconsistencies
Currently, the DenseMatrix inherits from np.ndarray and, as a result, has the entire API of a numpy array. The SparseMatrix inherits from scipy.sparse.csr_matrix and has the entire API of a scipy.sparse matrix. The Categorical, Split and Standardized Matrix classes are written from scratch and have a much smaller API. Having Split and Standardized as the only user-facing classes would make having a consistent API easy.

Smaller API
As a general rule, smaller APIs are easier to understand and maintain. Tons of costs grow in proportion to the number of methods provided: backwards compatibility, testing, error/input checking. The last one is a concrete example here. For any of our "user" facing methods, we need to check things like input types, input shapes, etc. See, for example L124-L134 in categorical_matrix.py. Currently, if we reduce to just supporting matvec/transpose_matvec/sandwich/standardize, we will have 20 methods where we need to do that error and input checking. By reducing that to just one or two classes, we can feel safe about the correctness of inputs in many more parts of the code.

You also asked:

Would this mean that the underlying type of X in quantcore.glm will always be a SplitMatrix?

Yes, that or a StandardizedMatrix. The overhead is very low. I just measured it and it seems to be on the order of 100 microseconds. We could probably make that smaller, but it's very small already.

from tabmat.

lbittarello commented on May 29, 2024

I would rather ditch inherited methods than preserve an inconsistent API. Nevertheless, I think it's reasonable at this point to just clarify which methods are officially supported (i.e. changes will first prompt a deprecation warning and imply a major update) and which aren't (i.e. changes may occur without warning).

I wouldn't mind if the SplitMatrix became the only user-facing class in the package. I'd also love if we gradually augmented it till it took over NumPy and SciPy. 😬

from tabmat.

A qc.matrix design proposal about tabmat HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs