GithubHelp home page GithubHelp logo

A qc.matrix design proposal about tabmat HOT 3 CLOSED

quantco avatar quantco commented on May 29, 2024
A qc.matrix design proposal

from tabmat.

Comments (3)

MarcAntoineSchmidtQC avatar MarcAntoineSchmidtQC commented on May 29, 2024

My thoughts (mostly the same as when we first talked about this @tbenthompson):

  • Overall, I agree with you that we should limit the API in the short-term.
  • However, I'm a bit sad about this. I got really excited about quantcore.matrix recently. When I was trying to write a function that would take in both dense numpy array and sparse scipy matrix, I ran into a hell of incompatibility. I believe that quantcore.matrix has the potential to become the go-to data storage for modeling. But we are a very long way from there.
  • I'm not convinced that we need to go as far as removing Dense-, Sparse-, and CategoricalMatrix classes from the public-facing API in order to streamline the release. A couple of points regarding this:
    • If we want to implement something with SplitMatrix, it needs to be implemented with the other underlying classes. Currently, do we need to work on something with those underlying classes? I thought most of the work was with SplitMatrix anyway.
    • For a data scientist with limited knowledge of things like AVX instructions or multiprocessing, CategoricalMatrix is a super nice feature that is easy to understand. Having a public-facing API makes it more salient.
    • Would this mean that the underlying type of X in quantcore.glm will always be a SplitMatrix? If so, we should make sure that we are not much slower compared to a direct DenseMatrix or SparseMatrix.
  • Having a broader goal allows us to gradually improve it. I feel like qc.matrix is a good way for interns or new employees to have a 1-week task to learn about git, review process, scientific computing, modeling.

Overall:

  • We should drop the idea to support basic arithmetic operations. Let's keep the classes like the Dataset class of LightGBM (good analogy btw)
  • @tbenthompson, can you explain in more details what would become easier by dropping the "public" support for Dense, Sparse, and CategoricalMatrix? Right now I lean towards keeping them, but maybe it's because I'm not seeing an obvious and large benefit to this.

from tabmat.

tbenthompson avatar tbenthompson commented on May 29, 2024

Thanks Marc!!

First, an alternate proposal: don't change anything but make it clear which methods are "supported" and which are simply accidentally inherited from their parent classes (np.ndarray and scipy.sparse.csr_matrix). I would be pretty happy with this option! In some ways, I prefer it.

This alternate proposal is also the least amount of work.

On to the original proposal.


@tbenthompson, can you explain in more details what would become easier by dropping the "public" support for Dense, Sparse, and CategoricalMatrix? Right now I lean towards keeping them, but maybe it's because I'm not seeing an obvious and large benefit to this.

My basic logic is that this will:

  1. Remove API inconsistencies.
  2. be easier to maintain/understand because the API is smaller.

I'll expand on both these points.

Removing API inconsistencies
Currently, the DenseMatrix inherits from np.ndarray and, as a result, has the entire API of a numpy array. The SparseMatrix inherits from scipy.sparse.csr_matrix and has the entire API of a scipy.sparse matrix. The Categorical, Split and Standardized Matrix classes are written from scratch and have a much smaller API. Having Split and Standardized as the only user-facing classes would make having a consistent API easy.

Smaller API
As a general rule, smaller APIs are easier to understand and maintain. Tons of costs grow in proportion to the number of methods provided: backwards compatibility, testing, error/input checking. The last one is a concrete example here. For any of our "user" facing methods, we need to check things like input types, input shapes, etc. See, for example L124-L134 in categorical_matrix.py. Currently, if we reduce to just supporting matvec/transpose_matvec/sandwich/standardize, we will have 20 methods where we need to do that error and input checking. By reducing that to just one or two classes, we can feel safe about the correctness of inputs in many more parts of the code.


You also asked:

Would this mean that the underlying type of X in quantcore.glm will always be a SplitMatrix?

Yes, that or a StandardizedMatrix. The overhead is very low. I just measured it and it seems to be on the order of 100 microseconds. We could probably make that smaller, but it's very small already.

from tabmat.

lbittarello avatar lbittarello commented on May 29, 2024

I would rather ditch inherited methods than preserve an inconsistent API. Nevertheless, I think it's reasonable at this point to just clarify which methods are officially supported (i.e. changes will first prompt a deprecation warning and imply a major update) and which aren't (i.e. changes may occur without warning).

I wouldn't mind if the SplitMatrix became the only user-facing class in the package. I'd also love if we gradually augmented it till it took over NumPy and SciPy. 😬

from tabmat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.