Comments (3)
My thoughts (mostly the same as when we first talked about this @tbenthompson):
- Overall, I agree with you that we should limit the API in the short-term.
- However, I'm a bit sad about this. I got really excited about quantcore.matrix recently. When I was trying to write a function that would take in both dense numpy array and sparse scipy matrix, I ran into a hell of incompatibility. I believe that quantcore.matrix has the potential to become the go-to data storage for modeling. But we are a very long way from there.
- I'm not convinced that we need to go as far as removing Dense-, Sparse-, and CategoricalMatrix classes from the public-facing API in order to streamline the release. A couple of points regarding this:
- If we want to implement something with SplitMatrix, it needs to be implemented with the other underlying classes. Currently, do we need to work on something with those underlying classes? I thought most of the work was with SplitMatrix anyway.
- For a data scientist with limited knowledge of things like AVX instructions or multiprocessing, CategoricalMatrix is a super nice feature that is easy to understand. Having a public-facing API makes it more salient.
- Would this mean that the underlying type of
X
in quantcore.glm will always be aSplitMatrix
? If so, we should make sure that we are not much slower compared to a directDenseMatrix
orSparseMatrix
.
- Having a broader goal allows us to gradually improve it. I feel like qc.matrix is a good way for interns or new employees to have a 1-week task to learn about git, review process, scientific computing, modeling.
Overall:
- We should drop the idea to support basic arithmetic operations. Let's keep the classes like the
Dataset
class of LightGBM (good analogy btw) - @tbenthompson, can you explain in more details what would become easier by dropping the "public" support for Dense, Sparse, and CategoricalMatrix? Right now I lean towards keeping them, but maybe it's because I'm not seeing an obvious and large benefit to this.
from tabmat.
Thanks Marc!!
First, an alternate proposal: don't change anything but make it clear which methods are "supported" and which are simply accidentally inherited from their parent classes (np.ndarray and scipy.sparse.csr_matrix). I would be pretty happy with this option! In some ways, I prefer it.
This alternate proposal is also the least amount of work.
On to the original proposal.
@tbenthompson, can you explain in more details what would become easier by dropping the "public" support for Dense, Sparse, and CategoricalMatrix? Right now I lean towards keeping them, but maybe it's because I'm not seeing an obvious and large benefit to this.
My basic logic is that this will:
- Remove API inconsistencies.
- be easier to maintain/understand because the API is smaller.
I'll expand on both these points.
Removing API inconsistencies
Currently, the DenseMatrix inherits from np.ndarray and, as a result, has the entire API of a numpy array. The SparseMatrix inherits from scipy.sparse.csr_matrix and has the entire API of a scipy.sparse matrix. The Categorical, Split and Standardized Matrix classes are written from scratch and have a much smaller API. Having Split and Standardized as the only user-facing classes would make having a consistent API easy.
Smaller API
As a general rule, smaller APIs are easier to understand and maintain. Tons of costs grow in proportion to the number of methods provided: backwards compatibility, testing, error/input checking. The last one is a concrete example here. For any of our "user" facing methods, we need to check things like input types, input shapes, etc. See, for example L124-L134 in categorical_matrix.py
. Currently, if we reduce to just supporting matvec/transpose_matvec/sandwich/standardize, we will have 20 methods where we need to do that error and input checking. By reducing that to just one or two classes, we can feel safe about the correctness of inputs in many more parts of the code.
You also asked:
Would this mean that the underlying type of X in quantcore.glm will always be a SplitMatrix?
Yes, that or a StandardizedMatrix. The overhead is very low. I just measured it and it seems to be on the order of 100 microseconds. We could probably make that smaller, but it's very small already.
from tabmat.
I would rather ditch inherited methods than preserve an inconsistent API. Nevertheless, I think it's reasonable at this point to just clarify which methods are officially supported (i.e. changes will first prompt a deprecation warning and imply a major update) and which aren't (i.e. changes may occur without warning).
I wouldn't mind if the SplitMatrix
became the only user-facing class in the package. I'd also love if we gradually augmented it till it took over NumPy and SciPy. 😬
from tabmat.
Related Issues (20)
- Daily run failure: Unit tests
- Build script in PyPI source version uses default `jemalloc` HOT 5
- Cannot sandwich SplitMatrix with non-owned array
- dlopen symbol not found issue with M1 wheel HOT 3
- `-march` is not cross-platform HOT 1
- Daily run failure: Unit tests HOT 1
- Sandwich product fails for very large dense matrices
- Sandwich product fails for large F-contiguous matrices
- Cross sandwich product fails for split matrices with large dense matrix part
- wheel build for aarch64 is incredibly slow HOT 1
- Equality comparison is incorrect for `CategoricalMatrix` HOT 4
- `matvec` inconsistent behavior when used with the `col` argument. HOT 3
- `.getcol` method ignores `drop_first` attribute
- `SplitMatrix.__init__()` does not handle `SplitMatrix` inputs correctly
- Sandwich product fails for large F-contiguous matrices in 3.1.8 HOT 1
- Daily run failure: Unit tests
- Installing with Pip on Mac leads to ImportError HOT 6
- Missing Linux x86_64 wheels for version 3.1.12
- Daily run failure: Unit tests HOT 2
- Create SplitMatrix from polars data frame
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tabmat.