cxd / scala-au.id.cxd.math Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 187.92 MB

Libraries containing math related functions in scala. Provides probability distributions and related operations.

License: MIT License

Scala 93.43% R 0.06% HTML 6.51%

analysis bayesian experiments hmm linear-regression manova math mcmc numerical regression scala statistics svd

scala-au.id.cxd.math's People

Contributors

Stargazers

Watchers

Forkers

deniserosalyn

scala-au.id.cxd.math's Issues

Generalized linear models.

Begin work on implementation of the base family for glms using the common approaches to optimisation, such as mle.
Families and link functions should follow those available in r.
Normal, identity (equal to lm)
Binomial, logit and probit
Poison, log
Gamma, inverse

Implement linear discriminant analysis

The ground work is in place to implement lda

Implement t-distribution

Add an implementation for the t distribution. This is required for inference in both generalized linear models and linear models, as well as inference in smaller sample sizes.

Implement canonical correlation analysis

Implement cca

Inverse CDF functions or optimise CriticalValue search method

The current method of estimating critical values or quantiles for a given distribution makes use of an approximation of the proper integral for the distributions PDF given range and a probability for which the critical value or quantile is sought.

The method of evaluating the integral makes use of a large stream sequence which is inefficient.

/**
* generate a sequence from start by increment.
* Default increment is 0.1.
*/
def sequence(last: Double, by: Double = 0.1): Stream[Double] = {
last #:: sequence(last + by, by)
}

The Numerical Recipes lists the inverse CDF for the major distributions. This is directly approximated without approximating via the trapezoidal integration method.

Either implement the inverse cdf functions for each distribution or implement a more efficient search method, such as an annealing technique when searching for quantiles for each distribution by approximating the integral.

Summarizing p-values for high dimensional predictors

When assessing the significance level of the \beta parameter the z-score and associated p-value can be used in a test of significance assuming that $ \beta_j ~ N(0, v_j)$. Giving a confidence bound
$[\beta \pm z_\alpha v_j ]$
One issue with high dimensional predictors is how to assess the significance. It is desirable to communicate to others why certain attributes have a level of significance. Find methods used to assess significance of high dimensioned predictors, and examples of interpretation and communication of their significance.
For example review:
https://arxiv.org/pdf/1202.1377.pdf

Regularized Incomplete gamma, beta and log gamma functions

These functions are required for implementation of cdf in beta and fdistributions.
The log gamma function is useful as a basic building block in the other functions.
The incomplete beta can be expressed in terms of beta distribution and products of gamma fn.
Review available texts, including numerical recipes and gsl for implementation approaches, and devised a suitable implementation in scala.
Then continue on testing cdf for beta and F distributions.

Tests for multivariate normality

Implement tests for mvn and also demonstrate usage.

Currently planning to provide mardia stats for skew and Kurtosis. Henze Zirkler test. And Royston test statistic.

Alpha value should be divided by 2 in test on beta parameter in ols

The test for the beta parameter with H0: beta = 0 is two tailed. The alpha parameter should be divided by two.

DummyVariableBuilder is slow to perform mapping on large files - latency increases n*m

When processing discrete data it's useful to build an indicator matrix.

The current implementation of DummyVariableBuilder does this by extracting the unique values for each column, determines the total number of unique columns per variable, and creates a matrix which is the n rows by m total unique columns.

The process of generating the matrix is quite slow, since currently it uses the DenseMatrix.tabulate method and iterates over n by m times.

When dealing with large matrices the number of iterations increases in proportion to n by m.

This can take quite some time when generating dummy variables for files with many unique column values over large numbers of rows (say cols = 1000+ and rows= 500000 causes 500,000,000 iterations).

This can be decreased by identifying the index position of each row value within the unique values for each column ahead of time.
(columnValues should also be augmented with columnIndices which represents the row value for that column within the ordered set of unique values).

When generating the matrix first build the entire matrix using the DenseMatrix.zeros(m, n) function.
Then do not iterate the entire n rows by m columns, instead iterate only the rows and number of DummyVariableBuilders which in themselves represent a subset of the columns (1 DummyVariableBuilder corresponds to k columns inside the total n columns).

For every row value of the kth DummyVariableBuilder lookup the corresponding columnIndex for that row, then assign a 1 to the matrix M(row, k) = 1

This means applying m x k operations instead of m x n operations, which should reduce the number of operations somewhat, as long as the matrix zeros operation is optimised to start with when first building the initial matrix, and the apply method of matrix(row, col) := value is hopefully optimised as well.

Add routines to assist in analysis of SVD and PCA

The SVD

A = USV'

similar to PCA allows for a method to explore the amount of variation explained by objects or attributes.
In the case of PCA where the matrix being inverted is for example (AA') and attributes where the matrix being inverted is (A'A).

The SVD provides the measure of variation in the diagonal of S the non-zero values of the diagonal are the square roots of the non-zero eigenvalues (from PCA).

U provides the components/orthonormal eigenvectors of the relation between objects (rows) of AA'
V' provides the components/orthonormal eigenvectors of the relation between attributes (columns) A'A.

There are a number of applications for SVD related to PCA. Dimensionality reduction is one particular application. As well as inference relating to the parameters resulting from the decomposition.

It will be useful to leverage the SVD provided by breeze to provide methods for attribute selection based on explanation of variance, and methods for "clustering" related to the SVD.
Similarly for PCA.

Additional applications of dimensionality reduction include simple encoders/decoders where original input is compressed by it's projection against the eigenvectors for objects and can be decompressed. (for example Eigenfaces demonstrates a similar approach).

There are also synergies that have been proposed between the basis matrix and partitions within clusters (although somewhat contentious).

Additionally a writeup in the notes section pertaining to both SVD and PCA would be useful.

References would include:

https://en.wikipedia.org/wiki/Singular_value_decomposition

T.W. Anderson "An Introduction to Multivariate Statistics"

D. Skillcorn "Understanding Complex Dataset, Data Mining with Matrix Decompositions"

T. Hastie et al, "The Elements of Statistical Learning"

as well as

C. Bishop "Pattern Recognition and Machine Learning"

Calculate p-values for beta parameter

When calculating the p-values for the beta parameter the pseudo inverse of the predictors (X'X)^{-1} seems to have negative values on the diagonal.
This impacts the approximation of \hat{\sigma} when taking the square root as this results in a complex number. So when calculating v_j from the diagonal by diving \frac{\beta_i}{\sqrt{v_j}} causes the value not to be determined using the standard double type.
Firstly, we need to determine whether the negative values are expected in the pseudo inverse. Second this may mean the need to introduce the use of complex numbers when calculating the z-score for the beta parameter. Otherwise can we get away with using only the real part of the complex number?

implement Manova

Implement manova tests

using within and between variance measures with the following statistics

Wilks Lambda
Pillais trace
Roys Largest root
Lawley Hotellings Trace

Implement Lasso regression regularisation

Pang Lin and Jian discuss method for automatic regularisation including method of lasso regularisation.
Experiment with this method, and the algorithms discussed as well as evaluation approach.

Note they describe lasso solution as
$$
\hat{\beta_j} = \sign(\beta_j^0})\left(\sqrt{n}|\hat{\beta_j}^0|-\frac{\lambda}{2}\right)
$$
the denominator in the right hand expression may be due to their presentation of using only two orthogonal predictors. Whereas in Hastie the approximation is shown without the denominator.

Examine the article and the chapter on Lasso in ESL, determine appropriate method of implementation and experiment with results.

Investigate application of ODEs

after spending this last semester learning about dynamical systems, phase plots, application of ODEs to qualitative analysis of phases of dynamic systems. Explore application of ODEs in the domain of statistical learning. Are ODEs simply applied in optimisation processes only?
Is there an manner in which the qualitative analysis used in dynamical systems able to be applied in the domain of statistical models, such as in time series analysis?

Alternately, there is definitely the ability to apply modelling in terms of control algorithms, perhaps explore that further in relation to playing with sphero for navigation and control.

Is there also an application of dynamical modelling in image processing? For example models of light and colour? Do some investigation into this also.

Implement quadratic discrimination function.

Based on the similar approach in the canonical discriminant analysis, implement the assumption that there is no common variance between groups, and then apply the additional component in the decision functions as outlined in Duda and Hart ch 2.
The calculation of the discriminant functions however will mean an basis for each group.

Provide an implementation of Feature Hashing

Feature hashing can be applied to build a mapped distribution of the input space and using the \xi function proposed in the wikipedia article can result in "whitened" or spherical distributions of the input space.

https://en.wikipedia.org/wiki/Feature_hashing

cxd / scala-au.id.cxd.math Goto Github PK

scala-au.id.cxd.math's People

Contributors

Stargazers

Watchers

Forkers

scala-au.id.cxd.math's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs