GithubHelp home page GithubHelp logo

oezgesahin / vineclust Goto Github PK

View Code? Open in Web Editor NEW
8.0 2.0 2.0 109 KB

Model-based clustering with vine copulas

License: GNU General Public License v3.0

R 100.00%
clustering copula mixture-models statistics vine

vineclust's Introduction

CRAN status R-CMD-check codecov

vineclust: Model-Based Clustering with Vine Copulas

An R package that fits vine copula based mixture model distributions to the continuous data for a given number of components as proposed in VCMM algorithm and use its results for clustering.

It depends on VineCopula, fGarch, mclust, and univariateML.

Installation

You can install the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("oezgesahin/vineclust")

Package overview

Below is an overview of some functions and features.

  • vcmm(): fits vine copula based mixture model distributions to the continuous data for a given number of components. Returns an object of class vcmm_res(). The class has the following methods:
    • print: a brief overview of the model statistics.
    • summary: list of fitted model components, including selected vine tree structures, bivariate copula families, univariate marginal distributions, and estimated parameters.
  • dvcmm(), rvcmm(): density and random generation for the vine copula based mixture model distributions.

Bivariate copula families

This package works with a wide range of parametric bivariate copula families for bivariate or multivariate clustering using vine copulas. Specifically, it allows fitting elliptical (Gaussian, Student-t) and Archimedean (Clayton, Gumbel, Frank, Joe, BB1, BB6, and BB8) copulas with their possible 90, 180, 270 degrees rotations to cover a large range of dependence patterns. Their encoding is detailed on VineCopula.

Univariate marginal distributions

This package currently includes following unimodal univariate marginal distributions.

  • cauchy(a,b): Cauchy distribution with location parameter a and scale parameter b,
  • gamma(a,b): gamma distribution with shape parameter a and rate parameter b,
  • llogis(a,b): log-logistic distribution with shape parameter a and rate parameter b,
  • lnorm(a,b): log-normal distribution with mean parameter a and standard deviation parameter b on the logarithmic scale,
  • logis(a,b): logistic distribution with location parameter a and scale parameter b,
  • norm(a,b): normal distribution with mean parameter a and standard deviation parameter b,
  • snorm(a,b,c): skew normal distribution with location parameter a, scale parameter b, and skewness parameter c.
  • std(a,b,c): Student’s t distribution with location parameter a, scale parameter b, and shape parameter c,
  • sstd(a,b,c,d): skew Student’s t distribution with location parameter a, scale parameter b, shape parameter c, and skewness parameter d.

Initial partition methods

This package currently implements following partition approaches to have starting values.

  • kmeans: performs k-means clustering (Hartigan-Wong) on given data after scaling,
  • hcVVV: performs model-based hierarchical clustering on given data after scaling,
  • gmm: performs model-based clustering with Gaussian mixture models on given data.

Usage

library(vineclust)
# data from UCI Machine Learning Repository 
data_wisc <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data", header = FALSE)

Model fitting

 # R-vine copula based mixture model with total components 2
fit <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2)
 # Model statistics
print(fit)
#> logLik = 2159.11   BIC = -4108.88   ICL = -4068.3   npars = 33   initial clustering = kmeans   ECM iterations = 4
# Fitted vine copula distributions
summary(fit) 
#> $margins
#>      [,1]          [,2]         
#> [1,] "Lognormal"   "Lognormal"  
#> [2,] "Lognormal"   "Gamma"      
#> [3,] "Lognormal"   "Skew Normal"
#> [4,] "Skew Normal" "Normal"     
#> 
#> $marginal_pars
#> , , 1
#> 
#>           [,1]       [,2]       [,3]       [,4]
#> [1,] 1.3257284 -1.9132582 -0.8083077 0.18916084
#> [2,] 0.5085319  0.1325288  0.3702962 0.03873749
#> [3,]        NA         NA         NA 1.81086926
#> [4,]        NA         NA         NA         NA
#> 
#> , , 2
#> 
#>           [,1]      [,2]       [,3]       [,4]
#> [1,] 0.6513504  42.84792  0.1607200 0.07370799
#> [2,] 0.3849773 347.42118  0.1222864 0.03358325
#> [3,]        NA        NA 13.0380866         NA
#> [4,]        NA        NA         NA         NA
#> 
#> 
#> $copula
#> , , 1
#> 
#>      [,1] [,2] [,3] [,4]
#> [1,]    0    0    0    0
#> [2,]    1    0    0    0
#> [3,]   24    1    0    0
#> [4,]    5    1   20    0
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3] [,4]
#> [1,]    0    0    0    0
#> [2,]   33    0    0    0
#> [3,]   33   23    0    0
#> [4,]   10    3   14    0
#> 
#> 
#> $copula_first_par
#> , , 1
#> 
#>            [,1]       [,2]  [,3] [,4]
#> [1,]  0.0000000  0.0000000 0.000    0
#> [2,] -0.3781992  0.0000000 0.000    0
#> [3,] -1.0644572 -0.2277139 0.000    0
#> [4,]  2.1769280  0.3734696 5.682    0
#> 
#> , , 2
#> 
#>             [,1]        [,2]    [,3] [,4]
#> [1,]  0.00000000  0.00000000 0.00000    0
#> [2,] -0.09652639  0.00000000 0.00000    0
#> [3,] -0.19082564 -0.07882284 0.00000    0
#> [4,]  1.30499379  0.41096477 2.66261    0
#> 
#> 
#> $copula_second_par
#> , , 1
#> 
#>      [,1] [,2]      [,3] [,4]
#> [1,]    0    0 0.0000000    0
#> [2,]    0    0 0.0000000    0
#> [3,]    0    0 0.0000000    0
#> [4,]    0    0 0.6308382    0
#> 
#> , , 2
#> 
#>           [,1] [,2] [,3] [,4]
#> [1,] 0.0000000    0    0    0
#> [2,] 0.0000000    0    0    0
#> [3,] 0.0000000    0    0    0
#> [4,] 0.9357103    0    0    0
#> 
#> 
#> $vine_structure
#> , , 1
#> 
#>      [,1] [,2] [,3] [,4]
#> [1,]    2    0    0    0
#> [2,]    1    1    0    0
#> [3,]    4    3    3    0
#> [4,]    3    4    4    4
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    0    0    0
#> [2,]    3    2    0    0
#> [3,]    2    3    3    0
#> [4,]    4    4    4    4
#> 
#> 
#> $mixture_probs
#> [1] 0.3524161 0.6475839
# Evaluate the density of the fitted model at (2.747, 0.1467, 0.13, 0.05334)
RVMs_fitted <- list()
RVMs_fitted[[1]] <- VineCopula::RVineMatrix(Matrix=fit$output$vine_structure[,,1],
                        family=fit$output$bicop_familyset[,,1],
                        par=fit$output$bicop_param[,,1],
                        par2=fit$output$bicop_param2[,,1])
RVMs_fitted[[2]] <- VineCopula::RVineMatrix(Matrix=fit$output$vine_structure[,,2],
                        family=fit$output$bicop_familyset[,,2],
                        par=fit$output$bicop_param[,,2],
                        par2=fit$output$bicop_param2[,,2])
dvcmm(c(2.747, 0.1467, 0.13, 0.05334), fit$output$margin, fit$output$marginal_param, RVMs_fitted, fit$output$mixture_prob)
#> [1] 26.70683
# C-vine copula based mixture model
fit_cvine <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2, is_cvine=1)  
# Confusion matrix w.r.t. true classification
table(fit_cvine$cluster, data_wisc$V2) 
#>    
#>       B   M
#>   1 325  21
#>   2  32 191
# Fit only bivariate Clayton copula for pairs of variables in both components
fit_clayton <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2, bicop=c(3))  
# Fix vine tree structures of both components 
fit_fix_vinestr <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2,
                        vinestr=matrix(c(1,2,3,4,0,2,4,3,0,0,4,3,0,0,0,3),4,4)) 
# Run ECM iterations shorter with a smaller threshold than the default threshold
fit_sthr <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2, threshold=0.001)  
# Use a different initial partition approach from k-means
fit_best_init <- vcmm(data=data_wisc[,c(15,27,29,30)], total_comp=2, methods=c("gmm"))  

Simulation

# Simulation setup given in Section 5.2 of the paper at https://arxiv.org/pdf/2102.03257.pdf
dims <- 3
obs <- c(500,500) 
RVMs <- list()
RVMs[[1]] <- VineCopula::RVineMatrix(Matrix=matrix(c(1,3,2,0,3,2,0,0,2),dims,dims),
                        family=matrix(c(0,3,4,0,0,14,0,0,0),dims,dims),
                        par=matrix(c(0,0.8571429,2.5,0,0,5,0,0,0),dims,dims),
                        par2=matrix(sample(0, dims*dims, replace=TRUE),dims,dims)) 
RVMs[[2]] <- VineCopula::RVineMatrix(Matrix=matrix(c(1,3,2,0,3,2,0,0,2), dims,dims),
                        family=matrix(c(0,6,5,0,0,13,0,0,0), dims,dims),
                        par=matrix(c(0,1.443813,11.43621,0,0,2,0,0,0),dims,dims),
                        par2=matrix(sample(0, dims*dims, replace=TRUE),dims,dims))
margin <- matrix(c('Normal', 'Gamma', 'Lognormal', 'Lognormal', 'Normal', 'Gamma'), 3, 2) 
margin_pars <- array(0, dim=c(2, 3, 2))
margin_pars[,1,1] <- c(1, 2)
margin_pars[,1,2] <- c(1.5, 0.4)
margin_pars[,2,1] <- c(1, 0.2)
margin_pars[,2,2] <- c(18, 5)
margin_pars[,3,1] <- c(0.8, 0.8)
margin_pars[,3,2] <- c(1, 0.2)
x_data <- rvcmm(dims, obs, margin, margin_pars, RVMs)

Contact

Please contact [email protected] if you have any questions.

References

Sahin, {"O}., & Czado, C. (2022). Vine copula mixture models and clustering for non-Gaussian data. Econometrics and Statistics. doi:10.1016/j.ecosta.2021.08.011. preprint, article

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.