Het-SBM

The toolbox is providing the fitting procedures for multi-subject heterogeneous stochastic blockmodel termed Het-SBM. Het-SBM has been first described in the thesis 'Generalised Stochastic Blockmodels and their Applications in the Analysis of Brain Networks' Chapter 4, Pavlovic 2015 and then in the paper Multi-Subject Stochastic Blockmodels for Adaptive Analysis of Individual Differences in Human Brain Network Cluster Structure.

Data Provided by User

The HetSBM function expects two datasets: X (adjacency data) which enters the analysis in the form of a tensor and D (covariates) which comes into the analysis in the form of a matrix. In particular, X is a 3-dimensional tensor (array) which encodes adjacency matrices for each subject. The dimension of X is n by n by K, where n stands for a total number of nodes in a network and K stands for a total number of subjects in the study. For example, X[ , ,1] is a square adjacency matrix corresponding to the first subject. Individual subject matrices (i.e. X[, ,1]) are assumed to be binary, symmetric and without self-connected nodes (i.e. the principal diagonal of X[,,k]=0). Covariates of interest are encoded in the design matrix D. The design matrix D is K by P, where K is the total number of subjects, and P is the total number of covariates in the model plus the intercept.

Fitted Variables Returned by the Function

The Het-SBM function returns the following set of fitted parameters: tau, alpha, beta, FI, ICL, VB, convergence and monoblock. The parameter tau is given as an n by Q matrix, and it specifies the probability that a node will fall into a particular cluster. For a specific node, the final cluster label is obtained by selecting the cluster with the highest likelihood as its final allocation. Repeating this for all nodes creates a vector of cluster labels. The parameter alpha is given as a vector of length Q, and it determines the probability that a randomly selected node will fall into a particular cluster. The parameter beta represents the fitted linear regression coefficients for each element in a cluster structure. It is given as a Q by Q by P tensor of regression coefficient values. The parameter FI stands for the Fisher Information matrix, which is provided as a Q by Q by P by P matrix. The ICL score is a numeric value which indicates the goodness of fit for a candidate model with Q clusters. The model with the highest ICL score is the most optimal. The VB score is a vector of the length t_max, and for each iteration in M-step, it evaluates the variational bound. The convergence score is a TRUE/FALSE value such that TRUE indicates that a selected solution is non-degenerate while FALSE flags solutions with empty clusters and those that might contain a single node. The logical value of a mononode score indicates if the solution includes a single node cluster (TRUE) or not (FALSE).

Input Function Variables

Het-SBM fits data over a range of candidate models or candidate number of clusters. The candidate models are specified by the parameters 'qmin' and 'qmax' which, respectively, indicate the smallest and the largest number of clusters in the data. For example, by setting qmin=2 and qmax=5, the function will fit the total of 4 models. The first model will have 2, the second 3, the third 4 and the fourth 5 clusters. As noted in the referenced work, the model which maximises the ICL criterion is the most optimal solution.

The parameters related to algorithmic iterations are t_max and h_max. Specifically, t_max is the maximum number of steps in the M-step (default value is 10) and h_max is the maximum number of steps in the E-step (default value is 10).

The convergence criterion for the update of taus is given by threshold_h (default value is 10^-10), while the convergence criterion for betas and alphas is given by threshold_psi (default value is set to 10^-10).

To fit regression coefficients (betas), the user has a choice between the Firth type estimation and the classical maximum likelihood estimation.

By setting method = "Firth", the user will obtain Firth type estimates for betas using the function firth_bs().
By setting method = "MLE", the user will obtain Maximum Likelihood estimates (MLE) using the function no_firth_bs(). In both options, the convergence criterion for betas is given by the threshold_lik (default value is 10^-10), the maximum number of steps in a logistic regression fitting is given by M_iter (default value is 10), the maximal absolute variation in beta value is given by the maxstep (default value is 5) while the step-halving parameter is half_iter (default value is 5).

Initialisation: There are 4 main initialisation strategies which can be chosen using the startType parameter.

When startType = "KMeans", the starting point is based on the k-means algorithm from the package 'amap'. Linked to this option, kmeanType can be used to pass the distance measure for centres which can be "euclidean", "maximum", "manhattan", "canberra", "binary", "pearson", "abspearson", "correlation", "abscorrelation", "spearman" or "kendall" (default value is set to "correlation"), and kmeanMax which sets the maximum number of iterations (default is 30).
When startType is set to "StartingPoint", then the algorithm expects initialisation values for each candidate models. For example with qmin=2 and qmax=3, a variable Z_initialisation is needed as an argument. Z_initialisation is expected to be a list such that Z_initialisation[[1]]= NULL, Z_initialisation[[2]] is a partition vector with 2 clusters and Z_initialisation[[3]] is a partition vector with 3 clusters.
When startType is set to "Random", then the algorithm uses uniform sampling to generate some cluster labels based on the given cluster numbers (i.e. qmin and qmax).
When startType is set to "Hclust", then the algorithm will use hierarchical clustering from stats package with distance metric set to "manhattan", while method is set to "ward.D2".

It is worth highlighting that for "KMeans" and "Hclust" options it is also possible to pick a specific subject whose data will be used for initialisation. This can be handled by iSubjStartingPoint parameter. For example, iSubjStartingPoint = 10 indicates that the network corresponding to the 10th subject will be used for initialisation.

pavlovic / het-sbm Goto Github PK

het-sbm's Introduction

Het-SBM

Data Provided by User

Fitted Variables Returned by the Function

Input Function Variables

het-sbm's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs