fsavje / scclust-r Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 3.0 661 KB

Size-constrained Clustering in R

License: GNU General Public License v3.0

R 45.30% C 54.41% Makefile 0.11% Shell 0.18%

scclust-r's People

Contributors

Stargazers

Watchers

Forkers

jasjeetsekhon slowkow hauselin

scclust-r's Issues

Make more informative error message for input checking

Make replica for new scclust API

Error in check_clustering(clustering = my_clustering): size_constraint must be scalar.

I am just trying to rerun the example, and get a warning:

Error in check_clustering(clustering = my_clustering): size_constraint must be scalar.

Thanks!

library(scclust)
#> Loading required package: distances

my_data <- data.frame(id = 1:100000,
                      type = factor(rbinom(100000, 3, 0.3),
                                    labels = c("A", "B", "C", "D")),
                      x1 = rnorm(100000),
                      x2 = rnorm(100000),
                      x3 = rnorm(100000))

# Construct distance metric
my_dist <- distances(my_data,
                     id_variable = "id",
                     dist_variables = c("x1", "x2", "x3"))

# Make clustering with at least 3 data points in each cluster
my_clustering <- sc_clustering(my_dist, 3)
check_clustering(my_clustering)
#> Error in check_clustering(clustering = my_clustering): `size_constraint` must be scalar.

^{Created on 2023-05-22 with reprex v2.0.2}

error: ‘for’ loop initial declarations are only allowed in C99 or C11 mode

I'm just trying to install the package after cloning it from github:

R CMD INSTALL .

I tried explicitly setting -std=c99 like this:

# ~/.R/Makevars
CC=gcc -std=c99

Still, I got:

src/digraph_core.c: In function ‘iscc_digraph_is_valid’:
src/digraph_core.c:62:2: error: ‘for’ loop initial declarations are only allowed in C99 or C11 mode
  for (size_t i = 0; i < dg->vertices; ++i) {
  ^
src/digraph_core.c:62:2: note: use option -std=c99, -std=gnu99, -std=c11 or -std=gnu11 to compile your code

The problem is in src/libscclust/Makefile:

scclust-R/src/libscclust/Makefile

Lines 33 to 34 in 796f459

 %.o: %.c 

 $(CC) -c $(ALL_CPPFLAGS) $(ALL_CFLAGS) $(XTRA_FLAGS) $< -o $@

I changed it to be this:

%.o: %.c
 	$(CC) -std=c99 -c $(ALL_CPPFLAGS) $(ALL_CFLAGS) $(XTRA_FLAGS) $< -o $@

And then the build worked without errors.

So, the Makefile needs to be modified so it can "see" the configuration variables inside ~/.R/Makevars.

Unfortunately, I can't find any good documentation for how to do this the right way...

The R-exts documentation is difficult to read, but maybe it has the answer.

Report seeds as attribute

sc_clustering does not converge to the optimal solution

Dear Fredrik,

I have tried this package with a simple problem of 50 data points. I have solved it with a genetic algorithm for two clusters, of which one has 26 points and the other 24. The global minimum has a sum of distances of about 168.6613. A local minimum has a sum of distances of about 185.4753.

Using sc_clustering, I need to fix the size constraint to 20 (my_clustering <- sc_clustering(my_dist, 20)). The solution I get points to the local minimum, not the global one. Is there something in the configuration of sc_clustering that I have not correctly set up?

If you need it, I am attaching the dataset for your evaluation.

Regards,
Julio

Matriz50.csv

Create distances object

Hello,

I have a similarity matrix df0 such as
object1 object2 similarity
x1 y1 0.09
x2 y2 0.25

I can create the a dist object with
nams <- with(df0, unique(c(as.character(object1), as.character(object2))))
df1 <- with(df0, structure(similarity, Size = length(nams), Labels = nams, Diag = FALSE, Upper = FALSE, method = "user", class = "dist"))

Then, I can use for example kmeans with this dist matrix. However, in order to use sc_clustering from scclust, I need a distances object. Do you know how I can create it, either directly from the similarity matrix, or from the dist object?

Thanks in advance.

Setting number of clusters?

Hi Fredrik!

Is there a way to set the exact number of clusters desired, on top of the cluster minimum size? Maybe in a second agglomerative step?

The problem is I am only interested in a few clusters, typically 3-10. I tried in the example to set size_constraint=50000/10 but it seems sc_clustering is very slow at creating clusters with large minimum size? Try:

example(sc_clustering)
my_clustering <- sc_clustering(my_dist, size_constraint=50000/10)

Thanks!

Split dist search function picker, and add search options

Weird crash when giving an existing_clustering to hierarchical_clustering

Hello there,

I'm using the latest version of scclust (0.2.2) and distances (0.1.8) to construct many size-constrained clusters.
As I want to get the smallest possible clusters while respecting the minimum size constraint, I first use sc_clustering then hierarchical_clustering which works flawlessly for most of the groups.

However, I noticed that, on certain groups, calling hierarchical_clustering with an existing_clustering attribute crashes my R session.

You may find here: issue_data.csv, data that produces a crash on the following code snippet.

library(distances)
library(scclust)

X <- read.csv2("issue_data.csv") %>% as.matrix()

X_distances <- distances(X)
clustering <- sc_clustering(X_distances, size_constraint=10, seed_method="inwards_updating")

# crashes happen with the following line
h_clustering <- hierarchical_clustering(X_distances, size_constraint=10, existing_clustering=clustering)

The problem doesn't seem to be related to memory issues since the snippet above works on way bigger groups.
Moreover, the final line works when no existing_clustering is given.
I tried to run hierarchical_clustering on each cluster produced by clustering independently and met no issue.

Thank you for your support.

Possible to set a maximum limit on cluster size?

I can't seem to see this implemented here or in any R pkg for clustering. Do you think it's possible?

Alernative input to distances objects in sc_clustering

Dear Frederik,
Thanks for the nice package. I noted that the base 'sc_clustering' function requires a 'distances' object. However, I am comparing lands based on environmental data and compute a multivariate ecological distance between each pairs of land. The output looks like this (fake example):
data.frame(land1=c("A","A","A","B","B","B","C,","C","C"),land2=c("B","C","D","B","C","D","B","C","D",),ecodist=c(0.2,0.3,0.4,0.2,0.1,0.6,0.6,0.5,0.1)).
From that I can re-create a distance matrix where ecodist serves as distance. The format is essentially the same as a distances object, and could theoretically be passed into the sc_clustering function, isn't it?
Would be very usefull to have more flexilibility to pass alternative distance matrix format.
Many thanks!
Ervan

	%.o: %.c
	$(CC) -c $(ALL_CPPFLAGS) $(ALL_CFLAGS) $(XTRA_FLAGS) $< -o $@

fsavje / scclust-r Goto Github PK

scclust-r's People

Contributors

Stargazers

Watchers

Forkers

scclust-r's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs