GithubHelp home page GithubHelp logo

scclust-r's People

Contributors

fsavje avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

scclust-r's Issues

Error in check_clustering(clustering = my_clustering): size_constraint must be scalar.

Hi

I am just trying to rerun the example, and get a warning:

Error in check_clustering(clustering = my_clustering): size_constraint must be scalar.

Thanks!

library(scclust)
#> Loading required package: distances

my_data <- data.frame(id = 1:100000,
                      type = factor(rbinom(100000, 3, 0.3),
                                    labels = c("A", "B", "C", "D")),
                      x1 = rnorm(100000),
                      x2 = rnorm(100000),
                      x3 = rnorm(100000))

# Construct distance metric
my_dist <- distances(my_data,
                     id_variable = "id",
                     dist_variables = c("x1", "x2", "x3"))

# Make clustering with at least 3 data points in each cluster
my_clustering <- sc_clustering(my_dist, 3)
check_clustering(my_clustering)
#> Error in check_clustering(clustering = my_clustering): `size_constraint` must be scalar.

Created on 2023-05-22 with reprex v2.0.2

error: ‘for’ loop initial declarations are only allowed in C99 or C11 mode

I'm just trying to install the package after cloning it from github:

R CMD INSTALL .

I tried explicitly setting -std=c99 like this:

# ~/.R/Makevars
CC=gcc -std=c99

Still, I got:

src/digraph_core.c: In function ‘iscc_digraph_is_valid’:
src/digraph_core.c:62:2: error: ‘for’ loop initial declarations are only allowed in C99 or C11 mode
  for (size_t i = 0; i < dg->vertices; ++i) {
  ^
src/digraph_core.c:62:2: note: use option -std=c99, -std=gnu99, -std=c11 or -std=gnu11 to compile your code

The problem is in src/libscclust/Makefile:

%.o: %.c
$(CC) -c $(ALL_CPPFLAGS) $(ALL_CFLAGS) $(XTRA_FLAGS) $< -o $@

I changed it to be this:

%.o: %.c
 	$(CC) -std=c99 -c $(ALL_CPPFLAGS) $(ALL_CFLAGS) $(XTRA_FLAGS) $< -o $@

And then the build worked without errors.

So, the Makefile needs to be modified so it can "see" the configuration variables inside ~/.R/Makevars.

Unfortunately, I can't find any good documentation for how to do this the right way...

The R-exts documentation is difficult to read, but maybe it has the answer.

sc_clustering does not converge to the optimal solution

Dear Fredrik,

I have tried this package with a simple problem of 50 data points. I have solved it with a genetic algorithm for two clusters, of which one has 26 points and the other 24. The global minimum has a sum of distances of about 168.6613. A local minimum has a sum of distances of about 185.4753.

Using sc_clustering, I need to fix the size constraint to 20 (my_clustering <- sc_clustering(my_dist, 20)). The solution I get points to the local minimum, not the global one. Is there something in the configuration of sc_clustering that I have not correctly set up?

If you need it, I am attaching the dataset for your evaluation.

Regards,
Julio

Matriz50.csv

Create distances object

Hello,

I have a similarity matrix df0 such as
object1 object2 similarity
x1 y1 0.09
x2 y2 0.25

I can create the a dist object with
nams <- with(df0, unique(c(as.character(object1), as.character(object2))))
df1 <- with(df0, structure(similarity, Size = length(nams), Labels = nams, Diag = FALSE, Upper = FALSE, method = "user", class = "dist"))

Then, I can use for example kmeans with this dist matrix. However, in order to use sc_clustering from scclust, I need a distances object. Do you know how I can create it, either directly from the similarity matrix, or from the dist object?

Thanks in advance.

Setting number of clusters?

Hi Fredrik!

Is there a way to set the exact number of clusters desired, on top of the cluster minimum size? Maybe in a second agglomerative step?

The problem is I am only interested in a few clusters, typically 3-10. I tried in the example to set size_constraint=50000/10 but it seems sc_clustering is very slow at creating clusters with large minimum size? Try:

example(sc_clustering)
my_clustering <- sc_clustering(my_dist, size_constraint=50000/10)

Thanks!

Weird crash when giving an existing_clustering to hierarchical_clustering

Hello there,

I'm using the latest version of scclust (0.2.2) and distances (0.1.8) to construct many size-constrained clusters.
As I want to get the smallest possible clusters while respecting the minimum size constraint, I first use sc_clustering then hierarchical_clustering which works flawlessly for most of the groups.

However, I noticed that, on certain groups, calling hierarchical_clustering with an existing_clustering attribute crashes my R session.

You may find here: issue_data.csv, data that produces a crash on the following code snippet.

library(distances)
library(scclust)

X <- read.csv2("issue_data.csv") %>% as.matrix()

X_distances <- distances(X)
clustering <- sc_clustering(X_distances, size_constraint=10, seed_method="inwards_updating")

# crashes happen with the following line
h_clustering <- hierarchical_clustering(X_distances, size_constraint=10, existing_clustering=clustering)

The problem doesn't seem to be related to memory issues since the snippet above works on way bigger groups.
Moreover, the final line works when no existing_clustering is given.
I tried to run hierarchical_clustering on each cluster produced by clustering independently and met no issue.

Thank you for your support.

Alernative input to distances objects in sc_clustering

Dear Frederik,
Thanks for the nice package. I noted that the base 'sc_clustering' function requires a 'distances' object. However, I am comparing lands based on environmental data and compute a multivariate ecological distance between each pairs of land. The output looks like this (fake example):
data.frame(land1=c("A","A","A","B","B","B","C,","C","C"),land2=c("B","C","D","B","C","D","B","C","D",),ecodist=c(0.2,0.3,0.4,0.2,0.1,0.6,0.6,0.5,0.1)).
From that I can re-create a distance matrix where ecodist serves as distance. The format is essentially the same as a distances object, and could theoretically be passed into the sc_clustering function, isn't it?
Would be very usefull to have more flexilibility to pass alternative distance matrix format.
Many thanks!
Ervan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.