Customizable modular pipeline for testing an improved version of CM for generating well-connected clusters. Image below from arXiv preprint: Park et. al. (2023). https://github.com/illinois-or-research-analytics/cm_pipeline/tree/main
The CM Pipeline is a modular pipeline for community detection that contains the following modules:
- Cluster Statistics: Compute statistics such as node and edge count,
- MacOS or Linux operating system
python3.9
or highercmake 3.2.0
or highergcc
of any version (In our analysis,gcc 9.2.0
was used)
There are several strategies for installation
- Clone the cm_pipeline repository
- Activate the venv which has the necessary packages
- Run
pip install -r requirements.txt && pip install .
- Make sure everything installed properly by running
cd tests && pytest
Simply run pip install git+https://github.com/illinois-or-research-analytics/cm_pipeline
. This will install CM++, but to use pipeline functionality, please setup via cloning.
-
python3 -m hm01.cm -i network.tsv -e clustering.tsv -o output.tsv -c leiden -g 0.5 --threshold 1log10 --nprocs 4 --quiet
- Runs CM++ on a Leiden with resolution 0.5 clustering with connectivity threshold
$log_{10}(n)$ (Every cluster with connectivity over the log of the number of nodes is considered "well-connected")
- Runs CM++ on a Leiden with resolution 0.5 clustering with connectivity threshold
-
python3 -m hm01.cm -i network.tsv -e clustering.tsv -o output.tsv -c ikc -k 10 --threshold 1log10 --nprocs 4 --quiet
- Similar idea but with IKC having hyperparameter
$k=10$ .
- Similar idea but with IKC having hyperparameter
- Suppose you have a pipeline like the one here. Call it
pipeline.json
- Then from the root of this repository run:
python -m main pipeline.json
To refer to usage instructions on CM++, see the following documentation.
- The input to the pipeline script is a pipeline.json file. NOTE that you can use any other json file as input as long as it fits the requirements in the documentation.
- Description of the supported key-value pairs in the config file can be found here pipeline_template.json
- Edit the fields of the
pipeline.json
file to reflect your inputs and requirements. - Run
python -m main pipeline.json
- Please refer to the json format documentation on how to write the
pipeline.json
file.
To quickly set up a developer environment for the CM++ Pipeline, simply run the following commands. (NOTE: Make sure you have Conda installed)
conda env create -f environment.yml
conda activate
- The CM++ Pipeline also allows for users to add their own pipeline stages and clustering methods.
- Please refer to the customization documentation on how to modify the code to allow for your own pipeline stages and .
- The commands executed during the workflow are captured in
{output_dir}/{run_name}-{timestamp}/commands.sh
. This is the shell script generated by the pipeline that is run to generate outputs. - The output files generated during the workflow are stored in the folder
{output_dir}/{run_name}-{timestamp}/
- The descriptive analysis files can be found in the folder
{output_dir}/{run_name}-{timestamp}/analysis
with the*.csv
file for each of the resolution values.
@misc{cm_pipe2023,
author = {Vikram Ramavarapu and Vidya Kamath and Minhyuk Park and Fabio Ayres and George Chacko},
title = {Connectivity Modifier Pipeline},
howpublished = {\url{https://github.com/illinois-or-research-analytics/cm_pipeline}},
year={2023},
doi={10.5281/zenodo.10076514}
}
@misc{park2023wellconnected,
title={Well-Connected Communities in Real-World and Synthetic Networks},
author={Minhyuk Park and Yasamin Tabatabaee and Vikram Ramavarapu and Baqiao Liu and Vidya Kamath Pailodi and Rajiv Ramachandran and Dmitriy Korobskiy and Fabio Ayres and George Chacko and Tandy Warnow},
year={2023},
eprint={2303.02813},
archivePrefix={arXiv},
primaryClass={cs.SI}
}