GithubHelp home page GithubHelp logo

scalemine's Introduction

ScaleMine: Scalable Parallel Frequent Subgraph Mining in a Single Large Graph

Overview:

ScaleMine is a novel parallel frequent subgraph mining system for a single large graph. ScaleMine introduces a novel two-phase approach. The first phase is approximate; it quickly identifies subgraphs that are frequent with high probability, while collecting various statistics. The second phase computes the exact solution by employing the results of the approximation to achieve good load balance; prune the search space; generate efficient execution plans; and guide intra-task parallelism.

If you use ScaleMine in your research, please cite our paper:

@inproceedings{hamid2016scalemine,
 title={ScaleMine: Scalable Parallel Frequent Subgraph Mining in a Single Large Graph},
 author={Ehab Abdelhamid, Ibrahim Abdelaziz, Panos Kalnis, Zuhair Khayyat and Fuad Jamour},
 booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
 pages={11},
 year={2016},
 organization={ACM}
}

Contents:

README ...................  This file
LICENSE.txt ..............  License file (Open Source)
makefile .................  build ScaleMine binary files
Datasets/ ................  Example graphs
SRC_*/ ...................  Directory containing ScaleMine source files

Dependencies

There are a few dependencies which must be satisfied in order to compile and run ScaleMine.

  • build-essential and g++ (>= 4.4.7) [Required]

    • Needed for compiling ScaleMine.
  • openssh-server [Required]

    • Required to initialize MPI and establish connections among compute nodes.
  • MPICH2 [Required]

    • ScaleMine uses MPI for inter-node communication. Open MPI is not tested with ScaleMine.

Installation:

  • install MPI and Boost libraries on the target machine
  • Uncompress ScaleMine using any compression tool
  • Build ScaleMine using the "compile.sh" script file

Running:

Single Machine Mode:

Run ScaleMine using the following command:

mpirun -n N pfsm -file GRAPH_FILE -freq F -threads T

N: the number of MPI computation nodes, make sure that there is at lease one for th emaster and one for a worker. Best practice is to have one computation node at each separate machine, then for each machine set a number of parallel threads.

GRAPH_FILE: the input graph file name, the supported graph format is .lg

F: the user-give support threshold

T: the number of threads per compute node

Example:

Use the following command:

mpirun -n 2 pfsm -file ./Datasets/patent_citations.lg -freq 28000 -threads 4

to mine the patent_citation graph for subgraphs having support larger than or equal to 28000 using 2 compute nodes; 1 master and 1 worker, the worker has 4 threads.

Distributed Mode

For running on a supercomputer using SLURM job scheduler:

#!/bin/bash
#SBATCH --account=user-account
#SBATCH --job-name=pfsm
#SBATCH --output=/output/mining.out
#SBATCH --error=/output/mining.err
#SBATCH --time=01:00:00
#SBATCH --threads-per-core=1
#SBATCH --nodes=256
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --cpus-per-task=32
export OMP_NUM_THREADS=32
export MKL_NUM_THREADS=32
export MPICH_MAX_THREAD_SAFETY=multiple
srun --ntasks=256 pfsm -file /Datasets/patent_citations.lg -freq 28000 -threads 32

Output:

ScaleMine outputs the list of frequents subgraphs on the standard output. Also, the elapsed time is returned at the end.

License:

ScaleMine is a free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version.

ScaleMine is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with ScaleMine. If not, see http://www.gnu.org/licenses/.

scalemine's People

Contributors

ehab-abdelhamid avatar ibrahimabdelaziz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

scalemine's Issues

MPI_Finalize() seems not work while running the example dataset.

I try to run the example dataset

mpirun -n 72 pfsm -file ./Datasets/lg/patent_citations.lg -freq 28000 -threads 4

but in the end, the output stops at

Graph loading [master] took 68 sec and 701 ms
Mining took 101 sec and 155 ms

I looked at the code, the next it should exec the following code

		cout<<"Graph loading [master] took "<<(Settings::graphLoadingTime/1000)<<" sec and "<<(Settings::graphLoadingTime%1000)<<" ms"<<endl;

		cout<<"Mining took "<<(elapsed/1000)<<" sec and "<<(elapsed%1000)<<" ms"<<endl;

		cout<<"Finished!";

		// MPI::Finalize();
		MPI_Finalize();
		delete miner;
		delete exactMiner;

I am confused as to why it stops at print Finished!.

pfsm file seems to be missing

Below is the error encountered while running the mpirun command specified in ReadMe

[proxy:0:0@vmName] HYDU_create_process (utils/launch/launch.c:74): execvp error on file pfsm (No such file or directory)
[proxy:0:0@vmName] HYDU_create_process (utils/launch/launch.c:74): execvp error on file pfsm (No such file or directory)

Also while compiling following errors are encountered.

Settings.h:38:22: error: 'constexpr' needed for in-class initialization of static data member 'const double Settings::errorMargin' of non-integral type [-fpermissive]
  static const double errorMargin = 0;
                      ^~~~~~~~~~~

Any help regarding the same is appreciated.

time up for my dataset.

I am trying to run ScaleMine on my dataset, which is much smaller than the example dataset.
My dataset contains 24577 nodes and 88060 edges.
this is my command

mpirun -quiet -n 72 pfsm -file ./Datasets/lg/triad.lg -freq 2000 -threads 4 > ./Datasets/output/triad.out

but it stops after printing

Start mining approximate ...

So I am confused as to why it stops, do I need to set a higher frequency?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.