lolei / spmf-py Goto Github PK

View Code? Open in Web Editor NEW

60.0 5.0 18.0 43 KB

Python SPMF Wrapper 🐍 🎁

License: GNU General Public License v3.0

Python 100.00%

spmf python wrapper data-mining pattern-mining frequent-patterns sequential-patterns hacktoberfest

spmf-py's Introduction

spmf-py

Python Wrapper for SPMF 🐍 🎁

Information

The SPMF [1] data mining Java library usable in Python.

Essentially, this module calls the Java command line tool of SPMF, passes the user arguments to it, and parses the output.
In addition, transformation of the data to Pandas DataFrame and CSV is possible.

In theory, all algorithms featured in SPMF are callable. Nothing is hardcoded, the desired algorithm and its parameters need to be perused in the SPMF documentation.

Installation

pip install spmf

Usage

Example:

from spmf import Spmf

spmf = Spmf("PrefixSpan", input_filename="contextPrefixSpan.txt",
            output_filename="output.txt", arguments=[0.7, 5])
spmf.run()
print(spmf.to_pandas_dataframe(pickle=True))
spmf.to_csv("output.csv")

Output:

=============  PREFIXSPAN 0.99-2016 - STATISTICS =============
 Total time ~ 2 ms
 Frequent sequences count : 14
 Max memory (mb) : 6.487663269042969
 minsup = 3 sequences.
 Pattern count : 14
===================================================

      pattern sup
0         [1]   4
1      [1, 2]   4
2      [1, 3]   4
3   [1, 3, 2]   3
4   [1, 3, 3]   3
5         [2]   4
6      [2, 3]   3
7         [3]   4
8      [3, 2]   3
9      [3, 3]   3
10        [4]   3
11     [4, 3]   3
12        [5]   3
13        [6]   3

The usage is similar to the one described in the SPMF documentation.
For all Python parameters, see the Spmf class.

SPMF Arguments

The arguments parameter are the arguments that are passed to SPMF and depend on the chosen algorithm. SPMF handles optional parameters as an ordered list. As there are no named parameters for the algorithms, if e.g. only the first and the last parameter of an algorithm are to be used, the ones in between must be filled with "" blank strings.
For advanced usage examples, see examples.

SPMF Executable

Download it from the SPMF Website.
It is assumed that the SPMF binary spmf.jar is located in the same directory as spmf-py. If it is not, either symlink it, or use the spmf_bin_location_dir parameter.

Input Formats

Either use an input file as specified by SPMF, or use one of the in-line formats as seen in examples.

Memory

The maxmimum memory can be increased in the constructor via Spmf(memory=n), where n is megabyte, see SPMF's FAQ.

Background

Why? If you're in a Python pipeline, like a Jupyter Notebook, it might be cumbersome to use Java as an intermediate step. Using spmf-py you can stay in your pipeline as though Java is never used at all.

Bibliography

Fournier-Viger, P., Lin, C.W., Gomariz, A., Gueniche, T., Soltani, A., Deng, Z., Lam, H. T. (2016).  
The SPMF Open-Source Data Mining Library Version 2.  
Proc. 19th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2016) Part III, Springer LNCS 9853,  pp. 36-40.

Disclaimer

Use at your own risk. This repo is not/barely maintained. Use SPMF itself for more robust results.

This module has been tested for a fraction of the algorithms offered in SPMF. Calling them and writing to the output file should be possible for all. Output parsing however should work for those that have outputs like the sequential pattern mining algorithms. It was not tested with other types, some adaption of the output parsing might be necessary.

If something is not working, submit an issue or create a PR yourself!

spmf-py's People

Contributors

Stargazers

Watchers

Forkers

sumarnimarsal shouwangbuqi charlesaydin nader-ld hyfhkc thuy4tbn99 sanjay-av seapants benature krishnavineel17 wangliang-cs hodlen sandy4321 churross monia241 gwonhong lecioassis

spmf-py's Issues

Windows PermissionError in temp file

Hello,

I appear to be encountering an issue with running algorithms from a windows system with a direct input. I've tested this with a number of different sized inputs that are nested lists.

Example input:
test = [ [[127], [128], [129], [130]], [[178], [179], [180], [181], [182], [183], [184], [185]], [[251], [252], [253], [254], [255], [256], [257]] ]

Example call:
spmf = Spmf('GPS', input_direct=test, arguments=[0.5]) spmf.run()
(I've also trialled with 'PrefixScan' algorithm)

Error text (with my actual username replaced):

File "C:\Python39\lib\site-packages\spmf_init_.py", line 46, in init
self.input_ = self.handle_input(
File "C:\Python39\lib\site-packages\spmf_init_.py", line 73, in handle_input
return self.write_temp_input_file(seq_spmf, ".txt")
File "C:\Python39\lib\site-packages\spmf_init_.py", line 87, in write_temp_input_file
os.rename(name, name + file_ending)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\username\AppData\Local\Temp\tmp1qxu06i7' -> 'C:\Users\username\AppData\Local\Temp\tmp1qxu06i7.txt'

Appears to be some issue in trying to modify the file extension without having properly closed the file stream?

ValueError: invalid literal for int() with base 10: '#SUP:'

Thank you for writing this wrapper.
I have an issue when using to_pandas_dataframe() method with 'Apriori_with_hash_tree'.
The following error is appear: ValueError: invalid literal for int() with base 10: '#SUP:'

from spmf import Spmf
spmf = Spmf("Apriori_with_hash_tree",
            input_filename="contextPasquier99_name.txt",
            output_filename="output.txt",
            arguments=[0.40, 30, 2])

spmf.run()
print(spmf.to_pandas_dataframe())
spmf.to_csv("output.csv")

contextPasquier99_name.txt

Regards.

FileNotFoundError at subprocess

Hello Sir,
Thank you very much for writing this wrapper.
I have an issue running the program, I would appreciate it if you helped me fix it.
I face a problem at the run function. Here is the log:

File "C:\ProgramData\Miniconda3\envs\vaenv\lib\site-packages\spmf_init_.py", line 102, in run
proc = subprocess.check_output(subprocess_arguments)
File "C:\ProgramData\Miniconda3\envs\vaenv\lib\subprocess.py", line 395, in check_output
**kwargs).stdout
File "C:\ProgramData\Miniconda3\envs\vaenv\lib\subprocess.py", line 472, in run
with Popen(*popenargs, **kwargs) as process:
File "C:\ProgramData\Miniconda3\envs\vaenv\lib\subprocess.py", line 775, in init
restore_signals, start_new_session)
File "C:\ProgramData\Miniconda3\envs\vaenv\lib\subprocess.py", line 1178, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

Here is an example of how I run the program from example.py (spfm.jar and the input file are both in the same directory as the example.py)

spmf_jar_dir = pathlib.Path(__file__).parent.absolute()

spmf = Spmf("PrefixSpan", input_filename="contextPrefixSpan.txt",spmf_bin_location_dir=spmf_jar_dir,
            output_filename="output.txt", arguments=[1, "", True])
spmf.run()

Here is what the variable status at line 102 of init.py

I've also tried with the absolute path of the input file, but got the same error.

permission error spmf

When I ran this problem, I did not find a suitable solution, could you please take a look at this problem, if you can solve it, I would be very grateful, thank you

How to handle datasets with timestamps for algorithms that involve time constraints?

I'm trying to run an instance of the HirateYamana with time constraints. In which format I should encode the dataset to involve the timestamp value for each subsequence?

e.g.

dataset = [
# sequence: list of events
[(1, ['a']), (2, ['a', 'b', 'c']), (3, ['a', 'c']), (4, ['c'])], # event: (timestamp : [list of item])
[(1, ['a']), (2, ['c']), (3, ['b', 'c'])],
[(1, ['a', 'b']), (2, ['d']), (3, ['c']), (4, ['b']), (5, ['c'])],
[(1, ['a']), (2, ['c']), (3, ['b']), (4, ['c'])]
]

Gives different output when tested on other algorithm

Thank you very much for this repo. It is very useful and helpful.
But when I use it in ToPKClass rule, the output which I get for any number of K (boundary for the number of rule to be generated) is 2. But when I run in jar file with the same input parameter the output which I expected is what I get. Are there any means to modify this?

[help] [frequent itemset mining] Understanding output with negative value

Looking for clarity on the output of FP Growth Algorithm.
I am doing frequent itemset mining and various times I see negative values in the output itemsets even though my data set doesn't contain negative values.
Curious as to how to interpret this negative value.

Below is an example:

from spmf import Spmf
input_example_list = [
    "1, 3, 4",
    "2, 3, 5",
    "1, 2, 3, 5",
    "2, 5",
    "1, 2, 4, 5"
]

spmf = Spmf("FPGrowth_itemsets",
            input_direct=input_example_list,
            input_type="text",
            output_filename="C:\\spaces\\igt_eye\\trials\\itemset\\output.txt",
            arguments=[0.4, 3, 3],
            spmf_bin_location_dir="\\site-packages\\spmf\\")
spmf.run()
print(spmf.parse_output())

This produces the following output:

=============  FP-GROWTH 2.42 - STATS =============
 Transactions count from database : 5
 Max memory usage: 8.0 mb 
 Frequent itemsets count : 9
 Total time ~ 4 ms
===================================================
Post-processing to show result in terms of string values.
Post-processing completed.

[
['-2 1 4 #SUP: 2'], 
['-2 3 5 #SUP: 2'], 
['3 2 5 #SUP: 2'], 
['-2 3 2 #SUP: 2'], 
['-2 1 3 #SUP: 2'], 
['-2 1 2 #SUP: 2'], 
['-2 1 5 #SUP: 2'], 
['1 2 5 #SUP: 2'], 
['-2 2 5 #SUP: 4']
]

In the above output, I am not sure how to interpret this negative value (-2) in the itemset.
Any pointers/hints from the community?