GithubHelp home page GithubHelp logo

chuanconggao / prefixspan-py Goto Github PK

View Code? Open in Web Editor NEW
401.0 401.0 94.0 68 KB

The shortest yet efficient Python implementation of the sequential pattern mining algorithm PrefixSpan, closed sequential pattern mining algorithm BIDE, and generator sequential pattern mining algorithm FEAT.

Home Page: https://git.io/prefixspan

License: MIT License

Python 100.00%
bide data-mining feat pattern-mining prefixspan

prefixspan-py's Introduction

PyPI version PyPI pyversions PyPI license

Featured on ImportPython Issue 173. Thank you so much for support!

The shortest yet efficient implementation of the famous frequent sequential pattern mining algorithm PrefixSpan, the famous frequent closed sequential pattern mining algorithm BIDE (in closed.py), and the frequent generator sequential pattern mining algorithm FEAT (in generator.py), as a unified and holistic algorithm framework.

  • BIDE is usually much faster than PrefixSpan on large datasets, as only a small subset of closed patterns sharing the equivalent information of all the patterns are returned.

  • FEAT is usually faster than PrefixSpan but slower than BIDE on large datasets.

For simpler code, some general purpose functions have been moved to be part of a new library extratools.

Reference

Research Papers

PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth.
Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, Meichun Hsu.
Proceedings of the 17th International Conference on Data Engineering, 2001.
BIDE: Efficient Mining of Frequent Closed Sequences.
Jianyong Wang, Jiawei Han.
Proceedings of the 20th International Conference on Data Engineering, 2004.
Efficient mining of frequent sequence generators.
Chuancong Gao, Jianyong Wang, Yukai He, Lizhu Zhou.
Proceedings of the 17th International Conference on World Wide Web, 2008.

Alternative Implementations

I created this project with the original minimal 15 lines implementation of PrefixSpan for educational purpose. However, as this project grows into a full feature library, its code size also inevitably grows. I have revised and reuploaded the original implementation as a GitHub Gist here for reference.

You can also try my Scala version of PrefixSpan.

Features

Outputs traditional single-item sequential patterns, where gaps are allowed between items.

  • Mining top-k patterns is supported, with respective optimizations on efficiency.

  • You can limit the length of mined patterns. Note that setting maximum pattern length properly can significantly speedup the algorithm.

  • Custom key function, custom filter function, and custom callback function can be applied.

Installation

This package is available on PyPI. Just use pip3 install -U prefixspan to install it.

CLI Usage

You can simply use the algorithms on terminal.

Usage:
    prefixspan-cli (frequent | top-k) <threshold> [options] [<file>]

    prefixspan-cli --help


Options:
    --text             Treat each item as text instead of integer.

    --closed           Return only closed patterns.
    --generator        Return only generator patterns.

    --key=<key>        Custom key function. [default: ]
                       Must be a Python function in form of "lambda patt, matches: ...", returning an integer value.
    --bound=<bound>    The upper-bound function of the respective key function. When unspecified, the same key function is used. [default: ]
                       Must be no less than the key function, i.e. bound(patt, matches) ≥ key(patt, matches).
                       Must be anti-monotone, i.e. for patt1 ⊑ patt2, bound(patt1, matches1) ≥ bound(patt2, matches2).

    --filter=<filter>  Custom filter function. [default: ]
                       Must be a Python function in form of "lambda patt, matches: ...", returning a boolean value.

    --minlen=<minlen>  Minimum length of patterns. [default: 1]
    --maxlen=<maxlen>  Maximum length of patterns. [default: 1000]
  • Sequences are read from standard input. Each sequence is integers separated by space, like this example:
cat test.dat

0 1 2 3 4
1 1 1 3 4
2 1 2 2 0
1 1 1 2 2
  • When dealing with text data, please use the --text option. Each sequence is words separated by space, assuming stop words have been removed, like this example:
cat test.txt

a b c d e
b b b d e
c b c c a
b b b c c
  • The patterns and their respective frequencies are printed to standard output.
prefixspan-cli frequent 2 test.dat

0 : 2
1 : 4
1 2 : 3
1 2 2 : 2
1 3 : 2
1 3 4 : 2
1 4 : 2
1 1 : 2
1 1 1 : 2
2 : 3
2 2 : 2
3 : 2
3 4 : 2
4 : 2
prefixspan-cli frequent 2 --text test.txt

a : 2
b : 4
b c : 3
b c c : 2
b d : 2
b d e : 2
b e : 2
b b : 2
b b b : 2
c : 3
c c : 2
d : 2
d e : 2
e : 2

API Usage

Alternatively, you can use the algorithms via API.

from prefixspan import PrefixSpan

db = [
    [0, 1, 2, 3, 4],
    [1, 1, 1, 3, 4],
    [2, 1, 2, 2, 0],
    [1, 1, 1, 2, 2],
]

ps = PrefixSpan(db)

For details of each parameter, please refer to the PrefixSpan class in prefixspan/api.py.

print(ps.frequent(2))
# [(2, [0]),
#  (4, [1]),
#  (3, [1, 2]),
#  (2, [1, 2, 2]),
#  (2, [1, 3]),
#  (2, [1, 3, 4]),
#  (2, [1, 4]),
#  (2, [1, 1]),
#  (2, [1, 1, 1]),
#  (3, [2]),
#  (2, [2, 2]),
#  (2, [3]),
#  (2, [3, 4]),
#  (2, [4])]

print(ps.topk(5))
# [(4, [1]),
#  (3, [2]),
#  (3, [1, 2]),
#  (2, [1, 3]),
#  (2, [1, 3, 4])]


print(ps.frequent(2, closed=True))

print(ps.topk(5, closed=True))


print(ps.frequent(2, generator=True))

print(ps.topk(5, generator=True))

Closed Patterns and Generator Patterns

The closed patterns are much more compact due to the smaller number.

  • A pattern is closed if there is no super-pattern with the same frequency.
prefixspan-cli frequent 2 --closed test.dat

0 : 2
1 : 4
1 2 : 3
1 2 2 : 2
1 3 4 : 2
1 1 1 : 2

The generator patterns are even more compact due to both the smaller number and the shorter lengths.

  • A pattern is generator if there is no sub-pattern with the same frequency.

  • Due to the high compactness, generator patterns are useful as features for classification, etc.

prefixspan-cli frequent 2 --generator test.dat

0 : 2
1 1 : 2
2 : 3
2 2 : 2
3 : 2
4 : 2

There are patterns that are both closed and generator.

prefixspan-cli frequent 2 --closed --generator test.dat

0 : 2

Custom Key Function

For both frequent and top-k algorithms, a custom key function key=lambda patt, matches: ... can be applied, where patt is the current pattern and matches is the current list of matching sequence (id, position) tuples.

  • In default, len(matches) is used denoting the frequency of current pattern.

  • Alternatively, any key function can be used. As an example, sum(len(db[i]) for i in matches) can be used to find the satisfying patterns according to the number of matched items.

  • For efficiency, an anti-monotone upper-bound function should also be specified for pruning.

    • If unspecified, the key function is also the upper-bound function, and must be anti-monotone.
print(ps.topk(5, key=lambda patt, matches: sum(len(db[i]) for i, _ in matches)))
# [(20, [1]),
#  (15, [2]),
#  (15, [1, 2]),
#  (10, [1, 3]),
#  (10, [1, 3, 4])]

Custom Filter Function

For both frequent and top-k algorithms, a custom filter function filter=lambda patt, matches: ... can be applied, where patt is the current pattern and matches is the current list of matching sequence (id, position) tuples.

  • In default, filter is not applied and all the patterns are returned.

  • Alternatively, any function can be used. As an example, matches[0][0] > 0 can be used to exclude the patterns covering the first sequence.

print(ps.topk(5, filter=lambda patt, matches: matches[0][0] > 0))
# [(2, [1, 1]),
#  (2, [1, 1, 1]),
#  (2, [1, 2, 2]),
#  (2, [2, 2]),
#  (1, [1, 2, 2, 0])]

Custom Callback Function

For both the frequent and the top-k algorithm, you can use a custom callback function callback=lambda patt, matches: ... instead of returning the normal results of patterns and their respective frequencies.

  • When callback function is specified, None is returned.

  • For large datasets, when mining frequent patterns, you can use callback function to process each pattern immediately, and avoid having a huge list of patterns consuming huge amount of memory.

  • The following example finds the longest frequent pattern covering each sequence.

coverage = [[] for i in range(len(db))]

def cover(patt, matches):
    for i, _ in matches:
        coverage[i] = max(coverage[i], patt, key=len)


ps.frequent(2, callback=cover)

print(coverage)
# [[1, 3, 4],
#  [1, 3, 4],
#  [1, 2, 2],
#  [1, 2, 2]]

Tip

I strongly encourage using PyPy instead of CPython to run the script for best performance. In my own experience, it is nearly 10 times faster in average. To start, you can install this package in a virtual environment created for PyPy.

Note that only the earlier version 0.4 works for the latest PyPy3 6.0.0 (compatible with Python 3.5.3). Please install it via pip3 install prefixspan==0.4. Latest version should work for the future PyPy3 (compatible with Python 3.6).

prefixspan-py's People

Contributors

chuanconggao avatar comonut avatar ikuyadeu avatar toddrme2178 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

prefixspan-py's Issues

Specify occurrence of each sequence in input

Hello,

Could the library support occurrence information of sequences in the input? I have data in this form, where the last element of a sequence indicates the number of times this sequence has occurred in the dataset.

c d e 7
b b b d e 89
c c c a 123
b b c c 789

wrong results

prefix = PrefixSpan(basket)
prefix.topk(10)

[(47, [(3657,)]), (42, [(3655,)]), (23, [(1915,)]), (13, [(1284,)]), (12, [(2098,)]), (11, [(372,)]), (10, [(3655,), (3655,)]), (9, [(395,)]), (9, [(660,)]), (9, [(1566,)])]

3657 appears 47 !!

when i use SPMF library it gave me 242 !!
i checked it manually on the first 10 sequences this library's algorithm gave me 6 times when i check manually i saw it 3 times also SPMF gave me this result

How to find support of given sequences and subsequences?

Hi,

First, I'm hoping you can help me find the support for all sequences of any length in my database (db) that end with a particular item (e.g., 'LATE').
Is the best method to do this as shown below? (note: 'LATE' is always the final item in the sequence if it is present)
ps = PrefixSpan(db)
LATE_results = ps.frequent(1, filter=lambda patt, matches: 'LATE' in patt == True)

Second, after achieving the result above, what is the best way to take every sequence present in LATE_results, remove 'LATE' from the sequence, and then calculate the support of the remaining subsequence in the original db? I'm not sure whether to use the filter, key, or callback options for this.

Ultimately, I am trying to use your great package to perform sequential rule mining. In other words, I want to calculate support(subsequence_i --> LATE) / support(subsequence_i) for all sequences that end in 'LATE'. Another example of this would look like support (A --> B --> F --> G --> LATE) / support(A --> B --> F --> G).

Thank you in advance for any guidance you can provide!

import error: SyntaxError: invalid syntax

Hi,
I have installed prefixspan 0.5.2 and want to import it in Jupyter Notebook.

import prefixspan

Traceback (most recent call last):

File "E:\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2961, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 6, in
import prefixspan

File "E:\anaconda3\lib\site-packages\prefixspan_init_.py", line 5, in
from .frequent import PrefixSpan_frequent

File "E:\anaconda3\lib\site-packages\prefixspan\frequent.py", line 5, in
from extratools.dicttools import nextentries

File "E:\anaconda3\lib\site-packages\extratools\dicttools.py", line 30
r: Mapping[VT, List[KT]] = defaultdict(list)
^
SyntaxError: invalid syntax

Could you help me take a look then?

Thanks,
Wei

The output I get looks very weird to me

Trying prefixspan-0.5.2 with python 3.6.8

I am running prefixspan-cli frequent 5 --closed ids.txt > seqs.ids

ids.txt: https://gist.github.com/johann-petrak/9d07e3bacd167639c26defb822dbe6aa
seqs.ids: https://gist.github.com/johann-petrak/7db1b94153816075556798db1d068069

If you look at the line 4 from the bottom, this is "34 1 0 2 : 6" and the next line is
"34 0 2 : 8".

If you look at the input you will notice that "34 0 2" does not occur anywhere in the input, so why is it included in the output with frequency 8?

Returning matching pattern indices

When I run topk, I get out a list of patterns and their frequencies. e.g.

[[66741, ['/home', '/home']],
 [42351, ['/home', '/basket.html']],
 [41275, ['/home', '/signin.html']]]

I would like to use these as features, so I need to map them back to their indices in the training data. This data is available in the callback/key/filter functions e.g. key=lambda patt, matches (I want matches).

Is there a way to return this match data? Should I use a callback function to store it?

Configure minlen, maxlen outside instead of hardcode to 1 and 1000 respectively

Hello,

I do have a requirement where I need to find matching patterns (along with theirs frequency counters) from given input list stating each matched pattern length should be greater than 5. I noticed that there is an one parameter named as 'minlen' that can be used for this purpose.

So is it possible to configure minlen, maxlen outside instead of hardcode to 1 and 1000 respectively? Or is there any other way to achieve this requirement?

Please let me know if you need any details on that.

Inclusion in scikit-mine

Hi there, very nice and rich implementation of these 3 algorithms

The INRIA center at Rennes is creating a new python library, namely scikit-mine, to centralise pattern mining methods, and improve inter-operability and consistency with other fields, such as Machine Learning.

Your API already has similarities with what scikit-mine provides, being:

In the context of scikit-mine, only BIDE and FEAT would be nice to have, as PrefixSpan mines too many patterns, and we encourage concise representations.

I also plan to try FEAT as a candidate generator for SQS-candidates, an algorithm based on MDL.
To this purpose handling gaps would be required, as SQS natively accounts them when running its optimization process

Anyone to provide support for integration into scikit-mine ?

How to handle the multiple items per transactions?

hi, your work is great. I have a question, if itemsets contain one multiple items, such as [1,[2,3],4,2,1], that is to say, 2 and 3 appear simultaneously, How should I adjust the input data so that the algorithm can handle this problem?

Algorithm outputs a series of repeated items but there are none in the training data

Hallo,

I have noticed a behaviour that, to me, is a bit strange. I trained the algorithm with a series of sequences that had no repeated items, i.e. it's not possible that an item appears again immediately after itself, like 1 in the sequence [3, 2, 1, 1, 5, 7, 2].

When I generated the most frequent sequences, though, I obtained repeated items. Is it possible?

For example, given the code:
seqs = [[22, 16],
[22, 21],
[22, 16, 14, 20],
[22, 16],
[22, 16, 34, 24, 26, 24, 26, 14, 13],
[22, 16],
[22, 26],
[22, 13, 34],
[22, 16],
[22, 21, 16]]

ps = PrefixSpan(seqs)
ps.minlen = 2
ps.maxlen = 10

freq_ratio = 0.1
freq = np.ceil(freq_ratio * len(seqs)).astype(int)

res = ps.frequent(freq)

The output has [26, 26, 14, 13]

I just made a small reproducible example, in my case the sequence dataset is ~1000 sequences. But the problem remains.

Thanks

Does not work properly with text

The given code does not work properly with text.
Example:
input I gave was,
What is your name?
What is the main theme of the party?
.....
The output had a lot of problems.
Can I use your code with sentences?

Incorrect frequent patterns

I'm getting patterns that have the wrong frequency counts. I assuming there should be no gaps when generating these FP's and no double counting. I've asterisked a few that I believe to be incorrect.

input:

sequences =
[0, 1, 0, 0, 0]
[1, 0, 0, 0, 1]
[0, 1, 0, 0, 1]
[0, 0, 1, 0, 1]
[1, 0, 0, 0, 0]
[1, 0, 0, 0, 0]
[1, 0, 0, 1, 0]
[0, 0, 0, 0, 0]
[0, 1, 0, 0, 0]
[0, 0, 0, 0, 0]
[1, 0, 0, 1, 1]
[0, 1, 1, 0, 0]
[0, 1, 0, 0, 1]
[1, 0, 0, 0, 1]
[1, 1, 1, 1, 0]

output:

from prefixspan import PrefixSpan
prefix_span = PrefixSpan(sequences)

min_support = 7
frequent_patterns = prefix_span.frequent(min_support, closed=True)

for pattern in frequent_patterns:
print(pattern)

(15, [0])
(10, [0, 1])
(7, [0, 1, 0])
(14, [0, 0])
(13, [0, 0, 0]) *
(7, [0, 0, 1])
(13, [1])
(13, [1, 0])
(11, [1, 0, 0])
(7, [1, 0, 0, 0]) *
(7, [1, 0, 1]) *
(9, [1, 1]) *

how to install it on windows 10 computer

Only it is not clear how to install it on windows 10 computer
and how to use for supervising learning
for example for train data with only 2 classes at last token in the sequence
"yes" and "no"

Multiple items per transactions and retrieving patterns

Hi,

is it possible to specify multiple items per transaction?

For instance,

[ [0, [1,2], 3], [0, 1, 3], [4, 5, 0, [1,2], 3] ]

Would also have the pattern [0, [1,2], 3] observed two times. Second, is it possible to specify that the framework should return all patterns that meet the minimum support criteria, without knowing how many there are in advance?

Thanks for a great tool! It seems to be super fast, even for millions of sequences.

TypeError: unhashable type: 'list'

I am trying the prefixspan algorithm (def frequent_rec) on a dataset of freeman codes (it is an list of lists of variable length, which have integer values.
[
[4, 3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 5, 5, 5, 5, 5, 6, 5, 5, 4, 5, 5, 5, 5, 6, 5, 6, 7, 6, 7, 7, 6, 7, 0, 0, 2, 3, 3, 3, 2, 3, 2, 2, 1, 1, 0, 7, 7, 7, 7, 7, 7, 7, 1, 1, 3, 2, 3, 3, 2, 3, 2, 2, 0, 0, 7, 0, 7, 7, 7, 7, 7, 7, 0, 1, 3, 2],
[3, 3, 3, 4, 3, 3, 4, 3, 3, 3, 3, 4, 5, 4, 5, 5, 4, 5, 5, 5, 5, 6, 7, 7, 6, 5, 5, 5, 5, 6, 7, 7, 7, 0, 2, 1, 1, 0, 1, 0, 7, 7, 0, 0, 2, 2, 2, 1, 0, 0, 7, 0, 1, 1, 2],
..............
..............
[3, 3, 3, 3, 3, 3, 4, 5, 6, 6, 5, 5, 5, 6, 5, 6, 6, 5, 5, 5, 6, 5, 5, 5, 5, 7, 7, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 0, 2, 1, 1, 0, 6, 6, 7, 6, 5, 5, 4, 7, 0, 0, 1, 1, 1, 2, 3, 3, 2]]
The dataset has about 60000 rows. However, when running the algorithm, I get an error

TypeError: unhashable type: 'list'

in this line:
l = occurs[seq[j]]
I also tried with your first implementation and yet had the same error. Do the items have to be of the same length? It would be nice if you had an idea. Thank you.

a way to edit the minlen

hi,

is there a way to edit the minlen? i am only interested in patterns between 2 and 5 length. (i am working in python btw)

Thanks in advance, the package works great!

How to get the prefixspan-cli working?

I tried to install setup.py using "python setup.py" but it throws an error. How do I use the prefixspan-cli option? It keeps throwing an error for me. I am particularly interested in using the maxlen option in my python code. How do I use maxlen in python code rather than using it on the terminal? Would greatly appreciate any help!

Finding frequent patterns in multiple lists

Can PrefixSpan find sequential patterns in multiple lists, instead of just one list?
For example, I'd want to find the most frequent patterns that occur in both of these lists:

db = [
    [0, 1, 2, 3, 4],
    [1, 1, 1, 3, 4],
    [2, 1, 2, 2, 0],
    [1, 1, 1, 2, 2],
]
db2 = [
    [0, 2, 2, 3, 5],
    [1, 1, 1, 3, 5],
    [2, 1, 2, 2, 0],
    [1, 1, 1, 2, 2],
]

Question: Sequential Rule Mining

Do you have a plan to implement for sequential rule mining algorithm?
(If not) Can I implement and submit a pull request to this repository?

Incorrect result

Data

a a b c a c
a d c b c a
e f a b d f

Running prefixspan-cli frequent 3 --text fmt-db1.txt gives me

e : 3
e e : 3
e b : 3
b : 3

How the support value is calculated in PrefixSpan-py module

Hi Chuanconggao,

I am using your module in one of my research for doing closed frequent sequence mining. BIDE algorithm is obvious choice for doing this. I have few questions regarding support value calculation in this module.

Details given below.

I have 47088 transactions, each was having variable length of items. I have used these 47088 transactions for BIDE training. I have used topk() to extract top 1000 rules. One of the rule listed below.

(1033,
['bdb061461ee36ed04a69b4913a4c50af433d8de1',
'68d234c39fd9217721d3ee22d5b81981f6a19cdc',
'85a66b6b2cbe4e35ffe838f7080eeac4bb67cd1d',
'4728d1f5afc614cbe76fc729de89c6634faecf85'])

Here 1033 is support value. I am under impression that 1033 times this pattern is appeared in 47088 transactions. But actually I found that it exists in 339 times in all the transactions.

Can you explain how this support value is calculated here. Also, let me know if there is any error in calculation.

Thanks,
Uday Vakalapudi

Installation problem

Hello!
I'm interested in your project and trying to install using pip.
However, I get this error and cannot proceed.

pip3 install -U prefixspan
Collecting prefixspan
Using cached https://files.pythonhosted.org/packages/53/8d/f48a549b96f63234e0ff446be390cb8c915d7596401764ca76a0ff508a86/prefixspan-0.5.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-vgkdsun3/prefixspan/setup.py", line 18
download_url=f"{url}/tarball/{version}",
^
SyntaxError: invalid syntax
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-vgkdsun3/prefixspan/
`

is there any way to solve this?
Thank you in advance!

cannot use prefixspan-cli

Sorry for bothering you.
Firstly I installed prefixspan by "pip3 install -U prefixspan" in Windows. The command "prefixspan-cli" can't be used in cmd, while it can be used in git bash. My python version is 3.8.3.
Then I tried to install prefixspan in Linux. But the command "prefixspan-cli" still can't be used. My system still can't recognize "prefixspan-cli" as a command. My python version here are 2.7.17 and 3.6.9.

Verbose mode

Do you think that it could be possible to develop a verbose mode ? thanks !

Gaps

Hi,

That's not a real issue i'm just wondering if exist any way to set a maximum and/or minimum gaps for the pattern mining ?

Issue with key parameter input

Hello,

Thank you for your very convenient package. Computation time is great.

I'd like to count the occurrences of patterns in my database, not just the number of times they appear as prefixes. Thus, I wanted to specify a key function that could count this. Unfortunately, everytime i try to use the key parameter, the output is : 'NoneType' object is not callable'. I then tried to change the defaultkey in prefixspan.py to len(patt) instead of len(matches) and the same happened. I then changed it back to len(matches) and rather than going back to a normal output it was still 'NoneType object is not callable'.

Would you have any idea on where the problem comes from ?

I work on Oracle VM Ubuntu 16, Anaconda python 3.7, Spyder IDE

Thank you.

Handling of --text does not work properly

The way how the word dictionary and subsequently the inverted word dictionary is created is broken.

Take this input file:

a b c
b c d
e f g b c
a f g
f b c

Running prefixspan-cli frequent 2 --closed --text test1.txt creates the following output:

e f g : 3
f g : 5
f g g : 2
f f g : 2

This is obviously wrong.

The reason is that the word dictionary created is: {'a': 0, 'b': 1, 'c': 2, 'd': 2, 'e': 0, 'f': 1, 'g': 2}
As you can see, the indices are not properly mapped to words and there are duplicate indices.

The inverted map is therefore also wrong: {0: 'e', 1: 'f', 2: 'g'}

BIDE algorithm returns extra patterns in the result.

db = ['c', 'a', 'a', 'b', 'c'], ['a', 'b', 'c', 'b'], ['c', 'a', 'b', 'c'],['a', 'b', 'b', 'c', 'a']
ps.frequent(2, closed=True)

shows result:
[(3, ['c', 'a']), (2, ['c', 'a', 'b', 'c']), (3, ['c', 'b']), (4, ['a']), (2, ['a', 'a']), (4, ['a', 'b']), (4, ['a', 'b', 'c']), (2, ['a', 'b', 'b'])]

which was supposed to be:
[(3, ['c', 'a']), (2, ['c', 'a', 'b', 'c']), (3, ['c', 'b']), (2, ['a', 'a']), (4, ['a', 'b', 'c']), (2, ['a', 'b', 'b'])]

The input is from original paper of BIDE.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.