GithubHelp home page GithubHelp logo

cascades's Introduction

Persistent Cascades

Implementation of the Persistent Cascades graph mining algorithm described in this paper.

Scripts

The main script is cascades.py which contains the class Cascades and several "helper" methods.

The Cascades class relies on the Node class defined in zss.py and the associated simple_distance method. This code is modified from this repo --- I did a hard copy, not a fork, because ... I have no good reason. Sorry. (The Node class is a tree structure, and allows computing tree edit distance between two Nodes using the Zhang-Shasha algorithm.)

The utils.py contains a few helper functions: get_depth computes the depth of a Node tree, load_calls loads mobile-phone datasets saved in binary format (such as using np.tofile()) where each file is a different day, and enrich_calls takes such a dataset plus the output of Cascades and annotates which calls are part of "persistent cascades."

Usage

Usage is designed for time series datasets in the following format:

Caller   [NA]   Callee   [NA]   Timestamp   [Duration]   Day

The "NA"s can be additional information about the caller/callee/call (such as a tower location), or just blank. The duration field is also not used in this implementation, and can be blank. The Day field should be a sequential index of what 24-hour period the timestamp refers to.

The dataset should be stored as a NumPy array with shape (,7), in the exact column order as above. Unfortunately the column index locations are hard-coded as of now, so if your data takes a different form you need to coerce it with blank columns into this format. This is on the to-do list to clean up, might be more sensible to use a recarray or Pandas DataFrame...

Given a dataset like this, let's say it's called df, you can call:

from cascades import Cascades

C = Cascades(calls=df)
C.build(nsample=-1)

which will construct a Cascades instance C, build all possible cascades from the dataset df, and compute the similarities between all cascades with the same root (as defined in the paper). To do the persistent analysis (that is, cluster similar cascades together), do

nted, jacc = C.build_persistence_classes()

which by default uses a similarity threshold of ell=0.8. This returns a hash dict of user to persistence classes, where values in the arrays represent day indexes.

Loading datasets from file

Instead of passing a dataset, you can pass Cascades a path and it will load the data itself. Currently the data needs to be in binary format (such as using np.tofile()), with a separate file for each day of records, and with a filename format of *_*_*_*_YYYYMMDD.dat where the * can be any string of characters a-zA-Z of any length. For example, city_result_call_date_20120101.dat is a valid file name corresponding to Jan 1, 2012.

Here's an example of loading from file:

C = Cascades(path='path/to/data', city='citysubfolder', nMonths=2, moyr=[(1,2012), (2,2012)])

Note on timestamp formats

The default format for timestamps is DDDSSSSS where DDD corresponds to the day period, and SSSSS corresponds to the second in that day. So 3886399 would be 11:59:59 on the 38th day. This is nonstandard and clunky. If your timestamps are a more typical UNIX epoch style, you can specify UTC=True in the constructor call.

Other settings

For more usage details beyond default settings, check out the code. Most of the functions are commented in the standard Python-documentation style.

cascades's People

Contributors

stmorse avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.