GithubHelp home page GithubHelp logo

rsc_model's Introduction

RSC-Model

Matlab implementation of the RSC-Model which is described in the following paper:

RSC: Mining and Modeling Temporal Activity in Social Media

Alceu Ferraz Costa, Yuto Yamaguchi, Agma Juci Machado Traina, Caetano Traina Jr., and Christos Faloutsos

The 21st SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2015

How to Use

Generating Synthetic Time-stamps

Using RSC to generate 10,000 synthetic time-stamps:

> addpath(genpath('.')); % Add sub-folders to Matlab path.
> [ fGen, ~, paramGuess ] = rsc_model();
> Tsynth = fGen(paramGuess, 1e5);

We can use the plot_iat_hist function to plot the log-binned histogram of the synthetic inter-arrival times (IAT):

> plot_iat_hist(Tsynth);

The result should be similar to the following figure:

Log-Binned Histogram

Fitting Data

Instead of using the default parameters paramGuess returned by the 'rsc_model' function we can estimate (fit) the parameters using real data. The function load_reddit_data loads sample data from Reddit users:

> Tcell = load_reddit_data();

The function fit_model is used to estimate the parameters. We are using the RSC default parameters as a starting point for the fit algorithm:

> paramEst = fit_model(Tcell, @rsc_model, 'paramGuess', paramGuess);

The function plot_iat_hist_fit compares the log-binned histogram for real data against synthetic time-stamps. If both histograms are similar, then the RSC fit was successful:

> timeStampTotal = numel(cell2mat(Tcell));
> Tsynth = fGen(paramEst, timeStampTotal);
> plot_iat_hist_fit(Tcell, Tsynth);

The result should be similar to the following figure:

Log-Binned Histogram

Detecting Bots

The sample dataset of Reddit users has some users that are bots. We can use the load_reddit_data function to get a grouping variable that tells whether the i-th entry of Tcell is a bot or a human:

> [ Tcell, ~, ~, ~, ~, userType] = load_reddit_data();

userType(idx) == 1 indicates that the time-stamp sequence in Tcell{idx} is from a bot. The function estimate_bot_likelihood returns a vector L where each entry L(idx) corresponds to the likelihood (i.e. the score) that the time-stamp sequence Tcell(idx) is from a bot:

% Split data into train and test subsets.
> CrossValIdxs = my_crossvalind('Kfold', userType, 2);
> TcellTest = Tcell(CrossValIdxs == 1);
> TcellTrain = Tcell(CrossValIdxs == 2);
> userTypeTrain = userType(CrossValIdxs == 2);
> [Ltest, Ltrain] = estimate_bot_likelihood(TcellTest, TcellTrain, userTypeTrain);

In order to classify users as bots or humans, we use a cost-sensitive approach in our paper. Assuming the costs FpCost = 10 and FnCost = 1 for false-positive (FP) and false-negative errors (FN), we can detect bot using the likelihood_thresh function as follows:

> FpCost = 10; FnCost = 1;
> Lthresh = likelihood_thresh(Ltrain, userTypeTrain, FpCost, FnCost);
> IsBot = LtrainL > Lthresh;

We can also use the print_conf_matrix function to print the confusion matrix:

> TP = sum(userTypeTrain == 1 & IsBot == 1);
> FP = sum(userTypeTrain == 0 & IsBot == 1);
> TN = sum(userTypeTrain == 0 & IsBot == 0);
> FN = sum(userTypeTrain == 1 & IsBot == 0);
> print_conf_matrix(TP, FP, TN, FN);

Predicted Class  
                .---------.--------.
                |   Pos.  |   Neg. |
        .-------|---------|--------|
 Actual | Pos.  |      9  |     5  |
 Class  | Neg.  |      1  |   498  |
        `--------------------------

Datasets

This repository includes a sample dataset with time-stamps of 1,036 Reddit users in the sample_data/reddit/ directory.

We also include below links to download the complete datasets used in our paper. Each dataset has a README.md file with a more detailed description of the data.

Acknowledgements

This material is based upon work supported by FAPESP, CNPq, CAPES, STIC-AmSud, the RESCUER project funded by the European Commission (Grant: 614154) and by the CNPq/MCTI (Grant: 490084/2013-3), JSPS KAKENHI, Grant-in-Aid for JSPS Fellows #242322, the National Science Foundation under Grant No. CNS-1314632, IIS-1408924, ARO/DARPA under Contract Number W911NF-11-C-0088 and by the Army Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053.

rsc_model's People

Contributors

alceufc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

rsc_model's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.