GithubHelp home page GithubHelp logo

feloe / df2markov Goto Github PK

View Code? Open in Web Editor NEW

This project forked from uvacw/df2markov

0.0 0.0 0.0 52 KB

A simple way to create Markov Chains from dataframes

License: GNU General Public License v3.0

Python 100.00%

df2markov's Introduction

df2markov

A simple way to create Markov Chains from dataframes

Description

df2markov is a simple tool to create a Markov chain from timestamped data. A typical use case is the analysis of web browsing data. df2markov allows you to estimate the transition probabilities of, for instance, reading an article on topic t1 after having read an article about topic t2; or the probabilities of continuing reading an article on a news website after having encountered it on social media. df2markov also allows to plot the transition probabilities.

Installation

You can either install df2markov from this github repository, or simply install it via pip:

pip install df2markov

For (optional) plotting, we create DOT files, which can then be converted into various output formats, such as PNG or PS. To make use of this, the dot command from Graphviz needs to be available on your system. On Ubuntu, you can install it via

sudo apt install graphviz

For Windows and Mac, have a look at the Graphviz website.

Usage

As input, df2markov expects your data to be (roughly) organized like this:

Timestamp Session User State
2019-2-1 13:44:21 1 Anna C
2019-2-1 13:44:45 1 Anna A
2019-2-1 13:44:59 1 Anna A
2019-2-1 13:46:05 1 Anna F
2019-2-1 17:46:05 2 Anna A
2019-2-1 17:46:47 2 Anna F
2019-2-1 13:44:22 1 Bob D
2019-2-1 13:45:38 1 Bob D
2019-2-1 13:46:01 1 Bob F

df2markov is relatively flexible and accepts different data types in the column of the table: In principle, all columns accept various data types such as integers, floats, strings. The only restriction is that the timestamp must be sortable in a meaningful way: A simple integer (with increasing values) is fine, as are datetime objects or strings in, for example, ISO 8601 format ("1997-07-16T19:20"). Strings that do not sort in chronological order (e.g., "16-7-1997") would lead to incorrect results.

The session column allows you to group data into sessions, such as a web browsing session. For instance, if a user visits website B four hours after visiting website A, you may not want to consider this as a transition. Important: Within each user, session IDs must be unique. In other words, Anna cannot have two sessions both identified as 'session 1', even if they are on different days.

This is particular useful if one of your states is a (meaningful) absorbing state, such as 'End of Web session'. In that case, one should add a final absorbing state to every session representing the exit point (i.e., a state once entered, cannot be left). In the example, this state is called 'F'.

However, when one does not have an absorbing state --- for example when examining the weather (which can only be "sunny" or "rainy") --- it is also possible to only use the Timestamp column to detect sequence patterns in the data. Again, strings that do not sort in chronological order (e.g., "16-7-1997") would lead to incorrect results. In this case, just fill the session-ID colum with one value, such as 1 in all cells.

To use df2markov, you can import it as follows:

from df2markov import SAMPLEDATA, Markov

Sequential patterns per user:

By creating transition matrices, df2markov organizes the data into meaningful sequential patterns:

mymodel = Markov(SAMPLEDATA)

Next, df2markov can create a transition probability matrix: the likelihood of transitioning between any two states:

mymodel.get_probability_matrices()      

It is also able to visualize the transitions:

mymodel.plot(outputdirectory='/path/to/store/output', user='Anna')      

Convert to common graphic formats:

dot -T png Anna_probabilities.dot > anna_markov.png

Aggregate sequential patterns:

Finally, df2markov is able to aggregate the data from all users in the data set to create an overall (a) percentage matrix (0-100 percent), (b) probability matrix (0-1), or (c) frequency matrix:

mymodel.aggregate(how = "frequency")      

Citation

If you find this package useful and build on it in your academic work, we appreciate the citation of our paper:

Vermeer, S.A.M. & Trilling, D. (2020): Toward a better understanding of news user journeys: A Markov chain approach. Journalism Studies, forthcoming.

Bibtex:

@df2markov,
author = {Vermeer, Susan A.M. and Trilling, Damian},
title = {Toward a better understanding of news use journeys: A Markov chain approach},
journal = {Journalism Studoes},
year = {2020},
volume = {tbd},
pages = {tbd},
doi = {tbd}

df2markov's People

Contributors

samvermeer avatar feloe avatar brian-yee avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.