GithubHelp home page GithubHelp logo

alexworldd / netembs Goto Github PK

View Code? Open in Web Editor NEW
1.0 4.0 1.0 312.63 MB

Framework for Representation Learning on Financial Statement Networks

License: Apache License 2.0

Python 65.17% Jupyter Notebook 34.83%
machine-learning representation-learning randomwalk auditing data-science skip-gram

netembs's People

Contributors

alexworldd avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

droiter

netembs's Issues

generation of noise

Hi Aleksei,

Just a question about the noise generation in the toy-example. I see that you have a core-processes and then add random-noise accounts. For example

0.5 A + 0.49 B + 0.01X -> C

my question is are the noise accounts, X in the example, unique or can it be that I have a second processes where X is either part of the core, or, also participates as noise?

Thank you for clarifying my understanding.

Kind regards,

Marcel Boersma

suggestion

Hi Aleksei,

Currently is see that my data is somewhat noisy in the following form:

0.1A + 0.2A + 0.3A + 0.5B -> C

if another process is:

0.09A + 0.21A + 0.3A + 0.5B -> C

then it is considered as a unique process (this is correct); However, I was wondering what will happen if we slightly simplify the two records to:

0.5 A + 0.5B -> 1C

because currently the algorithm clusters the above two suggestions, which is good, but I'm also curious to see what happens when we simplify the journal entry structure :)

What do you think?

Kind regards,

Marcel

Images

Hi aleksei,

I’ve tried some stuff with the plots and I think it has to do with the marker size. Resizing to 5 from 150 produced better plots already.

We can play around with that to see what the right parameters are.

Tensorflow

Hi Aleksei,

I couldn't find the Tensorflow code at first because in PyCharm the last cell of the notebook were not rendering correctly. Only while opening in jupyter notebook I could see the Tensorflow code.

However, I run into some problems

Average loss at step  0 :  8.046344757080078
Traceback (most recent call last):
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/analysisMB.py", line 145, in <module>
    run(graph, num_steps, skip_grams, 128)
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/analysisMB.py", line 134, in run
    final_embeddings = normalized_embeddings.eval()
NameError: name 'normalized_embeddings' is not defined

furthermore, I would like to render a simple t-sne plot. Could you combine this with:


import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

viz_words = 500
tsne = TSNE()
embed_tsne = tsne.fit_transform(embed_mat[:viz_words, :])



fig, ax = plt.subplots(figsize=(14, 14))
for idx in range(viz_words):
    plt.scatter(*embed_tsne[idx, :], color='steelblue')
    plt.annotate(int_to_vocab[idx], (embed_tsne[idx, 0], embed_tsne[idx, 1]), alpha=0.7)

Next steps

  • Start clustering over embedding space
  • Parallel implementation of RandomWalks
  • Think what we can do with these embeddings
  • WordsClouds
  • Run for real data with different parameters

parallel implementation

Hi Aleksei,

I've started with a parallel implementation, it somewhat works but definitely requires additional debugging:

Here the code snippet

# encoding: utf-8
__author__ = 'Aleksei Maliutin'
"""
utils.py
Created by lex at 2019-03-15.
"""
import numpy as np
from scipy.special import softmax
import random
import networkx as nx
from networkx.algorithms import bipartite
from NetEmbs.CONFIG import *
from collections import Counter
import pandas as pd
from NetEmbs.FSN.graph import FSN
import logging
from NetEmbs.CONFIG import LOG
np.seterr(all="raise")

import numpy as np
from multiprocessing import Process, Queue


def default_step(G, vertex, direction="IN", mode=0, return_full_step=False, debug=False):
    """
     One step according to the original implementation of RandomWalk by Perozzi et al.
     (uniform probabilities, follows the same direction)
    :param G: graph/network on which step should be done
    :param vertex: current vertex
    :param direction: the direction of step: IN or OUT
    :param mode: use the edge's weight for transition probability
    :param return_full_step: if True, then step includes intermediate node of FA type
    :param debug: print intermediate stages
    :return: next step if succeeded or -1 if failed
    """
    if vertex in [-1, -2, -3]:
        #         Step cannot be made, return -1
        return vertex
    elif not G.has_node(vertex):
        raise ValueError("Vertex {!r} is not in FSN!".format(vertex))
    if mode in [0, 1]:
        pass
    else:
        raise ValueError(
            "For DefaultStep only two modes available: 0 (uniform) or 1(weighted) byt given {!r}!".format(mode))
    # Get the neighborhood of current node regard the chosen direction
    if direction == "IN":
        ins = G.in_edges(vertex, data=True)
    elif direction == "OUT":
        ins = G.out_edges(vertex, data=True)
    else:
        raise ValueError("Wrong direction argument! {!s} used while IN or OUT are allowed!".format(direction))
    output = list()
    indexes = ["IN", "OUT"]
    # Check that we can make step, otherwise return special value -1
    if len(ins) > 0:
        # Apply weighted probabilities
        if mode == 1:
            ws = [edge[-1]["weight"] for edge in ins]
            p_ws = ws / np.sum(ws)
            ins = [edge[indexes.index(direction)] for edge in ins]
            tmp_idx = np.random.choice(range(len(ins)), p=p_ws)
            tmp_vertex = ins[tmp_idx]
            tmp_weight = ws[tmp_idx]
        #     Apply uniform probabilities
        elif mode == 0:
            ins = [edge[indexes.index(direction)] for edge in ins]
            tmp_vertex = np.random.choice(ins)
        if debug:
            print(tmp_vertex)
        output.append(tmp_vertex)
    else:
        return -1
    # ///////////// \\\\\\\\\\\\\\\
    #     Second sub-step, from FA to BP
    if direction == "IN":
        ins = G.in_edges(tmp_vertex, data=True)
    elif direction == "OUT":
        ins = G.out_edges(tmp_vertex, data=True)
    else:
        raise ValueError("Wrong direction argument! {!s} used while IN or OUT are allowed!".format(direction))
    # Check that we can make step, otherwise return special value -1
    if len(ins) > 0:
        if mode == 1:
            ws = [edge[-1]["weight"] for edge in ins]
            p_ws = ws / np.sum(ws)
            ins = [edge[indexes.index(direction)] for edge in ins]
            tmp_idx = np.random.choice(range(len(ins)), p=p_ws)
            tmp_vertex = ins[tmp_idx]
            tmp_weight = ws[tmp_idx]
        elif mode == 0:
            ins = [edge[indexes.index(direction)] for edge in ins]
            tmp_vertex = np.random.choice(ins)
        if debug:
            print(tmp_vertex)
        output.append(tmp_vertex)
        if return_full_step:
            return output
        else:
            return output[-1]
    else:
        return -1


def diff_function(prev_edge, new_edges, pressure):
    """
    Function for calculation transition probabilities based on Differences between previous edge and candidate edge
    :param prev_edge: Monetary amount on previous edge
    :param new_edges: Monetary amount on all edges candidates
    :param pressure: The regularization term, higher pressure leads to more strict function
    :return: array of transition probabilities
    """
    return softmax((1.0 - abs(new_edges - prev_edge)) * pressure)

def make_pairs(sampled_seq, window=3, debug=False):
    """
    Helper function for construction pairs from sequence of nodes with given window size
    :param sampled_seq: Original sequence of nodes (output of RandomWalk procedure)
    :param window: window size, how much predecessors and successors one takes into account
    :param debug: print intermediate stages
    :return:
    """
    if debug:
        print(sampled_seq)
    output = list()
    for cur_idx in range(len(sampled_seq)):
        for drift in range(max(0, cur_idx - window), min(cur_idx + window + 1, len(sampled_seq))):
            if drift != cur_idx:
                output.append((sampled_seq[cur_idx], sampled_seq[drift]))
    if len(output) < 2 and debug:
        print(output)
    return output


def step(G, vertex, direction="IN", mode=2, allow_back=True, return_full_step=False, pressure=20, debug=False):
    """
     Meta-Random step with changing direction.
    :param G: graph/network on which step should be done
    :param vertex: current vertex
    :param direction: the initial direction of step: IN or OUT
    :param mode: use the edge's weight for transition probability or difference between weights
    :param allow_back: If True, one can get the sequence of the same BPs... Might be delete it? TODO check, is it needed?
    :param return_full_step: if True, then step includes intermediate node of FA type
    :param pressure: The regularization term, higher pressure leads to more strict function
    :param debug: print intermediate stages
    :return: next step if succeeded or -1 if failed
    """
    # ////// THE FIRST STEP TO OPPOSITE SET OF NODES \\\\\
    if vertex in [-1, -2, -3]:
        #         Step cannot be made, return -1
        return vertex
    elif not G.has_node(vertex):
        raise ValueError("Vertex {!r} is not in FSN!".format(vertex))
    if direction == "IN":
        ins = G.in_edges(vertex, data=True)
    elif direction == "OUT":
        ins = G.out_edges(vertex, data=True)
    else:
        raise ValueError("Wrong direction argument! {!s} used while IN or OUT are allowed!".format(direction))
    output = list()
    mask = {"IN": "OUT", "OUT": "IN"}
    indexes = ["IN", "OUT"]
    if len(ins) > 0:
        ws = [edge[-1]["weight"] for edge in ins]
        p_ws = ws / np.sum(ws)
        ins = [edge[indexes.index(direction)] for edge in ins]
        if mode == 0:
            tmp_idx = np.random.choice(range(len(ins)))
        else:
            tmp_idx = np.random.choice(range(len(ins)), p=p_ws)
        tmp_vertex = ins[tmp_idx]
        tmp_weight = ws[tmp_idx]
        if debug:
            print(tmp_vertex)
        output.append(tmp_vertex)
    else:
        return -1
    # ////// THE SECOND STEP TO OPPOSITE SET OF NODES (to original one) \\\\\
    if mask[direction] == "IN":
        outs = G.in_edges(tmp_vertex, data=True)
    elif mask[direction] == "OUT":
        outs = G.out_edges(tmp_vertex, data=True)
    if len(outs) > 0:
        ws = [edge[-1]["weight"] for edge in outs]
        outs = [edge[indexes.index(mask[direction])] for edge in outs]
        if not allow_back:
            rm_idx = outs.index(vertex)
            ws.pop(rm_idx)
            outs.pop(rm_idx)
        if len(outs) == 0:
            return -3
        ws = np.array(ws)
        probas = None
        try:
            if mode == 2:
                # Transition probability depends on the difference between monetary flows
                probas = diff_function(tmp_weight, ws, pressure)
                if debug:
                    print(list(zip(outs, ws)))
                tmp_vertex = np.random.choice(outs, p=probas)
                output.append(tmp_vertex)
            elif mode == 1:
                # Transition probability depends on the monetary flows - "rich gets richer"
                probas = ws / np.sum(ws)
                if debug:
                    print(list(zip(outs, ws)))
                tmp_vertex = np.random.choice(outs, p=probas)
                output.append(tmp_vertex)
            elif mode == 0:
                # Transition probability is uniform
                if debug:
                    print(outs)
                tmp_vertex = np.random.choice(outs)
                output.append(tmp_vertex)
        except Exception as e:
            if LOG:
                snapshot = {"CurrentNode": tmp_vertex, "CurrentWeight": tmp_weight,
                            "NextCandidates": list(zip(outs, ws)), "Probas": probas}
                local_logger = logging.getLogger("NetEmbs.Utils.step")
                local_logger.error("Fatal ValueError during step", exc_info=True)
                local_logger.info("Snapshot" + str(snapshot))
        #     Return next vertex here
        if return_full_step:
            return output
        else:
            return output[-1]
    else:
        return -2


def randomWalk(G, vertex=None, length=3, direction="IN", version="MetaDiff", return_full_path=False, debug=False):
    """
    RandomWalk function for sampling the sequence of nodes from given graph and initial node
    :param G: Bipartite graph, an instance of networkx
    :param vertex: initial node
    :param length: the maximum length of RandomWalk
    :param direction: The direction of walking. IN - go via source financial accounts, OUT - go via target financial accounts
    :param version: Version of step:
    "DefUniform" - Pure RandomWalk (uniform probabilities, follows the direction),
    "DefWeighted" - RandomWalk (weighted probabilities, follows the direction),
    "MetaUniform" - Default Metapath-version (uniform probabilities, change directions),
    "MetaWeighted" - Weighted Metapath version (weighted probabilities "rich gets richer", change directions),
    "MetaDiff" - Modified Metapath version (probabilities depend on the differences between edges, change directions)
    :param return_full_path: If True, return the full path with FA nodes
    :param debug: Debug boolean flag, print intermediate steps
    :return: Sampled sequence of nodes
    """
    if version not in STEPS_VERSIONS:
        raise ValueError(
            "Given not supported step version {!s}!".format(version) + "\nAllowed only " + str(STEPS_VERSIONS))
    context = list()
    if vertex is None:
        context.append(random.choice(list(G.nodes)))
    else:
        context.append(vertex)
    cur_v = context[-1]
    mask = {"IN": "OUT", "OUT": "IN"}
    cur_direction = "IN"
    while len(context) < length + 1:
        try:
            if version == "DefUniform":
                new_v = default_step(G, cur_v, direction, mode=0, return_full_step=return_full_path, debug=debug)
            elif version == "DefWeighted":
                new_v = default_step(G, cur_v, direction, mode=1, return_full_step=return_full_path, debug=debug)
            elif version == "MetaUniform":
                new_v = step(G, cur_v, direction, mode=0, return_full_step=return_full_path, debug=debug)
            elif version == "MetaWeighted":
                new_v = step(G, cur_v, direction, mode=1, return_full_step=return_full_path, debug=debug)
            elif version == "MetaDiff":
                if direction is "COMBI":
                    new_v = step(G, cur_v, cur_direction, mode=2, return_full_step=return_full_path, debug=debug)
                    cur_direction = mask[cur_direction]
                else:
                    new_v = step(G, cur_v, direction, mode=2, return_full_step=return_full_path, debug=debug)
        except nx.NetworkXError:
            # TODO modify to more robust behaviour
            break
        if new_v == -1:
            if debug: print("Cannot continue walking... Termination.")
            break
        if return_full_path:
            if isinstance(new_v, list):
                context.extend(new_v)
            else:
                context.append(new_v)
        else:
            context.append(new_v)
        cur_v = context[-1]
    return context


def get_pairs(fsn, version="MetaDiff", walk_length=10, walks_per_node=10, direction="ALL", drop_duplicates=True):
    """
    Construction a pairs (skip-grams) of nodes according to sampled sequences
    :param fsn: Researched FSN
    :param version: Applying version of step method
    "DefUniform" - Pure RandomWalk (uniform probabilities, follows the direction),
    "DefWeighted" - RandomWalk (weighted probabilities, follows the direction),
    "MetaUniform" - Default Metapath-version (uniform probabilities, change directions),
    "MetaWeighted" - Weighted Metapath version (weighted probabilities "rich gets richer", change directions),
    "MetaDiff" - Modified Metapath version (probabilities depend on the differences between edges, change directions)
    :param walk_length: max length of RandomWalk
    :param walks_per_node: max number of RandomWalks per each node in FSN
    :param direction: initial direction
    :param drop_duplicates: True, delete pairs with equal elements
    :return: array of pairs(joint appearance of two BP nodes)
    """
    # TODO implement parallel version!
    if direction not in ["ALL", "IN", "OUT", "COMBI"]:
        raise ValueError(
            "Given not supported yet direction of walking {!s}!".format(version) + "\nAllowed only " + str(
                ["ALL", "IN", "OUT"]))
    if direction == "ALL":
        #     Apply RandomWalk for both IN and OUT direction
        pairs = [make_pairs(randomWalk(fsn, node, walk_length, direction="IN", version=version)) for _ in
                 range(walks_per_node) for node
                 in fsn.get_BP()] + [make_pairs(randomWalk(fsn, node, walk_length, direction="OUT", version=version))
                                     for _
                                     in
                                     range(walks_per_node) for node
                                     in fsn.get_BP()]
    elif direction == "IN":
        pairs = [make_pairs(randomWalk(fsn, node, walk_length, direction=direction, version=version)) for _ in
                 range(walks_per_node) for node
                 in fsn.get_BP()]
    elif direction == "OUT":
        pairs = [make_pairs(randomWalk(fsn, node, walk_length, direction=direction, version=version)) for _ in
                 range(walks_per_node) for node
                 in fsn.get_BP()]
    elif direction == "COMBI":
        print("Start multi-core Random-Walks")
        processes = []
        allbps = fsn.get_BP()
        print("before chunks")
        processesCount = 4
        chunks = np.array_split(allbps,  processesCount)
        print("Chunks done")
        q = Queue()

        for i in range(0, processesCount):
            print("Process " + str(i) + " starting")
            p = Process(target=rwWrapper, args=(walks_per_node, fsn, walk_length, direction, version, q, chunks[i]))
            processes.append(p)
            p.start()


        pairs = []
        print("Waiting for results")
        #grab 4 values from the queue, one for each process
        for i in range(0, processesCount):
            #set block=True to block until we get a result
            pairs.append(q.get())



        for process in processes:
            process.join()

        print("Received results from processes")

        q.close()
        q.join_thread()



    if drop_duplicates:
        pairs = [item for sublist in pairs for item in sublist if item[0] != item[1]]
    else:
        pairs = [item for sublist in pairs for item in sublist]
    return pairs

def rwWrapper(length_range, fsn, walk_length, direction, version, q, nodes):
    print("Start walks")
    print(len(nodes))
    for node in nodes:
        for _ in range(length_range):
            q.put(make_pairs(randomWalk(fsn, node, walk_length, direction=direction, version=version)))


def get_top_similar(all_pairs, top=3, as_DataFrame=True, sort_ids=True, title="Similar_BP"):
    """
    Helper function for counting joint appearance of nodes and returning top N
    :param all_pairs: all found pairs
    :param top: required number of top elements for each node
    :param as_DataFrame: convert output to DataFrame
    :param sort_ids: Sort output DataFrame w.r.t. ID column
    :param title: title of column in returned DataFrame
    :return: dictionary with node number as a key and values as list[node, cnt]
    """
    per_node = {item[0]: list() for item in all_pairs}
    output_top = dict()
    for item in all_pairs:
        per_node[item[0]].append(item[1])
    for key, data in per_node.items():
        output_top[key] = Counter(per_node[key]).most_common(top)
    if as_DataFrame:
        if sort_ids:
            return pd.DataFrame(output_top.items(), columns=["ID", title]).sort_values(by=["ID"])
        else:
            return pd.DataFrame(output_top.items(), columns=["ID", title])
    else:
        return output_top


def get_SkipGrams(df, version="MetaDiff", walk_length=10, walks_per_node=10, direction="COMBI"):
    """
    Get Skip-Grams for given DataFrame with Entries records
    :param df: original DataFrame
    :param version: Version of step:
    "DefUniform" - Pure RandomWalk (uniform probabilities, follows the direction),
    "DefWeighted" - RandomWalk (weighted probabilities, follows the direction),
    "MetaUniform" - Default Metapath-version (uniform probabilities, change directions),
    "MetaWeighted" - Weighted Metapath version (weighted probabilities "rich gets richer", change directions),
    "MetaDiff" - Modified Metapath version (probabilities depend on the differences between edges, change directions)
    :param walk_length: max length of RandomWalk
    :param walks_per_node: max number of RandomWalks per each node in FSN
    :param direction: initial direction
    :return: list of all pairs
    :return fsn: FSN class instance for given DataFrame
    :return tr: Encoder/Decoder for given DataFrame
    """
    fsn = FSN()
    fsn.build(df, name_column="FA_Name")
    tr = TransformationBPs(fsn.get_BP())
    return tr.encode_pairs(get_pairs(fsn, version, walk_length, walks_per_node, direction)), fsn, tr


class TransformationBPs:
    """
    Encode/Decode original BP nodes number to/from sequential integers for TensorFlow
    """

    def __init__(self, original_bps):
        self.len = len(original_bps)
        self.original_bps = original_bps
        self._enc_dec()

    def _enc_dec(self):
        self.encoder = dict(list(zip(self.original_bps, range(self.len))))
        self.decoder = dict(list(zip(range(self.len), self.original_bps)))

    def encode(self, original_seq):
        return [self.encoder[item] for item in original_seq]

    def decode(self, seq):
        return [self.decoder[item] for item in seq]

    def encode_pairs(self, original_pairs):
        return [(self.encoder[item[0]], self.encoder[item[1]]) for item in original_pairs]


def find_similar(df, top_n=3, version="MetaDiff", walk_length=10, walks_per_node=10, direction="IN",
                 column_title="Similar_BP"):
    fsn = FSN()
    fsn.build(df, name_column="FA_Name")
    if LOG:
        local_logger = logging.getLogger("NetEmbs.Utils.find_similar")
    if not isinstance(version, list) and not isinstance(direction, list):
        pairs = get_pairs(fsn, version=version, walk_length=walk_length, walks_per_node=walks_per_node,
                          direction=direction)
        return get_top_similar(pairs, top=top_n, title=column_title)
    else:
        #         Multiple parameters, build grid over them
        if not isinstance(version, list) and isinstance(version, str):
            version = [version]
        if not isinstance(direction, list) and isinstance(direction, str):
            direction = [direction]
        #             All possible combinations:
        _first = True
        for ver in version:
            for _dir in direction:
                if LOG:
                    local_logger.info("Current arguments are " + ver + " and " + _dir)
                if _first:
                    _first = False
                    output_df = get_top_similar(
                        get_pairs(fsn, version=ver, walk_length=walk_length, walks_per_node=walks_per_node,
                                  direction=_dir), top=top_n, title=str(ver + "_" + _dir))
                else:
                    output_df[str(ver + "_" + _dir)] = get_top_similar(
                        get_pairs(fsn, version=ver, walk_length=walk_length, walks_per_node=walks_per_node,
                                  direction=_dir), top=top_n, title=str(ver + "_" + _dir))[str(ver + "_" + _dir)]
        return output_df


def add_similar(df, top_n=3, version="MetaDiff", walk_length=10, walks_per_node=10, direction="IN"):
    """
    Adding "similar" BP
    :param df: original DataFrame
    :param top_n: the number of BP to store
    :param version: Version of step:
    "DefUniform" - Pure RandomWalk (uniform probabilities, follows the direction),
    "DefWeighted" - RandomWalk (weighted probabilities, follows the direction),
    "MetaUniform" - Default Metapath-version (uniform probabilities, change directions),
    "MetaWeighted" - Weighted Metapath version (weighted probabilities "rich gets richer", change directions),
    "MetaDiff" - Modified Metapath version (probabilities depend on the differences between edges, change directions)
    :param walk_length: max length of RandomWalk
    :param walks_per_node: max number of RandomWalks per each node in FSN
    :param direction: initial direction
    :return: original DataFrame with Similar_BP column
    """
    return df.merge(
        find_similar(df, top_n=top_n, version=version, walk_length=walk_length, walks_per_node=walks_per_node,
                     direction=direction),
        on="ID", how="left")


def get_JournalEntries(df):
    """
    Helper function for extraction Journal Entries from Entry Records DataFrame
    :param df: Original DataFrame with Entries Records
    :return: Journal Entries DataFrame, each row is separate business process
    """
    if "Signature" not in list(df):
        from NetEmbs.DataProcessing.unique_signatures import unique_BPs
        df = unique_BPs(df)
    return df[["ID", "Signature"]].drop_duplicates("ID")


global journal_decoder


def decode_row(row):
    global journal_decoder
    output = dict()
    output["ID"] = row["ID"]
    output["Signature"] = row["Signature"]
    for cur_title in row.index._data[2:]:
        cur_row_decoded = list()
        if row[cur_title] == -1.0:
            output[cur_title] = None
        else:
            for item in row[cur_title]:
                cur_row_decoded.append(journal_decoder[item[0]])
                cur_row_decoded.append("---------")
            output[cur_title] = cur_row_decoded

    return pd.Series(output)


def similar(df, top_n=3, version="MetaDiff", walk_length=10, walks_per_node=10, direction=["IN", "ALL", "COMBI"]):
    """
    Finding "similar" BP
    :param df: original DataFrame
    :param top_n: the number of BP to store
    :param version: Version of step:
    "DefUniform" - Pure RandomWalk (uniform probabilities, follows the direction),
    "DefWeighted" - RandomWalk (weighted probabilities, follows the direction),
    "MetaUniform" - Default Metapath-version (uniform probabilities, change directions),
    "MetaWeighted" - Weighted Metapath version (weighted probabilities "rich gets richer", change directions),
    "MetaDiff" - Modified Metapath version (probabilities depend on the differences between edges, change directions)
    :param walk_length: max length of RandomWalk
    :param walks_per_node: max number of RandomWalks per each node in FSN
    :param direction: initial direction
    :return: original DataFrame with Similar_BP column
    """
    global journal_decoder
    if LOG:
        local_logger = logging.getLogger("NetEmbs.Utils.Similar")
        local_logger.info("Given directions are " + str(direction))
        local_logger.info("Given versions are " + str(version))
    journal_entries = get_JournalEntries(df)

    if LOG:
        local_logger.info("Journal entries have been extracted!")
    journal_decoder = journal_entries.set_index("ID").to_dict()["Signature"]
    print("Done with extraction Journal Entries data!")
    output = find_similar(df, top_n=top_n, version=version, walk_length=walk_length, walks_per_node=walks_per_node,
                          direction=direction)
    print("Done with RandomWalking... Found ", str(top_n), " top")
    journal_entries = journal_entries.merge(output,
                                            on="ID", how="left")
    journal_entries.fillna(-1.0, inplace=True)
    res = journal_entries.apply(decode_row, axis=1)
    return res

evaluation of embeddings

When we have good embeddings we should yield useful clusters, in one paper (Zhang et al: Learning Node Embeddings in Interaction Graphs) I found the following paragraph describing that we can evaluate the performance of the clusters:

Clustering. We first use K-Means to test embeddings on
the unsupervised task. We use Normalized Mutual Information (NMI) [23] score to evaluate clustering results. The NMI score is between 0 and 1. The larger the value, the better the performance. A labeling will have score 1 if it matches the ground truth perfectly, and 0 if it is completely random. Since entities in the Yelp dataset are multi-labeled, we ignore the entities that belong to multiple categories when calculate NMI score.

with our toy-set we can create ground-truth labels and evaluate the embedding technique. We can even compare this with directly applying other techniques (metapath2vec,deepwalk etc)

for the real data-sets no ground truth is known hence we must describe it in a different way.

Cached files

Hi Aleksei,

The cached files is awesome! However, if the directory doesn't exist yet it crashes:

Traceback (most recent call last):
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-12/b_experiments/experiment.py", line 21, in <module>
    embds = get_embs_TF(df, embed_size = 2, walks_per_node = 2, num_steps=200, use_cached_skip_grams= False)
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-12/NetEmbs/SkipGram/tensor_flow.py", line 233, in get_embs_TF
    pd.DataFrame(embs).to_pickle(WORK_FOLDER[0] + WORK_FOLDER[1] + "cache/snapshot.pkl")
  File "/Users/mboersma/PycharmProjects/networkembedding/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 2593, in to_pickle
    protocol=protocol)
  File "/Users/mboersma/PycharmProjects/networkembedding/venv/lib/python3.7/site-packages/pandas/io/pickle.py", line 73, in to_pickle
    is_text=False)
  File "/Users/mboersma/PycharmProjects/networkembedding/venv/lib/python3.7/site-packages/pandas/io/common.py", line 430, in _get_handle
    f = open(path_or_buf, mode)
FileNotFoundError: [Errno 2] No such file or directory: '2_walks30_pressure30_window3/TFsteps200000batch64_emb32/cache/snapshot.pkl'

I added a couple of lines such that it creates the directory when it is not found, this seems to be working:

in utils.py I added

        skip_gr = tr.encode_pairs(get_pairs(N_JOBS, version, walk_length, walks_per_node, direction))
        if not os.path.exists(WORK_FOLDER[0]):
            os.makedirs(WORK_FOLDER[0])
        with open(WORK_FOLDER[0] + "skip_grams_cached.pkl", "wb") as file:
            pickle.dump(skip_gr, file)

os.makedirs

in tensor flow.py


    if not os.path.exists(WORK_FOLDER[0] + WORK_FOLDER[1] + 'cache/'):
        os.makedirs(WORK_FOLDER[0] + WORK_FOLDER[1] + "cache/")
    pd.DataFrame(embs).to_pickle(WORK_FOLDER[0] + WORK_FOLDER[1] + "cache/snapshot.pkl")

such that it creates a cache folder.

loading data

Please change the cell:

from NetEmbs.DataProcessing import *
YOUR_DATAFRAME = None
if YOUR_DATAFRAME is None:
    d = prepare_data(d)
else:
    d = prepare_data(rename_columns(YOUR_DATAFRAME), split=False)
d.head(20)

to

from NetEmbs.DataProcessing import *
df = pd.DataFrame.from_dict(journal_entries)
df.columns = ['ID', 'FA_Name', 'Debit', 'Credit']
YOUR_DATAFRAME = df
if YOUR_DATAFRAME is None:
    d = prepare_data(d)
else:
    d = prepare_data(YOUR_DATAFRAME, split=False)
d.head(20)

when I load the data frame it doesn't contain any names, therefore the replace function doesn't work as intended. The new code snippet fixes this.

transition probabilities zero

Hi Aleksei,

I'm running it on multiple datasets and sometimes I get the following error:

Traceback (most recent call last):
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/analysisMB.py", line 52, in <module>
    simdata = similar(d, direction=["COMBI"])
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py", line 502, in similar
    direction=direction)
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py", line 418, in find_similar
    direction=_dir), top=top_n, title=str(ver + "_" + _dir))
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py", line 318, in get_pairs
    range(walks_per_node) for node
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py", line 319, in <listcomp>
    in fsn.get_BP()]
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py", line 257, in randomWalk
    new_v = step(G, cur_v, cur_direction, mode=2, return_full_step=return_full_path, debug=debug)
  File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py", line 193, in step
    tmp_vertex = np.random.choice(outs, p=ws)
  File "mtrand.pyx", line 1144, in mtrand.RandomState.choice
ValueError: probabilities contain NaN

Process finished with exit code 1

I haven't found a cause of this, for some datasets it fails, for some it passes. However, the rate for now is approximately 50/50....

I will continue with the analysis and run it for examples where it does work, the results up to now are good. I find interesting groups of transactions although I'm now only studying it for the combi approach because the others seem to give weird results.

bug random walk

when I run

from NetEmbs.FSN import *
randomWalk(fsn, 1, length=10, direction="COMBI")

I get

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-95ca35af2a0e> in <module>()
      1 from NetEmbs.FSN import *
----> 2 randomWalk(fsn, 1, length=10, direction="COMBI")

/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py in randomWalk(G, vertex, length, direction, version, return_full_path, debug)
    255             elif version == "MetaDiff":
    256                 if direction is "COMBI":
--> 257                     new_v = step(G, cur_v, cur_direction, mode=2, return_full_step=return_full_path, debug=debug)
    258                     cur_direction = mask[cur_direction]
    259                 else:

/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py in step(G, vertex, direction, mode, allow_back, return_full_step, pressure, debug)
    146         return vertex
    147     elif not G.has_node(vertex):
--> 148         raise ValueError("Vertex {!r} is not in FSN!".format(vertex))
    149     if direction == "IN":
    150         ins = G.in_edges(vertex, data=True)

ValueError: Vertex 1 is not in FSN!

any idea how I can fix this? Do you need more info? Then please let me know.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.