alexworldd / netembs Goto Github PK
View Code? Open in Web Editor NEWFramework for Representation Learning on Financial Statement Networks
License: Apache License 2.0
Framework for Representation Learning on Financial Statement Networks
License: Apache License 2.0
Hi Aleksei,
Just a question about the noise generation in the toy-example. I see that you have a core-processes and then add random-noise accounts. For example
0.5 A + 0.49 B + 0.01X -> C
my question is are the noise accounts, X in the example, unique or can it be that I have a second processes where X is either part of the core, or, also participates as noise?
Thank you for clarifying my understanding.
Kind regards,
Marcel Boersma
Hi Aleksei,
Currently is see that my data is somewhat noisy in the following form:
0.1A + 0.2A + 0.3A + 0.5B -> C
if another process is:
0.09A + 0.21A + 0.3A + 0.5B -> C
then it is considered as a unique process (this is correct); However, I was wondering what will happen if we slightly simplify the two records to:
0.5 A + 0.5B -> 1C
because currently the algorithm clusters the above two suggestions, which is good, but I'm also curious to see what happens when we simplify the journal entry structure :)
What do you think?
Kind regards,
Marcel
Hi aleksei,
I’ve tried some stuff with the plots and I think it has to do with the marker size. Resizing to 5 from 150 produced better plots already.
We can play around with that to see what the right parameters are.
Hi Aleksei,
I couldn't find the Tensorflow code at first because in PyCharm the last cell of the notebook were not rendering correctly. Only while opening in jupyter notebook I could see the Tensorflow code.
However, I run into some problems
Average loss at step 0 : 8.046344757080078
Traceback (most recent call last):
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/analysisMB.py", line 145, in <module>
run(graph, num_steps, skip_grams, 128)
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/analysisMB.py", line 134, in run
final_embeddings = normalized_embeddings.eval()
NameError: name 'normalized_embeddings' is not defined
furthermore, I would like to render a simple t-sne plot. Could you combine this with:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
viz_words = 500
tsne = TSNE()
embed_tsne = tsne.fit_transform(embed_mat[:viz_words, :])
fig, ax = plt.subplots(figsize=(14, 14))
for idx in range(viz_words):
plt.scatter(*embed_tsne[idx, :], color='steelblue')
plt.annotate(int_to_vocab[idx], (embed_tsne[idx, 0], embed_tsne[idx, 1]), alpha=0.7)
Hi Aleksei,
I've started with a parallel implementation, it somewhat works but definitely requires additional debugging:
Here the code snippet
# encoding: utf-8
__author__ = 'Aleksei Maliutin'
"""
utils.py
Created by lex at 2019-03-15.
"""
import numpy as np
from scipy.special import softmax
import random
import networkx as nx
from networkx.algorithms import bipartite
from NetEmbs.CONFIG import *
from collections import Counter
import pandas as pd
from NetEmbs.FSN.graph import FSN
import logging
from NetEmbs.CONFIG import LOG
np.seterr(all="raise")
import numpy as np
from multiprocessing import Process, Queue
def default_step(G, vertex, direction="IN", mode=0, return_full_step=False, debug=False):
"""
One step according to the original implementation of RandomWalk by Perozzi et al.
(uniform probabilities, follows the same direction)
:param G: graph/network on which step should be done
:param vertex: current vertex
:param direction: the direction of step: IN or OUT
:param mode: use the edge's weight for transition probability
:param return_full_step: if True, then step includes intermediate node of FA type
:param debug: print intermediate stages
:return: next step if succeeded or -1 if failed
"""
if vertex in [-1, -2, -3]:
# Step cannot be made, return -1
return vertex
elif not G.has_node(vertex):
raise ValueError("Vertex {!r} is not in FSN!".format(vertex))
if mode in [0, 1]:
pass
else:
raise ValueError(
"For DefaultStep only two modes available: 0 (uniform) or 1(weighted) byt given {!r}!".format(mode))
# Get the neighborhood of current node regard the chosen direction
if direction == "IN":
ins = G.in_edges(vertex, data=True)
elif direction == "OUT":
ins = G.out_edges(vertex, data=True)
else:
raise ValueError("Wrong direction argument! {!s} used while IN or OUT are allowed!".format(direction))
output = list()
indexes = ["IN", "OUT"]
# Check that we can make step, otherwise return special value -1
if len(ins) > 0:
# Apply weighted probabilities
if mode == 1:
ws = [edge[-1]["weight"] for edge in ins]
p_ws = ws / np.sum(ws)
ins = [edge[indexes.index(direction)] for edge in ins]
tmp_idx = np.random.choice(range(len(ins)), p=p_ws)
tmp_vertex = ins[tmp_idx]
tmp_weight = ws[tmp_idx]
# Apply uniform probabilities
elif mode == 0:
ins = [edge[indexes.index(direction)] for edge in ins]
tmp_vertex = np.random.choice(ins)
if debug:
print(tmp_vertex)
output.append(tmp_vertex)
else:
return -1
# ///////////// \\\\\\\\\\\\\\\
# Second sub-step, from FA to BP
if direction == "IN":
ins = G.in_edges(tmp_vertex, data=True)
elif direction == "OUT":
ins = G.out_edges(tmp_vertex, data=True)
else:
raise ValueError("Wrong direction argument! {!s} used while IN or OUT are allowed!".format(direction))
# Check that we can make step, otherwise return special value -1
if len(ins) > 0:
if mode == 1:
ws = [edge[-1]["weight"] for edge in ins]
p_ws = ws / np.sum(ws)
ins = [edge[indexes.index(direction)] for edge in ins]
tmp_idx = np.random.choice(range(len(ins)), p=p_ws)
tmp_vertex = ins[tmp_idx]
tmp_weight = ws[tmp_idx]
elif mode == 0:
ins = [edge[indexes.index(direction)] for edge in ins]
tmp_vertex = np.random.choice(ins)
if debug:
print(tmp_vertex)
output.append(tmp_vertex)
if return_full_step:
return output
else:
return output[-1]
else:
return -1
def diff_function(prev_edge, new_edges, pressure):
"""
Function for calculation transition probabilities based on Differences between previous edge and candidate edge
:param prev_edge: Monetary amount on previous edge
:param new_edges: Monetary amount on all edges candidates
:param pressure: The regularization term, higher pressure leads to more strict function
:return: array of transition probabilities
"""
return softmax((1.0 - abs(new_edges - prev_edge)) * pressure)
def make_pairs(sampled_seq, window=3, debug=False):
"""
Helper function for construction pairs from sequence of nodes with given window size
:param sampled_seq: Original sequence of nodes (output of RandomWalk procedure)
:param window: window size, how much predecessors and successors one takes into account
:param debug: print intermediate stages
:return:
"""
if debug:
print(sampled_seq)
output = list()
for cur_idx in range(len(sampled_seq)):
for drift in range(max(0, cur_idx - window), min(cur_idx + window + 1, len(sampled_seq))):
if drift != cur_idx:
output.append((sampled_seq[cur_idx], sampled_seq[drift]))
if len(output) < 2 and debug:
print(output)
return output
def step(G, vertex, direction="IN", mode=2, allow_back=True, return_full_step=False, pressure=20, debug=False):
"""
Meta-Random step with changing direction.
:param G: graph/network on which step should be done
:param vertex: current vertex
:param direction: the initial direction of step: IN or OUT
:param mode: use the edge's weight for transition probability or difference between weights
:param allow_back: If True, one can get the sequence of the same BPs... Might be delete it? TODO check, is it needed?
:param return_full_step: if True, then step includes intermediate node of FA type
:param pressure: The regularization term, higher pressure leads to more strict function
:param debug: print intermediate stages
:return: next step if succeeded or -1 if failed
"""
# ////// THE FIRST STEP TO OPPOSITE SET OF NODES \\\\\
if vertex in [-1, -2, -3]:
# Step cannot be made, return -1
return vertex
elif not G.has_node(vertex):
raise ValueError("Vertex {!r} is not in FSN!".format(vertex))
if direction == "IN":
ins = G.in_edges(vertex, data=True)
elif direction == "OUT":
ins = G.out_edges(vertex, data=True)
else:
raise ValueError("Wrong direction argument! {!s} used while IN or OUT are allowed!".format(direction))
output = list()
mask = {"IN": "OUT", "OUT": "IN"}
indexes = ["IN", "OUT"]
if len(ins) > 0:
ws = [edge[-1]["weight"] for edge in ins]
p_ws = ws / np.sum(ws)
ins = [edge[indexes.index(direction)] for edge in ins]
if mode == 0:
tmp_idx = np.random.choice(range(len(ins)))
else:
tmp_idx = np.random.choice(range(len(ins)), p=p_ws)
tmp_vertex = ins[tmp_idx]
tmp_weight = ws[tmp_idx]
if debug:
print(tmp_vertex)
output.append(tmp_vertex)
else:
return -1
# ////// THE SECOND STEP TO OPPOSITE SET OF NODES (to original one) \\\\\
if mask[direction] == "IN":
outs = G.in_edges(tmp_vertex, data=True)
elif mask[direction] == "OUT":
outs = G.out_edges(tmp_vertex, data=True)
if len(outs) > 0:
ws = [edge[-1]["weight"] for edge in outs]
outs = [edge[indexes.index(mask[direction])] for edge in outs]
if not allow_back:
rm_idx = outs.index(vertex)
ws.pop(rm_idx)
outs.pop(rm_idx)
if len(outs) == 0:
return -3
ws = np.array(ws)
probas = None
try:
if mode == 2:
# Transition probability depends on the difference between monetary flows
probas = diff_function(tmp_weight, ws, pressure)
if debug:
print(list(zip(outs, ws)))
tmp_vertex = np.random.choice(outs, p=probas)
output.append(tmp_vertex)
elif mode == 1:
# Transition probability depends on the monetary flows - "rich gets richer"
probas = ws / np.sum(ws)
if debug:
print(list(zip(outs, ws)))
tmp_vertex = np.random.choice(outs, p=probas)
output.append(tmp_vertex)
elif mode == 0:
# Transition probability is uniform
if debug:
print(outs)
tmp_vertex = np.random.choice(outs)
output.append(tmp_vertex)
except Exception as e:
if LOG:
snapshot = {"CurrentNode": tmp_vertex, "CurrentWeight": tmp_weight,
"NextCandidates": list(zip(outs, ws)), "Probas": probas}
local_logger = logging.getLogger("NetEmbs.Utils.step")
local_logger.error("Fatal ValueError during step", exc_info=True)
local_logger.info("Snapshot" + str(snapshot))
# Return next vertex here
if return_full_step:
return output
else:
return output[-1]
else:
return -2
def randomWalk(G, vertex=None, length=3, direction="IN", version="MetaDiff", return_full_path=False, debug=False):
"""
RandomWalk function for sampling the sequence of nodes from given graph and initial node
:param G: Bipartite graph, an instance of networkx
:param vertex: initial node
:param length: the maximum length of RandomWalk
:param direction: The direction of walking. IN - go via source financial accounts, OUT - go via target financial accounts
:param version: Version of step:
"DefUniform" - Pure RandomWalk (uniform probabilities, follows the direction),
"DefWeighted" - RandomWalk (weighted probabilities, follows the direction),
"MetaUniform" - Default Metapath-version (uniform probabilities, change directions),
"MetaWeighted" - Weighted Metapath version (weighted probabilities "rich gets richer", change directions),
"MetaDiff" - Modified Metapath version (probabilities depend on the differences between edges, change directions)
:param return_full_path: If True, return the full path with FA nodes
:param debug: Debug boolean flag, print intermediate steps
:return: Sampled sequence of nodes
"""
if version not in STEPS_VERSIONS:
raise ValueError(
"Given not supported step version {!s}!".format(version) + "\nAllowed only " + str(STEPS_VERSIONS))
context = list()
if vertex is None:
context.append(random.choice(list(G.nodes)))
else:
context.append(vertex)
cur_v = context[-1]
mask = {"IN": "OUT", "OUT": "IN"}
cur_direction = "IN"
while len(context) < length + 1:
try:
if version == "DefUniform":
new_v = default_step(G, cur_v, direction, mode=0, return_full_step=return_full_path, debug=debug)
elif version == "DefWeighted":
new_v = default_step(G, cur_v, direction, mode=1, return_full_step=return_full_path, debug=debug)
elif version == "MetaUniform":
new_v = step(G, cur_v, direction, mode=0, return_full_step=return_full_path, debug=debug)
elif version == "MetaWeighted":
new_v = step(G, cur_v, direction, mode=1, return_full_step=return_full_path, debug=debug)
elif version == "MetaDiff":
if direction is "COMBI":
new_v = step(G, cur_v, cur_direction, mode=2, return_full_step=return_full_path, debug=debug)
cur_direction = mask[cur_direction]
else:
new_v = step(G, cur_v, direction, mode=2, return_full_step=return_full_path, debug=debug)
except nx.NetworkXError:
# TODO modify to more robust behaviour
break
if new_v == -1:
if debug: print("Cannot continue walking... Termination.")
break
if return_full_path:
if isinstance(new_v, list):
context.extend(new_v)
else:
context.append(new_v)
else:
context.append(new_v)
cur_v = context[-1]
return context
def get_pairs(fsn, version="MetaDiff", walk_length=10, walks_per_node=10, direction="ALL", drop_duplicates=True):
"""
Construction a pairs (skip-grams) of nodes according to sampled sequences
:param fsn: Researched FSN
:param version: Applying version of step method
"DefUniform" - Pure RandomWalk (uniform probabilities, follows the direction),
"DefWeighted" - RandomWalk (weighted probabilities, follows the direction),
"MetaUniform" - Default Metapath-version (uniform probabilities, change directions),
"MetaWeighted" - Weighted Metapath version (weighted probabilities "rich gets richer", change directions),
"MetaDiff" - Modified Metapath version (probabilities depend on the differences between edges, change directions)
:param walk_length: max length of RandomWalk
:param walks_per_node: max number of RandomWalks per each node in FSN
:param direction: initial direction
:param drop_duplicates: True, delete pairs with equal elements
:return: array of pairs(joint appearance of two BP nodes)
"""
# TODO implement parallel version!
if direction not in ["ALL", "IN", "OUT", "COMBI"]:
raise ValueError(
"Given not supported yet direction of walking {!s}!".format(version) + "\nAllowed only " + str(
["ALL", "IN", "OUT"]))
if direction == "ALL":
# Apply RandomWalk for both IN and OUT direction
pairs = [make_pairs(randomWalk(fsn, node, walk_length, direction="IN", version=version)) for _ in
range(walks_per_node) for node
in fsn.get_BP()] + [make_pairs(randomWalk(fsn, node, walk_length, direction="OUT", version=version))
for _
in
range(walks_per_node) for node
in fsn.get_BP()]
elif direction == "IN":
pairs = [make_pairs(randomWalk(fsn, node, walk_length, direction=direction, version=version)) for _ in
range(walks_per_node) for node
in fsn.get_BP()]
elif direction == "OUT":
pairs = [make_pairs(randomWalk(fsn, node, walk_length, direction=direction, version=version)) for _ in
range(walks_per_node) for node
in fsn.get_BP()]
elif direction == "COMBI":
print("Start multi-core Random-Walks")
processes = []
allbps = fsn.get_BP()
print("before chunks")
processesCount = 4
chunks = np.array_split(allbps, processesCount)
print("Chunks done")
q = Queue()
for i in range(0, processesCount):
print("Process " + str(i) + " starting")
p = Process(target=rwWrapper, args=(walks_per_node, fsn, walk_length, direction, version, q, chunks[i]))
processes.append(p)
p.start()
pairs = []
print("Waiting for results")
#grab 4 values from the queue, one for each process
for i in range(0, processesCount):
#set block=True to block until we get a result
pairs.append(q.get())
for process in processes:
process.join()
print("Received results from processes")
q.close()
q.join_thread()
if drop_duplicates:
pairs = [item for sublist in pairs for item in sublist if item[0] != item[1]]
else:
pairs = [item for sublist in pairs for item in sublist]
return pairs
def rwWrapper(length_range, fsn, walk_length, direction, version, q, nodes):
print("Start walks")
print(len(nodes))
for node in nodes:
for _ in range(length_range):
q.put(make_pairs(randomWalk(fsn, node, walk_length, direction=direction, version=version)))
def get_top_similar(all_pairs, top=3, as_DataFrame=True, sort_ids=True, title="Similar_BP"):
"""
Helper function for counting joint appearance of nodes and returning top N
:param all_pairs: all found pairs
:param top: required number of top elements for each node
:param as_DataFrame: convert output to DataFrame
:param sort_ids: Sort output DataFrame w.r.t. ID column
:param title: title of column in returned DataFrame
:return: dictionary with node number as a key and values as list[node, cnt]
"""
per_node = {item[0]: list() for item in all_pairs}
output_top = dict()
for item in all_pairs:
per_node[item[0]].append(item[1])
for key, data in per_node.items():
output_top[key] = Counter(per_node[key]).most_common(top)
if as_DataFrame:
if sort_ids:
return pd.DataFrame(output_top.items(), columns=["ID", title]).sort_values(by=["ID"])
else:
return pd.DataFrame(output_top.items(), columns=["ID", title])
else:
return output_top
def get_SkipGrams(df, version="MetaDiff", walk_length=10, walks_per_node=10, direction="COMBI"):
"""
Get Skip-Grams for given DataFrame with Entries records
:param df: original DataFrame
:param version: Version of step:
"DefUniform" - Pure RandomWalk (uniform probabilities, follows the direction),
"DefWeighted" - RandomWalk (weighted probabilities, follows the direction),
"MetaUniform" - Default Metapath-version (uniform probabilities, change directions),
"MetaWeighted" - Weighted Metapath version (weighted probabilities "rich gets richer", change directions),
"MetaDiff" - Modified Metapath version (probabilities depend on the differences between edges, change directions)
:param walk_length: max length of RandomWalk
:param walks_per_node: max number of RandomWalks per each node in FSN
:param direction: initial direction
:return: list of all pairs
:return fsn: FSN class instance for given DataFrame
:return tr: Encoder/Decoder for given DataFrame
"""
fsn = FSN()
fsn.build(df, name_column="FA_Name")
tr = TransformationBPs(fsn.get_BP())
return tr.encode_pairs(get_pairs(fsn, version, walk_length, walks_per_node, direction)), fsn, tr
class TransformationBPs:
"""
Encode/Decode original BP nodes number to/from sequential integers for TensorFlow
"""
def __init__(self, original_bps):
self.len = len(original_bps)
self.original_bps = original_bps
self._enc_dec()
def _enc_dec(self):
self.encoder = dict(list(zip(self.original_bps, range(self.len))))
self.decoder = dict(list(zip(range(self.len), self.original_bps)))
def encode(self, original_seq):
return [self.encoder[item] for item in original_seq]
def decode(self, seq):
return [self.decoder[item] for item in seq]
def encode_pairs(self, original_pairs):
return [(self.encoder[item[0]], self.encoder[item[1]]) for item in original_pairs]
def find_similar(df, top_n=3, version="MetaDiff", walk_length=10, walks_per_node=10, direction="IN",
column_title="Similar_BP"):
fsn = FSN()
fsn.build(df, name_column="FA_Name")
if LOG:
local_logger = logging.getLogger("NetEmbs.Utils.find_similar")
if not isinstance(version, list) and not isinstance(direction, list):
pairs = get_pairs(fsn, version=version, walk_length=walk_length, walks_per_node=walks_per_node,
direction=direction)
return get_top_similar(pairs, top=top_n, title=column_title)
else:
# Multiple parameters, build grid over them
if not isinstance(version, list) and isinstance(version, str):
version = [version]
if not isinstance(direction, list) and isinstance(direction, str):
direction = [direction]
# All possible combinations:
_first = True
for ver in version:
for _dir in direction:
if LOG:
local_logger.info("Current arguments are " + ver + " and " + _dir)
if _first:
_first = False
output_df = get_top_similar(
get_pairs(fsn, version=ver, walk_length=walk_length, walks_per_node=walks_per_node,
direction=_dir), top=top_n, title=str(ver + "_" + _dir))
else:
output_df[str(ver + "_" + _dir)] = get_top_similar(
get_pairs(fsn, version=ver, walk_length=walk_length, walks_per_node=walks_per_node,
direction=_dir), top=top_n, title=str(ver + "_" + _dir))[str(ver + "_" + _dir)]
return output_df
def add_similar(df, top_n=3, version="MetaDiff", walk_length=10, walks_per_node=10, direction="IN"):
"""
Adding "similar" BP
:param df: original DataFrame
:param top_n: the number of BP to store
:param version: Version of step:
"DefUniform" - Pure RandomWalk (uniform probabilities, follows the direction),
"DefWeighted" - RandomWalk (weighted probabilities, follows the direction),
"MetaUniform" - Default Metapath-version (uniform probabilities, change directions),
"MetaWeighted" - Weighted Metapath version (weighted probabilities "rich gets richer", change directions),
"MetaDiff" - Modified Metapath version (probabilities depend on the differences between edges, change directions)
:param walk_length: max length of RandomWalk
:param walks_per_node: max number of RandomWalks per each node in FSN
:param direction: initial direction
:return: original DataFrame with Similar_BP column
"""
return df.merge(
find_similar(df, top_n=top_n, version=version, walk_length=walk_length, walks_per_node=walks_per_node,
direction=direction),
on="ID", how="left")
def get_JournalEntries(df):
"""
Helper function for extraction Journal Entries from Entry Records DataFrame
:param df: Original DataFrame with Entries Records
:return: Journal Entries DataFrame, each row is separate business process
"""
if "Signature" not in list(df):
from NetEmbs.DataProcessing.unique_signatures import unique_BPs
df = unique_BPs(df)
return df[["ID", "Signature"]].drop_duplicates("ID")
global journal_decoder
def decode_row(row):
global journal_decoder
output = dict()
output["ID"] = row["ID"]
output["Signature"] = row["Signature"]
for cur_title in row.index._data[2:]:
cur_row_decoded = list()
if row[cur_title] == -1.0:
output[cur_title] = None
else:
for item in row[cur_title]:
cur_row_decoded.append(journal_decoder[item[0]])
cur_row_decoded.append("---------")
output[cur_title] = cur_row_decoded
return pd.Series(output)
def similar(df, top_n=3, version="MetaDiff", walk_length=10, walks_per_node=10, direction=["IN", "ALL", "COMBI"]):
"""
Finding "similar" BP
:param df: original DataFrame
:param top_n: the number of BP to store
:param version: Version of step:
"DefUniform" - Pure RandomWalk (uniform probabilities, follows the direction),
"DefWeighted" - RandomWalk (weighted probabilities, follows the direction),
"MetaUniform" - Default Metapath-version (uniform probabilities, change directions),
"MetaWeighted" - Weighted Metapath version (weighted probabilities "rich gets richer", change directions),
"MetaDiff" - Modified Metapath version (probabilities depend on the differences between edges, change directions)
:param walk_length: max length of RandomWalk
:param walks_per_node: max number of RandomWalks per each node in FSN
:param direction: initial direction
:return: original DataFrame with Similar_BP column
"""
global journal_decoder
if LOG:
local_logger = logging.getLogger("NetEmbs.Utils.Similar")
local_logger.info("Given directions are " + str(direction))
local_logger.info("Given versions are " + str(version))
journal_entries = get_JournalEntries(df)
if LOG:
local_logger.info("Journal entries have been extracted!")
journal_decoder = journal_entries.set_index("ID").to_dict()["Signature"]
print("Done with extraction Journal Entries data!")
output = find_similar(df, top_n=top_n, version=version, walk_length=walk_length, walks_per_node=walks_per_node,
direction=direction)
print("Done with RandomWalking... Found ", str(top_n), " top")
journal_entries = journal_entries.merge(output,
on="ID", how="left")
journal_entries.fillna(-1.0, inplace=True)
res = journal_entries.apply(decode_row, axis=1)
return res
When we have good embeddings we should yield useful clusters, in one paper (Zhang et al: Learning Node Embeddings in Interaction Graphs) I found the following paragraph describing that we can evaluate the performance of the clusters:
Clustering. We first use K-Means to test embeddings on
the unsupervised task. We use Normalized Mutual Information (NMI) [23] score to evaluate clustering results. The NMI score is between 0 and 1. The larger the value, the better the performance. A labeling will have score 1 if it matches the ground truth perfectly, and 0 if it is completely random. Since entities in the Yelp dataset are multi-labeled, we ignore the entities that belong to multiple categories when calculate NMI score.
with our toy-set we can create ground-truth labels and evaluate the embedding technique. We can even compare this with directly applying other techniques (metapath2vec,deepwalk etc)
for the real data-sets no ground truth is known hence we must describe it in a different way.
Hi Aleksei,
The cached files is awesome! However, if the directory doesn't exist yet it crashes:
Traceback (most recent call last):
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-12/b_experiments/experiment.py", line 21, in <module>
embds = get_embs_TF(df, embed_size = 2, walks_per_node = 2, num_steps=200, use_cached_skip_grams= False)
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-dev-12/NetEmbs/SkipGram/tensor_flow.py", line 233, in get_embs_TF
pd.DataFrame(embs).to_pickle(WORK_FOLDER[0] + WORK_FOLDER[1] + "cache/snapshot.pkl")
File "/Users/mboersma/PycharmProjects/networkembedding/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 2593, in to_pickle
protocol=protocol)
File "/Users/mboersma/PycharmProjects/networkembedding/venv/lib/python3.7/site-packages/pandas/io/pickle.py", line 73, in to_pickle
is_text=False)
File "/Users/mboersma/PycharmProjects/networkembedding/venv/lib/python3.7/site-packages/pandas/io/common.py", line 430, in _get_handle
f = open(path_or_buf, mode)
FileNotFoundError: [Errno 2] No such file or directory: '2_walks30_pressure30_window3/TFsteps200000batch64_emb32/cache/snapshot.pkl'
I added a couple of lines such that it creates the directory when it is not found, this seems to be working:
in utils.py I added
skip_gr = tr.encode_pairs(get_pairs(N_JOBS, version, walk_length, walks_per_node, direction))
if not os.path.exists(WORK_FOLDER[0]):
os.makedirs(WORK_FOLDER[0])
with open(WORK_FOLDER[0] + "skip_grams_cached.pkl", "wb") as file:
pickle.dump(skip_gr, file)
os.makedirs
in tensor flow.py
if not os.path.exists(WORK_FOLDER[0] + WORK_FOLDER[1] + 'cache/'):
os.makedirs(WORK_FOLDER[0] + WORK_FOLDER[1] + "cache/")
pd.DataFrame(embs).to_pickle(WORK_FOLDER[0] + WORK_FOLDER[1] + "cache/snapshot.pkl")
such that it creates a cache folder.
Please change the cell:
from NetEmbs.DataProcessing import *
YOUR_DATAFRAME = None
if YOUR_DATAFRAME is None:
d = prepare_data(d)
else:
d = prepare_data(rename_columns(YOUR_DATAFRAME), split=False)
d.head(20)
to
from NetEmbs.DataProcessing import *
df = pd.DataFrame.from_dict(journal_entries)
df.columns = ['ID', 'FA_Name', 'Debit', 'Credit']
YOUR_DATAFRAME = df
if YOUR_DATAFRAME is None:
d = prepare_data(d)
else:
d = prepare_data(YOUR_DATAFRAME, split=False)
d.head(20)
when I load the data frame it doesn't contain any names, therefore the replace function doesn't work as intended. The new code snippet fixes this.
Hi Aleksei,
I'm running it on multiple datasets and sometimes I get the following error:
Traceback (most recent call last):
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/analysisMB.py", line 52, in <module>
simdata = similar(d, direction=["COMBI"])
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py", line 502, in similar
direction=direction)
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py", line 418, in find_similar
direction=_dir), top=top_n, title=str(ver + "_" + _dir))
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py", line 318, in get_pairs
range(walks_per_node) for node
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py", line 319, in <listcomp>
in fsn.get_BP()]
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py", line 257, in randomWalk
new_v = step(G, cur_v, cur_direction, mode=2, return_full_step=return_full_path, debug=debug)
File "/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py", line 193, in step
tmp_vertex = np.random.choice(outs, p=ws)
File "mtrand.pyx", line 1144, in mtrand.RandomState.choice
ValueError: probabilities contain NaN
Process finished with exit code 1
I haven't found a cause of this, for some datasets it fails, for some it passes. However, the rate for now is approximately 50/50....
I will continue with the analysis and run it for examples where it does work, the results up to now are good. I find interesting groups of transactions although I'm now only studying it for the combi approach because the others seem to give weird results.
when I run
from NetEmbs.FSN import *
randomWalk(fsn, 1, length=10, direction="COMBI")
I get
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-95ca35af2a0e> in <module>()
1 from NetEmbs.FSN import *
----> 2 randomWalk(fsn, 1, length=10, direction="COMBI")
/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py in randomWalk(G, vertex, length, direction, version, return_full_path, debug)
255 elif version == "MetaDiff":
256 if direction is "COMBI":
--> 257 new_v = step(G, cur_v, cur_direction, mode=2, return_full_step=return_full_path, debug=debug)
258 cur_direction = mask[cur_direction]
259 else:
/Users/mboersma/Documents/phd/students/alex/NetEmbs-master/NetEmbs/FSN/utils.py in step(G, vertex, direction, mode, allow_back, return_full_step, pressure, debug)
146 return vertex
147 elif not G.has_node(vertex):
--> 148 raise ValueError("Vertex {!r} is not in FSN!".format(vertex))
149 if direction == "IN":
150 ins = G.in_edges(vertex, data=True)
ValueError: Vertex 1 is not in FSN!
any idea how I can fix this? Do you need more info? Then please let me know.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.