beancount / smart_importer Goto Github PK

Augment Beancount importers with machine learning functionality.

License: MIT License

Python 98.54% Makefile 1.46%

smart_importer's Introduction

beancount: Double-Entry Accounting from Text Files

Description

A double-entry bookkeeping computer language that lets you define financial transaction records in a text file, read them in memory, generate a variety of reports from them, and provides a web interface.

Documentation

Documentation can be read at:

https://beancount.github.io/docs/

Documentation authoring happens on Google Docs, where you can contribute by requesting access or commenting on individual documents. An index of all source documents is available here:

http://furius.ca/beancount/doc/index

There's a mailing-list dedicated to Beancount, please post questions there, so others can share in the responses. More general discussions about command-line accounting also occur on the Ledger mailing-list so you might be interested in that group as well.

Download & Installation

You can obtain the source code from the official Git repository on Github:

https://github.com/beancount/beancount/

See the Installing Beancount document for more details.

Versions

There are three versions

Version 3 (branch master): The in-development next version of Beancount since June 2020. This is unstable and you want to use version 2 below. The scope of changes is described in this document.
Version 2 (branch v2): The current stable version of Beancount, in maintenance mode as of July 2020. This was a complete rewrite of the first version, which introduced a number of constraints and a new grammar and much more. Use this now.
Version 1 (branch v1): The original version of Beancount. Development on this version halted in 2013. This initial version was intended to be similar to and partially compatible with Ledger. Do not use this.

Filing Bugs

Tickets can be filed at on the Github project page:

https://github.com/beancount/beancount/issues

Copyright and License

This code is distributed under the terms of the "GNU GPLv2 only". See COPYING file for details.

Donations

Beancount has found itself being useful to many users, companies, and foundations since I started it around 2007. I never ask for money, as my intent with this project is to build something that is useful to me first, as well as for others, in the simplest, most durable manner, and I believe in the genuinely free and open stance of Open Source software. Though its ends are utilitarian -it is about doing my own accounting in the first order - it is also a labor of love and I take great pride in it, pride which has pushed me to add the polish so that it would be usable and understandable by others. This is one of the rare areas of my software practice where I can let my desire for perfection and minimalism run untamed from the demands of time and external constraints.

Many people have asked where they can donate for the project. If you would like to give back, you can send a donation via Wise (preferably):

https://wise.com/share/martinb4019

or PayPal at:

https://www.paypal.com/paypalme/misislavski

Your donation is always appreciated in any amount, and while the countless hours spent on building this project are impossible to match, the impact of each donation is much larger than its financial import. I truly appreciate every person who offers one; software can be a lonely endeavour, and those donations as well as words of appreciation keep reminding me of the positive impact my side projects can have on others. I feel gratitude for all users of Beancount.

Thank you!

Author

Martin Blais <[email protected]>

smart_importer's People

Contributors

Stargazers

Watchers

smart_importer's Issues

Fails if payee is not defined

In my data I don't have a payee defined. This fails the training with

ValueError: empty vocabulary; perhaps the documents only contain stop words

if I comment out the payee part, it works fine.

Empty narration but payee leads to wrong posting

I editedtests/predict_postings_test.py to add some more test cases. I added the following to test an empty narration:

                2017-01-12 * "Uncle Boons" ""
                  Assets:US:BofA:Checking  -27.00 USD

It predicted Expenses:Food:Groceries even though looking at the training data it seems clear to me that the result should be Expenses:Food:Restaurant.

Use pytest; remove coloredlogs

I've already switched the test runner to be pytest. This allows us to have much nicer test output without any further work so I'd remove everything related to coloredlogs, which seems like a gimmick with little use to me.

Can this use metadata to improve matching?

One thing I've done to ensure I can track transactions back to original sources is to add metadata to the transactions, i.e.:

original-description: "Publix Supermarkets #1234 000001234 - Any City, Ga"

And the payee is normalized to "Publix". It would be nice if the original-description data could be used to improve matching because as noted in the beancount docs, sometimes the original description does not match at all. I can't remember the exact documentation, but basically if the original description is "*SQ PARENT FOOD COMPANY - 123557 - PARTOFCITYNA" but you know where you ate, so you set the payee to "Sandwich Shoppe".

Without referencing the original-description there's no hope of matching that, right? So how can the original-description meta be fed into the ML?

Refactor common part of predict_payees and predict_postings

I think it would make it cleaner and easier to expand if we refactor the common part into an own class.
That way each could simply focus on their respective parts.

Parts I see:

[Common] dealing with it being a decorator
[Common] loading training data
[Specific] training a model
[Common] getting a list of imported transactions
[Specific] enhancing the transactions
[Common] merge the enhanced transactions with the rest of imported entries

What do you think? @johannesjh

Code standards

So far the code doesn't really follow any code standards and is (in parts) horrible to read. It should at least pass something like flake8 and ideally also pylint.

Add duplicate detection functionality

I'm currently looking at fava again for handling my import process.
What I noticed is that the duplicate detection logic of bean-extract is not present when importers are used through fava.
I think it should be quite simple to create interceptors similar to the PredictPosting that

provides the same functionality as bean-extract (key logic for this is already reusable, so should be very simple)
detects duplicates by using a specific metadata key with a reference nr (another use case I have in some cases)

What do you think, should I add them to smart_importer or keep them separate?

apply decorator to importer class

If I understood it right, beancount comes with a csv parser which can be configured and you don't need to actually subclass it.
So I think this would be simply a configuration in your foo.import file. It would be nice if we could then configure it somehow to add smart importer functionality without having to extend the existing importer to add the interceptor.

Fix deprecation warnings if caused by smart_importer

Follow-up ticket for the issue raised in #53 by @sprnza:

How could I disable this output with python 3.7?

/usr/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Mapping, defaultdict

The above deprecation warning comes from inside scikit-learn. We probably can't do much about it. But some deprecation warnings do come from smart_importer:

python3.7/site-packages/numpy/matrixlib/defmatrix.py:68: PendingDeprecationWarning: the matrix subclass is not the recommended way to represent matrices or deal with linear algebra (see https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html). Please adjust your code to use regular ndarray.
  return matrix(data, dtype=dtype, copy=False)
python3.7/site-packages/numpy/matrixlib/defmatrix.py:68: PendingDeprecationWarning: the matrix subclass is not the recommended way to represent matrices or deal with linear algebra (see https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html). Please adjust your code to use regular ndarray.
  return matrix(data, dtype=dtype, copy=False)
python3.7/site-packages/numpy/matrixlib/defmatrix.py:68: PendingDeprecationWarning: the matrix subclass is not the recommended way to represent matrices or deal with linear algebra (see https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html). Please adjust your code to use regular ndarray.
  return matrix(data, dtype=dtype, copy=False)
smart_importer/smart_importer/machinelearning_helpers.py:216: PendingDeprecationWarning: the matrix subclass is not the recommended way to represent matrices or deal with linear algebra (see https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html). Please adjust your code to use regular ndarray.
  return np.transpose(np.matrix(data))

The problematic code seems to be smart_importer.machinelearning_helpers.ArrayCaster

clarify and decide packaging and integration with beancount and/or fava

e.g.,

including the decorators upstream in beancount
supporting the decorator's suggestions in fava's gui

decorated importer classes do not work in fava

decorating importer classes does not work properly in fava.

import.config.py:

CONFIG = [
    bawag_psk.SmartBawagPSKImporter(),
    paylife.SmartPaylifeImporter()
]

fava's import page will then show a csv file from paylife as follows:

File	Importer	Account	[Button]
MeineTransaktionen.csv	smart_importer.predict_postings.PredictPostingsImporter	Liabilities:Mastercard	Extract

Clicking extract triggers the following HTTP request:

GET /johannes-ledger-2018/extract/?filename=%2FUsers%2Fjohannes%2FDownloads%2FMeineTransaktionen.csv&importer=smart_importer.predict_postings.PredictPostingsImporter&partial=true HTTP/1.1

Problem:
The request should contain the classname of the importer, not of the decorator!

As a result, fava loads the wrong importer (the one for BawagPSK instead of the one for paylife) and produces a runtime exception.

Bad prediction if account is only differing factor

See multiaccounts test on bug/bad_account_prediction

It looks if the account is the only differing factor something goes wrong.

@johannesjh can you maybe have a look as I tried to figure this out for quite a while but somehow I'm not getting it. It looks like the "from" account is getting predicted instead of the "to" account.

Operate on single transactions

Is there a reason why the predictors operate on a list of transactions? After the ML model is trained, predictions should only depend on a single transaction, no?

Changing this would simplify a lot of the logic and would also be required for something like #32.

Predict multiple postings

Currently, PredictPostings will only predict a single posting and not a list of postings. It seems to me that the latter would be preferable, is there a reason why this is not done?

Improve matching quality

I now imported my historical data from gnucash into beancount and I'm trying to match new imports with the existing data.
Right now the matches are really terrible.

I'd like to figure out why that's the case and improve the matching to actually be usable for me.

Points that I can think of

imported narrations are "chatty" they contain some parts which are the same or almost the same in a lot of other transactions and only a small part is actually "interesting"
suggestions are independent of amount => #22
not sure if this is taken into account but I think the frequency of a certain transaction "type" should also be important. E.g. if I have a transaction "type" T1 which I have 100 times and a similar type T2 which occurs 1 time, it's way more likely that T1 is the correct match

Use amount as another dimension

I think it would make sense to also include the amount as a dimension as most of the time, the same entries should also have a similar amount.

Canonical usage

The currently recommended way to use this is quite ugly:

@PredictPostings()
class SmartMyBankImporter(MyImporter):
    pass

Why even have the decorators, if they're not intended to be used as such?

Fails with Importer which uses new existing_entries param

This currently does not work with a newer style importer which takes and uses the new existing_entries argument.

Allow to specify confidence score

Moved here from discussion in #4

Another thing which comes to mind. Is there a confidence for the prediction? Because if that one is very low I would rather not have a prediction.

Good point. The decorator could accept a parameter through which users can set a threshold.

The SVM classifier in scikit-learn can calculate probabilities, compare How to get a classifier's confidence score for a prediction in sklearn? on stackoverflow. I don't know how much this will slow down the pipeline.fit method, but I would not worry too much about it.

predict tags/metadata/links

I'm loving smart_importer, but it left me with the desire of also being able to predict tags/metadata/links.
I understand the model will be more complex/larger, but it should be doable, right?. Have you already considered the idea?
Either way, thanks a bunch for smart_importer!

Filter closed accounts

We should not suggest/predict closed accounts.

False predictions when importing several files at once

Guys hi,

It's more a question on usage than issue, hope you could explain me. I've read Quick Start and Documentation but since i am using several importers and 1 of them have 2 "modes" (credit card and checking) can't understand how to apply directions provided.

I have following folder structure:

/downloads/
/office/
	at.beancount
	at.import
	/importers/	
		__init__.py
		/paypal/
			__init__.py
		/chase/
			__init__.py

at.beancount looks like this
/paypal/__init__.py/ and chase/__init__.py like this

Using bean-extract -e at.beancount at.import ../Downloads/ > temp.beancount
gives me temp.beancount file similar to this

Than i manually put correct accounts and get this.

I'd like to automate this last manual part with smart_importer. As far as i understand i don't need @PredictPayees(), but only @PredictPostings(). But i can't understand in which importer file to insert them (in at.import or /chase/__init__.py and /paypal/__init__.py) and where exactly :) Python programmer helped me with importers, but now he is not available. So i have to figure out on my own.

gracefully handle lack of training data

add a unit test and gracefully handle the case for when there are <= two accounts in the training data (I think that this would currently throw an error).

Provide Autocomplete Suggestions to Text Editors like Emacs and Vim

Post by Martin @blais on the beancount mailinglist:

Another way I'm finding I'd like to invoke this is by invocation of an Emacs binding to auto-complete one particular transaction based on a stored model.
Basically put the cursor over an incomplete transaction and have it be completed by the ML classification.
Just an idea.

Only predict on transaction entries

(There is a bug: If there is a Balance entry, or any other
non-Trasaction entry in the entries returned by the importer, the ML
crashes with a cryptic traceback.)

in load_training_data, allow existing_entries to be a list of directives

in machinelearning_helpers.load_training_data, handle the case where the type of existing_entries is a list of directives, as specified in the ImporterProtocol.

Split machinelearning_helpers

smart_importer.machinelearning_helpers is currently quite a mixed bag, which contains all sorts of helper functions and classes (not all of which have something to do with ml). I think this module should be split

disable suggestions by default?

@tarioch, in CC @yagebu @aumayr:
shall we disable suggestions of accounts and payees by default?
fava does not handle them well (yet), see beancount/fava#801

Communicate more clearly that this project does not provide importers, but that it helps make existing importers smarter

It's not really "smart", is it? Is there maybe a better name we could give it?

fix travis-ci

the migration of this repo from johannesjh/smart_importer to beancount/smart_importer broke the travis_ci continuous integration.

add unittests to ensure the two decorators play together nicely

Include Account as a dimension

I have an importer which can return transactions for multiple accounts

e.g.
Assets:Foo:EUR
Assets:Foo:CHF
Assets:Foo:USD

so I can't filter the learning data for the specific account. The predictions should now take the account as an input dimension as it's unlikely that the same transaction that happens for Assets:Foo:EUR would also happen for Assets:Foo:CHF

Publish as Package on PyPI to ease installation

Read the Python packaging guide: https://packaging.python.org/tutorials/packaging-projects/#uploading-your-project-to-pypi
Setup Makefile targets similar to fava, using twine
Write a CHANGES file
Set the initial version to 0.1.0 in smart_importer/__init__.py
Tag v0.1.0 in git
Create a release v0.1.0 in github
Publish using twine on test.pypi.org and verify that it worked
Publish using twine on production pypi.org
Bump the version to 0.1.1-dev in smart_importer/__init__.py

Improve documentation

in this branch: feature/refs-42-improve-documentation

Add data based tests

I think it would be great to have data based tests to cover lots of use cases.
I'm having some really bad predictions which I would like to narrow down with that.

@johannesjh would you have an issue if I switch over to pytest as this has some support for test generators

Option to use a mapping file along with regexs to over-ride or super seed smart importer suggestion

This suggestion comes from a feature of icvs2legder (https://github.com/quentinsf/icsv2ledger) that I found very useful. Basically it was a file that would contain descriptions mapped to both payee and account (comma separated values) that also allowed for the use of regular expressions.

To give you an example use case of how it can be used:
My preferred way to send and receive money with friends and family is via email money transfers. My bank uses the following as the description when I receive one: "ETF received Bob Smith". For the posting I would like to see money going into the applicable bank account and coming out of the 'Accounts Receivable' for 'Bob Smith'....or any other name of person I receive one from. To do this I would use a regex with a capture buffer that I can use to formulate the payee and account from.

Create PR for standard extractors to be usable with smart_importer

Out of #6
I think it would make sense to create a PR to at least switch the beancount built in extractors to use the new extract method which also passes the existing_entries.
That way any extractor based on this can easily use smart_importer.

Training data path should be relative to import config

When using the importer from Fava, a relative path for training data will be interpreted as being relative to the directory that fava is run from afaict.

I first tried to use the decorator without any training data, however this threw errors (in both Fava and bean-extract). Is this still supposed to be supported?

pylint error, cannot import beancount.core.data

This pylint error

smart_importer/pipelines.py:7:0: E0001: Cannot import 'beancount.core.data' due to syntax error 'misplaced type annotation (<unknown>, line 283)' (syntax-error)

...has been fixed by Martin in beancount:
https://bitbucket.org/blais/beancount/issues/343/pylint-error-in-beancountcoredata
but the fix has not yet been released.

Until that happens, we could tell pylint to ignore the import error, i.e.,

from beancount.core.data import Transaction # pylint: disable=import-error

Type annotations

Is there a plan to check these at some point? Otherwise it seems a bit useless to add them everywhere.

Bugs that could be caught with it aren't caught as is (like load_training_data not returning a list).

test failure due to invalid syntax

The tests fail due to:

E     File "/home/tbm/scratch/cvs/smart_importer/smart_importer/predictor.py", line 74
E       f"After filtering for account {self.account}, "
E                                                     ^
E   SyntaxError: invalid syntax

This makes it work but I'm not sure what the correct fix is:

-                f"After filtering for account {self.account}, "
-                f"the training data consists of {len(training_data)} entries.")
+                "After filtering for account {self.account}, "
+                "the training data consists of {len(training_data)} entries.")

@yagebu

How to disable warnings and DEBUG output

Hi there! Thanks for the awesome project.
I've got a little question.
How could I disable this output with python 3.7?

/usr/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:17: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Mapping, defaultdict
DEBUG:smart_importer.decorator_baseclass:The Decorator was applied to a class.                                                   
DEBUG:smart_importer.decorator_baseclass:The Decorator was applied to a class.

Thanks!

Improve examples

Thank you @tarioch for adding the examples folder. I would like to discuss ways of making the example more realistic and easier to understand for novice users.

I think the example should illustrate the following usecase: Importing transactions from a debit or credit card, with automatic prediction of expense accounts:

A realistic CSV file (downloaded.csv currently only consists of the word "test").
The importer class should inherit from beancount.ingest.importers.csv.Importer and read the CSV file. There should be no hardcoded output in the importer class.
I would like to use different account names that go together with the usecase of predicting expense categories. E.g., Assets:MyBank:MyAccount, Assets:Cash, Expenses:Groceries, Expenses:Travel...
I think we should provide shorter and simpler training data in example.beancount.

Any thoughts?

example use case for beancount standard csv.importer

I am trying to get this to work with the standard provided csv.importer of beancount without much success. To be honest I am fairly green with Python let alone decorators so I am sure it is something that I am doing or not doing...

import sys
from os import path
sys.path.insert(0, path.join(path.dirname(__file__)))

from beancount.ingest import extract
from beancount.ingest.importers import csv

from smart_importer.predict_postings import PredictPostings

Col = csv.Col

csv.Importer = PredictPostings(suggest_accounts=False)(csv.Importer)

CONFIG = [
     csv.Importer({Col.DATE: 'Date',
                  Col.PAYEE: 'Transaction Details',
                  Col.AMOUNT_DEBIT: 'Funds Out',
                  Col.AMOUNT_CREDIT: 'Funds In'},
                 'Assets:Simplii:Chequing-9875',
                 'CAD',
                 ['Filename: .*SIMPLII_.*\.csv',
                  'Contents:\n.*Date, Transaction Details, Funds Out, Funds In']
                 ),
    ]

Could somebody please point me in the right direction here on what I am doing wrong. I am very interested in this and have some experience with ML and hope to add to this project where I can once I get everything up and running.

Predict narration / compare importer entries with manually edited entries

I haven't actually used smart_importer yet (nor have I used any beancount importer) but I've started reading the source code.

What I was hoping smart_importer would allow me to do is this: as learning input, I want to give it the raw beancount entries as generated by the importer and I want to give it a matching beancount file (i.e. same entries) after I manually edited them.

This way, payee, postings and even the narration could be figured out with machine learning.

Let's say my CVS file has:

Sale,04/11/2018,04/12/2018,INTERSPAR DANKT,-17.85

and the CVS importer will generate a transaction with INTERSPAR DANKT as narration. I will then manually change this to payee Interspar with narration Supermarket.

So you'd get one beancount file with

2018-04-11 * "INTERSPAR DANKT"

and one with

2018-04-11 * "Interspar" "Supermarket"

In fact, it would contain several such entries, so smart importer could automatically figure out: oh, if the input is INTERSPAR DANKT we should map that to payee Interspar with narration Supermarket (and posting Expenses:Food).

Training data should be independent of order of postings

Right now when training it's expected that the postings always follow the same order. e.g.

2017-01-01 * "Foo"
    Assets:Foo
    Income:Bar

If there is now a transaction with

2017-01-01 * "Foo"
    Income:Bar
    Assets:Foo

It will not be trained correctly/ignored.

limit predictions to missing second postings

the predict_postings decorator should only predict missing second postings, as opposed to predicting third and fourth postings etc as well, which does not make sense for any usecase i can think of.

Insert postings before existing ones

I think we should insert any new predicted postings before the list of postings - in Beancount (or at least in ledger) it is more or less convention that the "source" account of a transaction is the last one. I guess if one imports transactions from a bank, the one fixed account will be the source account and should come last.

default value for filter_training_data_by_account

The decorator could try to read importer.file_account() from the decorated importer instance and use this as default value for filter_training_data_by_account.

This would allow the decorator to be used without any arguments (training data can be retrieved from the existing_entries argument to the importer's extract function). The lack of a default value for filter_training_data_by_account is currently preventing the test_<...>_decoration_with_empty_arguments unit tests from passing; the unit tests are therefore currently skipped:

PredictPostingsDecorationTest#test_class_decoration_with_empty_arguments
PredictPostingsDecorationTest#test_method_decoration_with_empty_arguments
PredictPostingsDecorationTest#test_class_decoration_without_arguments
PredictPostingsDecorationTest#test_method_decoration_without_arguments

Also, to make the API easier to understand we could rename filter_training_data_by_account to known_account.

Support existing entries as training data

Very recently, the ImporterProtocol has been extended to pass the existing transactions in.
It would be great to simply use this instead of specifying explicit training data.