vallettea / koala Goto Github PK

View Code? Open in Web Editor NEW

142.0 142.0 60.0 2.73 MB

Transpose your Excel calculations into python for better performances and scaling.

License: GNU General Public License v3.0

Python 100.00%

koala's People

Stargazers

Watchers

koala's Issues

Handle better A1:A10 cases when detecting alive

This is when using get_arguments_from_ast() and get_volatiles_arguments_from_ast().
Only first and last parents are considered.

For now we have removed these cases temporarily.

Merge CellRange and Range classes

It might be interesting to merge CellRange and Range into one unique class, for clarity purposes.

https://github.com/anthill/koala/blob/master/koala/ast/excelutils.py#L15
https://github.com/anthill/koala/blob/master/koala/ast/Range.py#L118

Evaluation inconsistency

Evaluation does not always give the same result.

Example:
With InputData!G14 as 2018 in the .XLS,

print 'First evaluation', sp.evaluate('Cashflow!G187') # => outputs -2966.25862693
sp.set_value('InputData!G14', 0) # this is to avoid direct evaluation
sp.set_value('InputData!G14', 2025)
print 'Second evaluation', sp.evaluate('Cashflow!G187') # => outputs -3719.5504961

With InputData!G14 as 2025 in the .XLS,

print 'First evaluation', sp.evaluate('Cashflow!G187') # => outputs -2582.30664008
sp.set_value('InputData!G14', 0) # this is to avoid direct evaluation
sp.set_value('InputData!G14', 2025)
print 'Second evaluation', sp.evaluate('Cashflow!G187') # => outputs -2582.30663952

Check if Volatile Ranges are scanned during detect_alive()

When you prune your graph, the cellmap of the reduced graph has a smaller nb of cells than the original cellmap. But Rangeshave been created with the original cellmap, so they might have a valid reference to a cell that doesn't exist in the reduced cellmap.
This problem get solved by dumping/loading the graph, since Ranges are recreated from the reduced cellmap.

But still, such inconsistency should be addressed

Manage Spreadsheet.set_value() on sparse Ranges

Ranges with empty cells end up being sparse Ranges (meaning they lack some Cell references).
Then using Spreadsheet.set_value() might result in setting incorrect values on Cells.

Offset doesn't work with Ranges

When use the OFFSET height and width so that the output Cell is actually a Range, it is most probable that this output doesn't exist in the cellmap, leading to errors.

Set up a detailed Benchmark

Related to #17.

We need to understand exactly where we gain perfs and where we simplify the graphs.
A detailed benchmark is then needed.

The main 3 options we've added are:

volatile cleaning
pruning (inputs selection)
outputs selection

For each of these options, we want to know:

what is the size reduction of the graph (node, edges) ?
what is the time reduction of gen_graph ?
what is the time reduction of set_value ?
what is the time reduction of evaluate ?

Add a ".XLSX" test to verify inputs/outputs

Repair Excel function tests

Add licence

Pycel is GPL.
Openpyxl is MIT.

Seems like Koala should be GPL.

Is Range's cells need_update check necessary ?

This part of the code was necessary at one point because the update of Ranges was not optimal.
This ended up in infinite loops for some files, but this might have been fixed by these lines.

We need to check that this is the case.

Use a single Tokenizer

Currently, 2 different tokenizers are used in Koala:

the main tokenizer is the one from Pycel, is used when constructing the graph (in koala/ast/tokenizer.pyx)
a secondary tokenizer from Openpyxel used when reading the cells of type range to be able to translate the formulas (in koala/openpyxl/tokenizer.py).

We need to merge the 2 into one to avoid complexity.

Are Cells with indirect links correctly reset ?

Cells with a formula that implies a volatile have indirect links to other cells.
On set_value(), are these Cells correctly reset ?

Explore more calculation routes to ensure koala works

See https://github.com/anthill/engie/issues/42

No need to remove all index

we don't need to remove all index (only the one that give address) and not the one giving back a value. For the moment, we remove all.

Is the clean_volatiles() cache a source of bad evaluations ?

There is a cache dictionary in the Spreadsheet.clean_volatiles() function, whose purpose is to reduce the amount of expression calculated, when the formula is the same as one previously found.
The problem is that sometimes, the same formula, called from a different cell, will evaluate differently.

This might lead to bad evaluations, and might explain #44.
But performance might be impacted.

Possible conflict of counta and setvalue

As counta counts the number of cells whose value are not none, setvalue with reset could potentially alter the result.

Clean basic_evaluation.xlsx

Make it clearer

Repair basic_evaluation

Fix_cell() bug

After some experiments, fixing a cell in the middle of a calculation chain has proven to output fixed results. More investigation is needed.

Write a tool to detect files with volatile dependent inputs

When you have an input that affect the value of a volatile, your whole calculation will be broken if you use Spreadsheet.clean_volatiles().

Detecting in advance which files are concerned is then a nice feature.

Improve perfs

See #16.

Merge ExcelCompiler and Spreadsheet into one single Class ?

Technically, ExcelCompiler reads the Excel, and does the initialization of Spreadsheet, not much more.
We could imagine all of this done in Spreadsheet class.

Set up automated tests before commit

Not urgent at all.
Just in the future we will need to automatically launch tests before committing.
But before that, we need to structure a little bit our testing procedure.

Open the possibility to clean_volatiles() from Spreadsheet

Currently, it is necessary to call ExcelCompiler.clean_volatiles(), which will call the Spreadsheet equivalent.

But calling directly Spreadsheet.clean_volatiles() won't generate a new graph.
Opening this possibility requires to rethink how ast.__init__() works.

Are string values flatten ?

excelutils.py l424,* flatten method* for cells values:
if isinstance(el, collections.Iterable) and not isinstance(el, basestring):

@iOiurson How do you feel with that ?

Add util to check if a link exists between two cells

Excellib.irr() might not output correct values

We use the numpy version of Excel IRR function.

But for file 230, Calculations!CU273 does not output expected value.
It seems final evaluation value is correct though ...

Rename Volatiles

Volatile functions in Excel are functions that always trigger evaluation (see: http://www.decisionmodels.com/calcsecretsi.htm)

What we have called "volatiles" in our code is actually functions that output a reference to a cell, which is not the same.

For the sake of clarity, we need to rename what we call volatiles in our code.

Should we clean white spaces in formulas ?

White spaces in formulas are a problem:

if you clean them up, text variables that include white spaces are perverted
if you don't, clean_volatiles() function might end up not replacing parts of formula since revert_rpn() (which outputs the part of the formula to replace) returns a formula without white spaces.

The current set up is not replacing white spaces.

Set up an automatic Cython compiler on commit

VDB function with partials

excellib.vdb() doesn't output exactly the same result when using partial start_period or end_period (meaning, floats)

Find a faster way to read graphs

An idea is to use this kind of strategy:
https://axialcorps.com/2013/09/27/dont-slurp-how-to-read-files-in-python/

The problem seems to be that we handle gzip files. An alternative solution could be https://docs.python.org/3/library/zlib.html#zlib.decompressobj, but we need to know the 'window size' (wbits), which i don't think we do in advance...

Explore the idea of pure functional evaluation

This is to avoid to eval() at each node of the graph, which takes quite a long time.

Add function to create name ranges within generated graph

Update html documentation for gen_graph algorithm

excellib.match() needs an ExcelError as output

This line should output an ExcelError.

Adapt graph serialization to Range object

See #12.

Named ranges in different sheets with the same name

RangeCore.apply_all on Range with different sizes

Our current strategy is not to fill Ranges with empty cells.
But this might lead to apply_all operations on Ranges with different sizes, raising an Exception.

We might need to consider filling the missing cells values with zeros on such occasions.

Fix behavior without clean_volatiles()

Formulas like A1:OFFSET(A1,0,1) lead to errors when you don't clean_volatiles().

This issue overruns #25.

Correct Evaluation

Plug Cell.value to Range.values when necessary

Authorize ":" tokenizer when you have inputs that influence 'INDEX' or 'OFFSET' formula

When you have inputs that can modify cells with formulas containing INDEX or OFFSET, you don't want to pre parse your formula to clean the volatiles.
So you need to able to calculate entirely your workbook (even if it takes a great amout of time).

Currently, this is not possible and leads to evaluation errors due to bad parsing of ":" characters.
A generic mode addressing this case needs to be available.

Find_associated_values() doesn't work for partial Ranges

When a Range is partial due to empty cells, the find_associated_values() method won't work if the reference cell is associated to an empty cell, since this cell will be missing in the Range.

Spreadsheet.prune() takes too long

When using a certain number of inputs/ouputs (say, 15 each), the loop looks like infinite.

False circular references

Formulas like:
=(totalDecom-SUM(INDEX(FA_RecCostsDecom;1;1):INDEX(FA_RecCostsDecom;1;CA_Periods-1)))*Deprec_UOPRates when calculated on a cell referenced as FA_RecCostsDecom trigger infinite loop.

This is because currently our koala algorithm reevaluates a range each time it sees it in a formula.
A good way to handle this would be to store Ranges (in a koala sense) in a Spreadsheet.range_dict object so that when koala encounters a Range it already knows, it can directly use the values without reevaluating the Range (then avoiding the infinite loop).

2 problems though:

this means the way to initialize Ranges must be adjusted so that a Range is created in the dict on the first element inserted (otherwise the previous formula wouldn't work either)
this might be a lot of effort for a few cases, since this might not happen that many times

Be more precise with ExcelError return values

Most of the Excel Error Codes i've put so far are "#N/A" or "#VALUE!", we might need more precision.
We might need to find a way to be more explicit about the errors

Investigate on an Eval error

Exception: Problem evalling: cells and values in a Range must have the same size

vallettea / koala Goto Github PK

koala's People

Stargazers

Watchers

Forkers

koala's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs