lewisfogden / heavylight Goto Github PK
View Code? Open in Web Editor NEWA lightweight actuarial modelling framework for Python
Home Page: https://lewisfogden.github.io/heavylight/
License: MIT License
A lightweight actuarial modelling framework for Python
Home Page: https://lewisfogden.github.io/heavylight/
License: MIT License
Heavylight generally selects simplicity of coding over performance, in particular using Caching classes to store results adds some overhead vs. alternatives.
Directly using dictionaries (example below) removes quite a bit of this overhead but much more boilerplate code is needed. i.e. manual stores for each variable need to be initialised and then updated.
Version using HeavyLight:
from heavylight import Model
class Policy(Model):
def num_pols(self, t):
if t == 0:
return 1
else:
return self.num_pols(t - 1) - self.num_deaths(t - 1)
def num_deaths(self, t):
return self.num_pols(t) * 0.01
def run_pol():
p = Policy(do_run=True, proj_len=400)
return p.num_pols(399)
Version Using Dictionaries - runs faster but much less readable:
class Policy:
def __init__(self, proj_len: int):
self.proj_len = proj_len
self._v1 = {}
self._v2 = {}
def num_pols(self, t): #_v1
if t in self._v1:
return self._v1[t]
else:
if t == 0:
value = 1
else:
value = self.num_pols(t - 1) - self.num_deaths(t - 1)
self._v1[t] = value
return value
def num_deaths(self, t): #_v2
if t in self._v2:
return self._v2[t]
else:
value = self.num_pols(t) * 0.01
self._v2[t] = value
return value
def run_pol():
p = Policy(400)
return p.num_pols(399)
might not be a productive use of time to read this, be warned
Because everything is elementwise and there is no big matmul the models are memory limited on GPU and not utilizing all the FLOPS. https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#element-op
To get around the bottleneck we can use the AI compiler
Our models can't be compiled right now. Probably the logic related to checking if values are in the cache and stuff like that is causing what PyTorch people call a "graph break" as we need to execute python code between operations.
To avoid graph breaks, I believe we have to go around memoization/recursion and actually write a loop?
But writing out the model as a loop would really suck. So we will have to generate the loop.
It would maybe be easier to just generate the unrolled loop, but 22 formulas across 277 timesteps would be like 6000 lines of code. And you can't really edit that by hand in any productive way, so we will probably have to actually write the loop just for ergonomics.
I don't want to deal with things that aren't single integer parameters, will probably enforce that.
cache_graph.graph is currently unused. Probably use that sort of thing.
Enforce that timesteps are never going back more than 1. Enforce that only args is ever timesteps.
algorithm:
func(t-1)
go into the t_prev_list
.func_t_prev = func_t
for all func
in the t_prev_list
.
if t == 0
initialization conditions on pols_if.t=1
, sort, compile.
t-1
are going to reference func_t_prev
.The whole function can be parameterized by t
and that will determine the number of iterations in the loop or something like that? At the end of the day, the results of the compiler will be like this:
class MyClass:
def __init__():
# same code as before
mp = ...
...
def run(max_t: int):
pols_if = mp.pols_if_init
pols_death = pols_if * assume.mort_rate
pols_if_prev = pols_if
pols_death_prev = pols_death
for _ in range(max_t):
pols_if = mp.pols_if_init
pols_death = pols_if * assume.mort_rate
pols_if_prev = pols_if
pols_death_prev = pols_death
If table
has keys 'A', 'B' and 'C', then looking up table['AB']
returns the value for table['B']
.
Cause: np.searchsorted
places 'AB' between 'A' and 'B'
Ideal behaviour: Should return np.nan
or raise an exception if the key doesn't exist.
As keys and data should be aligned this shouldn't happen, but if incorrect data is passed in it will not fail.
I think in the past you wanted to return np.nan but then it changed type of array.
Currently it throws an error.
I almost think we shouldn't even throw the error. If someone says 999999 surely they mean np.inf anyways?
What is the best possible behavior?
Users would like methods to appear in .df
results according to the order they are defined (so similar functions are grouped together), currently the use of getmembers
to get all the methods in the class means they are sorted alphabetically., per https://docs.python.org/3/library/inspect.html#inspect.getmembers
Using a vars
or accessing the underlying class __dict__
would avoid this issue.
Oops, the badge is for the code coverage on my fork. Needs to be changed to track from https://app.codecov.io/gh/lewisfogden/heavylight
A few topics
Inputs
Outputs
pandas.DataFrame.agg
style?)Examples
heavylight.demo.create_sample('numpy_template', 'path/to/folder')
I don't think the model you developed supports them. they are pretty annoying, should I stop supporting them as well? Should you start supporting them? Does it matter?
Was trying to see what was going on with this notebook but running all doesn't work with the errors below:
The LightModel currently aggregates all methods using the storage function. This is not practical because some functions might return different data types, it becomes messy and we end up having if statements and such.
So try to use decorators to apply storage functions at the method level. For benchmarking purposes, this allows a reduction in the number of floating point operations used to calculate results.
Was seeing reductions in cache size of 80% when larger reductions > 99% were expected on the lifelib Term_ME model.
A cached function pols_new_biz
that is calculated and cleared by another function expenses
will still be called in the loop
for t in range(proj_len+1):
for func in model._single_param_timestep_funcs:
func(t)
From another issue
I've found an issue with the indexing in Tables - if you have multiple keys, an integer key going from 18-90 (say), and you look up 2, then you can get a false positive - it will find an earlier index. (2 would find the value for 90-16 for the prior key)
possible solution:
Can resolve by bounds checking. Possibly use optional bounds checking, like Table(safe=False)
and I think it is best that safe=True
by default. Safe=True performs any important validations that take time and safe=False doesn't?
import heavylight
class Non_T_Model(heavylight.Model):
def get_a(self, a):
return a * 10
ntm = Non_T_Model(proj_len=10):
ntm.get_a(5)
ntm.get_a(10)
ntm.get_a(7)
After this, ntm.get_a.values
will return [50, 100, 70]
.
I don't think this is a big issue, as it is unlikely that non t
functions are likely to have an order, however it might cause issues.
Aside: I've added a keys
property, e.g. ntm.get_a.keys
which is aligned to the values.
If we optimize the model it basically says that after executing a function, then clear some other functions. This is sensitive to the order that functions are executed in, so it is possible that there will be cache misses. This won't happen if both calls are made to RunModel(c) where c is the same for an optimized vs unoptimized model, the call orders are the same so it is not terrible.
But still, this is a foot gun. The generate_actions
and execute_actions
approach basically avoids this by making model optimizations only happen in some other execute_actions
context and not clearing the cache during normal model execution dependent on some internal state.
https://modelx.io/blog/2022/03/26/running-model-while-saving-memory/
The current approach in heavylight
is that projections run when the instance is created, e.g. if the user model is:
class MyModel(heavylight.Model):
def <user_method>(self, t):
return <stuff>
Then when this is run proj1 = MyModel(do_run=True, proj_len=10)
, the model will be run and results stored in the cache.
Once the projection is run, users can access the values from proj1.<user_method>(t)
for individual values; proj1.<user_method>.values
for an array, and proj1.ToDataFrame()
as the optional way to pull all single parameter values into a dataframe (handy for debugging/viewing as easy to copy into Excel). There is also a sum method on the cache which returns the total, e.g. proj.<user_method>.sum()
.
The proj_len
variable controls how much of the projection is pre-computed (from t=0...proj_len-1), if the user requests a method result from after this then the model will run through all the intermediate calculations and cache these.
e.g.: proj.<user_method>(20)
would calculate a further 11 values and cache them (10, 11, ... 20).
Rationale for doing the pre-computing is that the python stack can overload with a lot of recursion.
I initially allowed the cache to be cleared, however I found this risky (I use some proprietary software which doesn't always clear the cache correctly ๐ฏ), and instead decided that if a new projection is needed, you should just create a new instance.
As new features have just shipped and I would like to use them, I need the new code to be published. Often this is done with a GitHub action that publishes to PYPI on a new release.
Changes I'm thinking. @MatthewCaseres would like your views too
__init__
(i.e. Model behaviour rather than LightModel behaviour).backend
in Model, covering:
t-2
simple aggregator that applies agg_func
on all methods taking t
as a parameter.graph
Memory Optimisedt-2
and graph
both take the agg_func
parameter, which defaults to np.sum
graph
which requires a pre-run to optimise, this samples 50 points (say) from the dataset, but could pass a optimiser_data
optional parameter?backend
allows us to add different optimisers in future (potentially even a compiler)The caller might look like this:
proj = Model(data=data, basis=basis, proj_len=120, backend='t-2', agg_func=np.mean)
heavylight is a famous algorithm so maybe this package can't be easily found online. why not rename to LifeInsurance
and then people look up "life insurance python" and find out about it?
edit: another way to increase discoverability is making PR to lifelib once package more stabilized
As stated in the README, the functions that use future periods cause recursion error. For example:
import heavylight
class MyModel(heavylight.Model):
def test(self, t):
if t == 1440:
return 1
return self.test(t+1)
model = MyModel(do_run=True, proj_len=1440)
model_cashflows = model.ToDataFrame()
print(model_cashflows)
results in:
RecursionError: maximum recursion depth exceeded while calling a Python object
If I change the line 141 in heavylight.py
from:
for t in range(proj_len):
to
for t in range(proj_len, -1, -1):
then it works without any problems.
So maybe the package should check if the given function calls itself with argument t+...
and if so, then use the for-loop backwards.
To check the argument of the function, the ast
package can be used.
See that the CI runs are failing. Please set up codecov. I cannot do this because I cannot modify your repository secrets.
Tests
basic
tables only
combined
advanced
Ideally publishing of site is automated with GitHub action
I'm starting up some docs and it isn't clear how to motivate the difference in the models. from a historical perspective, I wanted to do my own thing and contribute to a group effort without breaking existing stuff.
How do we communicate this to users in a way that isn't confusing? Do I just say
LightModel is like Model but it implements memory optimizations, automatic aggregations, and has a slightly different API
and move on from it? Is there anything to really be done from a source code perspective here that would be easy to do?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.