tjhladish / abcsmc Goto Github PK
View Code? Open in Web Editor NEWSequential Monte Carlo Approximate Bayesian Computation with Partial Least Squares parameter estimator
License: GNU General Public License v3.0
Sequential Monte Carlo Approximate Bayesian Computation with Partial Least Squares parameter estimator
License: GNU General Public License v3.0
I'm doing a complete write up of transformations as part of refactor. What is currently supported, as I understand it:
// a = sum of transformed scale addends
// b = prod of transformed scale factors
// c = sum of untransformed scale addends
// d = prod of untransformed scale factors
// (n.b. the transformed / untransformed bit only applies to when a-d applied - the actual values sum/prod'd are all on the transformed scale)
// u = arbitrary untransform function (though limited selection from config files)
//
// rescale.1, rescale.2 = config-specified values
//
// x = value on the _fitting_ scale
// x' = value on the _model_ scale
//
// x' = (rescale.2-rescale.1) (u((x + a) * b) + c) * d + rescale.1
That grok as correct?
For bundled executables, desire a version that can run DB less easily. Either ingesting a row from command line ala the run via executable simulation, or iterating over a vector of parameters.
@tjhladish any other design constraint thoughts from todays problem or others?
There is no dtor for the AbcSmc object. I believe that an explicit dtor is required in this case to address memory leaks related to the _model_pars
and _model_mets
class members. These pointers will not be cleaned up and deleted without an explicit dtor.
Currently, every pass with AbcSmc must ingest the entire history of complete jobs to process the latest set => propose next set. This is because the weights are computed iteratively: the latest set weights actually depend on the entire history of weights. We don't store those weights, nor the parameters for noising draws.
We should continue to store the history, but if we also start storing weights and doubled variances, then we can simplify the code that deals with filtering to just take the last complete step. This could allow us to substantially trim down the AbcSmc object itself, getting rid of all the vectors-of-stuff and the assorted look-up-by-set accessors.
At the moment, we have no fitting-algorithmic reason to use all the data, though we do look at it for diagnostic reasons. We don't generally need an AbcSmc object, however, to do the diagnostic steps.
For many of the set_... methods in AbcSmc, it could be useful to return the self pointer, which would enable using the library like:
AbcSmc* abc = (new AbcSmc())->parse_config(args.config_file)->set_simulator(simulator);
instead of the current
AbcSmc* abc = new AbcSmc();
abc->parse_config(args.config_file);
abc->set_simulator(simulator);
The advantage is enabling that sort of fluent/flow syntax (which some people like to use). The main disadvantage is losing the ability to have the methods return true/false "success" indicators (e.g.).
See tjhladish/PLS#10.
Our typical use case for post processing analysis (e.g., visualization, scenario evaluation) is R
-based, using a relatively portable collection of tools.
I'm slowly but surely aggregating my big-stochastic-simulation-of-many scenarios set of generic tooling over here. AbcSmc specific stuff probably doesn't belong in that, but I think AbcSmc could come with an internal R
package gathering up well-proscribed combinations of jsonlite
+ data.table
+ RSQlite
functions that move from AbcSmc outputs to the kind of stuff that library wants.
The idea would be provide a user interface that masks the json / SQL syntax + AbcSmc structural decisions with R
functions that yield data.table
objects.
reference eigen repo: https://gitlab.com/libeigen/eigen/-/tree/master/
can basically purge Eigen as specific files, replace with the submodule root, and then need to address path issues.
This would entail slightly more disk footprint when building AbcSmc (since it's the whole repo, vs just the headers), but that's pretty marginal. Longer term would simplify updating Eigen (i.e. just pull in the submodule, then update AbcSmc's record of where the submodule is).
Users may want a clean slate to re-run an experiment, i.e., delete the sqlite db in order to try again. (Currently, this has to be done manually, rm mydb.sqlite
)
This can be dangerous, however, as the database could represent an arbitrary amount of compute time, and we don't want to make it too easy to delete dbs.
Suggest adding a command line argument for AbcSmc parser to specify that db should be truncated/deleted, but that this can only be done with interactive input from the user. The inconvenience is worth the precaution against losing a month of work.
For scenario analysis, we often want to see samples in progress. As in, we have, say, samples 1-10 for all scenarios, rather than finishing all the samples for a particular scenario, then moving on to the next.
This would be accomplished currently by using a "realization" pseudo parameter, and making that the last item in the parameter configuration file, and then effectively the outermost iteration loop when processing all the scenario elements.
Three points:
When using "Posterior" type parameters, they should be drawn from the data storage as unified rows.
This would simplify their representation as parameters (get a row at time, could eliminate a fair chunk of the iteration logic), and reduce their footprint in the database (distinct posterior table, par_vals would point to the key of that, rather than whole table; can still materialize into view when querying).
The config specification itself could then refer to an array of column names, shared ranks, etc.
There's a potential downside in precluding unlinked posterior iteration, but: that's already precluded and I think this approach actually enables that approach. Could define multiple posterior parameters (for different blocks of columns), which would be allowed independent traversal.
We have typically used AbcSmc in two modes: "fitting" and "project". However, there are not generally unambiguously distinguishing configuration file features. Nor is there a core abc
executable that will take a -f|--fitting
vs -p|--project
flags.
If we could know that we were in "project" mode, then lots of boilerplate options could be removed and/or set to defaults for the configuration file.
For example, all of the predictive prior / PLS fraction keys are irrelevant (maybe even error inducing). num_samples could be defaulted all combinations.
There's a no-longer-maintained library of ranking operations in lib/ranker.h
.
However, these are now likely superceded by c++17 standard operations / library.
We should eliminate this dependecy.
postgresql library https://pqxx.org/development/libpqxx/
Quick start documentation instructs users to run make run_sql; make run_sql; make run_sql
. This used to work, but run_sql is no longer a valid make target. Running make
in the examples directory with the default target produces the following:
cmake -S . -B demo
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Warning (dev) in CMakeLists.txt:
No cmake_minimum_required command is present. A line of code such as
cmake_minimum_required(VERSION 3.22)
should be added at the top of the file. The version specified may be lower
if you wish to support older CMake versions for this project. For more
information run "cmake --help-policy CMP0000".
This warning is for project developers. Use -Wno-dev to suppress it.
-- Configuring done
CMake Warning (dev) at CMakeLists.txt:9 (add_library):
Policy CMP0028 is not set: Double colon in target name means ALIAS or
IMPORTED target. Run "cmake --help-policy CMP0028" for policy details.
Use the cmake_policy command to set the policy and suppress this warning.
Target "dice" links to target "GSL::gsl" but the target was not found.
Perhaps a find_package() call is missing for an IMPORTED target, or an
ALIAS target is missing?
This warning is for project developers. Use -Wno-dev to suppress it.
CMake Warning (dev) at CMakeLists.txt:9 (add_library):
Policy CMP0028 is not set: Double colon in target name means ALIAS or
IMPORTED target. Run "cmake --help-policy CMP0028" for policy details.
Use the cmake_policy command to set the policy and suppress this warning.
Target "dice" links to target "GSL::gslcblas" but the target was not found.
Perhaps a find_package() call is missing for an IMPORTED target, or an
ALIAS target is missing?
This warning is for project developers. Use -Wno-dev to suppress it.
CMake Warning (dev) at CMakeLists.txt:12 (add_executable):
Policy CMP0028 is not set: Double colon in target name means ALIAS or
IMPORTED target. Run "cmake --help-policy CMP0028" for policy details.
Use the cmake_policy command to set the policy and suppress this warning.
Target "dice_game" links to target "GSL::gsl" but the target was not found.
Perhaps a find_package() call is missing for an IMPORTED target, or an
ALIAS target is missing?
This warning is for project developers. Use -Wno-dev to suppress it.
CMake Warning (dev) at CMakeLists.txt:12 (add_executable):
Policy CMP0028 is not set: Double colon in target name means ALIAS or
IMPORTED target. Run "cmake --help-policy CMP0028" for policy details.
Use the cmake_policy command to set the policy and suppress this warning.
Target "dice_game" links to target "GSL::gslcblas" but the target was not
found. Perhaps a find_package() call is missing for an IMPORTED target, or
an ALIAS target is missing?
This warning is for project developers. Use -Wno-dev to suppress it.
CMake Warning (dev) at CMakeLists.txt:37 (add_executable):
Policy CMP0028 is not set: Double colon in target name means ALIAS or
IMPORTED target. Run "cmake --help-policy CMP0028" for policy details.
Use the cmake_policy command to set the policy and suppress this warning.
Target "shared" links to target "GSL::gsl" but the target was not found.
Perhaps a find_package() call is missing for an IMPORTED target, or an
ALIAS target is missing?
This warning is for project developers. Use -Wno-dev to suppress it.
CMake Warning (dev) at CMakeLists.txt:37 (add_executable):
Policy CMP0028 is not set: Double colon in target name means ALIAS or
IMPORTED target. Run "cmake --help-policy CMP0028" for policy details.
Use the cmake_policy command to set the policy and suppress this warning.
Target "shared" links to target "GSL::gslcblas" but the target was not
found. Perhaps a find_package() call is missing for an IMPORTED target, or
an ALIAS target is missing?
This warning is for project developers. Use -Wno-dev to suppress it.
CMake Warning (dev) at CMakeLists.txt:37 (add_executable):
Policy CMP0028 is not set: Double colon in target name means ALIAS or
IMPORTED target. Run "cmake --help-policy CMP0028" for policy details.
Use the cmake_policy command to set the policy and suppress this warning.
Target "pseudo" links to target "GSL::gsl" but the target was not found.
Perhaps a find_package() call is missing for an IMPORTED target, or an
ALIAS target is missing?
This warning is for project developers. Use -Wno-dev to suppress it.
CMake Warning (dev) at CMakeLists.txt:37 (add_executable):
Policy CMP0028 is not set: Double colon in target name means ALIAS or
IMPORTED target. Run "cmake --help-policy CMP0028" for policy details.
Use the cmake_policy command to set the policy and suppress this warning.
Target "pseudo" links to target "GSL::gslcblas" but the target was not
found. Perhaps a find_package() call is missing for an IMPORTED target, or
an ALIAS target is missing?
This warning is for project developers. Use -Wno-dev to suppress it.
CMake Warning (dev) at CMakeLists.txt:37 (add_executable):
Policy CMP0003 should be set before this line. Add code such as
if(COMMAND cmake_policy)
cmake_policy(SET CMP0003 NEW)
endif(COMMAND cmake_policy)
as early as possible but after the most recent call to
cmake_minimum_required or cmake_policy(VERSION). This warning appears
because target "shared" links to some libraries for which the linker must
search:
abc, GSL::gsl, GSL::gslcblas
and other libraries with known full path:
/home/tjhladish/work/AbcSmc/examples/demo/libdice.so
CMake is adding directories in the second list to the linker search path in
case they are needed to find libraries from the first list (for backwards
compatibility with CMake 2.4). Set policy CMP0003 to OLD or NEW to enable
or disable this behavior explicitly. Run "cmake --help-policy CMP0003" for
more information.
This warning is for project developers. Use -Wno-dev to suppress it.
-- Generating done
-- Build files have been written to: /home/tjhladish/work/AbcSmc/examples/demo
cd demo && make direct
make[1]: Entering directory '/home/tjhladish/work/AbcSmc/examples/demo'
make[1]: *** No rule to make target 'direct'. Stop.
make: *** [Makefile:36: demo/direct] Error 2
From what I can tell re MPI, the intent is to support distributing runs out to various nodes, running the simulations, collecting the results, then running ABC-PLS on the collected results.
"resume" directory seems to serve a related purpose.
These seem like useful approaches to offer, but are currently deprecated clutter. Recommend we just excise them for now, then revisit if necessary in the future. At that point, my imagination of AbcSmc would enable implementing those in a stand-alone way as some mixture of simulator / storage classes.
For broadest use of AbcSmc, we should prefer to only require the minimal arguments, while flexibly supporting receiving more state as desired by the library user.
The minimal arguments might be just the parameter values. Arguably, if someone is using AbcSmc, they're doing stochastic work, so should also be receiving a seed. But we shouldn't force defining a simulator in terms of arguments that they immediately ignore (e.g. serial in some cases, _mp in basically all cases).
&
and *
set_sim*
signaturesAbcConfig
is a friend of AbcSmc
(so can access private fields)AbcSmc
stateAbcConfig
has a template, virtual parse
method (possibly also a deserialize? takes plain string filename and inflates to the relevant type for parse?). Sub classes define this parse
AbcConfig
has a build
method (receives filename => deserialize => parse => check state => yields an AbcSmc ready to work)AbcConfig
static method to serve up relevant parser, based on sniffing filetype?Being able to use AbcSmc without having to provide a configuration file would be useful.
This kind of approach is currently represented in examples/manual/main.cpp
, though is no longer functional as such.
Low priority: make this work again.
With an established DSL for the configuration files, it should be possible to write a skeleton file which creates a-plain-old-data object convenience converter, ala:
struct SOME {
// ...pertinent fields
SOME(vector<double> params) // ... some automated assignment to pertinent fields
{}
}
Basically, a static convenience tool - similar available for parsing scenarios, deparsing metrics, etc. Some of this is irrelevant if AbcSmc adopts an approach of passing around map
s instead of vector
s, but would enable disentangling model variable choices from storage choices (i.e. these skeletons could be editted to do a bit of more complex transformation from the preferred model value and the preferred analysis / storage value).
Ideally, if someone wants to use AbcSmc in some simulator (and not do anything fancy with elements under the hood), they should be able to include a header or two and to indicate the path just to find those.
However, currently the sqdb and jsoncpp libraries leaky out interface-wise - as in, if I'm just writing a simulator, I have to know the detailed structure inside the AbcSmc to provide the path to those dependencies as well OR we have to write some gnarly bits in those within abc headers.
Proposed solution: expose less of the interface publicly. If we remove the sqdb, jsoncpp elements from the signatures in the AbcSmc header and instead encapsulate talking to those dependencies to AbcSmc.cpp, that will hide (for build purposes) those dependencies from users.
Need to establish a specification for .json configuration files (more broadly, other modes of providing those).
Notes:
TODO: ... assorted top level keys, which required / conflicting, type for associated values
num_samples
: a positive integer or array of positive integers. Optional in "projection" mode; when provided in "projection" mode, should always be a single integer.
predictive_prior_(fraction|size)
keys: mutually exclusive options for specifying how many draws to include when generating a predictive prior (in "fitting" mode). For either fraction or size, may be either a single value or an array. If an array, then must either match the length of the num_samples
array OR num_samples must be length 1.
"parameters" key
"metrics" key
To move AbcSmc toward being a compiled tool, should be marking various methods for deprecation, particularly those that using AbcSmc in compiled-library-with-simulator mode.
The set_sim family are definitely in the category. The key issue for backwards compatibility will be whether the as-a-library mode can correctly find the simulator symbol. That seems immently solvable: have a sensible "just works" default for finding the symbol if its named simulator, has the right signature [plus some useful error messages? convenience checkers?], and maybe allow a configuration option to use a different symbol name is people are picky.
Like #28 , but basically addressing the acceptable combinations of par1, par2, step, vals, ... other options?
Would be useful to delete branches which have been successfully merged. By my count:
I don't have access to purge them here, but have trimmed off all of these in my locals.
The current approach to organizing parameters is un-normalized, meaning that adding parameter levels (or additional parameters) typically entails substantial handwork. (Similar problem for metrics, or if we went to split parameters by fitted vs scenario settings, etc).
For any given interaction with a database (that is, for a one off simulation analysis), the unnormalized approach is easier to work with. But for repeated development in particular simulation (e.g. adapting scenarios, metrics), or for developing tools which are generic across simulations, a normalized approach would be more flexible.
So:
Relative to the prepare, harness, etc approach in this application, would prefer to have "installing" abc 1) provide a library install AND 2) a command line tooling.
The command line tooling would then work roughly like
$ abcsmc (-p|--prepare) some_config.json [-o some_config.sqlite]
$ abcsmc (-s|--simulate) some_config.sqlite some_model.so
Possible solution would be to have abcsmc install 1) a shell script that handles parsing command line args and passing to 2) binaries that are purpose built to the relevant steps
Propose replacing the json parsing library.
This would entail internal rework, but should have no knock-on consequences (excepting potentially abcsmc + simulator compile approaches that rely on jsoncpp to parse non-abc-related json).
Desire to have a "fingerprint" capability for users.
Currently do that by including the CCRC32 library and relying on users to know its methods.
A more robust approach would be to encapsulate precisely the capability we're providing, and then export it via AbcUtil e.g., document it, and so on.
One of the core AbcSmc capabilities is management of scenario projections: AbcSmc will systematically build up a scenario run plan and then run that plan.
However, there are use-case gaps and pain points. AbcSmc currently only directly supports "all combinations" build, so excluded combinations then have to be manually pruned after construction. This is wasteful (when there's lots of pruning), error prone / onerous (because user has to write pruning code), and may not be particularly portable (e.g. if AbcSmc supports different backends in the future, users would have to write different pruning for each backend).
AbcSmc also doesn't distinguish between parameters (fitted elements) versus settings (scenario assumptions). That smells a bit funny, though it's hard to specify the precise problems there. For future capabilities, might want have AbcSmc work either from a computed prior (i.e. the resulting posterior cloud from fitting) or from a sampling one (i.e. just draw a sample from this analytic distribution), and having distinct parameters versus settings seems a natural boundary to set which would make that swap easier.
So, in general, we ought to formally describe what the domain specific language (DSL) for AbcSmc is. This is currently captured ad-hoc in the json configuration files; these blend engineering settings, fitting parameters, targets, scenarios parameters, but also often neglect the actual desired outputs.
Fairly hot: the examples code needs be more one-stop-shopping, see all the moving parts (i.e. complete, isolated, minimal working examples). @pearsonca has been using these more as integration tests, which results in more abstraction / reduced code duplication, but tends to be too complex for self-contained MREs.
These two use cases need to be distinguished. Initial implementation can be a bit messy.
Per subject line: https://github.com/SRombauts/SQLiteCpp
Biggest pro currently is that it's actively maintained.
Unknown: how it specifically exposes the features we use
Con: seems to have a bit more build to it, but: does explicitly talk about how to "install" it as a submodule.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.