niconeureiter / sbayes Goto Github PK

MCMC algorithms to identify linguistic contact zones

License: GNU General Public License v3.0

Python 100.00%

sbayes's Introduction

sBayes

This software package implements a Bayesian mixture model for reconstructing linguistic contact areas, as described in Contact-tracing in cultural evolution: a Bayesian mixture model to detect geographic areas of language contact (Ranacher P., Neureiter N., Van Gijn R., Sonnenhauser B., Escher A., Weibel R., Muysken P., Bickel B.). sBayes implements a custom MCMC sampler to generate contact areas according to the model. Here we describe the installation process and the basic commands needed to run an analysis. For more detailed instructions explaining each step in the analysis and the various settings, please consult the user manual (sbayes_documentation.pdf).

Installation

To run sBayes, you need Python (version >=3.7) and three required system libraries: GEOS, GDAL, PROJ. The way of installing these system requirements depends on your operating system. E.g. on Linux (Ubuntu/Debian) you can use the following command:

sudo apt-get install -y libproj-dev proj-data proj-bin libgeos-dev

On MAC OS the same can be done using the homebrew package manager:

brew install proj geos gdal

On Windows the three libraries need to be installed manually from the corresponding sources.

Once these system libraries are ready, sBayes can be installed (along with the required python libraries) using:

 pip install --user git+git://github.com/derpetermann/sBayes.git

Running sBayes

sBayes can be used as a python library or through a command line interface. Here we describe the command line interface, which offers a convenient way to run the standard workflow of a sBayes analysis. To run sBayes from the command line, simply call:

sbayes <path_to_config_file>

The config file is a JSON file which contains all the settings for the analysis. The results are written to CSV files, which can be visualized using the plotting functions. The format of the config file and details about the set-up and visualization are described in the user manual.

sbayes's People

Contributors

Stargazers

Watchers

Forkers

anaphory noorefrat lingulist andrei-wonge antipodite meavia-jing hansonmenghan nataliacp

sbayes's Issues

Consuming CLDF: the NULL value defined in the metadata should be treated as Null

When CLDF becomes readable as an input, you should look at the "null" value in the metadata description for the datatype and import that as unknown data.

Visualization: Area lines should have configurable / exaggeratable widths

The lines that are used to draw areas should have exaggeratable / configurable widths. E.g., if one has a minimum posterior frequency of 0.7, then the differences between 0.7 and 0.9 should be visually salient, with 0.7 being thin and 0.9 being quite thick.

Fix implementation of hierarchical prior for confounders

The "universal" option as a confounding effect prior was deactivated because it caused issues in the MCMC operators. In order to make the hierarchical prior for confounders usable the operators need to be adjusted.

Write stats out as the program advances (+hanging run while writing output)

We discussed a while ago the possibility of writing the output files as the algorithm advances, rather than waiting til the end. I was doing a run for n=9 areas and the program hangs after writing to stdout "MCMC Statistics" and never finishes writing the stats file. I went to check tracer to see if I could salvage the run, but it seems like the algorithm is still waiting until it is finished with a run before it writes anything. Can we have it write to output as it goes? This will be important for monitoring lengthy runs.

The point at which the hang happens (has been ongoing for many hours), if it is useful for debugging:

Visualizaiton crashes while generating plots. Error is "Contour levels must be increasing" (urgent)

On the OOA data from Salsa, the visualization crashes while generating the plots. It is with the following error message

Traceback (most recent call last):
File "/Users/david/opt/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/david/opt/miniconda3/lib/python3.8/runpy.py", line 87, in run_code
exec(code, run_globals)
File "/Users/david/Documents/sbayes/sbayes/plot.py", line 1761, in
main()
File "/Users/david/Documents/sbayes/sbayes/plot.py", line 1724, in main
plot.plot_weights(file_name='weights_grid' + m)
File "/Users/david/Documents/sbayes/sbayes/plot.py", line 1252, in plot_weights
self.plot_weight(weights[f], feature=f, title=cfg_weights['graphic']['title'],
File "/Users/david/Documents/sbayes/sbayes/plot.py", line 1113, in plot_weight
sns.kdeplot(x=x, y=y, shade=True, cut=30, n_levels=100,
File "/Users/david/opt/miniconda3/lib/python3.8/site-packages/seaborn/_decorators.py", line 46, in inner_f
return f(**kwargs)
File "/Users/david/opt/miniconda3/lib/python3.8/site-packages/seaborn/distributions.py", line 1742, in kdeplot
p.plot_bivariate_density(
File "/Users/david/opt/miniconda3/lib/python3.8/site-packages/seaborn/distributions.py", line 1181, in plot_bivariate_density
cset = contour_func(
File "/Users/david/opt/miniconda3/lib/python3.8/site-packages/matplotlib/init.py", line 1599, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "/Users/david/opt/miniconda3/lib/python3.8/site-packages/matplotlib/axes/_axes.py", line 6430, in contourf
contours = mcontour.QuadContourSet(self, *args, **kwargs)
File "/Users/david/opt/miniconda3/lib/python3.8/site-packages/matplotlib/contour.py", line 855, in init
kwargs = self._process_args(*args, **kwargs)
File "/Users/david/opt/miniconda3/lib/python3.8/site-packages/matplotlib/contour.py", line 1456, in _process_args
x, y, z = self._contour_args(args, kwargs)
File "/Users/david/opt/miniconda3/lib/python3.8/site-packages/matplotlib/contour.py", line 1527, in _contour_args
self._contour_level_args(z, args)
File "/Users/david/opt/miniconda3/lib/python3.8/site-packages/matplotlib/contour.py", line 1211, in _contour_level_args
raise ValueError("Contour levels must be increasing")
ValueError: Contour levels must be increasing

Support for a Pacific-centered map

It is difficult to get a pacific-centered map: Offsetting the coordinates (e.g., -250 west) renders a blank map. It would be nice to have support for or instructions on a pacific-centered map.

The mathematical treatment of missing data

Following discussion with Nico, we think there needs to be a decision and justification for how to treat missing data:
(1) as it is treated now, not adding it;
(2) split probability equally across all possible states

Correspondence in visualization duplicates each language N times

When the plotting function outputs the correspondence sets, it duplicates each language (in each color) N times, where N is the number of areas. This means that it is not very useful for determining which language is in which area, especially if there are overlapping languages or lines in the posterior map. (This is a fairly high-priority bug.)

Visualization script will not run to completion

After the recent update, there is a problem when running the visualization script:

Traceback (most recent call last):
  File "/Users/david/opt/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/david/opt/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 1762, in <module>
    main()
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 1721, in main
    plot_map(plot, m)
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 1752, in plot_map
    iterate_or_run(
  File "/Users/david/Documents/sbayes/sbayes/cli.py", line 102, in iterate_or_run
    iterate_over_parameter(x, config_setter, function, print_message)
  File "/Users/david/Documents/sbayes/sbayes/cli.py", line 95, in iterate_over_parameter
    function(value)
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 1755, in <lambda>
    function=lambda x: plot.posterior_map(file_name=f'posterior_map_{m}_{x}')
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 912, in posterior_map
    self.visualize_base_map(extent, cfg_geo, cfg_graphic, ax)
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 802, in visualize_base_map
    self.add_background_map(cfg_geo, cfg_graphic, ax)
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 763, in add_background_map
    lon_0 = map_crs.coordinate_operation.params[0].value
AttributeError: 'NoneType' object has no attribute 'params'

I did some debugging and map_crs is appropriately set to EPSG:4326 (as per my config), but it looks like it has no dependent value coordinate_operation that the code is expecting. Something must have changed because this used to work.

Options for map visualizations

When a large number of languages are included in the input, the map can get too busy to read. It is impossible to see if a language in the middle of a split area is a member of the area or not -- and this is a significant problem with dense or large maps. There are two solutions I can think of:

(1) A toggle on showing or not showing the lines, so that you just color the points differently
(2) The "aquarella" solution, where the land mass on the map is colored according the the area, and on a fade-out. I.e., 100% opacity at the point of the language being spoken, which radially fades out to 0% opacity. When two colored bubbles intertwine, you linearly mix the color value according to their comparative opacities.

Dirichlet should be readable from file

We have complex dirichlets for complex inputs. Dirichlet should be readable from file, not just directly inputted into the config file.

Configs in experiment (SA) not up to date

Running the SA experiment results in an error. The config.json file for this experiment does not seem to be up to date.
This is the error message:

(base) Administrators-MacBook-Pro:sBayes Anna$ python experiments/south_america/scripts/run_sa_exp.py
Fontconfig warning: ignoring UTF-8: not a valid region tag
Experiment: 2022-02-03_13-44
File location for results: None
Traceback (most recent call last):
File "experiments/south_america/scripts/run_sa_exp.py", line 10, in
exp.load_config(config_file='experiments/south_america/config.json')
File "/Users/anaconda3/lib/python3.8/site-packages/sbayes/experiment_setup.py", line 86, in load_config
self.verify_config()
File "/Users/anaconda3/lib/python3.8/site-packages/sbayes/experiment_setup.py", line 172, in verify_config
raise NameError(f'{k} is not defined {self.config_file}')
NameError: n_areas is not defined /Users/Anna 1/sBayes/experiments/south_america/config.json

Plotting throws an exception if there are too many families in the input

Even with plot_families set to false, if there are too many families in the input, there is an exception in the plotting function, of the following type:

  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 1733, in <lambda>
    function=lambda x: plot.posterior_map(file_name=f'posterior_map_{m}_{x}'),
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 898, in posterior_map
    self.color_families(locations_map_crs, cfg_graphic, cfg_legend, ax)
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 607, in color_families
    family_color = cfg_graphic['families']['color'][i]

It does not seem to be possible to work around this, even by providing families > color with a massive array

Map plotting runs into ValueError

Hi, I'm a student researcher at UCLA. I downloaded sBayes 1.1 source code, which contains an experiment folder with balkan data and scripts. I currently have an environment with Python 3.8 and sbayes 1.0. After changing the file paths in the scripts and config files, I was able to run run_balkan_exp.py with no issue, but plot_balkan_exp.py only outputs graphs for weights, probability grids, and DIC. No map was outputted as a result of running the script. After some digging in the code, I realize that it's due to the ValueError in the function of posterior_map which triggers the function compute_alpha_shapes. In line 485 of plot.py, the error was raised due to unpacking. Could you please let me know if there's a way I could fix this issue? Thank you so much!

Visualization: "Contact Areas" legend cuts off map polygons

When the contact areas legend is enabled, the side of the screen on which it is rendered has the underlying polygons for the map (from top to bottom of the window) obliterated by the legend.

Better name for "cost_based" geoprior

Currently one of the enums for types is called COST_BASED, but Gereon + I think it is probably better called "MST_BASED", because there are many conceivable costs but this is based on the minimum spanning tree.

deprecation warning related to pip

During installation I ran into this deprecation warning:
DEPRECATION: sbayes is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at pypa/pip#8559

Scale Counts per feature

We discussed the possibility of having scale counts that differ per feature: i.e., a scale count of 10 may be appropriate for a feature with 2 states, but a higher scale count may be appropriate for a feature with more states.

Model logging should be suppressed when sampling from the prior

When sampling from the prior, there is an error because the obervation_lhs is None. But this is expected when sampling from the prior. So model logs should not be made when sampling from the prior.

Algorithm: inheritance prior that applies to all language families

There should be a way to give the same inheritance prior for all language families. If one has a large number of language families, it is quite tedious to enter them all in separately. There should be a way to set the same prior for all of them, and it could be a uniform prior or a dirichlet (e.g., a cornuto dirichlet).

Visualization: Default colors should be an expandable, generated list

Currently the default colors for a language family is a hard-coded array. If there are more languages than this, and the user does not provide his or her own array in the config file, then the visualization program crashes with an out of bounds error. The family colors should instead be automatically generated by a color picker, so that this does not happen.

Universal prior mean as basis for family prior in sbayes

We want to be able derive family priors based on the universal prior + a cornuto dirichlet. But this will have to be done in the MCMC. This is to accommodate cases where we cannot provide a family prior (because the family is entirely within the area of concern) but still have a belief that there should be a family prior which is derived somehow from the universal prior.

Balkans experiment setup code not available in repo

Hi, thanks for this excellent package. I'm attempting to use it to find contact areas in languages which are all part of a single family with features that may or may not be inherited from a) the ancestral protolanguage and b) lower level protolanguages. I see in the supplementary material that --

we could also model relatedness on a different than the family level. In particular, we can incorporate closer relatedness by defining the likelihood of inheritance based on clades [...] We applied this idea in the case study on the Balkans, where we modelled a separate likelihood for the Slavic, Romance, Albanian, and Greek languages.

I would like to apply this, but the Balkans case study isn't in the experiments folder in the repo, so it's not clear how to do it. Would you simply include the clade in the "family" column in the input data? If this is the case, should I also try to determine features probably due to common inheritance at the family level and remove them from the data?

If you'd prefer to share the code for the Balkans example privately, you could email me, isaac underscore stead at eva dot mpg dot de

Cheers

Isaac

Allow dirichlet to be read from prior

Currently the code for reading dirichlet priors from file is commented out -- this needs to be restored

Visualization: Pie charts should be left-aligned with text

The pie plots should be to the left of the text in the output pdf. Because the text is left-aligned in a table cell, it is difficult to read with the pie plots to their right (where they are closer to the text of the next cell).

Default should permit more than 5 areas

When trying to plot a map of K=6, I get the following error:

all_labels = self.visualize_areas(locations_map_crs, extent, cfg_content, cfg_graphic, cfg_legend, ax)

File "/Users/david/Documents/sbayes/sbayes/plot.py", line 564, in visualize_areas
current_color = cfg_graphic['areas']['color'][i]
IndexError: tuple index out of range

It appears that the default area colors is capped to 5? This should be expanded as with families so it is dynamically generated.

bbox is still being called in add_background_map

When running visualization the following error is encountered:

Traceback (most recent call last):
  File "/Users/david/opt/miniconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/david/opt/miniconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 1768, in <module>
    main()
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 1727, in main
    plot_map(plot, m)
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 1755, in plot_map
    plot.posterior_map(file_name='posterior_map_' + m)
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 941, in posterior_map
    self.add_overview_map(locations_map_crs, extent, cfg_geo, cfg_graphic, cfg_legend, ax)
  File "/Users/david/Documents/sbayes/sbayes/plot.py", line 716, in add_overview_map
    self.add_background_map(bbox, cfg_geo, cfg_graphic, axins)
TypeError: add_background_map() takes 3 positional arguments but 4 were given

"bbox" must be removed from that call.

test idea for visualization

Could we have an automatic check for the visualization with a given output file, so we make sure it is always running? Sorry if you already have this, but I just thought about it.

Dirichlet Distribution on Features for Family Prior

Nico has devised a method for allowing family priors to be given with a skewed dirichlet that is unique per feature. The skew is toward persistence of one state or another of the feature, deformed by the universal distribution.

Probability distribution of isolates

Currently, isolates lack a family entirely, so the probability distribution for universal and areal are renormalized, effectively boosting those likelihoods. From the discussion on 12 May, we thought it might (for isolates and languages who are the only representative of their family) be better to reassign their family weight entirely to the universal distribution. Otherwise, isolates will be slightly more likely to be in an area, which is perhaps not motivated.

Plotting should always generate probability grids, or be a flag in the config file

As done, probability grids have to be requested from within python code, but they should probably be in the config file with a flag for "generate them all" or "generate none." Then you can do more in python if you prefer.

Visualization: Polygons wrapping around the dateline

There was an issue in an older version of the visualization software that caused map polygons to deform at the date line. We are unsure if this is currently recurring: This issue is to keep track of the known problem in case it currently happens.