GithubHelp home page GithubHelp logo

rmjarvis / treecorr Goto Github PK

View Code? Open in Web Editor NEW
97.0 11.0 38.0 47.86 MB

Code for efficiently computing 2-point and 3-point correlation functions. For documentation, go to

Home Page: http://rmjarvis.github.io/TreeCorr/

License: Other

C++ 18.08% Python 58.07% Makefile 0.08% Shell 0.03% Jupyter Notebook 23.50% Roff 0.23%
astronomy python des lsst correlation weaklensing largescalestructure

treecorr's Introduction

image

image

TreeCorr is a package for efficiently computing 2-point and 3-point correlation functions.

  • The code is hosted at https://github.com/rmjarvis/TreeCorr
  • It can compute correlations of regular number counts, weak lensing shears, or scalar quantities such as convergence or CMB temperature fluctutations.
  • 2-point correlations may be auto-correlations or cross-correlations. This includes shear-shear, count-shear, count-count, kappa-kappa, etc. (Any combination of shear, kappa, and counts.)
  • 3-point correlations currently can only be auto-correlations. This includes shear-shear-shear, count-count-count, and kappa-kappa-kappa. The cross varieties are planned to be added in the near future.
  • Both 2- and 3-point functions can be done with the correct curved-sky calculation using RA, Dec coordinates, on a Euclidean tangent plane, or in 3D using either (RA,Dec,r) or (x,y,z) positions.
  • The front end is in Python, which can be used as a Python module or as a standalone executable using configuration files. (The executable is corr2 for 2-point and corr3 for 3-point.)
  • The actual computation of the correlation functions is done in C++ using ball trees (similar to kd trees), which make the calculation extremely efficient.
  • When available, OpenMP is used to run in parallel on multi-core machines.
  • Approximate running time for 2-point shear-shear is ~30 sec * (N/10^6) / core for a bin size b=0.1 in log(r). It scales as b^(-2). This is the slowest of the various kinds of 2-point correlations, so others will be a bit faster, but with the same scaling with N and b.
  • The running time for 3-point functions are highly variable depending on the range of triangle geometries you are calculating. They are significantly slower than the 2-point functions, but many orders of magnitude faster than brute force algorithms.
  • If you use TreeCorr in published research, please reference: Jarvis, Bernstein, & Jain, 2004, MNRAS, 352, 338 (I'm working on new paper about TreeCorr, including some of the improvements I've made since then, but this will suffice as a reference for now.)
  • If you use the three-point multipole functionality of TreeCorr, please also reference Porth et al, 2023, arXiv:2309.08601
  • Record on the Astrophyics Source Code Library: http://ascl.net/1508.007
  • Developed by Mike Jarvis. Fee free to contact me with questions or comments at mikejarvis17 at gmail. Or post an issue (see below) if you have any problems with the code.

The code is licensed under a FreeBSD license. Essentially, you can use the code in any way you want, but if you distribute it, you need to include the file TreeCorr_LICENSE with the distribution. See that file for details.

Installation

The easiest ways to install TreeCorr are either with pip:

pip install treecorr

or with conda:

conda install -c conda-forge treecorr

If you have previously installed TreeCorr, and want to upgrade to a new released version, you should do:

pip install treecorr --upgrade

or:

conda update -c conda-forge treecorr

Depending on the write permissions of the python distribution for your specific system, you might need to use one of the following variants for pip installation:

sudo pip install treecorr
pip install treecorr --user

The latter installs the Python module into ~/.local/lib/python3.X/site-packages, which is normally already in your PYTHONPATH, but it puts the executables corr2 and corr3 into ~/.local/bin which is probably not in your PATH. To use these scripts, you should add this directory to your PATH. If you would rather install into a different prefix rather than ~/.local, you can use:

pip install treecorr --install-option="--prefix=PREFIX"

This would install the executables into PREFIX/bin and the Python module into PREFIX/lib/python3.X/site-packages.

If you would rather download the tarball and install TreeCorr yourself, that is also relatively straightforward:

1. Download TreeCorr

You can download the latest tarball from:

https://github.com/rmjarvis/TreeCorr/releases/

Or you can clone the repository using either of the following:

git clone [email protected]:rmjarvis/TreeCorr.git
git clone https://github.com/rmjarvis/TreeCorr.git

which will start out in the current stable release branch.

Either way, cd into the TreeCorr directory.

2. Install dependencies

All required dependencies should be installed automatically for you by pip or conda, so you should not need to worry about these. But if you are interested, the dependencies are:

  • numpy
  • pyyaml
  • LSSTDESC.Coord
  • pybind11

They can all be installed at once by running:

pip install -r requirements.txt

or:

conda install -c conda-forge treecorr --only-deps

Note

Several additional modules are not required for basic TreeCorr operations, but are potentially useful.

  • fitsio is required for reading FITS catalogs or writing to FITS output files.
  • pandas will signficantly speed up reading from ASCII catalogs.
  • pandas and pyarrow are required for reading Parquet files.
  • h5py is required for reading HDF5 catalogs.
  • mpi4py is required for running TreeCorr across multiple machines using MPI.

These are all pip installable:

pip install fitsio
pip install pandas
pip install pyarrow
pip install h5py
pip install mpi4py

But they are not installed with TreeCorr automatically.

Also, beware that many HPC machines (e.g. NERSC) have special instructions for installing mpi4py, so the above command may not work on such systems. If you have trouble, look for specialized instructions about how to install mpi4py properly on your system.

3. Install

You can then install TreeCorr from the local distribution. Typically this would be the command:

pip install .

If you don't have write permission in your python distribution, you might need to use:

pip install . --user

In addition to installing the Python module treecorr, this will install the executables corr2 and corr3 in a bin folder somewhere on your system. Look for a line like:

Installing corr2 script to /anaconda3/bin

or similar in the output to see where the scripts are installed. If the directory is not in your path, you will also get a warning message at the end letting you know which directory you should add to your path if you want to run these scripts.

4. Run Tests (optional)

If you want to run the unit tests, you can do the following:

pip install -r test_requirements.txt
cd tests
pytest

Two-point Correlations

This software is able to compute a variety of two-point correlations:

NN

The normal two-point correlation function of number counts (typically galaxy counts).

GG

Two-point shear-shear correlation function.

KK

Nominally the two-point kappa-kappa correlation function, although any scalar quantity can be used as "kappa". In lensing, kappa is the convergence, but this could be used for temperature, size, etc.

NG

Cross-correlation of counts with shear. This is what is often called galaxy-galaxy lensing.

NK

Cross-correlation of counts with kappa. Again, "kappa" here can be any scalar quantity.

KG

Cross-correlation of convergence with shear. Like the NG calculation, but weighting the pairs by the kappa values the foreground points.

There are also additional combinations involving complex fields with different spin than 2 (shear is a spin-2 field). See Two-point Correlation Functions for more details.

Three-point Correlations

This software is not yet able to compute three-point cross-correlations, so the only avaiable three-point correlations are:

NNN

Three-point correlation function of number counts.

GGG

Three-point shear correlation function. We use the "natural components" called Gamma, described by Schneider & Lombardi (2003) (Astron.Astrophys. 397, 809) using the triangle centroid as the reference point.

KKK

Three-point kappa correlation function. Again, "kappa" here can be any scalar quantity.

See Three-point Correlation Functions for more details.

Running corr2 and corr3

The executables corr2 and corr3 each take one required command-line argument, which is the name of a configuration file:

corr2 config_file
corr3 config_file

A sample configuration file for corr2 is provided, called sample.params. See Configuration Parameters for the complete documentation about the allowed parameters.

You can also specify parameters on the command line after the name of the configuration file. e.g.:

corr2 config_file file_name=file1.dat gg_file_name=file1.out
corr2 config_file file_name=file2.dat gg_file_name=file2.out
...

This can be useful when running the program from a script for lots of input files.

See Using configuration files for more details.

Using the Python module

The typical usage in python is in three stages:

  1. Define one or more Catalogs with the input data to be correlated.
  2. Define the correlation function that you want to perform on those data.
  3. Run the correlation by calling process.
  4. Maybe write the results to a file or use them in some way.

For instance, computing a shear-shear correlation from an input file stored in a fits file would look something like the following:

>>> import treecorr
>>> cat = treecorr.Catalog('cat.fits', ra_col='RA', dec_col='DEC',
...                        ra_units='degrees', dec_units='degrees',
...                        g1_col='GAMMA1', g2_col='GAMMA2')
>>> gg = treecorr.GGCorrelation(min_sep=1., max_sep=100., bin_size=0.1,
...                             sep_units='arcmin')
>>> gg.process(cat)
>>> xip = gg.xip  # The xi_plus correlation function
>>> xim = gg.xim  # The xi_minus correlation function
>>> gg.write('gg.out')  # Write results to a file

For more details, see our slightly longer Getting Started Guide.

Or for a more involved worked example, see our Jupyter notebook tutorial.

And for the complete details about all aspects of the code, see the Sphinx-generated documentation.

Reporting bugs

If you find a bug running the code, please report it at:

https://github.com/rmjarvis/TreeCorr/issues

Click "New Issue", which will open up a form for you to fill in with the details of the problem you are having.

Requesting features

If you would like to request a new feature, do the same thing. Open a new issue and fill in the details of the feature you would like added to TreeCorr. Or if there is already an issue for your desired feature, please add to the discussion, describing your use case. The more people who say they want a feature, the more likely I am to get around to it sooner than later.

treecorr's People

Contributors

arunkannawadi avatar beckermr avatar gogrean avatar joezuntz avatar pfleget avatar rmjarvis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

treecorr's Issues

feature request: Max/min line of sight separation for summing pairs in new rperp metric

Adding an option to restrict the line of sight separation of pairs summed using the new rperp metric. For (rp, Pi) coordinates, this amounts to an integration limit in the Pi direction. In practice, it just excludes pairs where Delta Pi > los_max. Useful in both limits, a maximum los separation (for IA) and minimum line of sight separation (for gg lensing).

Rlens geometry problem

I have forked TreeCorr to try to implement Rlens myself, which I have so far done by copying the Rperp code, the only differences being how DistSq() is defined and how we handle TooSmallDist() and TooLargeDist(). However, I am (embarrassingly) having a very hard time with the geometry of finding Rlens with which to write DistSq().

Knowing that we want to find an expression for Rlens^2 only in terms of r1, r2 and r (i.e. distance to position1, 2 and their separation) and NOT in terms of any trig functions (since those are computationally slow) it boils down to solving the following geometry problem:
Rlens_triangle.pdf

Rlens is labeled L, and r1, r2 and r are labeled a, b, and c respectively. The solution that myself and Erika (Eduardo's other grad student) arrived at was the following (mathematica assisted):

(L^2 + a^2) * b^2 = (a^2 + Sqrt(c^2_a^2 + c^2_L^2 -a^2*L^2))^2.

Unfortunately of the four solutions to this, none of them are real (for example L = +-ia, and the other two are similar).

Perhaps you have some insight into this, and have done the geometry already. I am worried that if we write DistSq() for Rlens with trig functions involved then TreeCorr will get heavily bogged down. Let me know what you think!

weights in NN correlations

Is the weight column okay to use with NN correlations? Iโ€™m setting a bunch of the weights to zero, and one thing Iโ€™ve noticed, is I get very different results if I remove all the zero weights. Maybe the weights are being ignored?

feature request: enable the use of two sets of weights in computations

This request is for the ability to use two sets of weights in the computations. The first set would b eused to do all tree building, traversal and pairings of cells. The second set would be used just to accumulate the quantities in each cell that contribute to the 2pt function.

can't find std::cerr symbol?

I'm having some trouble with TreeCorr on my laptop. I've poked around a bit but I'm still not sure what the problem is.

Here is the message I get when I try to run corr2:

Traceback (most recent call last):
  File "/Users/daniel/anaconda/bin/corr2", line 17, in <module>
    import treecorr
  File "/Users/daniel/anaconda/lib/python2.7/site-packages/treecorr/__init__.py", line 18, in <module>
    from .config import read_config, set_omp_threads
  File "/Users/daniel/anaconda/lib/python2.7/site-packages/treecorr/config.py", line 344, in <module>
    _treecorr = numpy.ctypeslib.load_library('_treecorr',os.path.dirname(__file__))
  File "/Users/daniel/anaconda/lib/python2.7/site-packages/numpy/ctypeslib.py", line 123, in load_library
    return ctypes.cdll[libpath]
  File "/Users/daniel/anaconda/lib/python2.7/ctypes/__init__.py", line 440, in __getitem__
    return getattr(self, name)
  File "/Users/daniel/anaconda/lib/python2.7/ctypes/__init__.py", line 435, in __getattr__
    dll = self._dlltype(name)
  File "/Users/daniel/anaconda/lib/python2.7/ctypes/__init__.py", line 365, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/daniel/anaconda/lib/python2.7/site-packages/treecorr/_treecorr.so, 6): Symbol not found: __ZSt4cerr
  Referenced from: /Users/daniel/anaconda/lib/python2.7/site-packages/treecorr/_treecorr.so
  Expected in: dynamic lookup

demangled symbol name:

$ c++filt __ZSt4cerr
std::cerr

Shared library report:

$ otool -L /Users/daniel/anaconda/lib/python2.7/site-packages/treecorr/_treecorr.so
/Users/daniel/anaconda/lib/python2.7/site-packages/treecorr/_treecorr.so:
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1197.1.1)
    /usr/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current version 2577.0.0)

I'm using anaconda python on OS X 10.9.5 and I used pip to install TreeCorr.

Better detection of when OpenMP is available

@reikonakajima reported that one of her linux systems has cc = gcc, but the code was unable to detect it as such, and so it doesn't use OpenMP.

The output of cc --version is

cc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The problem is that this doesn't report itself as being gcc anywhere.

So I'll need to do something more clever to figure out that this compiler does support the -fopenmp flag. Probably just need to try it out and see if it works.

Move multiprocessing from OpenMP to python multiprocessing

Currently, TreeCorr uses OpenMP if available both for building the tree and for doing the correlation functions in parallel. (The latter is the much more important one, btw.) It would probably be worth moving that into python, and use the python multiprocessing package instead.

Improvements to the v3.1 documentation

This issue will cover any errors or omissions in the docs for version 3.1.

So far I have noticed the following:

  • The attribute logr of the various correlation classes is the natural log of r in units of sep_units. This isn't obvious from the docs
  • The correlation classes don't say what kwargs are expected if config is not given.
  • Neither does Catalog

If you notice any other items that need work, please add a comment below.

Rlens metric

It would be fabulous to have an Rlens metric where Rlens(x1,x2) = D1*theta where D1 = angular diameter distance to point x1, and theta = angular separation between x1 and x2.

This metric is ideal for stacked-lensing analysis of a given lens population.

Thanks!

Change default bin_slop if bin_size > 0.1

I generally recommend that people use a bin_size of around 0.1 or smaller. However, on a couple occasions recently I have seen people use fairly large bin sizes (both times implicitly defined using a range and number of bins) and finding results that weren't super accurate.

The algorithmic error from our approximations in the TreeCode algorithm are proportional to the difference in value between neighboring bins. (Maybe even the second derivative, I haven't really calculated a formula for the error.) And as the bin size goes up, these differences tend to increase, leading to inaccuracies in the estimated correlation function.

So my idea is to change the default bin_slop if bin_size > 0.1 to be 0.1 / bin_size. That way people would not inadvertently be adding this big algorithmic error if they decide to use fat bins. You could still set bin_slop = 1 if you want to, but the default would be to use an effective bin_size of 0.1 for the purpose of the tree code to avoid the larger algorithmic error.

Installation hangs

I attempted to install treecorr on a new anaconda installation that I have been using, and the treecorr installation hung at the line:
Running setup.py bdist_wheel for TreeCorr ... \

The installation was hung here for about a day.

AttributeError raised when performing NN cross correlation with multiple datasets

I received the following error when attempting to run corr2:
AttributeError: Cannot provide both file_name2 and rand_file_list2.

The following is a snippet from the configuration file I was using

file_name = dataset1.dat
file_name2 = dataset2.dat
rand_file_list = random_1_list.txt
rand_file_list2 = random_2_list.txt

where rand_file_list and rand_file_list2 contain a list of 10 files each.

I was able to (apparently!) fix the problem by changing

cat2 = treecorr.read_catalogs(config, 'file_name2', 'rand_file_list2', 1, logger)
to cat2 = treecorr.read_catalogs(config, 'file_name2', 'file_list2', 1, logger).

mean(radius) for correlation function bins

For 2pt correlations, the separation value currently returned is meanlogr, i.e. the mean of the log separation of the pairs in the bin. It would be useful to have the actual (weighted) mean separation, since I believe this is the separation at which you want to compute the theoretical signal you're comparing to. For wide bins the difference can be significant.

Feature request: get indices of nearby points

Say we create a tree, TREE, from two arrays, RA and DEC. I think it would be useful to have a method "query_point" that returns the indices of points in the RA and DEC arrays that are within a given distance from the queried point. So, for instance, one could do:

RA0 = 90.
DEC0 = 0.
radius = 10.
indices = TREE.query_point(RA0, DEC0, radius)
RA_nearby, DEC_nearby = RA[indices], DEC[indices]

This is exactly analogous to the "query_ball_point" method in scipy.spatial.KDTree: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.query_ball_point.html#scipy.spatial.KDTree.query_ball_point.

Thanks!

feature request: halo ellipticity estimators

Similar to running an NG (tangential shear) correlation, but the user passes in an extra parameter, call it phi, for each object in the lens catalog. Treecorr returns the result of two optimally weighted estimators of the elliptical lensing shear, and two corresponding cross-components. All four results are arrays of length N, where N is the number of bins in angle (or perhaps projected scale in Mpc/h).

This extra input parameter phi is used to rotate the shear field for each lens into a common Cartesian coordinate system (gamma_1 and gamma_2 instead of gamma_+ and gamma_x). Having performed this rotation, the desired estimators to calculate are
gamma_4theta = Sum_i (w_i * gamma_1i * cos(4theta) + w_i * gamma_2i * sin(4theta)) / Sum_i (w_i)
gamma_const = Sum_i (w_i * gamma_1i) / Sum_i (w_i)

gamma_4theta_x = Sum_i (w_i * gamma_1i * cos(4theta) - w_i * gamma_2i * sin(4theta)) / Sum_i (w_i)
gamma_const_x = Sum_i (w_i * gamma_2i) / Sum_i (w_i)

where the index i runs over all lens-source pairs, and w_i is the usual weight for gamma (for example, sum of squares of measurement error and shape noise).

Errors in w(theta) at large separations

Erika Wagoner and Oliver Friedrich both recently reported (by emails) that TreeCorr can give fairly inaccurate results for w(theta) at large scales using the default parameters. In both cases, they found that lowering bin_slop below the default helped improve matters, so this is a case of numerical inaccuracies due to the tree algorithm.

cf. this plot from Oliver for simulated data:
w_lenses_1_setup_1
The blue points use bin_slop=0.6 (which is already a bit smaller than the default, since the bin_size is about 0.127, so the default would be 0.1/0.127 = 0.79), and the red points use bin_slop=0.3. Clearly there are big problems with the blue points, but less so with the red.

This issue is to make the discussion about this problem more public than the existing email threads and to discuss potential solutions to it. Pinging interested parties: @wagoner47 @erozo @danielgruen

verbose = 0 still gives output?

It appears that verbose = 0 still produces output to stdout or stderr. Is this the correct behavior? Is there anyway to make the code completely silent?

Feature Request: Overloading -= and multiplying NNCorrelations

At present, the following code works:

NN1 = NNCorrelation(config)
NN1 += NN2

where DD is some other NN correlation object. This works because the "+=" operator (iadd) has been overloaded starting on line 256 of nncorrelation.py. However, when creating jackknifes this is not all we need.

To do jackknifing intelligently, we need to be able to have an overloaded "-=" and be able to multiply correlations. Essentially, we also need:

NN1 -= NN2
NN1 += 0.5 * NN2

This amounts to implementing functions very similar to iadd. If you want I do this myself and issue a pull request.

Support linear bin spacing

Currently, the binning in separation is done in log(r), which is usually appropriate given the kinds of applications we usually have in astronomy. However, some people have asked about linear spacing instead. I think it would require a bit of thinking about how this might change some of the underlying correlation code, since the log spacing is kind of hard-coded into that. But it's worth looking into.

If this is important to you, please chime in below and describe your use case. Otherwise, this will likely fall off my radar...

Mixing of file_list and file_name?

I've tried running w(theta) using the combined options:

file_list = [a file containing a list of filenames of galaxy ra/dec data]
rand_file_name = [a FITS file containing ra/dec of randoms, consistent with the contents of file_list]

This ran in the pre-V3.0.0 version, but seemed to have stopped working on the most recent one (V3.0.2).
Was this intended? I get an error message saying "Either file_name or file_list is required".

"plus" and "cross" shear correlation functions?

Are shear correlation functions for the plus and cross polarizations stored somewhere in the treecorr.GGCorrelation object? I could compute either one by setting all of the g1 or g2 zero, but I was wondering if I was overlooking something.

Thanks,
CS

Maximum separation

Hello Mike

Hope you are well, and thank you for making the TreeCorr package publicly available.

I was wondering whether there is an inbuilt restriction on the maximum value that the parameter max_sep can be(!). This is because I am using an all-sky survey, and so expect that the maximum separation is 180 degrees (i.e. directly-opposite sides of the celestial sphere). However when I use this I find that there is a cut off, in the number of pairs counted, at 128 degrees. Yet there are clearly pairs of objects that satisfy a separation between 128 and 180 degrees. I thought I would test it using a smaller region, defined over the equator from RA=135 deg to RA=225 deg (-20 < Dec/deg <20), setting max_sep to 90 deg. npairs (for the data-points) remains non-zero until 81.4 deg, yet I can pick a pair of objects whose separation is 88.1 deg.

Out of curiosity, what is the step used to store the ra and dec values in the catalogues for TreeCorr? As I had specified units='deg', I was puzzled that printing each of the cat_data.ra[i] and cat_data.dec[i] values showed that they did not match up to the input ra and dec. Instead, each needed to be scaled up by a factor of just over 57, and there seemed to be a larger offset as RA increased...

Hope to hear from you soon.
Many thanks,
Sarah

flip_g1 / flip_g2 doesn't always work?

I'm seeing cases where setting flip_g2 from false to true doesn't change the results.
I call "corr2" in sequence, calculating for shear-shear correlation, with only the flip_g2 line changed in the configuration file.

Details: Results from using file_name for a single pointing doesn't change. Results from file_list (which contains a list of all 42 of the file_names used) changes. If I use a file_list with a single entry, the results still do not change. If I use a file_name containing data from all 42 files, the results changes. (As an independent test, I changed the bin size, and the result changes for all appropriately.)

Getting high-level outputs in Python layer

At the moment, it's possible to get the outputs of the correlation function in the Python layer:

xip = gg.xip

as you say in the documentation, and also to write out an array of quantities to a file including, say, the properly compensated estimators. Some of those have extra processing (I'm looking at writeNorm for the NG correlation function, as an example). I'd like to be able to get the outputs from something like the writeNorm function to use in a Python layer without a) having to duplicate your code or b) having to write the file to disk. Is there a way to do that now, or would it be possible to add a way to do that?

(At the moment I'm just doing b), which works fine for our current purposes, but for something where we're computing small correlation functions for many separate data sets--as we might do checking pipeline outputs on small regions--I start to worry about disk I/O.)

Can only run the Python Module in sudo

Hello, I downloaded the code and installed it without errors in sudo (and also, not in sudo), and when I tried importing it to Python, I got this error:

>>> import treecorr
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/anaconda2/lib/python2.7/site-packages/treecorr/__init__.py", line 30, in <module>
    _lib = _ffi.dlopen(lib_file)
  File "/home/user/anaconda2/lib/python2.7/site-packages/cffi/api.py", line 139, in dlopen
    lib, function_cache = _make_ffi_library(self, name, flags)
  File "/home/user/anaconda2/lib/python2.7/site-packages/cffi/api.py", line 769, in _make_ffi_library
    backendlib = _load_backend_lib(backend, libname, flags)
  File "/home/user/anaconda2/lib/python2.7/site-packages/cffi/api.py", line 758, in _load_backend_lib
    return backend.load_library(name, flags)
OSError: cannot load library /home/user/anaconda2/lib/python2.7/site-packages/treecorr/_treecorr.so: /home/user/anaconda2/lib/python2.7/site-packages/treecorr/_treecorr.so: undefined symbol: GOMP_parallel

However, this error doesn't show up when using sudo python, so I don't know what's going on here.

Error in unit tests

I just reinstalled and went to run the nosetests. On one of them I recieved an IOError. Here is the whole output related to that unit test:
Starting process NN auto-correlations for cat data/nn_list_rand2.dat.
Using 4 threads.
Starting 39 jobs.
.......................................
Done RR calculations.
Writing NN correlations to output/nn_list1.out
file_type assumed to be ASCII from the file name.
Traceback (most recent call last):
File "/home/tom/anaconda2/bin/corr2", line 4, in
import('pkg_resources').run_script('TreeCorr==3.2.3', 'corr2')
File "/home/tom/anaconda2/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/pkg_resources/init.py", line 724, in run_script
File "/home/tom/anaconda2/lib/python2.7/site-packages/setuptools-19.6.2-py2.7.egg/pkg_resources/init.py", line 1649, in run_script
File "/home/tom/anaconda2/lib/python2.7/site-packages/TreeCorr-3.2.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/corr2", line 105, in
main()
File "/home/tom/anaconda2/lib/python2.7/site-packages/TreeCorr-3.2.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/corr2", line 102, in main
treecorr.corr2(config, logger)
File "/home/tom/anaconda2/lib/python2.7/site-packages/TreeCorr-3.2.3-py2.7-linux-x86_64.egg/treecorr/corr2.py", line 227, in corr2
dd.write(config['nn_file_name'],rr,dr,rd)
File "/home/tom/anaconda2/lib/python2.7/site-packages/TreeCorr-3.2.3-py2.7-linux-x86_64.egg/treecorr/nncorrelation.py", line 523, in write
file_name, col_names, columns, prec=prec, file_type=file_type, logger=self.logger)
File "/home/tom/anaconda2/lib/python2.7/site-packages/TreeCorr-3.2.3-py2.7-linux-x86_64.egg/treecorr/util.py", line 55, in gen_write
gen_write_ascii(file_name, col_names, columns, prec=prec)
File "/home/tom/anaconda2/lib/python2.7/site-packages/TreeCorr-3.2.3-py2.7-linux-x86_64.egg/treecorr/util.py", line 86, in gen_write_ascii
numpy.savetxt(file_name, data, fmt=fmt, header=header)
File "/home/tom/anaconda2/lib/python2.7/site-packages/numpy/lib/npyio.py", line 1096, in savetxt
fh = open(fname, 'w')
IOError: [Errno 2] No such file or directory: 'output/nn_list1.out'
EReading input file data/nn_perp_data.dat
file_type assumed to be ASCII from the file name.
nobj = 100000
nbins = 5, min,max sep = 20..32.9744, bin_size = 0.1

It looks like it couldn't find the output file. The rest of the unit tests proceeded to run after that, however eventually it failed due to a seg fault. Here is the text related to that failure:

Starting process NNN cross-correlations
Using 4 threads.
Building NField
Starting 38 jobs.
..Segmentation fault (core dumped)

I ran this twice and it failed both times, and I then ran it again from a new terminal, and it also failed there. Let me know if you need any other information related to this.

A few updates for sample.params in V3.0.2?

The following I believe is no longer true (based on what Mike told me) that are still in the sample.params:

# You can also have file_name and/or file_name2 be lists of files, in which case, it will
# do all the cross-correlations between the first list and the second.
#file_name = file1a.dat file1b.dat

Possibly can add the following line about weight specification for FITS file input:

# For FITS files, the columns are specified by name, not number:
#ra_col = RA
#dec_col = DEC
#g1_col = GAMMA1
#g2_col = GAMMA2
#w_col = WEIGHT

Feature request: Marked point bootstrap error estimator

This Loh (2008) paper describes an algorithm to estimate bootstrap errors on spatial correlation function estimators that is both faster and more accurate than bootstrap estimators that resample spatial blocks of the data (i.e., in sky regions that are subsets of the survey footprint).

To implement the marked point bootstrap, we need to be able to resample the pair counts in annuli around each galaxy or star (these are the 'marks' for the algorithm). From a cursory inspection of the code, it is not clear if such marks are easily accessible given the way the pair counts are computed.

Still, I think this error estimator would be valuable as an alternative to jackknife.

Support 3D positions

Currently, the two kinds of positions we support are coordinates on a flat sky and on a curved sky. But there are many applications (NN correlations especially) where 3D positions are needed.

Propose including both (x,y,z) and (ra,dec,r) formulations.

Put up real documentation on the wiki

The Guide on the wiki now is a decent primer on using the python module (imho), but I also need to have the full documentation available for people to look at. I want to try out using Sphinx, since I've heard that it is good for python docs, but I don't know much about it. So this issue is to look into that and hopefully get the documentation up on the wiki somewhere.

Figure out how to do 3 point function in spherical coordinates

The 3 point function that I coded up only deals with (x,y) coordinates. I think half of the work getting it to use (ra,dec) is already done by how we build the Cells for spherical coordinates. But there is probably some more thinking to be done about how to compute the Gammas for spherical triangles.

feature request: g1, g2 in addition to gtan, gcross

We had a request for this as part of Stile: one of our users looks at and as diagnostics of possible chip-frame shape measurement problems. I wasn't sure if this was something easy to add to TreeCorr, or if you had any plans to do something with it, so I thought I'd ask!

Allow YAML config files

I believe the current configuration file format is equivalent to the .ini format that is readable in python using ConfigParser. So I could probably replace my manual config parsing stuff with that.

In addition, it would be nice to allow users to use YAML files instead. It's kind of a nicer format and it has native support for vectors, so that may be clearer in some cases than the string -> vector thing we do now.

Library error with Python 3 and Anaconda

After installing v3.3.5 using the Anaconda pip and Python v3.3.4, I get the following error when importing treecorr:

OSError: cannot load library /mnt/hd/gogrean/anaconda3/lib/python3.4/site-packages/treecorr/_treecorr.so: /mnt/hd/gogrean/anaconda3/lib/python3.4/site-packages/treecorr/_treecorr.so: cannot open shared object file: No such file or directory

I could solve the problem by making a soft link for _treecorr.cpython-34m.so as _treecorr.so.

Below is a screenshot of the error (got it when running one of the tests in Piff):

screen shot 2016-11-08 at 12 46 48 pm

config File read in as string

Hi, when I try to follow the instructions in the guide to using the Python module of Treecorr I seem to get a few errors. Specifically, when I have

import treecorr config = treecorr.read_config("config.yaml") treecorr.corr2(config)

I get an Attribute Error saying

AttributeError: 'str' object has no attribute 'get'

It seems that Python sees 'config' as a string and I'm not sure what it should be seeing it as. Is there something obvious that I'm missing here? Thanks for any help.

Failed any() check on numpy arrays

This line from catalog.py in the checkForNan() function:
if col is not None and any(numpy.isnan(col)):
fails with the error
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
(which is nice and weird) on one of my systems. It's fine on the others.

The fix seems to be replacing any with numpy.any. I've done that locally, but thought it might be useful for others.

Convert corr3 into the new API

I wrote the code to do 3 point correlations, but I haven't converted it over into the new API yet. For now, this would probably just be using the flat-sky coordinates. It's a separate project to convert that calculation to properly account for spherical coordinates.

Direct data input in python

When doing Jackknife type error estimations for large data sets more than 80% of time is spent on writing input data files into disc. It would be a very nice feature which allows us to pass python arrays as input to the code without writing them as fits/ascii files.

compute projected correlation function wp(rp)?

How exactly does the 3d option work, where you specify ra, dec, and r? There seems to be very little documentation. Does it just use the Euclidean distance in 3d? It might be nice to have the option to compute the projected correlation function(s) in the geometry defined by wp(rp).

Backward compatability lost with config files

Disclaimer: I do not know which version of TC broke this compatability, I only know that it was broken.

Previously I used configuration files without an extension (e.g. config_file) that contained lines that looked like the following:

file_name = foobar.txt

I went to run some of my old scripts again and TC failed saying that it couldn't recognize the configuration file from the extension. I have moved over to using .yaml files based on the sample configuration file in the repo, but I wanted to raise an issue that backward compatibility was lost with one of the recent updates. This was using TC version 3.3.3.

Increase efficiency of bin_slop < 0.5

I had an idea about how to speed up the code when bin_slop < 0.5.

Currently, the metric for when to stop traversing down the tree and just use two aggregate cells is if the largest separation between any two pairs is no more than (1+b) * r, where r = the separation between the two mean positions. (By symmetry of the ball nodes, the smallest separation is no less than (1-b) * r.)

In practice, b is usually the bin_size. So this means that the error in where the pairs are dropped is no more than 1 histogram bin. So there is an extra slight smoothing by this approximation. You can reduce this smoothing by setting bin_slop < 1, in which case b = bin_slop * bin_size. The code will then continue to split until the error is somewhat smaller than a full bin size.

But if bin_slop < 0.5, then there are times when we keep splitting even though all the pairs will actually fall in the same bin. If the distance between the mean positions happens to be near the center of a histogram bin, and the maximum difference from this is about bin_size/2. Then all the pairs will fall in the nominal bin. There is no approximation by stopping at this point and using the mean values. But if bin_slop < 0.5, the algorithm will currently keep splitting for a while more.

The effect is most pronounced when doing a brute force calculation with bin_slop = 0. We currently traverse everything down to the leaf nodes, never stopping early. But we don't have to do that. We could check for when the minimum and maximum separation both fall in the same bin, and when they do use the mean values rather than traversing further.

In general, I think the split check could become:

  1. Determine which bin we are considering based on the distance between the mean positions.
  2. Check if the minimum possible separation < left edge of this bin - bin_slop * this bin width
  3. Check if the maximum possible separation > right edge of this bin + bin_slop * this bin width
  4. If either 2 or 3 is true, then split. Else use the mean values.

I think changing to this kind of approach would also facilitate the linear binning (Issue #5). We could set up the bins along with their minimums (taking into account bin_slop) and maximums. Then as we traverse the tree, we could keep track of which range of bins are still in play. When that drops to only 1, we stop traversing.

Add option to read a Correlation object from a file.

Joseph has a use case that isn't easy to implement with the current TreeCorr functionality. He wants to split up his correlation calculation into multiple jobs run on different machines and then combine the results at the end into a single Correlation (NGCorrelation in his case) object.

There are three missing bits to make this work seamlessly. First, we need a way to write the partially complete Correlation data to disk. This would be handled by Issue #22, writing the Correlation data to a fits binary table. That should be efficient enough and preserves the full precision of the data.

The next thing is to be able to read this file back in to construct a Correlation object from that. Along the way, I'll probably add in an option to construct a Correlation object directly from existing numpy arrays, similar to how Catalog works, since that will be helpful in implementing the read command.

Finally, we need to be able to add up several partially complete Correlation objects and store the results into a single one. My proposal for that is ng.combineFrom(list_of_ng_objects) This would assume that the NGCorrelation objects in list_of_ng_objects are not yet finalized, so the data vectors just add together. Then you would call finalize on the final ng object.

3d correlation question

Hi Mike,
I've used your TreeCorr code to do some 3D 2pt correlation calculation, and the results are coming out strange:
pk_g09
You can see that the sharp structure is showing up on the 1 deg^2 patches (the thin colored lines), as well as in the overall correlation function (heavy blue dots). Would you have any suggestions as to what I'm doing wrong? I've been using your 2D 2pt correlation calculation, and the output for those makes sense.

I'm using the following Python commands:

data_cat = treecorr.Catalog(ra=ra, ra_units='deg', dec=dec, dec_units='deg', r=r_comoving)
rand_cat = treecorr.Catalog(ra=rand_ra, dec=rand_dec, ra_units='deg', dec_units='deg', r=rand_r_comoving)
dd = treecorr.NNCorrelation(nn_3d_config_single)

where nn_3d_config_single is a file that specifies min_sep, max_sep, and nbins. I did not specify sep_units, since I assumed it just uses the units for r, as-is.

The TreeCorr version is v3.0.2.

Have the Correlation classes check for the num_threads parameter

Currently, the num_threads parameter is checked at the very beginning of the config file processing, which is fine for the corr2 executable usage, but it means that if you are running things from within python, it will just use the default number of threads (i.e. all of them).

My proposed change is to have any function for which this parameter matters (mostly the ones that calculate the correlation functions, but also the various Field constructions) check for the parameter and call treecorr.set_omp_threads(num_threads) if necessary.

In the meanwhile, the workaround is for the user to call this function manually to set the number of threads to use.

Feature request : Integrated jack-knife error

Have to code internally compute the jackknife error.
Ideally the code can split the sky in n regions itself or the user can provide the regions in the input file.

example
ra(float) dec(float) e(float)1 e2(float) region(int)

add ability to carry other quantities through the computation

For instance, in order to properly use the lensfit-style shear calibrations, one must compute the normalization from the same galaxies as used in the 2pt function. One can just do this for the whole catalog at once, but it might be better to be able do it on a bin-by-bin basis by specifying an extra column (or maybe vector so we can do more than one thing at once) to carry through the computations that gets summed as a scalar.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.