momentoscope / hextof-processor Goto Github PK

Code for preprocessing data from the HEXTOF instrument at FLASH, DESY in Hamburg (DE)

Home Page: https://hextof-processor.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Python 20.94% Makefile 0.07% C++ 0.50% Jupyter Notebook 78.37% Cython 0.12%

arpes photoemission condensed-matter-physics solid-state-physics distributed-processing dask-dataframes dask pes free-electron-laser materials-science

hextof-processor's Introduction

hextof-processor

This code is used to analyze data measured at FLASH using the HEXTOF (high energy X-ray time of flight) instrument. The HEXTOF uses a delay line detector (DLD) to measure the position and arrival time of single electron events.

The analysis of the data is based on "clean tables" of single events as dask dataframes. There are two dataframes generated in the data readout process. The main dataframe dd contains all detected electrons and can be binned according to the needs of the experiment. The second dataframe ddMicrobunches contains the FEL pulses and is commonly used for normalization.

The class DldProcessor contains the dask dataframes as well as the methods to perform binning in a parallelized fashion.

The DldFlashDataframeCreatorExpress class subclasses DldProcessor and is used for creating the dataframes from the hdf5 files generated by the DAQ system.

Installation

In this section we will walk you through all you need to get up and running with the hextof-processor.

For using this package with the old FLASH data structure, please refer to README_DEPR.md.

1. Python

If you don't have python on your local machine yet we suggest to start with anaconda or miniconda. Details about how to install can be found here.

2. Install hextof-processor

Download the package by cloning to a local folder.

$ git clone https://github.com/momentoscope/hextof-processor.git

2.1 Virtual environment

Create a clean new environment (We strongly suggest you to always do so!)

If you are using conda:

$ conda env create -f environment.yml

now, to activate your new environment (windows):

$ conda activate hextof-express

if you are using linux:

$ source activate hextof-express

2.2 Virtual environment in Jupyter Notebooks

To add the newly created environment to the Jupyter Notebooks kernel list, and install your new kernel:

(hextof-express)$ python -m ipykernel install --user --name=hextof-express

3. Local Setup

3.1 Initialize settings

Finally, you need to initialize your local settings. This can be done by running InitializeSettings.py, in the same repository folder

(hextof-env)$ python InitializeSettings.py

This will create a file called SETTINGS.ini in the local repository folder. This is used to store the local settings as well as calibration values (will change in future..) and other options.

3.2 Setting up local paths

In order to make sure your folders are in the right place, open this file and modify the paths in the [path] section.

data_raw_dir - location where the raw h5 files from FLASH are stored
data_h5_dir - storage of binned hdf5 files
data_parquet_dir where the apache parquet data files from the generated single event tables are stored (we suggest using an SSD for this folder, since would greatly improve the binning performance.)
data_results_dir folder where to save results (figures and binned arrays)

if you are installing on Maxwell, we suggest setting the following paths:

[paths]
data_raw_dir =     /asap3/flash/gpfs/pg2/YYYY/data/xxxxxxxx/raw/
data_h5_dir =      /asap3/flash/gpfs/pg2/YYYY/data/xxxxxxxx/processed/
data_parquet_dir = /asap3/flash/gpfs/pg2/YYYY/data/xxxxxxxx/processed/parquet/
data_results_dir = /asap3/flash/gpfs/pg2/YYYY/data/xxxxxxxx/processed/*USER_NAME*/binned/

Where YYYY is the current year and xxxxxxxx is the beamtime number.

3.3 Calculating sector_correction list

If you like, in the settings, you can add the sector_correction list, which will shift any misalignment of the sectors. At the very least, this should include the "bit stealing hack" correction, where the last bits of the dldTime are set so they encode dldSectorId. This can be achieved by using the calibration.gen_sector_correction function which will generate the list for you, given the energy shifts you want.

3.4 Installing Doniach-Sunjic gaussian broadened

Please refer to XPSdoniachs/README.md for compilation instructions.

4. Test your installation

In order to test your local installation, we have provided a series of tutorial Jupyter Notebooks. You can find all the relevant material in the tutorial folder in the main repository. We suggest setting testing Express data readout.ipynb.

Documentation

The documentation of the package can be found here.

Examples are available as Jupyter Notebooks. Some example data is provided together with the examples. More compatible data is being collected and will soon be added to online open-access repositories.

Citation and and acknowledgments

If you use this software, please consider citing these two papers:

hextof-processor's People

Contributors

Stargazers

Watchers

Forkers

jus80687 kbuehlmann michaelheber zain-sohail

hextof-processor's Issues

Parallelize the express file readout

Branched out. Yet to be merged.

Reporting of electron fraction in binned volume

It would be nice if the binner function would return in the end a report of the fraction of electrons in the binned volume (e.g. 15 Mcts out of a total of 20 Mcts in the run binned (75%))

Improve binning performance

Binning using numpy.histogramdd is quite good, but as Laurenz pointed out, this creates a strong bottleneck when binning with large number of cores. Since this is the case in Maxwell, where most of the analysis is done, we need to improve this.

An option would be to port the binning methods from MPES in here, expecially the numba based binning.
An other would be to make a third package which would incorporate the dataframe numerical handling for both the environments, but this might be more of a long term project

Determnistic space-charge correction

Correction of the deterministic push of electrons induced by the space-charge effect that results in a Lorentzian energy front.

Momentum Calibration

Test and improve the detector position to momentum calibration method.

Delete all branches except master for LFS integration

I deleted the branches I had on the repo and I believe Steinn deleted some too. There's still two branches here. Is it possible to delete them?

Enhancement needed when storing dataframes loaded with id interval

The metadata gets overwritten by the last interval when storing dataframes in append mode. In other words, the data gets appended, but the metadata is only aware of the last append.

acquisition time and bin size normalization

Normalization of the output data by acquisition time (number of macrobunches/10) and to the bin sizes along all dimensions, to get numbers that are independent of binning and of acquisition time. That yields rates as Hz/eV/A-1^2 for a properly calibrated 3D cube, that will yield rates in Hz if integrated within a certain eVA-1A-1 volume.
At this step one looses total count information, which are useful for error bar calculations. So maybe one should calculate and store error bar information before that step and also normalize that then accordingly.

Basic testing framework

Create a testing framework in Pytest or Unittest (whichever works for the situation)
Test a core functionality

It is easier to implment Pytest and functionality is similar to Unittest from what I read, and also that pytest can be extended to Unittest if necessary. But I will try out both and confirm.

possible improvement of laziness in createDataframePerElectron

in the electron dataframe creation, there is a line which causes kernel crashes for very large datasets.
I suspect it is due to computing the whole dataframe instead of keeping it lazy:

self.daListResult = dask.compute(*daList)
a = np.concatenate(self.daListResult, axis=1)
da = dask.array.from_array(a.T, chunks=self.CHUNK_SIZE)
cols = self.ddColNames
self.dd = dask.dataframe.from_array(da, columns=cols)

Is there a way to do the same but using map_partition or similar? Is there a reason why this is not kept lazy?

Make repo independent + new name?

Currently this repo is a fork of Yves' repo. I suspect this means the idea would be to eventually merge the two with a pull request. Somehow, I think we should consider keeping the two completely separated, so that we might have Yves' repo as a legacy reference.

I am not sure I understand the consequences entirely, so if I am just asking for trouble, please let me know.

In any case, if we decide to act in this direction, it is possible to contact github to let them know we want to make this repo independent, and it should be a straightforward process.

PumpProbeTime calibration

Test and improve the delay stage to pump probe time calibration method. This should be done possibily keeping retrocompatibility to the old stage.

dataframes not reading correctly

when reading a dataframe, if not resetting the processor object, readData will not read a new run number even if processor.runNumber has been updated

Change from distutils to setuptools

It seems that setuptools is a better choice than distutils, since it is a requirement for PyPi. So switching to it seems like the right thing to do.

import data, express

Out of memory while reading about 4h of data (134 dataframes).

delete lib

Clean up ddMicrobunches dataframe creator

Needs a lot of cleanup. There is some mess with the array creation for different channels. Also, needs generating the dataframe with channels selected from the SETTINGS.ini file

Error reading settings files when package installed via pypi

The DldProcessor Class assumes a root folder of
root_folder = os.path.dirname(os.path.dirname(processor.__file__))

However, if one installs the package in a certain environment, this would lead to it searching in an incorrect folder. Perhaps this should be user customisable in that case? or is there already a way?

Use Github Actions to automate testing

Add use of versions

This will allow to keep track of which version of the code was used in which notebook, and see if you are using the latest version with the new fancy functionalities.

Also, version should be automatically printed on import, for easy visualization in jupyter notebooks

KeyError: 'monochromatorEnergy'

In the calibrate function, there's a useAvgMonochormatorEnergy parameter. Whether it's false or true, both ways it gives the error KeyError: 'monochromatorEnergy'. I saw that it's in the TODO but I couldn't figure out a way to bypass that in the function.

mochromatorEnergy bug

after filtering the data, the monochromatorEnergy channel gets filled with NaNs

Documentation requested

Please include an html version of the documentation of all functions for the repo

introduce calibration dictionary

A great improvement would be to define a dictionary with all calibration and correction parameters which should be passed to the different calibration methods. This would allow easier transfer between different datasets of calibration/correction parameters, as well as making it possible to store them together with the data, to increase the re-usability and transparency of data treatment.

The idea would be to have, at the final stage, a single method to call: processor.calibrate(calibDict) which will run all the calibrations/corrections which have an entry in the dictionary.

Clean up comments, readme and generate documentation

We need to do some serious cleanup regarding the readme and creating the documentation. Examples need to be moved to comments in the code, and the documentation page needs to be generated.

Implement readout for "express" data format

The new data structure requires a full remake of processor.readData() and related methods.

The current method is extremely memory consuming, and breaks when reading too large runs. We should take the chance and improve the method which is used, for example, using dask delayed functions.

Binning on subset of data

The option of binning a small subset of data could greatly improve calibration parameter determination by speeding up the binning process. The choice of loading a single h5 file is rather poor, for it looses the distortions induced by slow frequency changes (over minutes or hours).

Raw data origin folder

Allow for searching in all online-X with X=1,2,3 folders created in the different beamtime blocks.

Add an option for "per train" channels

Currently, such channels are hardcoded so it makes sense to include this in the DataFrameCreator

Major bug in PAH

There is a major bug in the PAH module (inside h5filedataaccess.valuesFromInterval) that results in ~8% of the electrons being lost, mostly at the end of the macrobunch.
The h5 files come in chunks of ~3200 macrobunches, and each chunk has its own shape. However, all the chunks are loaded as if their shape was the same as the first one, resulting in cutting most other chunks significantly. This severely impacts the unpumped data, usually at the end of macrobunches.

Add progress bar

progress bar while loading or binning data would be very useful

Boost Histogram

I've pushed a new branch which is implementing this rather slow method to give weighted axis rather than the mean of the bin edges.
I also change the numpy.hist method to boost_histogram's.
Overall, it's slower because of the finer binning necessary for weighting. However, it would be helpful if I can get suggestions on how to improve this implementation.

PS: I also pushed the updated setup.py etc, which were not merged with master before.

Timing corrections

Implement BAM corrections toghether with Streak Camera corrections.

The bam looks only at the FEL electron bunches timining compared to the master clock
The streak camera instead looks at interaction of laser beam with the electron bunches. This is also refreshed with much lower frequency (every few macrobunches)

In order to get pumpProbeTime correctly, both corrections need to be accounted for, and NOT just summed.

A good run to test these corrections on could be 22122, where a 1 ps shift in both BAM and streak camera appeared.

Optionalize ddMicrobunches

I have been thinking of making the whole ddMicrobunches dataframe optional. Is this even necessary at all? we should have all the data in the dd dataframe anyway. If needed for debugging purposes, we can generate it optionally. Or, we could always generate it but have it empty if not required?

Just a proposition, no need to fuss about this if not necessary.

DldSectorId alignment

The different sectors of the DLD are not perfectly aligned. We require a method for clearly aligning the dldTime of the different sectors, in order to increase the energy resolution.

Clean Binning methods

Introduction of the new numpy based binning method led to lots of functions doing the same things in different ways. Would be best to delete them and keep only a single efficient method.

Fix energy calibration functions

Remove all traces of old methods and add description in the readme.

Unravel Energy Calibration

The processor.energyCalibration method is now growing quickly with many different corrections applied together. These should be separated in different methods, so that they can be tested independently.

Might be a good idea to make functions which do not live inside the processor class, but in the processor.utilities.calibration. These should be functions which all follow the same structure, so that debugging them would be easier.

Such a re-structuring would be good to define for all calibration methods, no matter how simple they might be (see pumpProbeTime calibration)

Bugfix: Run doesn't load when there are empty arrays in the acquired h5 files

large files still in git history?

I noticed the repository, as it is now with the LFS for the data files, still ammounts to nearly a GB.
Is this because of the residuals of the commits of the lfs, or do we have that many changes in the code to make up for 616MB in the .git folder?

LFS quota

So there appears to be an LFS 1Gb bandwidth quota:

Bandwidth quota: If you use more than 1 GB of bandwidth per month without purchasing a data pack, Git LFS support is disabled on your account until the next month.

Which means, right now I cannot even LFS fetch. Is this the right approach after all?

create h5 structure for output files

Proposed structure:

series of pump probe time slices of 3d data
3D data order: Kx, Ky, Energy
metadata folder

Avoid progress bar spam

when running the code outside of IPython, the progress bar is not updating itself and it is rather spamming a new line for each iteration.

During the dataframe creation, the first iteration takes far longer (5 minutes) than all other iterations, (10 seconds max), and also, the progress bar spams a line each 0.1 seconds, instead of at each cycle.

I dont know whether this is a bug or a setting which can be set, but it would be nice, if it cannot be repaired, to set the use of the progress bar optional.

Binning performance

From the discussion in the issue #73 I showed the binning performance as function of number of cores used, which showed a poor performance with increased number of cores used. This turns out to be wrong however, as this was tested on dataframes with too few electrons.

Working with larger datasets, in the order of 200M electrons, the binning performance greatly improves, up to more than 40 cores at least. It showed, in this case, a speedup of a factor 4 at least between using 4 and 32 cores. 94 cores seemed slightly faster, but I did not make a quantitative study of this.

This should be further investigated, looking for a sweet spot for the number of cores, or a variable number of cores, based on the size of the dataset could be a cool alternative.