GithubHelp home page GithubHelp logo

momentoscope / hextof-processor Goto Github PK

View Code? Open in Web Editor NEW
7.0 4.0 4.0 13.03 MB

Code for preprocessing data from the HEXTOF instrument at FLASH, DESY in Hamburg (DE)

Home Page: https://hextof-processor.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Python 20.94% Makefile 0.07% C++ 0.50% Jupyter Notebook 78.37% Cython 0.12%
arpes photoemission condensed-matter-physics solid-state-physics distributed-processing dask-dataframes dask pes free-electron-laser materials-science

hextof-processor's Introduction

DOI Documentation Status

hextof-processor

This code is used to analyze data measured at FLASH using the HEXTOF (high energy X-ray time of flight) instrument. The HEXTOF uses a delay line detector (DLD) to measure the position and arrival time of single electron events.

The analysis of the data is based on "clean tables" of single events as dask dataframes. There are two dataframes generated in the data readout process. The main dataframe dd contains all detected electrons and can be binned according to the needs of the experiment. The second dataframe ddMicrobunches contains the FEL pulses and is commonly used for normalization.

The class DldProcessor contains the dask dataframes as well as the methods to perform binning in a parallelized fashion.

The DldFlashDataframeCreatorExpress class subclasses DldProcessor and is used for creating the dataframes from the hdf5 files generated by the DAQ system.

Installation

In this section we will walk you through all you need to get up and running with the hextof-processor.

For using this package with the old FLASH data structure, please refer to README_DEPR.md.

1. Python

If you don't have python on your local machine yet we suggest to start with anaconda or miniconda. Details about how to install can be found here.

2. Install hextof-processor

Download the package by cloning to a local folder.

$ git clone https://github.com/momentoscope/hextof-processor.git

2.1 Virtual environment

Create a clean new environment (We strongly suggest you to always do so!)

If you are using conda:

$ conda env create -f environment.yml

now, to activate your new environment (windows):

$ conda activate hextof-express

if you are using linux:

$ source activate hextof-express

2.2 Virtual environment in Jupyter Notebooks

To add the newly created environment to the Jupyter Notebooks kernel list, and install your new kernel:

(hextof-express)$ python -m ipykernel install --user --name=hextof-express

3. Local Setup

3.1 Initialize settings

Finally, you need to initialize your local settings. This can be done by running InitializeSettings.py, in the same repository folder

(hextof-env)$ python InitializeSettings.py

This will create a file called SETTINGS.ini in the local repository folder. This is used to store the local settings as well as calibration values (will change in future..) and other options.

3.2 Setting up local paths

In order to make sure your folders are in the right place, open this file and modify the paths in the [path] section.

  • data_raw_dir - location where the raw h5 files from FLASH are stored
  • data_h5_dir - storage of binned hdf5 files
  • data_parquet_dir where the apache parquet data files from the generated single event tables are stored (we suggest using an SSD for this folder, since would greatly improve the binning performance.)
  • data_results_dir folder where to save results (figures and binned arrays)

if you are installing on Maxwell, we suggest setting the following paths:

[paths]
data_raw_dir =     /asap3/flash/gpfs/pg2/YYYY/data/xxxxxxxx/raw/
data_h5_dir =      /asap3/flash/gpfs/pg2/YYYY/data/xxxxxxxx/processed/
data_parquet_dir = /asap3/flash/gpfs/pg2/YYYY/data/xxxxxxxx/processed/parquet/
data_results_dir = /asap3/flash/gpfs/pg2/YYYY/data/xxxxxxxx/processed/*USER_NAME*/binned/

Where YYYY is the current year and xxxxxxxx is the beamtime number.

3.3 Calculating sector_correction list

If you like, in the settings, you can add the sector_correction list, which will shift any misalignment of the sectors. At the very least, this should include the "bit stealing hack" correction, where the last bits of the dldTime are set so they encode dldSectorId. This can be achieved by using the calibration.gen_sector_correction function which will generate the list for you, given the energy shifts you want.

3.4 Installing Doniach-Sunjic gaussian broadened

Please refer to XPSdoniachs/README.md for compilation instructions.

4. Test your installation

In order to test your local installation, we have provided a series of tutorial Jupyter Notebooks. You can find all the relevant material in the tutorial folder in the main repository. We suggest setting testing Express data readout.ipynb.

Documentation

The documentation of the package can be found here.

Examples are available as Jupyter Notebooks. Some example data is provided together with the examples. More compatible data is being collected and will soon be added to online open-access repositories.

Citation and and acknowledgments

If you use this software, please consider citing these two papers:

hextof-processor's People

Contributors

balerion avatar jus80687 avatar michaelheber avatar realpolitix avatar steinnymir avatar yacremann avatar zain-sohail avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

hextof-processor's Issues

Reporting of electron fraction in binned volume

It would be nice if the binner function would return in the end a report of the fraction of electrons in the binned volume (e.g. 15 Mcts out of a total of 20 Mcts in the run binned (75%))

Improve binning performance

Binning using numpy.histogramdd is quite good, but as Laurenz pointed out, this creates a strong bottleneck when binning with large number of cores. Since this is the case in Maxwell, where most of the analysis is done, we need to improve this.

An option would be to port the binning methods from MPES in here, expecially the numba based binning.
An other would be to make a third package which would incorporate the dataframe numerical handling for both the environments, but this might be more of a long term project

acquisition time and bin size normalization

Normalization of the output data by acquisition time (number of macrobunches/10) and to the bin sizes along all dimensions, to get numbers that are independent of binning and of acquisition time. That yields rates as Hz/eV/A-1^2 for a properly calibrated 3D cube, that will yield rates in Hz if integrated within a certain eVA-1A-1 volume.
At this step one looses total count information, which are useful for error bar calculations. So maybe one should calculate and store error bar information before that step and also normalize that then accordingly.

Basic testing framework

  • Create a testing framework in Pytest or Unittest (whichever works for the situation)
  • Test a core functionality

It is easier to implment Pytest and functionality is similar to Unittest from what I read, and also that pytest can be extended to Unittest if necessary. But I will try out both and confirm.

possible improvement of laziness in createDataframePerElectron

in the electron dataframe creation, there is a line which causes kernel crashes for very large datasets.
I suspect it is due to computing the whole dataframe instead of keeping it lazy:

self.daListResult = dask.compute(*daList)
a = np.concatenate(self.daListResult, axis=1)
da = dask.array.from_array(a.T, chunks=self.CHUNK_SIZE)
cols = self.ddColNames
self.dd = dask.dataframe.from_array(da, columns=cols)

Is there a way to do the same but using map_partition or similar? Is there a reason why this is not kept lazy?

Make repo independent + new name?

Currently this repo is a fork of Yves' repo. I suspect this means the idea would be to eventually merge the two with a pull request. Somehow, I think we should consider keeping the two completely separated, so that we might have Yves' repo as a legacy reference.

I am not sure I understand the consequences entirely, so if I am just asking for trouble, please let me know.

In any case, if we decide to act in this direction, it is possible to contact github to let them know we want to make this repo independent, and it should be a straightforward process.

PumpProbeTime calibration

Test and improve the delay stage to pump probe time calibration method. This should be done possibily keeping retrocompatibility to the old stage.

dataframes not reading correctly

when reading a dataframe, if not resetting the processor object, readData will not read a new run number even if processor.runNumber has been updated

Change from distutils to setuptools

It seems that setuptools is a better choice than distutils, since it is a requirement for PyPi. So switching to it seems like the right thing to do.

Clean up ddMicrobunches dataframe creator

Needs a lot of cleanup. There is some mess with the array creation for different channels. Also, needs generating the dataframe with channels selected from the SETTINGS.ini file

Error reading settings files when package installed via pypi

The DldProcessor Class assumes a root folder of
root_folder = os.path.dirname(os.path.dirname(processor.__file__))

However, if one installs the package in a certain environment, this would lead to it searching in an incorrect folder. Perhaps this should be user customisable in that case? or is there already a way?

Add use of versions

This will allow to keep track of which version of the code was used in which notebook, and see if you are using the latest version with the new fancy functionalities.

Also, version should be automatically printed on import, for easy visualization in jupyter notebooks

KeyError: 'monochromatorEnergy'

In the calibrate function, there's a useAvgMonochormatorEnergy parameter. Whether it's false or true, both ways it gives the error KeyError: 'monochromatorEnergy'. I saw that it's in the TODO but I couldn't figure out a way to bypass that in the function.

introduce calibration dictionary

A great improvement would be to define a dictionary with all calibration and correction parameters which should be passed to the different calibration methods. This would allow easier transfer between different datasets of calibration/correction parameters, as well as making it possible to store them together with the data, to increase the re-usability and transparency of data treatment.

The idea would be to have, at the final stage, a single method to call: processor.calibrate(calibDict) which will run all the calibrations/corrections which have an entry in the dictionary.

Implement readout for "express" data format

The new data structure requires a full remake of processor.readData() and related methods.

The current method is extremely memory consuming, and breaks when reading too large runs. We should take the chance and improve the method which is used, for example, using dask delayed functions.

Binning on subset of data

The option of binning a small subset of data could greatly improve calibration parameter determination by speeding up the binning process. The choice of loading a single h5 file is rather poor, for it looses the distortions induced by slow frequency changes (over minutes or hours).

Raw data origin folder

Allow for searching in all online-X with X=1,2,3 folders created in the different beamtime blocks.

Major bug in PAH

There is a major bug in the PAH module (inside h5filedataaccess.valuesFromInterval) that results in ~8% of the electrons being lost, mostly at the end of the macrobunch.
The h5 files come in chunks of ~3200 macrobunches, and each chunk has its own shape. However, all the chunks are loaded as if their shape was the same as the first one, resulting in cutting most other chunks significantly. This severely impacts the unpumped data, usually at the end of macrobunches.

Add progress bar

progress bar while loading or binning data would be very useful

Boost Histogram

I've pushed a new branch which is implementing this rather slow method to give weighted axis rather than the mean of the bin edges.
I also change the numpy.hist method to boost_histogram's.
Overall, it's slower because of the finer binning necessary for weighting. However, it would be helpful if I can get suggestions on how to improve this implementation.

PS: I also pushed the updated setup.py etc, which were not merged with master before.

Timing corrections

Implement BAM corrections toghether with Streak Camera corrections.

The bam looks only at the FEL electron bunches timining compared to the master clock
The streak camera instead looks at interaction of laser beam with the electron bunches. This is also refreshed with much lower frequency (every few macrobunches)

In order to get pumpProbeTime correctly, both corrections need to be accounted for, and NOT just summed.

A good run to test these corrections on could be 22122, where a 1 ps shift in both BAM and streak camera appeared.

Optionalize ddMicrobunches

I have been thinking of making the whole ddMicrobunches dataframe optional. Is this even necessary at all? we should have all the data in the dd dataframe anyway. If needed for debugging purposes, we can generate it optionally. Or, we could always generate it but have it empty if not required?

Just a proposition, no need to fuss about this if not necessary.

DldSectorId alignment

The different sectors of the DLD are not perfectly aligned. We require a method for clearly aligning the dldTime of the different sectors, in order to increase the energy resolution.

Clean Binning methods

Introduction of the new numpy based binning method led to lots of functions doing the same things in different ways. Would be best to delete them and keep only a single efficient method.

Unravel Energy Calibration

The processor.energyCalibration method is now growing quickly with many different corrections applied together. These should be separated in different methods, so that they can be tested independently.

Might be a good idea to make functions which do not live inside the processor class, but in the processor.utilities.calibration. These should be functions which all follow the same structure, so that debugging them would be easier.

Such a re-structuring would be good to define for all calibration methods, no matter how simple they might be (see pumpProbeTime calibration)

large files still in git history?

I noticed the repository, as it is now with the LFS for the data files, still ammounts to nearly a GB.
Is this because of the residuals of the commits of the lfs, or do we have that many changes in the code to make up for 616MB in the .git folder?

LFS quota

So there appears to be an LFS 1Gb bandwidth quota:

Bandwidth quota: If you use more than 1 GB of bandwidth per month without purchasing a data pack, Git LFS support is disabled on your account until the next month.

Screen Shot 2021-01-21 at 12 08 25

Which means, right now I cannot even LFS fetch. Is this the right approach after all?

Avoid progress bar spam

when running the code outside of IPython, the progress bar is not updating itself and it is rather spamming a new line for each iteration.

During the dataframe creation, the first iteration takes far longer (5 minutes) than all other iterations, (10 seconds max), and also, the progress bar spams a line each 0.1 seconds, instead of at each cycle.

I dont know whether this is a bug or a setting which can be set, but it would be nice, if it cannot be repaired, to set the use of the progress bar optional.

Binning performance

From the discussion in the issue #73 I showed the binning performance as function of number of cores used, which showed a poor performance with increased number of cores used. This turns out to be wrong however, as this was tested on dataframes with too few electrons.

Working with larger datasets, in the order of 200M electrons, the binning performance greatly improves, up to more than 40 cores at least. It showed, in this case, a speedup of a factor 4 at least between using 4 and 32 cores. 94 cores seemed slightly faster, but I did not make a quantitative study of this.

This should be further investigated, looking for a sweet spot for the number of cores, or a variable number of cores, based on the size of the dataset could be a cool alternative.

TODO: add settings integrity test

We need to add an integrity test for the SETTINGS.ini file in case a new parameter is added in the code. This is necessary because the SETTINGS.ini file is not tracked.

Make tutorial functioning again

Tutorial notebooks are not running as they are referring to CAMP packages, which are not available in the new express readout

FLASH DAQ info link broken

the link to the desy webpage giving info for the daq channels is broken in the last section of the readme. Does anyone have the correct link?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.