davidt3 / daxa Goto Github PK

Democratising Archival X-ray Astronomy (DAXA) is an easy-to-use Python module for downloading multi-mission X-ray telescope data and processing it into usable archives. Users can acquire entire archives, or filter observations based on ID/positions/time. Supports XMM; partial support eROSITA, Chandra, NuSTAR, Swift, Suzaku, ASCA, ROSAT, INTEGRAL

License: BSD 3-Clause "New" or "Revised" License

Python 98.70% TeX 1.30%

astronomy astrophysics python x-ray-astronomy xmm chandra erosita xga archival-astronomy nustar

daxa's People

Contributors

Stargazers

Watchers

daxa's Issues

XMM data will currently be downloaded even if it already exists

This should not be the default behaviour, though perhaps it would be a good idea to have a clobber argument in the mission init? So people can overwrite existing data if they want to.

Should design a DAXA-specific cleaning process at some point

This would be mission-agnostic, and ideally support any of the telescopes which DAXA ends up being able to reduce data for. This would be an alternative to the mission-specific methods I am implementing first (i.e. the SAS cleaning methods for XMM).

Maybe I should come up with a generalised download function

So that for each mission's download method I can wrap it in a decorator that will download the data in a multithreaded way (or not if NUM_CORES=1) and I don't have to write that code over and over again.

Process logging storage keys

Currently the logs, errors, processed errors, and warnings are stored in archives under either an ObsID or an ObsID+instrument+sub exposure ID combo.

This is somewhat tantamount to what the docstrings in the Archive class say, as they state either an ObsID or an ObsID+Instrument key combo.

I should consider having lower level instrument and then sub-exposure dictionaries to store the results/logs in, rather than ObsID+instrument+exposure ID. I intend to implement some sort of lookup method that can grab all results for an ObsID, or a specific ObsID instrument combo, and that would probably be easier with more distinct layers of dictionaries.

Implement SAS procedures for slew observations, when the xmm slew mission is added

Currently not sure which would actually work for slew observations, so at the moment a not implemented error is thrown.

Downloading XMM data should be allowed to retry at least once

If the connection is refused, then it should be allowed to attempt at least one more time to download the data.

Add the basic structure to the documentation

Set up the general structure, with an installation guide, intro section, contact section etc.

Don't need to make it perfect for this issue, or write any tutorials, but sketch out the framework.

Support the acquisition and reduction of proprietary data

What it says on the tin really. For XMM for instance you need to provide a login and password, and I'll also have to make sure that the proprietary data belonging to a particular user are marked as usable in the fetch_obs_info method, as currently all proprietary observations are marked as unusable.

Add a way of storing 'extra info' dictionaries for processes in archives

Much the same way as logs/errors are stored right now, but for any extra information that was passed out of the SAS process wrapper function (e.g. the event list paths from epchain and emchain).

Use XGA to generate images and exposure maps for DAXA XMM archives

This will probably require changes to XGA as well, but shouldn't be too hard to achieve using the NullSource class.

Currently storing warnings and errors from SAS in Archives as dictionaries

Decide whether that is the right way to do it, or whether to just store the text of the error/warning

PN non-imaging mode sub-exposures

As mentioned in issue #34, DAXA cannot currently parse XMM ODF summary files. As such it is difficult to efficiently identify which exposures are in which observing mode, and other details about them.

Currently only PN imaging mode data will be processed by DAXA (though hopefully that will change at some point), but as issue #34 is not implemented, and I don't want to be reading headers for thousands of fits files if I can avoid it, epchain will currently attempt to process every sub-exposure as an imaging mode observation.

This will cause an error for other data modes (like timing for instance):

At the moment I am just going to let them fail, and then catch them further down the line, rather than identifying them a priori and never running the commands in the first place

XMM scheduled and unscheduled observations

I want to ensure that any unscheduled (with U in their exposure identifier rather than S) PN observations are processed by epchain, but its not clear to me whether that is True by default.

You can set the 'schedule' flag in epchain to S or U, but it only triggers if odfaccess=odf rather than oal, which is not explained...

To be honest, exactly what an 'unscheduled' observation is isn't really explained either.

The position filtering methods in BaseMission aren't really valid for survey-type missions

For instance it wouldn't really be valid for eROSITA, because (at least for eFEDS) their data is released in big sweeps across the sky. As such matching to the central coordinate of that region doesn't really have much meaning.

Setup convenience functions to easily set up Archives in particular circumstances

I.e. the simplest could be 'process all available observations from XMM', or 'process all available observations from XMM, Chandra, and eROSITA' (once support for other telescopes is added).

That would provide an archive instance which could be passed into processing functions, both the telescope-specific processing which is generally provided by a particular telescope's software suite, or the planned mission-agnostic processing that I will eventually add to DAXA (issue #17).

Other examples of convenience functions like this could be ones that would assemble an archive from multiple telescopes for observations relevant to some particular sources.

DAXA XMMPointed download isn't successfully identifying observations without data for the requested instruments

For instance 4444440301, not only is it really weirdly named but it also seems to only have OM data on XSA - definitely don't care about that when the XMMPointed object can only process PN, M1, and M2 at the moment (that's hard coded in).

Build SAS summary file parser for DAXA

That can then be used to determine what processed files we expect ahead of time, rather than pattern matching like I do now

The SAS espfilt docs seem to have incorrect information about a default parameter value

An important one unfortunately - it seems like (though I am only ~90% certain on this right now) the with_binning argument is defaulting to yes rather than defaulting to no like it states in the docs.

That was causing some extremely irritating behaviour where my SP filtering was not working at all.

Current download-checker for XMM raw data is very crude

Doesn't check for presence of particular instrument files, just that there is a directory for the particular ObsID - definitely needs to be more sophisticated

Need to write unit tests for DAXA

This should hopefully be easier than XGA, because we're earlier in the development process + its just a simpler module.

Add previous-process awareness to XMM processing tasks

Need to ensure that things are run in the right order - for instance cif_build must be run before everything, odf_ingest must be run before basically everything else etc.

Currently just rely on the user doing that, but that won't be a permanent state of affairs - I'll make use of the process_success property of Archive to a) check if dependencies have been run, and if they were successful.

Do proper data processing tests on observations known to have multiple sub-exposures

When the eSAS processing stack is implemented, make sure to manually check that everything is working for observations with multiple exposures

Add combined sky-coverage calculation capabilities

This should both be able to assess how much of the sky is covered by a particular set of data, but also produce coverage maps which can be stored alongside the processed datasets to allow for the identification of data relevant to a particular source.

Should I change mission properties so that things named as 'obs_ids' for instance are the filtered values

Currently that would return all of the observations, rather than whatever subset has been selected for the mission - this doesn't feel like the right behaviour really.

Implement eROSITA commissioning mission

This would include the eFEDS field and the other commissioning data taken by eROSITA.

Downloading specific instruments for XMM currently downloads then deletes irrelevant

I intended the downloading of specific instruments to minimise disk usage/bandwidth usage by not downloading data that a user considers to be irrelevant to their use-case/can't (yet) be processed by DAXA. Unfortunately for XMM, the downloading of ODF (observation data files) using the AIO URLs (and thus the AstroQuery interface) for specific instruments is currently impossible, as regardless of the specified instrument, all instrument ODFs are downloaded.

This is happening on the XSA end, and I've sent in a ticket asking if this is an intended behaviour, however whatever the answer ends up being I have to deal with it for the time being. As such the XMMPointed class download behaviour will acquire all instrument data for a given observation.

Then (assuming that this doesn't break any pre-built data processing tasks downstream) it will delete those ODF files which relate to instruments that have NOT been selected by the user.

The documentation is not building on RTD

I will add more information as I explore the issue, but every build of the DAXA documentation on read the docs has failed thus far.

I think its a dependancy versioning problem.

Add an Archive class

Instances of the Archive class will be capable of storing and accessing multiple missions, and will probably be the most user-facing class of this module. They will contain a bunch of convenience methods, and probably the planned sky coverage generation capabilities (issue #18).

SAS error parsing is largely duplicated between XGA and DAXA

The SAS parsing method from BaseProduct in XGA has basically been copied over, and the lists of SAS errors and warnings have also been copied over.

Probably would be best to just put it all in DAXA, and have XGA import what it needs.

NOTE ON ExceptionGroup IN JUPYTER NOTEBOOK

Very limited parts of DAXA use a new Python feature (introduced in 3.11, backported by exceptiongroup module), that allows me to raise a set of exceptions together.

Specifically this is used when Python errors occur during the parallel tasks that run command line SAS tools (and possibly other telescope specific command line tools in the future) - to be clear Python errors shouldn't happen in those parallelised tasks, but if they do an ExceptionGroup is used.

It seems that at the moment (this is true on my setup on the date this issue was created) that Jupyter notebooks do not show the tracebacks properly for ExceptionGroup. For instance in the notebook a test raised ExceptionGroup gives this traceback:

ExceptionGroup: pythony errors (3 sub-exceptions)

Whereas in a script run from terminal this is what you get (and should get):

Exception Group Traceback (most recent call last):
| File "/Users/dt237/code/test_daxa/testo.py", line 12, in
| success, errors, outs = cif_build(arch)
| ^^^^^^^^^^^^^^^
| File "/Users/dt237/code/DAXA/daxa/process/xmm/_common.py", line 209, in wrapper
| raise ExceptionGroup("pythony errors", python_errors)
| ExceptionGroup: pythony errors (3 sub-exceptions)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/opt/anaconda3/envs/daxa_dev/lib/python3.11/multiprocessing/pool.py", line 125, in worker
| result = (True, func(*args, **kwds))
| ^^^^^^^^^^^^^^^^^^^
| File "/Users/dt237/code/DAXA/daxa/process/xmm/_common.py", line 89, in execute_cmd
| print(boi)
| ^^^
| NameError: name 'boi' is not defined
+---------------- 2 ----------------
| Traceback (most recent call last):
| File "/opt/anaconda3/envs/daxa_dev/lib/python3.11/multiprocessing/pool.py", line 125, in worker
| result = (True, func(*args, **kwds))
| ^^^^^^^^^^^^^^^^^^^
| File "/Users/dt237/code/DAXA/daxa/process/xmm/_common.py", line 89, in execute_cmd
| print(boi)
| ^^^
| NameError: name 'boi' is not defined
+---------------- 3 ----------------
| Traceback (most recent call last):
| File "/opt/anaconda3/envs/daxa_dev/lib/python3.11/multiprocessing/pool.py", line 125, in worker
| result = (True, func(*args, **kwds))
| ^^^^^^^^^^^^^^^^^^^
| File "/Users/dt237/code/DAXA/daxa/process/xmm/_common.py", line 89, in execute_cmd
| print(boi)
| ^^^
| NameError: name 'boi' is not defined
+------------------------------------

So just be aware of that!

Could task-based parallelism be used to maximise compute usage of data preparation tasks?

That way we wouldn't have to wait for each individual missions processing to complete before moving onto the next one (as will be the case in the way I'm writing DAXA right now).

Start work on the DAXA paper for JOSS

It won't need to be very long, and some liberties can be taken in terms of writing about features that don't exist yet, because this won't be going on arXiv until they do exist.

Add anomalous CCD state checking for MOS

I am basically following the eSAS guide at this point, but checking for CCDs in anomalous states is going to be a good idea.

This should enable filtering based on what states the user considers acceptable as well.

The choice of acceptable states should of course be recorded for the archive.

SAS v21's (upcoming not released) eSAS implementation is quite different from previous versions

I am currently implementing an eSAS-based XMM processing method, and have accidentally found the eSAS v21 manual indexed on google. It indicates that many of the eSAS tools have had their inputs changed considerably to better resemble normal SAS functions (i.e. they'll take arguments to point to specific event lists etc.)

This is obviously great, more control is better, but it does mean that there will be a significant difference in behaviour. As I do not want to lock people into one specific version of SAS if it can be avoided (especially considering that version isn't even out yet) I will have to build two different approaches (though within the same Python function) for SAS v21 and any lower SAS version (though I don't think I will allow any SAS version below v14).

Hopefully won't be too difficult, considering I already identify the installed SAS version in the find_sas function, it'll just be some extra work.

Small-window mode PN processing errors

When running epchain on small-window mode without some extra configuration, it will throw errors when it finds that most of the CCD IME files are missing (small window just uses one). These errors aren't fatal to the epchain process, but they do contaminate the stderr output which DAXA parses to try to find any truly fatal errors.

As such we should identify which CCDs are available a priori and pass that list to the ccds parameter of epchain. Ideally this will eventually be done by parsing the SAS summary file (issue #34), but for now I think I can just search through files in the ODF directory.

XMM data weirdness

This isn't really a question to me, but just XMM in general.

On XSA the observation 0001730401 shows only RGS data being available - the quality report just indicates RGS as well, no EPIC data.

However when I acquire the ODFs I find unscheduled PN observations and scheduled MOS observations - so what gives??

This is being left here mostly as a reminder to myself to try and solve this mystery

Ensure that a new CCF is created if a different analysis date is used

In the case where ccfs already exist but cif_build is run again with a different analysis date set make sure they are overwritten. The date information should be stored somewhere as well.

This will be integrated into the backend datebase I suspect, in some way that I have yet to figure out.

If a CCF is re-created, then presumably reduction should be re-run to be completely valid?

The setup.py doesn't have all module requirements in like requirements.txt

Not a problem when DAXA is eventually released on PyPI, but for using setup.py to install it right now it is a problem.

Currently the TI files for MOS observations are copied after the emchain process

Need to figure out whether I actually need them (they get copied because they pattern match the EVLI string I look for, as recommended by the eSAS manual). I assume they contain some good time info even though we haven't run soft proton filtering yet?

Have archives create reports on the reduction/processing

Fold in all the information I'm collecting about which processes were successful, the stdout, and stderr.

Implement a wrapper for the eSAS espfilter soft-proton filter function

Again following the example of the XMM eSAS manual, I will be using espfilter to find bad time intervals with high levels of soft proton flaring courtesy of the Sun.

In the currently released version of eSAS there is a script called PN-FILTER (and an equivalent MOS implementation) that called espfilter, but the upcoming version of eSAS (per the unreleased manual I found) has removed it so that eSAS adds functions rather than processing scripts to SAS.

I will be attempting to make DAXA compatible with as many versions of SAS/eSAS as possible (whilst remaining consistent) by not using PN-FILTER, and making a espfilter function for DAXA that supports PN and MOS.

Make the readme and 'about' less rubbish

DAXA does do things now, though admittedly not everything that it will

Reconsider the naming scheme for cleaned event lists

Currently just observation, instrument, subexposure and 'cleaned' - probably should include some more information in the file names.

I should normalise how DAXA calls emchain and epchain as much as possible

Currently emchain will loop through all available sub-exposures, including unscheduled observations, without any extra intervention. As such the processing of an entire ObsID-MOSX set of data happens as one process.

As epchain has to have the sub-exposures manually specified, each sub-exposure of each observation is processed separately. As such it gets its own success/log/error entry in the Archive records - considerably more granular.

I think I should change emchain's behaviour in DAXA so it is more comparable to how epchain behaves. I can address separate sub-exposures by themselves in emchain (using the exposure argument) - this will also make it easier to check that a particular process for a particular sub-exposure did work when it comes to looking for anomalous CCD states in MOS observations.

cleaned_evt_lists fails for 0099280101 because of emanom and calclosed

Currently no checks are performed at any stage to identify what the filter value of a particular sub-exposure of an observation is, and as such everything is blindly thrown into emanom (if the user chooses to run it). This method will fail for any CalClosed filter data, which then carries through to cleaned_evt_lists because DAXA is trying to create cleaned versions of those evt lists as well and expects there to be an emanom log file, even though CalClosed is not useful observation data for us.

Add granularity to missions on whether individual observations have been fully processed

This would go hand-in-hand with some extra resilience to failing processes - I don't really want to halt the whole processing stack just because one observation has failed for some reason.

Just so long as we can keep track of whether an observation has been processed or not.

How do I deal with OOT event lists in cleaned_evt_lists?

Not sure whether the same filters should be applied - though I should think the same GTIs should be

Allow final path to be a list of paths (for SAS process calls)

Particularly this applicable to the epchain wrapper, as it actually creates 'normal' uncleaned PN event lists and out of time event lists as well - so it would be good to check that both are present.

Failed to find or open the following file: (ffopen) toto.in.mos[1]

This is happening during emchain runs, and at another point in the stderr output there is 'sh: lcurve: command not found'

I suspect they might be connected.

lcurve is a part of xronos section of HEASoft, which I may not have selected for my laptop install of HEASoft. This could help me learn which parts of HEASoft are actually required for SAS to work in its entirety.

My ICER install of HEASoft is the whole thing, so I can test running emchain on there to see if the same problem pops up.

BaseMission filter_on_time should allow just start or end datetimes

So that the user can select everything before or after a particular date/time

davidt3 / daxa Goto Github PK

daxa's People

Contributors

Stargazers

Watchers

daxa's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs