astrodbtoolkit / astrodbkit2 Goto Github PK

View Code? Open in Web Editor NEW

7.0 4.0 5.0 177 KB

A new version of AstrodbKit built with SQLAlchemy

Python 100.00%

astrodbkit2's Introduction

Astronomical database handler code

https://github.com/astrodbtoolkit/AstrodbKit2/workflows/Test%20Astrodbkit2/badge.svg?branch=main

https://codecov.io/gh/astrodbtoolkit/AstrodbKit2/graph/badge.svg?token=LMKVN65D82

AstrodbKit2 is an astronomical database handler code built on top of SQLAlchemy. This is built to work with the SIMPLE database, though similarly constructed databases will work.

Documentation is available at https://astrodbkit2.readthedocs.io/en/latest/

License

This project is Copyright (c) David Rodriguez and licensed under the terms of the BSD 3-Clause license. This package is based upon the Astropy package template which is licensed under the BSD 3-clause license. See the licenses folder for more information.

Contributing

We love contributions! AstrodbKit2 is open source, built on open source, and we'd love to have you hang out in our community.

Imposter syndrome disclaimer: We want your help. No, really.

There may be a little voice inside your head that is telling you that you're not ready to be an open source contributor; that your skills aren't nearly good enough to contribute. What could you possibly offer a project like this one?

We assure you - the little voice in your head is wrong. If you can write code at all, you can contribute code to open source. Contributing to open source projects is a fantastic way to advance one's coding skills. Writing perfect code isn't the measure of a good developer (that would disqualify all of us!); it's trying to create something, making mistakes, and learning from those mistakes. That's how we all improve, and we are happy to help others learn.

Being an open source contributor doesn't just mean writing code, either. You can help out by writing documentation, tests, or even giving feedback about the project (and yes - that includes giving feedback about the contribution process). Some of these contributions may be the most valuable to the project as a whole, because you're coming to the project with fresh eyes, so you can see the errors and assumptions that seasoned contributors have glossed over.

Note: This disclaimer was originally written by Adrienne Lowe for a PyCon talk, and was adapted by AstrodbKit2 based on its use in the README file for the MetPy project.

astrodbkit2's People

Contributors

Stargazers

Watchers

Forkers

ariedel blackwer bigdaddymanjulla dr-rodriguez kelle

astrodbkit2's Issues

Update documentation for database description

To clarify how the database is used and built, we should add explicit documentation through Sphinx that describes how this works. The key thing is to highlight the different between data/object and reference tables.

Think about filename extensions

I noticed that for SIMPLE, we have an SQLite database but we end up with .db file. Should we consider using the .sqlite or .sqlite3 extension when the output file is an SQLlite database?

Relevant SO discussion: https://stackoverflow.com/questions/808499/does-it-matter-what-extension-is-used-for-sqlite-database-files

Address unicode issues in database

When working with the database there are sometimes explicit representations of unicode characters (eg, \u2212) instead of what they should be (-). We need to investigate where this is happening; it is possible the input data is wrong (and thus a matter for the SIMPLE scripts) but we can probably force an encoding on the output JSON to ensure it doesn't happen on the file representation. The problem with having a mix of unicode characters is searching exact values can be much harder.

Make more robust Spex Prism loader

Currently, the Spex Prism data loader is very basic. It checks that spex is somewhere in the file and then assumes it's a Spex Prism file and proceeds from there. It should be a little more robust, perhaps checking the header contents instead.

Furthermore, the load_spex function needs to better handle cases of missing header keywords, invalid units, and such.

Finally, unit tests should be written for the spex-related functions in spectra.py

Make new IRAF loader

Make a loader and identifier for an IRAF FITS file that has multiple spectra AND a linear wavelength solution which needs to be calculated. This is a slight modification of the existing wcs1d-fits identifer/loader in specutils except these files have NAXIS > 1.

name = wcs1d_multispec

Add method to save reference table

We currently output the reference tables as part of save_database, but we should have an explicit add_reference_table or similar to just handle things like Publications tables.

Crawling through inventory results: companions

There is a possibility of having a CompanionRelationship table with associated CompanionParameter results. It may be good to consider adding the ability to crawl through and fetch companion parameters for any companion that is found.
This may be a bit tricky, but probably not impossible.

An option I'm thinking is add a dictionary with the Relationship name and the Column to use for matching, then associated tables to use with that matched column value. This can allow for more flexibility if the name of the relationship table is different and there is more than one associated table of values.

Update Sphinx template

Astropy's Sphinx template has been updated- we should grab the latest version.

Move reference tables out of data/ directory

Move reference tables such as Publications.json out of the data/ directory which will make PRs easier to review when there are lots of changes to the data files.

Remove _data from JSON output

The filenames are descriptive enough and _data does not need to be appended to them.

Consider unit handling for spectra

Units for spectra are, for the single Spex Prism example, extracted from the FITS header. This should be an optional parameter that can be passed to set it. Ideally, as either a column name or an actual string value.
Error handling should also be in place in case the unit can't be resolved by astropy.

Add parameter to control fuzzy search in search_object

Currently, search_object does a fuzzy search (ilike(f'%{n}%')) to match against provided names. In general, this is fine. However, when you have exact names you don't need a fuzzy search. We should add a flag there to disable the fuzziness so that we can better handle things like the Simbad query tests.

Explore disabling foreign key constraints on load_database

The current implementation of load_database clears existing tables before loading data from JSON files. However, the existence of foreign key constraints can greatly complicate this as tables would have to be deleted in a specified order for it to work.
We should explore if these constraints can be disabled in this step, if there are other ways to fully delete all contents, or if we just need to provide a way for the user to supply a table deletion order.

PyCharm Complaints

Couple of things my PyCharm started complaining about:
Class Database having unresolved attribute "Sources"; e.g in.

db = Database(db_file)
db.query(db.Sources).count()

Then, db.search_object having unexpected argument, e.g. db.search_object('twa 27', fmt='pandas').source.values
Although the code itself runs fine as expected...

Perform string search

Add method to perform an arbitrary string search against a specified table (or list of table, or the entire database?). This would need to first identify what are the string-type columns for each table.

Consider use of Astropy QTable

Astropy QTables are Tables with units enforced in them. The databases I have worked with thus far don't have units stored in them, but it may be that some will (perhaps if the data is in long-form) or that it can be added via some map.
We may want to consider adding a .qtable method that will return query output but do so in QTable form. The simplest implementation would probably require the user to provide a dictionary of column -> unit, but we can explore other options too.

enable `search_object` function to use coordinates from SIMBAD

Right now, the search_object gets all the names from SIMBAD and searches the database for those names. I think the function should also use the coordinates from SIMBAD to find possible matches in the database.

Consider option to output results as astropy.Tables

Astropy Tables are widely used and could be a useful output format. A similar case could be made for pandas DataFrames.
Currently, Astrodbkit2 uses the native SQLAlchemy output, a list of named tuples, which can be readily transformed to other formats.

While this would increase our dependencies, it may be worthwhile to consider adding this in some fashion. The ideal way would be as an extra method available when querying but that may require some deeper investigation in how the Query class is constructed in SQLAlchemy.

Add ignore_tables list

When instantiating a Database object, there may be a need to provide a list of tables that are to be ignored. This can be useful to handle cases where those tables are created/managed separately from astrodbkit2.

Add `Regimes` as a REFERENCE_TABLE

Update AstrodbKit2 to handle SQLAlchemy 2.0

Recent changes to SQLAlchemy have broken some of the database logic.

Create spectra.py to store spectra functions

Create a spectra.py file to store spectra-related functions and classes. This would contain the code to take a URL and get a specutils spectrum object, among other things.

Add unit tests

Unit tests will be needed, can use the example schema.py (which should be renamed as schema_example.py) to create a temporary in-memory DB to test against.

Adopt `access_url` as an accepted column name

After a reading of ObsCore standards, the idea of using access_url came up instead of spectrum in the Spectrum table. Please modify astrodbkit2 accordingly. (Related issue in SIMPLE-AstroDB/SIMPLE-db#353)

Use logger to add debug mode

When in debug mode,

print the filename being saved in the save_json function
print the filename being opened in load_json function.
do whatever verbose=True does in the search_object function

When load_database or save_database fails, this would help troubleshoot the file that is causing the problem.

Modify Spex Prism loader

in spectra.py, there's a "Spex Prism" loader. This is actually a "spextool" loader.

Based on looking at the headers of some SINFONI data, I don't see a good header keyword to use for the identfication.

Example spectra:
https://s3.amazonaws.com/bdnyc/SINFONI/ABPicb_H%2BK_0.025_final_spectrum.fits
https://s3.amazonaws.com/bdnyc/SINFONI/ABPicb_J_0.025_norm_master_spectrum.fits
https://s3.amazonaws.com/bdnyc/20912_etador21_merge.fits

Migrate to GitHub Actions

Setup github actions to run unit tests so we can turn Travis off. Existing travis.yml file is probably doing too much. Decide on os coverage.

Prepare Sphinx documentation

Once docstrings are in place (#7), see about generating Sphinx documentation and committing that.

Add method to pivot results

In some situations, we may need to group results and shift values into individual columns.
The example that comes to mind is data sorted like this:

name     parameter   value
source1  C/O ratio     0.1
source1  velocity       -2
source1  mass           10
source2  C/O ratio     0.2
source2  mass            1

While this is fine, the user may want instead a table of the form:

name        C/O ratio  velocity  mass
source1     0.1        -2        10
source2     0.2        None      1

This can be done with some group by commands, but we may want to provide a method .pivot to handle this automatically for the user, with an option to pass along what column name should be the pivot.

handle byte type in `json_serializer`

The Blob type turns into a byte type and the code doesn't expect it.

Add an

if isinstance(obj, bytes):

to the json_serializer function. In all cases where I have encountered this, the intended value was None.

Add more format options for add_table_data

The add_table_data method of Database currently only takes csv files.
We should add other input options like astropy Tables and pandas Dataframes.

Investigate slowness of building in Windows

When loading the database from source in Windows, the execution time is much slower, easily 20x or so.
Some things to investigate:
Does running a linux VM/container fix this (ie, is this a Windows limitation?)
Are there notes on performance with SQLAlchemy or similar for Windows?
Can we leverage things like multiprocessor to speed this up?

Consider option to also copy DB contents via SQLAlchemy

In copy_database_schema, consider adding an option to also have it copy the database contents and not just the schema. This could then be used for databases that don't follow approximately the schema that Astrodbkit2 is built to handle in terms of JSON handling.

Investigate codecov coverage limits

Github will display a red X for commits that don't pass codecov limits, but it's not clear to me where/how to set this. It's possible I'm missing a codecov.xml file or similar that specifies this.
This is something we should explicitly set and be similar in coverage requirements as other astropy packages.

Add module for loading data

We should add a separate module for handling loading data into the database. This would work independently of the JSON serialization and would just be wrappers to do things like load CSV files. I envision a set of functions for each supported format taking as inputs the database object, the table name, and the file to process. Optionally a mapping dictionary between file column names and database column names could be provided, otherwise we can expect them to be the same.

Add option to pass spectrum format

Currently, Astrodbkit2 will attempt to decipher the correct format, but it may be useful to have that be a parameter the user can pass.

Generic string search

We should consider adding a method to perform a generic string search against all text fields in a specified table. This is probably not something that would be used in the short term, and it can be coded manually with existing queries, so it is of lower priority.
Searching for an object by name is already covered by #4

Funky Spectra Loading

Some spectra, e.g. of 2MASS J00325584-4405058, are causing the website to crash because specutils is running into a UnitConversionError when trying to load; this isn't being handled by astrodbkit2 (i.e. returning the spectrum as a string), hence the object crashing on the website (because the website is only handling a string instead of a Spectrum1D object). cc. @kelle

Write docstrings

A lot of docstrings are missing; spend some time populating them

Test and implement example View

Database views can be very powerful tools to represent data. We should create an example view in the test schema to demonstrate how to do this and verify that Astrodbkit2 handles them properly.

Perform cone search

Add method to perform cone search. May require specifying what table/columns contain coordinate information and how astrodbkit2 will handle that.

Consider how spectra should be stored

Spectra will likely point to a repository (eg, Amazon Cloud, CUNY AW). Should Astrodbkit2 resolve that into a plot-able object or should that be handled separately by the user/client?

Consider how to link astrodbkit2 and astroquery

Need to figure out how to deal with different object names and how to store/query the database.

Future proof reference table by adding more examples

Add some more possible examples of reference tables, for examples Filters, PhotometryFilters, Citations, References, etc

Removing hard-coded table names

https://github.com/dr-rodriguez/AstrodbKit2/blob/cbed572f7d7b22b0000e7a41181e6f03c6698844/astrodbkit2/astrodb.py#L91

Consider replacing that with use of setattr like in the example below:

class A:
    def __init__(self):
        self.mydict = {'sources': 1, 'photometry': 2}
        for k, v in self.mydict.items():
            self.__setattr__(k, v)

a = A()
a.sources
a.photometry

This will avoid having to hard-code table names.

Discussion: Move some ingest scripts to here (astrodbkit)

Per discussion with @arjunsavel, I still think that some of the most fundamental ingest scripts should live in this package. This makes it possible for those scripts to be improved upon without needing to modify folks' databases. Example scripts are ingest_publication, ingest_telescope, ingest_source.

Expand sql_query method to have format parameter

Currently, the sql_query method is a simple call that returns the output as a list of named tuples. Since the general query method can now output astropy Tables and pandas DataFrames we should offer that same functionality with sql_query. This would just be handled by a format parameter on the call, though, since it's not constructed the same way as query but that is sufficient.

Bug: shared attributes (ie, tables) between two databases

When connecting to two separate database, I see the union of all tables listed as attributes for both, as opposed to each database instance just containing their set of tables.

from astrodbkit2.astrodb import Database, copy_database_schema
from sqlalchemy import types  # for BDNYC column overrides

# Establish connection to databases

connection_string = 'sqlite:///../BDNYCdevdb/bdnycdev.db'
bdnyc = Database(connection_string,
                 reference_tables=['changelog', 'data_requests', 'publications', 'ignore', 'modes',
                                   'systems', 'telescopes', 'versions', 'instruments'],
                 primary_table='sources',
                 primary_table_key='id',
                 foreign_key='source_id',
                 column_type_overrides={'spectra.spectrum': types.TEXT(),
                                        'spectra.local_spectrum': types.TEXT()})

connection_string = 'sqlite:///SIMPLE.db'
db = Database(connection_string)

# This fails because the list of tables is the same for both
assert db.metadata.tables != bdnyc.metadata.tables

Note that it's not just the names- the table metadata itself is accessible from either database instance even though the actual data isn't.

On object creation we will need to work out how to set database metadata attributes to ensure their scope is within the class alone.

Search result without column names

Doing the to_pandas method because giving fmt='pandas' produces df without col names:

results = db.search_object(query, fmt='astropy')  # get the results for that object

This sounds like a bug in search_object that needs to be investigated.

Fix coverage reports

Neither coveralls nor codecov is working. Coveralls receives something but returns 0% coverage despite travis calculating it. Codecov says no coverage report exists. My guess is the directory structure that tox uses (.tmp/envname) is messing with the configuration, though it's surprising to me this wouldn't have been fixed already or that it doesn't return a more meaningful error.