Please see the project homepage for more information.
Please refer to the project documentation.
Kowalski: a multi-survey data archive and alert broker for time-domain astronomy
License: MIT License
Please see the project homepage for more information.
Please refer to the project documentation.
Feature Summary
There's quite a bit of duplicate code in those ingest scripts, so it would be useful to refactor them into a general Ingestor
class + subclassing for the custom logic.
Notes on implementation
_id
type conversion:df["_id"] = df["_id"].astype(int)
Should be:
database:
indexes:
<catalog>:
- name: <name>
fields: <list of field names and directions>
unique: <true|false>
To reduce the number of API calls to the TNS, we should use the public_timestamp
parameter in our queries, see here for details. The weekly sync job could then be excluded.
Feature Summary
A script similar to ingest_ztf_matchfiles.py
, but for the public csv files with light curves.
Usage / behavior
Data should be ingested into a ZTF_sources_DR collection.
Additional context
See https://www.ztf.caltech.edu/page/dr5#12c for details.
In particular SDSS' u-band would bridge the SED between PS1 and Galex.
Hi,
We had a discussion about whether filter modifications are immediately applied to subsequent alerts. You said that the filters are only loaded once per day. Would it be possible to reduce the delay time, ideally no delay at all? This would be very useful for debugging filters, in case a problem was not caught by the Compass.
Cheers,
Steve
Feature Summary
Create several small matchfiles with public data only (from real matchfiles) and test ingesting them with ingest_ztf_matchfiles.py
.
Feature Summary
Make token expiration interval configurable as it is currently baked into the API code.
Implementation details
Move the def to config.yaml.
Describe the bug
When testing a local version of Kowalski (using ./kowalski.py test
), the test_ingester.py::TestIngester::test_ingester - assert 0 == 313
test fails.
The cause of this could possibly be the following failures on connecting to the Kafka cluster (see screenshot, full log attached)-
Message delivery failed: KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
2. Message delivery failed: KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}%3|1615087455.390|FAIL|rdkafka#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused (after 0ms in state CONNECT, 15 identical error(s) suppressed)
To Reproduce
./kowalski.py test
Expected behavior
Platform information:
Additional context
Full log of the tests
kowalski_test_log.log
Please fill out relevant sections below; remove those that are unused.
Feature Summary
It would be useful if the PS1 STRM catalog can be ingested into Kowalski. The catalog is described and can be downloaded here. It provides photo-z and morphological classes (galaxy/star/qso) for all point sources detected in the PS1 catalog.
Usage / behavior
The user should be able to query this Table using ra, dec, and a cross match radius.
Additional context
Currently I query this table using CasJobs, as is shown here. If this table is ingested into Kowalski, then the different groups on Fritz can cross match their ZTF candidates with this catalog in the filtering and add auto-annotation into the source page.
Feature Summary
...otherwise if expiration date is not set in the config, the API will always generate the same token.
Alternative Solutions
Care not?
Implementation details
Add the "created_at"
field or something.
Feature Summary
Implement a brokering service for the Palomar Gattini IR (PGIR) alert stream.
Usage / behavior
The service should mimic the behavior of the ZTF alert brokering service.
Implementation details
config.defaults.yaml
alerts_pgir: "PGIR_alerts"
and alerts_pgir_aux: "PGIR_alerts_aux"
to database.collections
database.indexes.PGIR_alerts
database.filters.PGIR_alerts
xmatch
section?dask_pgir
section; use a different port for the dask
schedulersupervisor
programs to supervisord.ingester
for program:dask-cluster-pgir
and program:alert-broker-pgir
ingester.Dockerfile
kowalski/alert_broker_pgir.py
and kowalski/dask_cluster_pgir.py
to /app
kowalski/alert_broker_pgir.py
and kowalski/dask_cluster_pgir.py
kowalski/alert_broker_ztf.py
and kowalski/dask_cluster.py
, keep the relevant parts of the code, and make the necessary adjustments.tests/test_ingester_pgir.py
tests/test_ingester.py
as an example. The latter pulls sample alerts from a public Google Cloud bucket. You may either do the same or post several examples to a GH repo.kowalski.py::test
(use the section for ZTF alerts as an example)Note that we will need to implement several things on the F side to enable filtering (create a stream, handle permissioning, PGIR alert page etc.)
IPHAS dr2 has been superseded by the IGAPS catalog and the survey should be updated
Feature Summary
The IPHAS dr2 catalog has been superseded by IGAPS (http://www.star.ucl.ac.uk/IGAPS/survey.shtml). IGAPS included both IPHAS and UVEX; and contains data in the Ugriz-filters. Especially the U-band data is important for compact object science.
Usage / behavior
This is simply an update of a catalog available for crossmatching
Implementation details
This is simply an update of a catalog. If needed, some catalog columns can be omitted.
Additional context
IGAPS is a Galactic Plane catalog, having this available soon would be nice since the Galactic plane is rising again in the sky
Feature Summary
With broker mode activated, should impose certain rate limits, e.g. don't allow auto-saving more than 5,000 sources in a 24-hour period to SP.
Usage / behavior
Stop posting new sources to SP once the limit is reached.
Alternative Solutions
Limit rate on the SP side?
Implementation details
Current thinking: implement #18 and this on top of it, i.e. store the info on what has been posted to SP (objectId, candid, jd, passing filter ids, save/post state)
When I'm using ./kowalski.py up
to build, there is an error failed to compute cache key: "/version.txt" not found: not found
.
In fact there is no version.txt file.
It may be useful to have code that quickly assesses the completeness of ingested ZTF source catalogs. This code in scope
(https://github.com/ZwickyTransientFacility/scope/blob/dbf2c31719ec04cb7e86338d7fa6f9c0fd8a01cf/tools/generate_features_slurm.py#L51) could be adapted for this purpose. It iterates over ZTF field/ccd/quad combinations, performing find_one
queries to identify whether any sources exist. Comparing the results to previous data releases is a quick way to see if a new catalog is complete (if the new catalog has sources in fewer fields than an older catalog, something likely went wrong during the ingestion).
Adam Miller requested that we add median centered inverse hyperbolic sine scaling. However, currently LogNorm is hardcoded.
if ztftype != 'Difference':
img[img <= 0] = np.median(img)
plt.imshow(img, cmap=plt.cm.bone, norm=LogNorm(), origin='lower')
else:
plt.imshow(img, cmap=plt.cm.bone, origin='lower')
With this scaling, we could potentially get rid of the if clause.
However, perhaps it would be simplest to leave these scalings to the frontend, paving the way for using JS9.
When ingesting a parquet file using ingest_catalog.py
, this line casts all columns' dtypes to float
. Thus the downstream checks for different dtypes will not work as intended.
Perhaps using itertuples
instead would help by preserving the dtypes (at the expense of some re-working of the loop)? Note this could introduce a new issue where columns beginning with an underscore are renamed (e.g. _id
, which gets mapped to _1
). This might be avoided by setting name=None
within itertuples
.
As preparation for practicing combining ZTF with future surveys like LSST (and to facilitate long baseline period searches now), PTF would be a really nice value-added catalog for our period searches.
Kowalski only currently knows about a 201 response from ZTF's scheduler. However, 200 is another useful error code indicating the queue already exists. We should have that returned separately.
# (PUT should be idempotent)
if data['queue_name'] in request.app['scheduler'].queues:
msg = f"Submitted queue {data['queue_name']} already exists"
[logging.info](http://logging.info/)(msg)
return web.Response(status=200, text=msg)
Feature Summary
Implement a new query type near
with a footprint similar to that of cone_search
, which would return the nearest object from catalog(s).
Usage / behavior
User will be able to specify min_distance and max_distance in radians in kwargs
.
Implementation details
Use MongoDB's $nearSphere operator.
This should be discussed with the team to see what the implementation should look like.
Feature Summary
Implement a mode that would allow multiple instances of Kowalski to operate under one "umbrella".
Usage / behavior
The user would query a primary instance of K. If the latter does not find the requested catalog in its db, it would ask secondary instances to see if they have it, and query them if they do.
Alternative Solutions
penquins
side what instance to queryImplementation details
In config.yaml, set up
{
primary_instance: <false|true>
secondary_instances: <list of known secondary instances access info>
}
Feature Summary
Inspect the broker logs on a daily basis and generate and post a summary pdf report with plots to a Slack channel.
Implementation details
Run it every day or so; add a supervisor program to config.defaults.yaml
in superviosrd.ingester
Use something like:
import datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pathlib
from tqdm.auto import tqdm
p_base = "/data/logs"
start_date = datetime.datetime(2021, 2, 16)
end_date = datetime.datetime(2021, 2, 17)
log_dir = pathlib.Path(p_base)
log_files = list(log_dir.glob("dask_cluster*"))
actions = {
"mongification": [],
"mling": [],
"ingesting": [],
"xmatch": [],
"clu_xmatch": [],
"aux_updating": [],
"filtering": [],
"is_candidate": [],
"is_source": [],
"get_candidate": [],
"make_photometry": [],
"post_photometry": [],
"post_annotation": [],
"make_thumbnail": [],
"post_thumbnail": [],
}
for log_file in tqdm(log_files):
with open(log_file) as f_l:
lines = f_l.readlines()
for line in tqdm(lines):
if len(line) > 5:
try:
tmp = line.split()
t = datetime.datetime.strptime(tmp[0], "%Y%m%d_%H:%M:%S:")
if start_date <= t <= end_date:
if ("mongification" in line) and ("took" in line):
actions["mongification"].append(float(tmp[-2]))
if ("mling" in line) and ("took" in line):
actions["mling"].append(float(tmp[-2]))
if ("Xmatch" in line) and ("took" in line):
actions["xmatch"].append(float(tmp[-2]))
if ("CLU xmatch" in line) and ("took" in line):
actions["clu_xmatch"].append(float(tmp[-2]))
if ("ingesting" in line) and ("took" in line):
actions["ingesting"].append(float(tmp[-2]))
if ("aux updating" in line) and ("took" in line):
actions["aux_updating"].append(float(tmp[-2]))
if ("Filtering" in line) and ("took" in line):
actions["filtering"].append(float(tmp[-2]))
if ("Checking if object is Candidate" in line) and ("took" in line):
actions["is_candidate"].append(float(tmp[-2]))
if ("Checking if object is Source" in line) and ("took" in line):
actions["is_source"].append(float(tmp[-2]))
if ("Getting candidate info" in line) and ("took" in line):
actions["get_candidate"].append(float(tmp[-2]))
if ("Making alert photometry" in line) and ("took" in line):
actions["make_photometry"].append(float(tmp[-2]))
if ("Posting photometry" in line) and ("took" in line):
actions["post_photometry"].append(float(tmp[-2]))
if ("Posting annotation" in line) and ("took" in line):
actions["post_annotation"].append(float(tmp[-2]))
if ("Making" in line) and ("thumbnail" in line) and ("took" in line):
actions["make_thumbnail"].append(float(tmp[-2]))
if ("Posting" in line) and ("thumbnail" in line) and ("took" in line):
actions["post_thumbnail"].append(float(tmp[-2]))
except Exception as e:
continue
for action, values in actions.items():
actions[action] = np.array(values)
print(action)
print(
f"median: {np.median(values):.5f}, "
f"std: {np.std(values):.5f}, "
f"min: {np.min(values):.5f}, "
f"max: {np.max(values):.5f}"
)
print(f"total: {np.sum(values):.5f} seconds / {np.sum(values)/60:.2f} minutes")
fig = plt.figure(figsize=(8, 2), dpi=120)
ax = fig.add_subplot(111)
ax.hist(values, bins=100, range=(0, np.median(values) + 3*np.std(values)))
ax.grid(alpha=0.3)
plt.show()
Feature Summary
Add a separate histogram for SkyPortal-specific actions within the performance report
We can't create a file containing this program and the TensorFlow for Mac book M1 in docker.
Feature Summary
ZTF switches to a 2-month data release schedule in Spring 2021. This means that the ZTF matchfiles that contain absolute photometry for over a billion sources in multiple filters will also be re-generated every 2 months. The matchfiles are pre-processed and ingested into Kowalski feeding many of the variable science and machine learning applications. The matchfiles take up well over 20 TB (in a binary format) and it takes quite a bit of time to pre-process and ingest that data. We need to refactor the pipeline so that it is faster and complies with the constraints that we have in terms of the available space and compute.
Usage / behavior
[to be filled in]
Implementation details
Currently, we are using a set of scripts:
https://github.com/dmitryduev/kowalski/blob/master/tools/ztf_mf2dump.py
Describe the bug
I should have paid more attention at how photometry.PUT and photometry.POST were reimplemented on SP to incorporate alert stream provenance (similar to how broadcasting in case of group is implemented).
Describe the bug
Front end source display shows "No matches found" for TNS, but when you click the TNS catalog button, a source is found.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Consistency between TNS match display and search results from catalog
Platform information:
Context
Originally reported on fritz as Issue #105
It'd be great if we could add milliquas to the crossmatches we do for alerts:
https://heasarc.gsfc.nasa.gov/W3Browse/all/milliquas.html
radius: 2 arcsec
fields to keep:
name
broad_type
rx_assoc_prob
qso_prob
VLASS is a useful point source catalog of radio sources:
https://cirada.ca/vlasscatalogueql0
Can be loaded as, for example:
df = pd.read_csv(catalog_file)
df = df[df["Duplicate_flag"] < 2]
df = df[df["Quality_flag"] == 0]
df = df[df["Total_flux"] >= 1]
df = df[df["Total_flux"]/df["E_Total_flux"] >= 5]
The VSX catalog contains 2M variable sources. For automatic filtering and crossmatching, this catalogue is useful to have.
Feature Summary
Ingest VSX catalogue from vizier: http://vizier.u-strasbg.fr/viz-bin/VizieR?-source=B%2Fvsx The vizier catalog is regularly updated, and we like to keep to Kowalski version up to date too (can be done manually).
Implementation details
requires a python script that takes the downloaded Vizier catalog and ingests it into Kowalski. Similar to other catalogs.
Feature Summary
Refactor the queries API to be more clean and concise.
Usage / behavior
Noting should change from the enduser perspective.
Implementation details
odmantic
Feature Summary
In order to increase the dev speed, we need a non-docker-based build system.
Implementation details
Use a virtual environment for the python stuff. For devs, run the app with the reload
parameter or use gunicorn with the --reload
option. Keep mongo and the java/kafka stuff containerized.
Avoid make
at all costs!
Additional context
The docker stuff should not be scrapped since it'll still be used in prod.
Feature Summary
As suggested by @stefanv, add mypy
to pre-commit hooks.
Implementation details
- repo: https://github.com/pre-commit/mirrors-mypy
rev: 'v0.782'
hooks:
- id: mypy
Looking for a tutorial on what to do after adding a new stream to kowalski. I am adding a new instrument to kowalski called TURBO. I have followed the steps in the tutorial about adding WINTER to kowalski. I think I have things set up correctly as I can successfully complete the test step using TURBO data. I now don't know what to do to have kowalski ingest local turbo data and then post candidates to skyportal.
Feature Summary
Ingestion script for ZTF light curve features produced by @mcoughlin.
Implementation details
Similar to other ingestion scripts found in tools
.
Currently, in broker mode, K posts annotations to SP only the first time an alert from an object passes a filter. We need to add an option to update annotations in case a new alert from the same object passes the filter.
[As a side note: we do update the passing_alert_id and passed_at fields every time]
pd.read_fwf
is apparently making a wrong guess, as reported by Przemek Mroz.
Hi Dima,
I am using python 3.6 and try to install kowalski. When I do
./kowalski.py up
I get the following error message:
Traceback (most recent call last):
File "./kowalski.py", line 194, in <module>
getattr(sys.modules[__name__], args.command)(args)
File "./kowalski.py", line 32, in up
print('Spinning up Kowalski \U0001f680')
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f680' in position 21: ordinal not in range(128)
How can I fix this problem?
Cheers,
Steve
Keep track of alerts that should have been posted to SP, but the operation failed for whatever reason. Retry posting next time the ingester pulls the alert in from the Kafka topic.
candid
s in a dedicated collectionFeature Summary
A GH action that would deploy K to a/the staging server on PR merges/commits to master.
Implementation details
Mimic Ansible and just ssh into, say, skipper, then cd ..., pull latest master, check/fix config if necessary, build images, deploy.
Alternative Solutions
Build, upload, and deploy images similar to how it's done for Fritz?
We need a way to handle this situation:
These are events detected close to the CCD edge. They are always centered on the longer side. We need to use the (x, y) position of the transient to pad the shorter side in such a way that it always lands in the center.
A nice idea from @stefanv: make the thumbnails transparent where there is no data!
The relevant code lives here: https://github.com/dmitryduev/kowalski/blob/master/kowalski/alert_broker.py#L434.
For alert brokering; automatically save posted candidates to filter's parent group.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.