datapao / dac Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 4.0 385 KB

Databricks Admin Center

License: Apache License 2.0

Python 65.68% HTML 31.88% JavaScript 1.84% Dockerfile 0.27% Shell 0.32%

cost-control cost-optimization dashboard databricks monitoring python spark

dac's People

Contributors

Stargazers

Watchers

Forkers

kmate mikepetridisz motiteux shalevy1

dac's Issues

Scraper fails for job runs without cluster instance

A key error raises when accessing job_run_dict['cluster_instance"]. According to the Databricks Job API docs, this is not always present:

The cluster used for this run. If the run is specified to use a new cluster, this field will be set once the Jobs service has requested a cluster for the run.

Scraper still fails on `result_state` for pending jobs

  File "dac/scraping/scraper.py", line 151, in scrape_job_run
    state_result_state=state["result_state"] if not failed_run else 'FAIL',
KeyError: 'result_state'

See the docs on availability of result_state, based on the life_cycle_state.

Scraper fails on event type `INIT_SCRIPTS_STARTED`

The dac-scraper process dies with the following:

ValueError: Unkown event: { ..., 'type': 'INIT_SCRIPTS_STARTED', ... }

Recognized events are: ['INIT', 'CREATING', 'DID_NOT_EXPAND_DISK', 'EXPANDED_DISK', 'FAILED_TO_EXPAND_DISK', 'INIT_SCRIPTS_STARTING', 'INIT_SCRIPTS_FINISHED', 'STARTING', 'RESTARTING', 'TERMINATING', 'EDITED', 'RUNNING', 'RESIZING', 'UPSIZE_COMPLETED', 'NODES_LOST', 'DRIVER_HEALTHY', 'DRIVER_UNAVAILABLE', 'SPARK_EXCEPTION', 'DRIVER_NOT_RESPONDING', 'DBFS_DOWN', 'METASTORE_DOWN', 'AUTOSCALING_STATS_REPORT', 'NODE_BLACKLISTED', 'PINNED', 'UNPINNED']

Got `sqlite3.IntegrityError` during scraping

Stack trace:

Exception in thread scraping-loop-Thread:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1229, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 577, in do_executemany
    cursor.executemany(statement, parameters)
sqlite3.IntegrityError: NOT NULL constraint failed: cluster_types.type

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/app/scraping/scraper.py", line 440, in scraping_loop
    result = scrape(json_path, session)
  File "/app/scraping/scraper.py", line 462, in scrape
    instance_types = upsert_instance_types(session)
  File "/app/scraping/scraper.py", line 138, in upsert_instance_types
    session.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1027, in commit
    self.transaction.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 494, in commit
    self._prepare_impl()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 473, in _prepare_impl
    self.session.flush()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2470, in flush
    self._flush(objects)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2608, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 153, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2568, in _flush
    flush_context.execute()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 589, in execute
    uow,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
    insert,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 1084, in _emit_insert_statements
    c = cached_connections[connection].execute(statement, multiparams)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 988, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
    distilled_params,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1253, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1473, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 152, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1229, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 577, in do_executemany
    cursor.executemany(statement, parameters)
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) NOT NULL constraint failed: cluster_types.type
[SQL: INSERT INTO cluster_types (scrape_time, type, cpu, mem, dbu_light, dbu_job, dbu_analysis) VALUES (?, ?, ?, ?, ?, ?, ?)]

Scraper fails for job runs in `INTERNAL_ERROR` state

We have a job where run_state is not present:

'state': {'life_cycle_state': 'INTERNAL_ERROR', 'state_message': 'Notebook not found: ***REDACTED***'}`

  File "dac/scraping/scraper.py", line 145, in scrape_job_run
    state_result_state=state["result_state"],
KeyError: 'result_state'

Scraper fails for autoscale clusters

In case of autoscaling clusters there's no num_workers key in the dictionary. (However there's a key like this 'autoscale': {'min_workers': 1, 'max_workers': 2}.)

  File "dac/scraping/scraper.py", line 72, in scrape_cluster
    num_workers=cluster_dict["num_workers"],
KeyError: 'num_workers'

Typo in table name

Seems that this should be scraper-run: https://github.com/datapao/dac/blob/master/db/db.py#L316

Scraper fails for jobs created on existing clusters

In that case, the settings object contains an existing_cluster_id key, but not a new_cluster key. This is permitted according to the Databricks Job API docs.

  File "dac/scraping/scraper.py", line 174, in scrape_jobs
    new_cluster=job_dict["settings"]["new_cluster"],
KeyError: 'new_cluster'

Unique key violation in scraper

Recently the scraper stopped working for us, with the following problem:

Exception in thread scraping-loop-Thread:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1229, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 577, in do_executemany
    cursor.executemany(statement, parameters)
sqlite3.IntegrityError: UNIQUE constraint failed: cluster_states.user_id, cluster_states.cluster_id, cluster_states.timestamp, cluster_states.state
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/app/scraping/scraper.py", line 313, in scraping_loop
    result = scrape(json_path)
  File "/app/scraping/scraper.py", line 342, in scrape
    session.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1027, in commit
    self.transaction.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 494, in commit
    self._prepare_impl()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 473, in _prepare_impl
    self.session.flush()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2470, in flush
    self._flush(objects)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2608, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 153, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2568, in _flush
    flush_context.execute()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute

It's very strange indeed but it seems that the API might return some things twice?

Rendering home page fails because of missing column in aggregation

  File "dac/aggregation/aggregator.py", line 65, in get_cluster_type
    names = clusters[['cluster_name']].copy()
...
KeyError: "None of [Index(['cluster_name'], dtype='object')] are in the [columns]"

It seems to be right: https://github.com/datapao/dac/blob/master/db/db.py#L409

Scraper fails for jobs created by deleted users

According to the Databricks Job API docs, the creator user name won’t be included in the response if the user has already been deleted. This may result in a key error when scraping a job upon accessing job_dict["creator_user_name"].

Possible memory leak in scraper

We have been already running the DAC for a few days and after around a week the operating system killed the scraper because of reaching memory limits:

scripts/dac.sh: line 23:     6 Killed                  python main.py scrape
[Wed Apr  8 02:51:32 2020] Memory cgroup out of memory: Killed process 7871 (python) total-vm:1209768kB, anon-rss:334320kB, file-rss:33360kB, shmem-rss:0kB

Could there be a memory leak somewhere? (We didn't really give it too much memory, I know...)

datapao / dac Goto Github PK

dac's People

Contributors

Stargazers

Watchers

Forkers

dac's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs