GithubHelp home page GithubHelp logo

datapao / dac Goto Github PK

View Code? Open in Web Editor NEW
6.0 6.0 4.0 385 KB

Databricks Admin Center

License: Apache License 2.0

Python 65.68% HTML 31.88% JavaScript 1.84% Dockerfile 0.27% Shell 0.32%
cost-control cost-optimization dashboard databricks monitoring python spark

dac's People

Contributors

fulibacsi avatar gulyasm avatar rgabo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dac's Issues

Scraper fails for job runs without cluster instance

A key error raises when accessing job_run_dict['cluster_instance"]. According to the Databricks Job API docs, this is not always present:

The cluster used for this run. If the run is specified to use a new cluster, this field will be set once the Jobs service has requested a cluster for the run.

Scraper still fails on `result_state` for pending jobs

  File "dac/scraping/scraper.py", line 151, in scrape_job_run
    state_result_state=state["result_state"] if not failed_run else 'FAIL',
KeyError: 'result_state'

See the docs on availability of result_state, based on the life_cycle_state.

Scraper fails on event type `INIT_SCRIPTS_STARTED`

The dac-scraper process dies with the following:

ValueError: Unkown event: { ..., 'type': 'INIT_SCRIPTS_STARTED', ... }

Recognized events are: ['INIT', 'CREATING', 'DID_NOT_EXPAND_DISK', 'EXPANDED_DISK', 'FAILED_TO_EXPAND_DISK', 'INIT_SCRIPTS_STARTING', 'INIT_SCRIPTS_FINISHED', 'STARTING', 'RESTARTING', 'TERMINATING', 'EDITED', 'RUNNING', 'RESIZING', 'UPSIZE_COMPLETED', 'NODES_LOST', 'DRIVER_HEALTHY', 'DRIVER_UNAVAILABLE', 'SPARK_EXCEPTION', 'DRIVER_NOT_RESPONDING', 'DBFS_DOWN', 'METASTORE_DOWN', 'AUTOSCALING_STATS_REPORT', 'NODE_BLACKLISTED', 'PINNED', 'UNPINNED']

Got `sqlite3.IntegrityError` during scraping

Stack trace:

Exception in thread scraping-loop-Thread:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1229, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 577, in do_executemany
    cursor.executemany(statement, parameters)
sqlite3.IntegrityError: NOT NULL constraint failed: cluster_types.type

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/app/scraping/scraper.py", line 440, in scraping_loop
    result = scrape(json_path, session)
  File "/app/scraping/scraper.py", line 462, in scrape
    instance_types = upsert_instance_types(session)
  File "/app/scraping/scraper.py", line 138, in upsert_instance_types
    session.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1027, in commit
    self.transaction.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 494, in commit
    self._prepare_impl()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 473, in _prepare_impl
    self.session.flush()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2470, in flush
    self._flush(objects)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2608, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 153, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2568, in _flush
    flush_context.execute()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 589, in execute
    uow,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
    insert,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 1084, in _emit_insert_statements
    c = cached_connections[connection].execute(statement, multiparams)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 988, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
    distilled_params,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1253, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1473, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 152, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1229, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 577, in do_executemany
    cursor.executemany(statement, parameters)
sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) NOT NULL constraint failed: cluster_types.type
[SQL: INSERT INTO cluster_types (scrape_time, type, cpu, mem, dbu_light, dbu_job, dbu_analysis) VALUES (?, ?, ?, ?, ?, ?, ?)]

Scraper fails for job runs in `INTERNAL_ERROR` state

We have a job where run_state is not present:

'state': {'life_cycle_state': 'INTERNAL_ERROR', 'state_message': 'Notebook not found: ***REDACTED***'}`
  File "dac/scraping/scraper.py", line 145, in scrape_job_run
    state_result_state=state["result_state"],
KeyError: 'result_state'

Scraper fails for autoscale clusters

In case of autoscaling clusters there's no num_workers key in the dictionary. (However there's a key like this 'autoscale': {'min_workers': 1, 'max_workers': 2}.)

  File "dac/scraping/scraper.py", line 72, in scrape_cluster
    num_workers=cluster_dict["num_workers"],
KeyError: 'num_workers'

Scraper fails for jobs created on existing clusters

In that case, the settings object contains an existing_cluster_id key, but not a new_cluster key. This is permitted according to the Databricks Job API docs.

  File "dac/scraping/scraper.py", line 174, in scrape_jobs
    new_cluster=job_dict["settings"]["new_cluster"],
KeyError: 'new_cluster'

Unique key violation in scraper

Recently the scraper stopped working for us, with the following problem:

Exception in thread scraping-loop-Thread:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1229, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 577, in do_executemany
    cursor.executemany(statement, parameters)
sqlite3.IntegrityError: UNIQUE constraint failed: cluster_states.user_id, cluster_states.cluster_id, cluster_states.timestamp, cluster_states.state
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/app/scraping/scraper.py", line 313, in scraping_loop
    result = scrape(json_path)
  File "/app/scraping/scraper.py", line 342, in scrape
    session.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1027, in commit
    self.transaction.commit()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 494, in commit
    self._prepare_impl()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 473, in _prepare_impl
    self.session.flush()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2470, in flush
    self._flush(objects)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2608, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 153, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2568, in _flush
    flush_context.execute()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute

It's very strange indeed but it seems that the API might return some things twice?

Scraper fails for jobs created by deleted users

According to the Databricks Job API docs, the creator user name won’t be included in the response if the user has already been deleted. This may result in a key error when scraping a job upon accessing job_dict["creator_user_name"].

Possible memory leak in scraper

We have been already running the DAC for a few days and after around a week the operating system killed the scraper because of reaching memory limits:

scripts/dac.sh: line 23:     6 Killed                  python main.py scrape
[Wed Apr  8 02:51:32 2020] Memory cgroup out of memory: Killed process 7871 (python) total-vm:1209768kB, anon-rss:334320kB, file-rss:33360kB, shmem-rss:0kB

Could there be a memory leak somewhere? (We didn't really give it too much memory, I know...)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.