fal-ai / dbt-fal Goto Github PK

do more with dbt. dbt-fal helps you run Python alongside dbt, so you can send Slack alerts, detect anomalies and build machine learning models.

Home Page: https://fal.ai/dbt-fal

License: Apache License 2.0

Python 86.35% JavaScript 1.12% CSS 0.33% Gherkin 10.96% Jupyter Notebook 1.11% Dockerfile 0.12%

dbt python pandas machine-learning machinelearning data-modeling analytics

dbt-fal's People

Contributors

Stargazers

Watchers

dbt-fal's Issues

`fal flow run` does not select scripts with certain selector

When using fal flow run with a selector, for example fal flow run -m "+final_model"
FAL does not select the scripe parent nodes that are directly after a DBT model node predecessor.

It seems to appear: https://github.com/fal-ai/fal/blob/0fe92da47472733eb7ee793b02b16b1f05f39371/src/fal/cli/selectors.py#L77

[Bug] Too many messages received before initialization

mmeasic: Hey, I get this log message on dbt version 0.21.0:

Traceback (most recent call last):
  File "/Users/mmeasic/.virtualenvs/bi-etl-dbt/lib/python3.8/site-packages/logbook/handlers.py", line 216, in handle
    self.emit(record)
  File "/Users/mmeasic/.virtualenvs/bi-etl-dbt/lib/python3.8/site-packages/dbt/logger.py", line 478, in emit
    assert len(self._msg_buffer) < self._bufmax, \
AssertionError: too many messages received before initilization!

jstrom40: did your job run after it gave you this error message? i have had this problem when i have had too many threads set up in dbt. i also had it when i tried to run the fal tool but my actual job still ran after it popped out this message

mmeasic: It did run.
I actually have 4 threads set for the target

Thread link

Pass command line variables to fal flow

Is there a way to pass command line variables to a fal flow run like we do for a dbt run?

$ dbt run --vars '{"key": "value"}'

Add `upsert` and `overwrite` behaviors for `write_to_source`

New features requested:

An ability to prevent duplication in certain table types.

mode argument for write_to_source() function to determine how the data will be written to source:
- mode='write': Write data to source with no deduplication measures
- mode='append': Append data with respect to timestamps to prevent duplication of data with the same timestamp
- mode='update': Delete the contents of the table on the source and rewrite it

Why is it needed?

The current write_to_source function just writes the data to the table via just adding it to the table. This creates an issue in certain tables; i.e time sensitive tables and tables that need to be updated. Deduplication measures can be taken to prevent the disruption to the table structures.

write_to_source needs to escape string values

some rows that contains statements like this one where created_on > '{{date_start}}' will raise and exeception

Running query
CREATE TABLE <my schema destination>.<my destination table> (
        id FLOAT(53), 
        sql TEXT, 
        tables TEXT, 
        validation BOOLEAN
)
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/dbt/clients/jinja.py", line 517, in catch_jinja
    yield
  File "/usr/local/lib/python3.9/site-packages/dbt/clients/jinja.py", line 544, in get_template
    return env.from_string(template_source, globals=ctx)
  File "/usr/local/lib/python3.9/site-packages/jinja2/environment.py", line 941, in from_string
    return cls.from_code(self, self.compile(source), globals, None)
  File "/usr/local/lib/python3.9/site-packages/jinja2/environment.py", line 638, in compile
    self.handle_exception(source=source_hint)
  File "/usr/local/lib/python3.9/site-packages/jinja2/environment.py", line 832, in handle_exception
    reraise(*rewrite_traceback_stack(source=source))
  File "/usr/local/lib/python3.9/site-packages/jinja2/_compat.py", line 28, in reraise
    raise value.with_traceback(tb)
  File "<unknown>", line 2046, in template
jinja2.exceptions.TemplateSyntaxError: expected token 'end of print statement', got 'start'
  line 2046
    and rgq.created_at >= ''{{ date start }}''::date

In this case, the sql column is for SQL statements, which of course has quotes all around.

[DevEx] Developers would like to see which scripts are about to run and why?

Log message in the beginning of fal run, which scripts are we running and why?

fal running with DBT 0.19.0

Hey folks, just a quick question, is it possible to run fal with dbt 0.19.0? if not, how long would it take to adapt fal to dbt 0.19.0?

I would like a helper method that works like `write_to_source` but instead writes to external sources

We need access to the credentials of the external source. Dbt profiles.yml might be a good place to read this information from.

Enable updates to a model from fal scripts

Initial proposal

The idea of this function is to be able to update a model table after it has been run (as an after-hook).

An example scenario would be:

-- models/tickets/tickets_with_sentiment.sql
SELECT
  *,
  -- NOTE: will be filled by fal in sentiment_analysis.py
  NULL AS label, 
  NULL AS score
FROM {{ ref('tickets') }}

Then, the after-hook:

# models/tickets/sentiment_analysis.py
ticket_data = ref(context.current_model)
ticket_descriptions = list(ticket_data.description)
classifier = pipeline("sentiment-analysis")
description_sentiment_analysis = classifier(ticket_descriptions)

rows = []
for id, sentiment in zip(ticket_data.id, description_sentiment_analysis):
    rows.append((int(id), sentiment["label"], sentiment["score"]))

records = np.array(rows, dtype=[("id", int), ("label", "U8"), ("score", float)])

sentiment_df = pd.DataFrame.from_records(records)

print("Uploading\n", sentiment_df)
write_to_model(
	dataframe=sentiment_df,
	# needed because function has no context of where it is being called from
	# we just have to document very well
	# (btw, what would happen if people used it "wrong"?)
	ref=context.current_model,
	id_column='id', # must be the same in df and table, used for knowing WHICH row to update
	columns=['label', 'score'] # defaults to ALL columns in dataframe?
)

How would the actual SQL statement look?

SQL does not match this kind of operation of inserting data on already existing rows very well. So you usually are updating data based on other database data or not doing it in big batches as we will.

The following SQL statement should work. However, more ideas may come up.

UPDATE {{ ref('tickets') }} _table
JOIN (
    SELECT 1 as id, 'positive' AS label, 0.8 AS score
    UNION ALL
    SELECT 1 as id, 'negative' AS label, 0.6 AS score
    UNION ALL
    SELECT 1 as id, 'neutral' AS label, 0.9 AS score
) _insert
ON _insert.id = _table.id
SET
	_table.label = _insert.label,
	_table.score = _insert.score;

Script context should include the dbt manifest

faldbt object already has access to the manifest; we need to pass it here in the exec method. We probably want to pass the dbt manifest directly rather than the fal wrapper as we dont want the fal wrapper to be our public api.

https://github.com/fal-ai/fal/blob/f9fad8d76af0a9bfd1e0472e9526cc6d5d6f265d/src/fal/fal_script.py#L228

Python script should be able to handle relative imports

I was trying execute a script using fal, it works fine when full code is in a single script but breaks down when I write down my script to different modules. Probably this is because fal is internally using python's exec builtins function to execute the script after reading the file. Would appreciate it very much if you guys can add this feature to fal as soon as possible. It is a great tool to work with dbt.! :D

Support Python 3.7 like dbt

  File "/Users/x/workspace/dbt-data-transformation/.venv/lib/python3.7/site-packages/faldbt/project.py", line 5, in <module>
    from typing import Dict, List, Any, Literal, Optional, Tuple, TypeVar, Sequence, Union
ImportError: cannot import name 'Literal' from 'typing' (/Users/x/.pyenv/versions/3.7.9/lib/python3.7/typing.py)

Writing big table to BigQuery fails after 10 minutes.

I have a script that makes a clustering model and when i'm trying to extract data to bigquery with the function write_to_model it fails after 10 minutes with the log:

log_message.txt

---------------------------------------------------------------------------
ConnectionResetError                      Traceback (most recent call last)
File ~\anaconda3\envs\dbt-venv\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
    704     conn,
    705     method,
    706     url,
    707     timeout=timeout_obj,
    708     body=body,
    709     headers=headers,
    710     chunked=chunked,
    711 )
    713 # If we're going to release the connection in ``finally:``, then
    714 # the response doesn't need to know about the connection. Otherwise
    715 # it will also try to release it and we'll have a double-release
    716 # mess.

File ~\anaconda3\envs\dbt-venv\lib\site-packages\urllib3\connectionpool.py:398, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    397     else:
--> 398         conn.request(method, url, **httplib_request_kw)
    400 # We are swallowing BrokenPipeError (errno.EPIPE) since the server is
    401 # legitimately able to close the connection after sending a valid response.
    402 # With this behaviour, the received response is still readable.

File ~\anaconda3\envs\dbt-venv\lib\site-packages\urllib3\connection.py:239, in HTTPConnection.request(self, method, url, body, headers)
    238     headers["User-Agent"] = _get_default_user_agent()
--> 239 super(HTTPConnection, self).request(method, url, body=body, headers=headers)

File ~\anaconda3\envs\dbt-venv\lib\http\client.py:1285, in HTTPConnection.request(self, method, url, body, headers, encode_chunked)
   1284 """Send a complete request to the server."""
-> 1285 self._send_request(method, url, body, headers, encode_chunked)

File ~\anaconda3\envs\dbt-venv\lib\http\client.py:1331, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked)
   1330     body = _encode(body, 'body')
-> 1331 self.endheaders(body, encode_chunked=encode_chunked)

File ~\anaconda3\envs\dbt-venv\lib\http\client.py:1280, in HTTPConnection.endheaders(self, message_body, encode_chunked)
   1279     raise CannotSendHeader()
-> 1280 self._send_output(message_body, encode_chunked=encode_chunked)

File ~\anaconda3\envs\dbt-venv\lib\http\client.py:1079, in HTTPConnection._send_output(self, message_body, encode_chunked)
   1077         chunk = f'{len(chunk):X}\r\n'.encode('ascii') + chunk \
   1078             + b'\r\n'
-> 1079     self.send(chunk)
   1081 if encode_chunked and self._http_vsn == 11:
   1082     # end chunked transfer

File ~\anaconda3\envs\dbt-venv\lib\http\client.py:1001, in HTTPConnection.send(self, data)
   1000 try:
-> 1001     self.sock.sendall(data)
   1002 except TypeError:

File ~\anaconda3\envs\dbt-venv\lib\ssl.py:1204, in SSLSocket.sendall(self, data, flags)
   1203 while count < amount:
-> 1204     v = self.send(byte_view[count:])
   1205     count += v

File ~\anaconda3\envs\dbt-venv\lib\ssl.py:1173, in SSLSocket.send(self, data, flags)
   1170         raise ValueError(
   1171             "non-zero flags not allowed in calls to send() on %s" %
   1172             self.__class__)
-> 1173     return self._sslobj.write(data)
   1174 else:

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

During handling of the above exception, another exception occurred:

ProtocolError                             Traceback (most recent call last)
File ~\anaconda3\envs\dbt-venv\lib\site-packages\requests\adapters.py:440, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    439 if not chunked:
--> 440     resp = conn.urlopen(
    441         method=request.method,
    442         url=url,
    443         body=request.body,
    444         headers=request.headers,
    445         redirect=False,
    446         assert_same_host=False,
    447         preload_content=False,
    448         decode_content=False,
    449         retries=self.max_retries,
    450         timeout=timeout
    451     )
    453 # Send the request.
    454 else:

File ~\anaconda3\envs\dbt-venv\lib\site-packages\urllib3\connectionpool.py:785, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    783     e = ProtocolError("Connection aborted.", e)
--> 785 retries = retries.increment(
    786     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    787 )
    788 retries.sleep()

File ~\anaconda3\envs\dbt-venv\lib\site-packages\urllib3\util\retry.py:550, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    549 if read is False or not self._is_method_retryable(method):
--> 550     raise six.reraise(type(error), error, _stacktrace)
    551 elif read is not None:

File ~\anaconda3\envs\dbt-venv\lib\site-packages\urllib3\packages\six.py:769, in reraise(tp, value, tb)
    768 if value.__traceback__ is not tb:
--> 769     raise value.with_traceback(tb)
    770 raise value

File ~\anaconda3\envs\dbt-venv\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
    704     conn,
    705     method,
    706     url,
    707     timeout=timeout_obj,
    708     body=body,
    709     headers=headers,
    710     chunked=chunked,
    711 )
    713 # If we're going to release the connection in ``finally:``, then
    714 # the response doesn't need to know about the connection. Otherwise
    715 # it will also try to release it and we'll have a double-release
    716 # mess.

File ~\anaconda3\envs\dbt-venv\lib\site-packages\urllib3\connectionpool.py:398, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    397     else:
--> 398         conn.request(method, url, **httplib_request_kw)
    400 # We are swallowing BrokenPipeError (errno.EPIPE) since the server is
    401 # legitimately able to close the connection after sending a valid response.
    402 # With this behaviour, the received response is still readable.

File ~\anaconda3\envs\dbt-venv\lib\site-packages\urllib3\connection.py:239, in HTTPConnection.request(self, method, url, body, headers)
    238     headers["User-Agent"] = _get_default_user_agent()
--> 239 super(HTTPConnection, self).request(method, url, body=body, headers=headers)

File ~\anaconda3\envs\dbt-venv\lib\http\client.py:1285, in HTTPConnection.request(self, method, url, body, headers, encode_chunked)
   1284 """Send a complete request to the server."""
-> 1285 self._send_request(method, url, body, headers, encode_chunked)

File ~\anaconda3\envs\dbt-venv\lib\http\client.py:1331, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked)
   1330     body = _encode(body, 'body')
-> 1331 self.endheaders(body, encode_chunked=encode_chunked)

File ~\anaconda3\envs\dbt-venv\lib\http\client.py:1280, in HTTPConnection.endheaders(self, message_body, encode_chunked)
   1279     raise CannotSendHeader()
-> 1280 self._send_output(message_body, encode_chunked=encode_chunked)

File ~\anaconda3\envs\dbt-venv\lib\http\client.py:1079, in HTTPConnection._send_output(self, message_body, encode_chunked)
   1077         chunk = f'{len(chunk):X}\r\n'.encode('ascii') + chunk \
   1078             + b'\r\n'
-> 1079     self.send(chunk)
   1081 if encode_chunked and self._http_vsn == 11:
   1082     # end chunked transfer

File ~\anaconda3\envs\dbt-venv\lib\http\client.py:1001, in HTTPConnection.send(self, data)
   1000 try:
-> 1001     self.sock.sendall(data)
   1002 except TypeError:

File ~\anaconda3\envs\dbt-venv\lib\ssl.py:1204, in SSLSocket.sendall(self, data, flags)
   1203 while count < amount:
-> 1204     v = self.send(byte_view[count:])
   1205     count += v

File ~\anaconda3\envs\dbt-venv\lib\ssl.py:1173, in SSLSocket.send(self, data, flags)
   1170         raise ValueError(
   1171             "non-zero flags not allowed in calls to send() on %s" %
   1172             self.__class__)
-> 1173     return self._sslobj.write(data)
   1174 else:

ProtocolError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
File ~\anaconda3\envs\dbt-venv\lib\site-packages\google\api_core\retry.py:190, in retry_target(target, predicate, sleep_generator, deadline, on_error)
    189 try:
--> 190     return target()
    192 # pylint: disable=broad-except
    193 # This function explicitly must deal with broad exceptions.

File ~\anaconda3\envs\dbt-venv\lib\site-packages\google\cloud\_http\__init__.py:482, in JSONConnection.api_request(self, method, path, query_params, data, content_type, headers, api_base_url, api_version, expect_json, _target_object, timeout, extra_api_info)
    480     content_type = "application/json"
--> 482 response = self._make_request(
    483     method=method,
    484     url=url,
    485     data=data,
    486     content_type=content_type,
    487     headers=headers,
    488     target_object=_target_object,
    489     timeout=timeout,
    490     extra_api_info=extra_api_info,
    491 )
    493 if not 200 <= response.status_code < 300:

File ~\anaconda3\envs\dbt-venv\lib\site-packages\google\cloud\_http\__init__.py:341, in JSONConnection._make_request(self, method, url, data, content_type, headers, target_object, timeout, extra_api_info)
    339 headers["User-Agent"] = self.user_agent
--> 341 return self._do_request(
    342     method, url, headers, data, target_object, timeout=timeout
    343 )

File ~\anaconda3\envs\dbt-venv\lib\site-packages\google\cloud\_http\__init__.py:379, in JSONConnection._do_request(self, method, url, headers, data, target_object, timeout)
    348 """Low-level helper:  perform the actual API request over HTTP.
    349 
    350 Allows batch context managers to override and defer a request.
   (...)
    377 :returns: The HTTP response.
    378 """
--> 379 return self.http.request(
    380     url=url, method=method, headers=headers, data=data, timeout=timeout
    381 )

File ~\anaconda3\envs\dbt-venv\lib\site-packages\google\auth\transport\requests.py:484, in AuthorizedSession.request(self, method, url, data, headers, max_allowed_time, timeout, **kwargs)
    483 with TimeoutGuard(remaining_time) as guard:
--> 484     response = super(AuthorizedSession, self).request(
    485         method,
    486         url,
    487         data=data,
    488         headers=request_headers,
    489         timeout=timeout,
    490         **kwargs
    491     )
    492 remaining_time = guard.remaining_timeout

File ~\anaconda3\envs\dbt-venv\lib\site-packages\requests\sessions.py:529, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    528 send_kwargs.update(settings)
--> 529 resp = self.send(prep, **send_kwargs)
    531 return resp

File ~\anaconda3\envs\dbt-venv\lib\site-packages\requests\sessions.py:645, in Session.send(self, request, **kwargs)
    644 # Send the request
--> 645 r = adapter.send(request, **kwargs)
    647 # Total elapsed time of the request (approximately)

File ~\anaconda3\envs\dbt-venv\lib\site-packages\requests\adapters.py:501, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    500 except (ProtocolError, socket.error) as err:
--> 501     raise ConnectionError(err, request=request)
    503 except MaxRetryError as e:

ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

The above exception was the direct cause of the following exception:

RetryError                                Traceback (most recent call last)
File ~\anaconda3\envs\dbt-venv\lib\site-packages\dbt\adapters\bigquery\connections.py:174, in BigQueryConnectionManager.exception_handler(self, sql)
    173 try:
--> 174     yield
    176 except google.cloud.exceptions.BadRequest as e:

File ~\anaconda3\envs\dbt-venv\lib\site-packages\dbt\adapters\bigquery\connections.py:549, in BigQueryConnectionManager._retry_and_handle(self, msg, conn, fn)
    548 with self.exception_handler(msg):
--> 549     return retry.retry_target(
    550         target=fn,
    551         predicate=_ErrorCounter(self.get_retries(conn)).count_error,
    552         sleep_generator=self._retry_generator(),
    553         deadline=None,
    554         on_error=reopen_conn_on_error)

File ~\anaconda3\envs\dbt-venv\lib\site-packages\google\api_core\retry.py:190, in retry_target(target, predicate, sleep_generator, deadline, on_error)
    189 try:
--> 190     return target()
    192 # pylint: disable=broad-except
    193 # This function explicitly must deal with broad exceptions.

File ~\anaconda3\envs\dbt-venv\lib\site-packages\dbt\adapters\bigquery\connections.py:378, in BigQueryConnectionManager.raw_execute.<locals>.fn()
    377 def fn():
--> 378     return self._query_and_results(client, sql, conn, job_params)

File ~\anaconda3\envs\dbt-venv\lib\site-packages\dbt\adapters\bigquery\connections.py:534, in BigQueryConnectionManager._query_and_results(self, client, sql, conn, job_params, timeout)
    533 job_config = google.cloud.bigquery.QueryJobConfig(**job_params)
--> 534 query_job = client.query(sql, job_config=job_config)
    535 iterator = query_job.result(timeout=timeout)

File ~\anaconda3\envs\dbt-venv\lib\site-packages\google\cloud\bigquery\client.py:3390, in Client.query(self, query, job_config, job_id, job_id_prefix, location, project, retry, timeout, job_retry)
   3388         return query_job
-> 3390 future = do_query()
   3391 # The future might be in a failed state now, but if it's
   3392 # unrecoverable, we'll find out when we ask for it's result, at which
   3393 # point, we may retry.

File ~\anaconda3\envs\dbt-venv\lib\site-packages\google\cloud\bigquery\client.py:3367, in Client.query.<locals>.do_query()
   3366 try:
-> 3367     query_job._begin(retry=retry, timeout=timeout)
   3368 except core_exceptions.Conflict as create_exc:
   3369     # The thought is if someone is providing their own job IDs and they get
   3370     # their job ID generation wrong, this could end up returning results for
   3371     # the wrong query. We thus only try to recover if job ID was not given.

File ~\anaconda3\envs\dbt-venv\lib\site-packages\google\cloud\bigquery\job\query.py:1298, in QueryJob._begin(self, client, retry, timeout)
   1297 try:
-> 1298     super(QueryJob, self)._begin(client=client, retry=retry, timeout=timeout)
   1299 except exceptions.GoogleAPICallError as exc:

File ~\anaconda3\envs\dbt-venv\lib\site-packages\google\cloud\bigquery\job\base.py:510, in _AsyncJob._begin(self, client, retry, timeout)
    509 span_attributes = {"path": path}
--> 510 api_response = client._call_api(
    511     retry,
    512     span_name="BigQuery.job.begin",
    513     span_attributes=span_attributes,
    514     job_ref=self,
    515     method="POST",
    516     path=path,
    517     data=self.to_api_repr(),
    518     timeout=timeout,
    519 )
    520 self._set_properties(api_response)

File ~\anaconda3\envs\dbt-venv\lib\site-packages\google\cloud\bigquery\client.py:782, in Client._call_api(self, retry, span_name, span_attributes, job_ref, headers, **kwargs)
    779     with create_span(
    780         name=span_name, attributes=span_attributes, client=self, job_ref=job_ref
    781     ):
--> 782         return call()
    784 return call()

File ~\anaconda3\envs\dbt-venv\lib\site-packages\google\api_core\retry.py:283, in Retry.__call__.<locals>.retry_wrapped_func(*args, **kwargs)
    280 sleep_generator = exponential_sleep_generator(
    281     self._initial, self._maximum, multiplier=self._multiplier
    282 )
--> 283 return retry_target(
    284     target,
    285     self._predicate,
    286     sleep_generator,
    287     self._deadline,
    288     on_error=on_error,
    289 )

File ~\anaconda3\envs\dbt-venv\lib\site-packages\google\api_core\retry.py:205, in retry_target(target, predicate, sleep_generator, deadline, on_error)
    204 if deadline_datetime <= now:
--> 205     raise exceptions.RetryError(
    206         "Deadline of {:.1f}s exceeded while calling target function".format(
    207             deadline
    208         ),
    209         last_exc,
    210     ) from last_exc
    211 else:

RetryError: Deadline of 600.0s exceeded while calling target function, last exception: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

During handling of the above exception, another exception occurred:

RuntimeException                          Traceback (most recent call last)
c:\Users\new user\Documents\dbt\dbt-models\fal_scripts\clustering.ipynb Cell 3' in <cell line: 1>()
----> 1 faldbt.write_to_model(df,'fct_superapp_clustering', mode='overwrite')

File ~\anaconda3\envs\dbt-venv\lib\site-packages\fal\telemetry\telemetry.py:338, in log_call.<locals>._log_call.<locals>.wrapper(*func_args, **func_kwargs)
    335 start = datetime.datetime.now()
    337 try:
--> 338     result = func(*func_args, **func_kwargs)
    339 except Exception as e:
    340     log_api(
    341         action=f"{action}_error",
    342         total_runtime=str(datetime.datetime.now() - start),
   (...)
    347         },
    348     )

File ~\anaconda3\envs\dbt-venv\lib\site-packages\faldbt\project.py:506, in FalDbt.write_to_model(self, data, target_model_name, target_package_name, dtype, mode)
    496     lib.write_target(
    497         data,
    498         self.project_dir,
   (...)
    502         profile_target=self._profile_target,
    503     )
    505 elif mode.lower().strip() == WriteToSourceModeEnum.OVERWRITE.value:
--> 506     lib.overwrite_target(
    507         data,
    508         self.project_dir,
    509         self.profiles_dir,
    510         target_model,
    511         dtype,
    512         profile_target=self._profile_target,
    513     )
    515 else:
    516     raise Exception(f"write_to_model mode `{mode}` not supported")

File ~\anaconda3\envs\dbt-venv\lib\site-packages\faldbt\lib.py:185, in overwrite_target(data, project_dir, profiles_dir, target, dtype, profile_target)
    179     relation = _build_table_from_target(target)
    181 temporal_relation = _build_table_from_parts(
    182     relation.database, relation.schema, f"{relation.identifier}__f__"
    183 )
--> 185 results = _write_relation(
    186     data,
    187     project_dir,
    188     profiles_dir,
    189     temporal_relation,
    190     dtype,
    191     profile_target=profile_target,
    192 )
    193 try:
    194     _replace_relation(
    195         project_dir,
    196         profiles_dir,
   (...)
    199         profile_target=profile_target,
    200     )

File ~\anaconda3\envs\dbt-venv\lib\site-packages\faldbt\lib.py:265, in _write_relation(data, project_dir, profiles_dir, relation, dtype, profile_target)
    259 _clean_cache(project_dir, profiles_dir, profile_target=profile_target)
    261 insert_stmt = Insert(alchemy_table, values=row_dicts).compile(
    262     bind=engine, compile_kwargs={"literal_binds": True}
    263 )
--> 265 _, result = _execute_sql(
    266     project_dir,
    267     profiles_dir,
    268     six.text_type(insert_stmt).strip(),
    269     profile_target=profile_target,
    270 )
    271 return result

File ~\anaconda3\envs\dbt-venv\lib\site-packages\faldbt\lib.py:88, in _execute_sql(project_dir, profiles_dir, sql, profile_target)
     86 result = None
     87 with adapter.connection_named(name):
---> 88     response, execute_result = adapter.execute(sql, auto_begin=True, fetch=True)
     90     table = ResultTable(
     91         column_names=list(execute_result.column_names),
     92         rows=[list(row) for row in execute_result],
     93     )
     95     result = RemoteRunResult(
     96         raw_sql=sql,
     97         compiled_sql=sql,
   (...)
    102         generated_at=datetime.utcnow(),
    103     )

File ~\anaconda3\envs\dbt-venv\lib\site-packages\dbt\adapters\base\impl.py:225, in BaseAdapter.execute(self, sql, auto_begin, fetch)
    211 @available.parse(lambda *a, **k: ('', empty_table()))
    212 def execute(
    213     self, sql: str, auto_begin: bool = False, fetch: bool = False
    214 ) -> Tuple[Union[str, AdapterResponse], agate.Table]:
    215     """Execute the given SQL. This is a thin wrapper around
    216     ConnectionManager.execute.
    217 
   (...)
    223     :rtype: Tuple[Union[str, AdapterResponse], agate.Table]
    224     """
--> 225     return self.connections.execute(
    226         sql=sql,
    227         auto_begin=auto_begin,
    228         fetch=fetch
    229     )

File ~\anaconda3\envs\dbt-venv\lib\site-packages\dbt\adapters\bigquery\connections.py:389, in BigQueryConnectionManager.execute(self, sql, auto_begin, fetch)
    387 sql = self._add_query_comment(sql)
    388 # auto_begin is ignored on bigquery, and only included for consistency
--> 389 query_job, iterator = self.raw_execute(sql, fetch=fetch)
    391 if fetch:
    392     table = self.get_table_from_response(iterator)

File ~\anaconda3\envs\dbt-venv\lib\site-packages\dbt\adapters\bigquery\connections.py:380, in BigQueryConnectionManager.raw_execute(self, sql, fetch, use_legacy_sql)
    377 def fn():
    378     return self._query_and_results(client, sql, conn, job_params)
--> 380 query_job, iterator = self._retry_and_handle(msg=sql, conn=conn, fn=fn)
    382 return query_job, iterator

File ~\anaconda3\envs\dbt-venv\lib\site-packages\dbt\adapters\bigquery\connections.py:549, in BigQueryConnectionManager._retry_and_handle(self, msg, conn, fn)
    546         return
    548 with self.exception_handler(msg):
--> 549     return retry.retry_target(
    550         target=fn,
    551         predicate=_ErrorCounter(self.get_retries(conn)).count_error,
    552         sleep_generator=self._retry_generator(),
    553         deadline=None,
    554         on_error=reopen_conn_on_error)

File ~\anaconda3\envs\dbt-venv\lib\contextlib.py:137, in _GeneratorContextManager.__exit__(self, typ, value, traceback)
    135     value = typ()
    136 try:
--> 137     self.gen.throw(typ, value, traceback)
    138 except StopIteration as exc:
    139     # Suppress StopIteration *unless* it's the same exception that
    140     # was passed to throw().  This prevents a StopIteration
    141     # raised inside the "with" statement from being suppressed.
    142     return exc is not value

File ~\anaconda3\envs\dbt-venv\lib\site-packages\dbt\adapters\bigquery\connections.py:206, in BigQueryConnectionManager.exception_handler(self, sql)
    204 if BQ_QUERY_JOB_SPLIT in exc_message:
    205     exc_message = exc_message.split(BQ_QUERY_JOB_SPLIT)[0].strip()
--> 206 raise RuntimeException(exc_message)

RuntimeException: Runtime Error
  Deadline of 600.0s exceeded while calling target function, last exception: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

Hi

fal flow run only runs the dbt model but not the fal script inside meta

I was trying to run fal flow but it does not run the python script defined inside the meta tag. dbt run followed by a fal run works as expected.

> fal flow run --select netsuite_usd_exchange_rate_refined --target dev --experimental-flow
04:02:52  Found 422 models, 845 tests, 38 snapshots, 0 analyses, 441 macros, 0 operations, 0 seed files, 113 sources, 0 exposures, 0 metrics
Executing command: dbt --log-format json run --project-dir /Users/deepak.rout/code/dippr/dbt_pipelines --target dev --select netsuite_usd_exchange_rate_refined
Running with dbt=1.0.4
Found 422 models, 845 tests, 38 snapshots, 0 analyses, 441 macros, 0 operations, 0 seed files, 113 sources, 0 exposures, 0 metrics
Concurrency: 5 threads (target='dev')
1 of 1 START table model sandbox_staging.netsuite_usd_exchange_rate_refined..... [RUN]
1 of 1 OK created table model sandbox_staging.netsuite_usd_exchange_rate_refined [OK in 14.22s]
Finished running 1 table model in 38.77s.

`fal flow` should be able add python nodes in between two sql dags in `dbt`

Enable running python scripts in parallel

write_to_source to Redshift makes strings varchar(256)

As discussed here: https://discord.com/channels/908693336280432750/940966301680156683

https://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html#r_Character_types-text-and-bpchar-types

You can create an Amazon Redshift table with a TEXT column, but it is converted to a VARCHAR(256) column that accepts variable-length values with a maximum of 256 characters.
If you use the VARCHAR data type without a length specifier in a CREATE TABLE statement, the default length is 256. If used in an expression, the size of the output is determined using the input expression (up to 65535).

I don't believe you can instruct from the pandas dataframe to use VARCHAR(65535)

deleted

Upgrade to dbt=1.0.0

Upgrade to dbt=1.0.0 seems to break fal
fal tries to search for
source-paths: ["models"] in dbt_project.yml whereas this is deprecated in dbt=1
and replaced by
model-paths: ["models"]

[BUG] write_to_source fails silently when source doesn't exist or schema doesn't match

Describe the bug

When the source is missing or the schema doesn't match, write_to_source fails silently if mode is unspecified. Sometimes I will want to create tables with fal based on modeling, so this is important!

Your environment

fal version: 0.2.16
OS: Mac

How to reproduce

delete a source or modify the schema
use write_to_source function with no mode parameter

Expected behavior

I get an error telling me write to source failed. Bonus points if you can tell me that I need a mode parameter, the source is not there, or there's a mismatch on the schema.

I want to see this error even if I don't specify a mode. More importantly, I'd rather that fal create the table for me if it doesn't exist.

  002003 (42S02): SQL compilation error:
  Table 'DATABASE.SCHEMA."TABLE__f__"' does not exist or not authorized.

Actual behavior
fal script runs and completes as if it were a successful modeling run but when I go to my db, nothing is there

Additional context
n/a

[DevEx] Better error messaging when a dbt run is missing

When the developer doesn't run dbt run but have models that have fal scripts those models dont have any run results.. and we error but not very descriptive, I think if we dont find run results 99% of the time its a wrong dbt run.

[QUESTION]: Which database should write_to_source write to when testing locally?

When a dbt project has multiple profiles, which database should write_to_source write to?

Context: When implementing write to source, I ref-ed a source that was PROD_DB.PROD_SCHEMA.PROD_TABLE and I was testing on an ephemeral branch TEST_DB.EPHEMERAL_SCHEMA.EPHEMERAL_TABLE. Rather than going to either of those places, write_to_source wrote the data to TEST_DB.SAME_NAME_AS_PROD_SCHEMA.SAME_NAME_AS_PROD_TABLE.

On one hand, I'm relieved that while developing I do not contaminate my production data. On the other, having this data land in an unintuitive place made it difficult for me to debug and figure out what had happened.

Where should this data go? Is there something that can be done in the write_to_source function to make it more clear what will happen?

How to write dates and datetimes to database with write_to_source / write_to_model functions

I have a dataframe with date values and when writing it with write_to_model (same for write_to_source) it fails not knowing what to do with it. I tried some dtype values and had no success.

dataframe.dtypes

ds                                      datetime64[ns]
trend                                          float64
yhat_lower                                     float64
yhat_upper                                     float64

BigQuery error:

Error in script /Users/matteo/Projects/fal/jaffle_shop_with_fal/models/orders_forecast.py with model orders_forecast:
Traceback (most recent call last):
  File "/.../lib/python3.9/site-packages/dbt/adapters/bigquery/connections.py", line 174, in exception_handler
    yield
  File "/.../lib/python3.9/site-packages/dbt/adapters/bigquery/connections.py", line 549, in _retry_and_handle
    return retry.retry_target(
  File "/.../lib/python3.9/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/.../lib/python3.9/site-packages/dbt/adapters/bigquery/connections.py", line 378, in fn
    return self._query_and_results(client, sql, conn, job_params)
  File "/.../lib/python3.9/site-packages/dbt/adapters/bigquery/connections.py", line 535, in _query_and_results
    iterator = query_job.result(timeout=timeout)
  File "/.../lib/python3.9/site-packages/google/cloud/bigquery/job/query.py", line 1498, in result
    do_get_result()
  File "/.../lib/python3.9/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func
    return retry_target(
  File "/.../lib/python3.9/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/.../lib/python3.9/site-packages/google/cloud/bigquery/job/query.py", line 1488, in do_get_result
    super(QueryJob, self).result(retry=retry, timeout=timeout)
  File "/.../lib/python3.9/site-packages/google/cloud/bigquery/job/base.py", line 728, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
  File "/.../lib/python3.9/site-packages/google/api_core/future/polling.py", line 137, in result
    raise self._exception
google.api_core.exceptions.BadRequest: 400 Syntax error: Expected ")" or "," but got string literal '2018-01-01T00:00:00.000000000' at [2:1965]

Location: US

TODO: add Postgres error

`fal` should be able to run pre-hook python scripts, dbt, and post-hook python scripts from a single cli command

Introducing fal flow, this command should handle all dbt cli params and run dbt and fal in the right order.
For example fal flow run should invoke fal run --before, dbt run and fal run in order with no regression in the dbt run behavior.

This should also work for all other dbt commands, such as dbt test, dbt build..

This command will be more powerful when used with model selectors, for example:

When the following command is invoked:
"""
fal flow run --select modela
"""
Then following plan is calculated: 
"""
dbt run --select modela && fal run
"""

fal flow command also works with script selectors.

Given the following dbt dag:

When the following command is invoked:
"""
fal flow run --select etl_script.py+
"""
Then following plan is calculated: 
"""
fal run --before etl_script.py &&
dbt run --select modela+  &&
fal run
"""

deleted

Adding the --target flag to `fal run`

It would be very useful to be able to use fal with a dbt profile other than the default. For instance, I would want to be able to do dbt run --target dev followed by fal run --target dev where dev is not the default profile.

The current behavior is that fal does not recognize the --target flag, so I can't do fal run --target dev and I haven't found another way to get fal to recognize models that were run with a profile that is not the default.

For example, if I try doing dbt run --target dev followed by fal run, when fal runs I get the message:
Unable to do partial parsing because config vars, config profile, or config target have changed
and then when the script tries to retrieve the current model using a command like faldbt.ref(current_model), the script crashes with the error:
Exception: Could not get relation for 'model.<project>.<model>'
even though that model was successfully run by dbt.

Is there a way to get fal to recognize models run with a non-default dbt profile that I'm not aware of? Or would that require the addition of the --target flag to fal?

Thank you!

AttributeError: 'ParsedSingularTestNode' object has no attribute 'test_metadata'

Since the latest release our fal task has started failing to run a python scripts and gives the following error:

Traceback (most recent call last):
  File "/usr/local/bin/fal", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/site-packages/fal/cli.py", line 146, in cli
    run_fal(sys.argv)
  File "/usr/local/lib/python3.9/site-packages/fal/cli.py", line 129, in run_fal
    _run(
  File "/usr/local/lib/python3.9/site-packages/fal/cli.py", line 185, in _run
    faldbt = FalDbt(
  File "/usr/local/lib/python3.9/site-packages/faldbt/project.py", line 218, in __init__
    self.models, self.tests = _map_nodes_to_models(self._run_results, self._manifest)
  File "/usr/local/lib/python3.9/site-packages/faldbt/project.py", line 519, in _map_nodes_to_models
    tests = manifest.get_tests()
  File "/usr/local/lib/python3.9/site-packages/faldbt/project.py", line 128, in get_tests
    return list(
  File "/usr/local/lib/python3.9/site-packages/faldbt/project.py", line 132, in <lambda>
    lambda node: DbtTest(node=node), self.nativeManifest.nodes.values()
  File "<string>", line 4, in __init__
  File "/usr/local/lib/python3.9/site-packages/faldbt/project.py", line 54, in __post_init__
    self.name = node.test_metadata.name
AttributeError: 'ParsedSingularTestNode' object has no attribute 'test_metadata'

fal should be able to report on dbt test success/failure status

Currently, I do something like dbt run then fal run, and one of the python scripts I have messages slack depending on the dbt model's success/failure status. I'd like to be able to do something similar with dbt tests

I would like to run my python scripts in dependency order

fal should know about dbt Cloud Artifacts

DRAFT: fal dbt Cloud Specs

As of today, dbt cloud doesn't allow any external cli commands to be invoked other than dbt commands. This might be due to a few reasons including dbt labs wanting to keep dbt cloud as a secure environment that can only run known executables, difficulty of maintaining Python (or other languages) environments etc.

In order to run fal alongside dbt Cloud, we currently propose an approach where the invocation of dbt Cloud Jobs and fal happens from an external CI/CD system (Github Actions, Jenkins etc.).

We recommend all dbt related commands to be configured within a dbt Cloud Job. The Schedule setting of the Trigger for these Jobs should be turned OFF, since we now have moved the scheduling responsibility to Github Actions.

The fal team will provide a Github Action to trigger these jobs from within Github Actions easily. We then execute a fal run command to run the Python scripts. An example workflow could look like:

 steps:
  - uses: actions/checkout@v2
  
  - uses: actions/setup-python@v2
     with:
       python-version: 3.7
  
  - name: Install dependencies
     run: |
       pip install -r requirements.txt
  
  - uses: fal-ai/fal-dbt-cloud@v1  # Exact name to be determined
     env:
       DBT_CLOUD_API_TOKEN: ${{ secrets.DBT_CLOUD_API_TOKEN }}   # This is to authenticate into dbt Cloud
       DBT_ACCOUNT_ID: 9876
     with:
       job-id: 1234 # This is the Job Id of the job that you want to trigger on dbt Cloud
     
     # This action outputs the run artifacts of this job
  
  - name: Run fal
    run: |
      fal run --profiles-dir . # fal reads the run artifacts from the previous step

Alternatives considered

`fal` should should support pre-hook python scripts

Why are we doing this?

Eventually we would like to enable adding python nodes to anywhere in a dbt dag, adding a pre-hook node is the best next step, and we already have users asking for it for simple EL use cases.

Similar to the fal run command today, fal users should be able to write python scripts that are configured to run before a dbt run.

Pre-hook scripts are called before scripts in fal terminology and marked as such in the configuration.

For example:

Given the following model configuration:

models:
  - name: zendesk_ticket_descriptions
    description: zendesk ticket descriptions
    config:
      materialized: table
    meta:
      owner: "@meder"
      fal:
         scripts:
	        before:
	          - fal_scripts/postgres.py
	        after: // if there are no before scripts after can be omited 
	          - fal_scripts/slack.py

When the following command is invoked:

fal run --before --select zendesk_ticket_descriptions

Then the following scripts are invoked:

fal_scripts/postgres.py

fal run --before command does not add any artifact details to the context (let us know if this would be something useful)

Make logging of insert statements for write_to_source optional or just remove it

cesarsantos#3303 in Discord asked for this probably to avoid logging sensible information or filling the logger with all the data there is.

Ability to load only incremental data with `ref()`

deleted

linear

Using fal with Dbt Cloud

How can we install and run fal in Dbt Cloud?

I would like to trigger fivetran and airbyte sync from fal

New features requested:

Ability to specify fivetran and airbyte (EL) connections in profiles.yml
Magic functions for starting and checking sync operations

Specifying EL connections in `profiles.yml`

fal_test:
  target: dev
  fal:
    el:
      - type: fivetran
        api_key: my_fivetran_key
        api_secret: my_fivetran_secret
        connectors:
          - name: fivetran_connector_1
            id: id_fivetran_connector_1
          - name: fivetran_connector_2
            id: id_fivetran_connector_2
      - type: airbyte
        host: http://localhost:8001
        connections:
          - name: airbyte_connection_1
            id: id_airbyte_connection_1
          - name: airbyte_connection_2
            id: id_airbyte_connection_2
  outputs:
    dev:
      type: postgres
      host: localhost
      user: pguser
      password: pass
      port: 5432
      dbname: test
      schema: dbt_fal
      threads: 4

Magic functions for sync operations

# Starts airbyte sync and waits until it's completed
airbyte_sync(connection_id: str)
airbyte_sync_all()

# Starts fivtran sync and waits until it's completed
fivetran_sync(connector_id: str)
fivetran_sync_all()

These functions should also be available in FalDbt class:

from fal import FalDbt
faldbt = FalDbt(profiles_dir="~/.dbt", project_dir="../my_project")

faldbt.airbyte_sync_all()
# etc

Alternatively, we could have a combined sync function:

el_sync(el_type: str, connection_id: str, all: bool)
# el_type can be 'airbyte' or 'fivetran'
# connection_id and all are optional arguments, where either one or the other has to be provided.

faldbt.el_sync(el_type, connection_id)

write_to_source throws on certain characters

We tried to fix this for fal-ai/fal#121 in fal-ai/fal#150 by simply escaping the known jinja syntax: { and }; but it comes back with other characters making it fail too:

image from @cesar-loadsmart in Discord

I think the real solution for this would be to try and disable Jinja processing altogether in write_to_source (and other functions, maybe?)

Internally, we call dbt (since it holds the credentials from the profiles) in this lib method
https://github.com/fal-ai/fal/blob/d76cf2cfdcb1d6bf6d44553e8665fa97ec792856/src/faldbt/lib.py#L103-L132

We could maybe control how adapter.execute is called to avoid processing Jinja in it.

does write_to_source work cross database?

e.g. is it able to use postgres as the source and snowflake as the destination?

Consider how to run fal in a fresh environment

Right now, we use dbt's run_results.json artifact to decide which models to run based on status.

A use case presented in the Discord channel is running in a fresh environment where there is no artifacts.

This happens by running each step in a different pod (Airflow):

PodOperator(dbt run -m model) >> \
  PodOperator(dbt test -m model) >> \
  PodOperator(fal run -m model -s script_name)

Notice the proposal has a -m that works like dbt's and -s for a specific script of the model.

Better handling of python dependencies in scripts

Run jupyter notebooks `.ipynb` just like `.py` files

Problem
fal can run python scripts but not jupyter notebooks natively
If I'm doing exploratory work in a notebook and want to automate that as part of the fal flow run, I have to refactor it as a python .py file. This adds redundant work that shouldn't be needed.

Solution
Allow .ipynb files to work

Allow users to use ref, source, and context within a Juputer notebook

Given that the user has set up fal with the meta attributes, as I user I would like to do something like

import fal

df = ref(my_model)
df.iloc[0]

Consider offering an agate version of `ref` and `source` functions

This is the way the data originally comes from the data warehouse, we transform it to pandas and we go back when doing write_to_source. Maybe it would be beneficial to offer not doing agate-pandas transformations at all for some tasks.

I would like to run dbt macros directly from a python script

Python after/before scripts should match how they are referenced in selection flags

If we write in the yml file:

models:
  - name: other_model
    meta:
      fal:
        scripts:
          after:
            - fal_scripts/complete_other_model.py

Then to select said script it should be (with any graph operator)

--select fal_scripts/complete_other_model.py

Right now it works with

--select complete_other_model.py

KeyError: 'column_name'

Latest issue when using v0.2.2 -

Logged from file <class 'networkx.utils.decorators.argmap'> compilation 5, line 5
Traceback (most recent call last):
  File "/usr/local/bin/fal", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/site-packages/fal/cli.py", line 154, in cli
    run_fal(sys.argv)
  File "/usr/local/lib/python3.9/site-packages/fal/cli.py", line 136, in run_fal
    _run(
  File "/usr/local/lib/python3.9/site-packages/fal/cli.py", line 203, in _run
    faldbt = FalDbt(
  File "/usr/local/lib/python3.9/site-packages/faldbt/project.py", line 221, in __init__
    self.models, self.tests = _map_nodes_to_models(self._run_results, self._manifest)
  File "/usr/local/lib/python3.9/site-packages/faldbt/project.py", line 522, in _map_nodes_to_models
    tests = manifest.get_tests()
  File "/usr/local/lib/python3.9/site-packages/faldbt/project.py", line 131, in get_tests
    return list(
  File "/usr/local/lib/python3.9/site-packages/faldbt/project.py", line 135, in <lambda>
    lambda node: DbtTest(node=node), self.nativeManifest.nodes.values()
  File "<string>", line 4, in __init__
  File "/usr/local/lib/python3.9/site-packages/faldbt/project.py", line 61, in __post_init__
    self.column = node.test_metadata.kwargs['column_name']
KeyError: 'column_name'

fal installation instructions with dbt 1.0

Trying fal now after a recommendation through Slack. I installed dbt 1.0 by pip install dbt-core, as well as fal by pip install fal.
I ran into a few snags when installing fal in a clean python 3.9 environment, and it might be worth adding a note about dbt 1.0 to the fal readme.

fal currently doesn't install dbt. I assume this is because of the change from dbt to dbt-core.
fal relies on bigquery here, which means that one needs to run pip install dbt-bigquery. This should be explicitly stated on the readme, and preferably not be a requirement.

Pure Python models design

Introduce Python-only models to fal. This mirrors the dbt models for SQL. A Python model will be a script with a call to write_to_model that creates the table in the database. Python models can be referenced from other dbt models as well as python models in the project using the ref function.

Python models can be used interchangeably with any dbt model in the fal flow graph selectors
Python models appear in the dbt lineage graph
Python models can be documented the same way dbt models can in the schema.yml file

Until now we have only attached Python scripts to existing SQL models and have done some creative workarounds to get the data back into the data warehouse. If we want to go closer to what seems to be the dbt ideas on how a transformation layer should work, we need to be able to create a new dbt model with the data from other models instead of modifying existing ones. Doing this enables us to be confident about parallelizations because no 2 scripts could be writing to the same model (race conditions).

Previous work

Based on the post from Andy Reagan, we can see that if we generate a dbt model with

{{ config(materialized='ephemeral') }} 

/*
List your dbt dependencies here:

{{ ref('customers') }}
{{ ref('orders') }}
*/

select * {{ target.schema }}.{{ model.name }}

We can make dbt believe that the table is there and be able to reference it from other models (read more on the ephemeral config to understand how this hack works), while not really running the code for it.

Proposal

The usage would be:

realize there are some Python scripts that are to be treated as models
a. understand dependencies the Python script has to other models
write the ephemeral file for each of the models (with dependencies as a comment)
run the models
delete the ephemeral file for each of the models
a. consider how this would affect dbt docs?

Let's assume the user has the following project

my_project/
├── dbt_project.yml
├── fal_scripts
│   └── after.py
└── models
    ├── some_model.sql
    ├── my_model.py
    └── schema.yml

Where the script looks like this:

file models/my_model.py

df = ref('some_model') # how do we realize there is a dependency?

s = source('app', 'a_source')

# use the df...

write_to_model(df)

Finding Python models

Mimic dbt

Leverage the model-paths configuration to look for .py files and assume these are Python models.

The user would just write a model in one of the model paths.

CON: the user may have other .py files that should not be interpreted as models.

Specify in the schema.yml

The user would be explicit about models fal should pick up:

file models/schema.yml

version: 2

models:
  - name: some_model
  - name: my_model
    meta:
      fal:
        model: true

sources:
  - name: app
    tables:
      - name: a_source

Finding dependencies between Python models and other models

Mimic dbt

We could do some parsing of the Python file. Probably very naive parsing is good enough for now.

Specify in the schema.yml

A dependency could be specified in the schema.yml:

file models/schema.yml

version: 2

models:
  - name: some_model
  - name: my_model
    meta:
      fal:
        model: true
        deps:
          - ref('some_model')
          - source('app', 'a_source')

sources:
  - name: app
    tables:
      - name: a_source

This is about FEA-20

[Design Doc] fal-dbt feature store

What are we building?

A feature store is data system that facilitates managing the data transformations centrally for predictive analysis and ML models in production.

fal-dbt feature store is a feature store implementation that consists of a dbt package and a python library.

Why are we doing this?

Empower analytics engineer: ML models and analytics operate on the same data. Analytics engineers know this data inside out. They are the ones setting up metrics, ensuring data quality and freshness. Why shouldn’t they be the ones responsible for the predictive analysis? With the rise of open source modelling libraries most of the work that goes into an ML model is done on the data processing side.

Leverage the Warehouse: Warehouses are secure, scalable and relatively cheap environments to do data transformation. Doing transformations in other environments is at least an order of magnitude more complicated. Warehouse should be part of the ML engineer toolkit especially for batch predictions. dbt is the best tool out there to do transformations with the warehouse. dbt feature store will make ML workflows leverage all the advantages of the modern data warehouses.

Strategy

The first building block for the fal feature store is the fal-dbt cli tool. Using fal-dbt cli, dbt users are able perform various tasks via python scripts after their dbt workflows.

✅ Milestone 1: Add ability to read feature store config from dbt ymls

✅ Milestone 2: Run create_dataset from the fal dbt python client

✅ Milestone 3: Move feature to online store and provide online store client

Aready Possible with fal-dbt cli

✅ Milestone 4: Add ability to etl data from a fal script

✅ Milestone 5: Model Monitoring

Stretch Goals

⭐️ Milestone: Logged real time models

Online/Offline Predictions vs Logged Features

There are roughly 3 types of ML systems in terms of complexity; offline predictions, online predictions with batch features and online predictions with real-time features. Most of the use cases we saw were also in the same order, "online predictions with real-time features" being the least common.

A warehouse can handle all the feature calculations for offline use cases, combined with the firestore reverse etl we can also handle online predictions with batch features. This leaves out "online predictions with real-time features" which is out of scope for the initial implementation. We plan on tackling that with logged features as a stretch goal.

Implementation

Feature Definitions

Feature store configurations are added under model configurations as part of the fal meta tag. Each feature is required to have an entity_id and a timestamp field.

entity_id and timestamp fields are later used for the point in time join of a list of features and a label.

Optionally feature definitions can include fal scripts for downstream workflows. For example the dbt model below includes a make_avaliable_online.py (link to example) script. A typical etl step that moves the latest values of features from the data warehouse to an OLTP database.

## schema.yml
models:
  - name: bike_duration
    columns:
      - name: trip_count_last_week
      - name: trip_duration_last_week
      - name: user_id
      - name: start_date
    meta:
      fal:
        feature_store:
          entity_id: user_id
          timestamp: start_date
        scripts:
          - make_avaliable_online.py

A label is also a defined as a feature using the configuration above. fal-dbt feature doesn’t have any requirements or assumptions on what constitutes a label.

Create Dataset

A feature store configuration doesn’t have any effect on your infrastructure unless it is used in a dataset calculation. A dataset in fal-dbt feature store is a dataframe that includes all the features and the label for the machine learning model being built.

There are two ways to create a dataset.

Creating a dataset with dbt macro:

// dataset_name.sql
SELECT
    *
FROM
    {{ feature_store.create_dataset(
        features = ["total_transactions", "credit_score"],
        label_name = "credit_decision"
    ) }}

This model can later be referred in a fal script:

df = ref("dataset_name")

Creating a dataset with python:

from fal.ml.feature_store import FeatureStore

store = FeatureStore(creds="/../creds.json") // path to service account

ds = store.create_dataset(
	name="dataset_name",
  features=["total_transactions", "credit_score"], 
  label="credit_decision"
)

df = ds.get_pandas_dataframe()

Python Client

class FeatureStore

	def create_dataset(dataset_name: str, features: List[str], label: str)
		
	def get_dataset(dataset_name: str)

@dataclass
class OnlineClient
	client_config: ClientConfig

	def get_feature_vector(dbt_model: str, feature_name: str)

	def get_feature_vectors(feature_list: List[Tuple[str, str]])

Scheduling

Scheduling is usually an afterthought in existing feature store implementations. It is left to the users to handle using tools like Airflow. fal-dbt feature store’s close integration with dbt offloads scheduling responsibilities to the dbt scheduler.

Incremental Calculations

dbt incremental calculations make sure feature calculations are not wasteful, they can be incrementally calculated, and always fresh if scheduled properly with the dbt scheduler. In fal-dbt feature store there are no lazy feature calculations all features are assumed to be fresh.

Stretch Goals

Logged Features

We have talked about this before but we never had a clear design on how we would achieve this. This fits very well with the "do the simple thing first" tenant mentioned above. Logged features achieve real time transformations by transforming the data with the application code and then storing the transformed version in the data warehouse for training. This enables transformation logic to live in just one place (application code) and not duplicated in the warehouse and application. Not only does it live in the application code, it's also written with the web stack where applying business logic is easier with the help of an ORM or similar.

This is almost too good to be true but problems start to emerge when the transformation code starts changing over time. Once a change is made in the application code, the training data still has the shape of the older data. Model has to be retrained, but also older data needs to be back-filled - just one time- to apply the new transformation. This is not ideal but better than maintaining two code-bases.

How can we build tools to make this easier?

Make back-filling easier
Make writing application code with warehouse SQL easier

fal-ai / dbt-fal Goto Github PK

dbt-fal's People

Contributors

Stargazers

Watchers

Forkers

dbt-fal's Issues

New features requested:

Why is it needed?

Initial proposal

How would the actual SQL statement look?

DRAFT: fal dbt Cloud Specs

Alternatives considered

Why are we doing this?

For example:

New features requested:

Specifying EL connections in profiles.yml

Magic functions for sync operations

Previous work

Proposal

Finding Python models

Mimic dbt

Specify in the schema.yml

Finding dependencies between Python models and other models

Mimic dbt

Specify in the schema.yml

What are we building?

Why are we doing this?

Strategy

Online/Offline Predictions vs Logged Features

Implementation

Feature Definitions

Create Dataset

Python Client

Scheduling

Incremental Calculations

Stretch Goals

Logged Features

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Specifying EL connections in `profiles.yml`