Summary As an etl process writer, I would like to customise what h

There are some initial commits in <a href="https://github.com/BritishGeologicalSurvey/

I've added a branch <a href="https://github.com/BritishGeologicalSurvey/etlhelper/tree

Allow custom error handling for executemany (and execute?) about etlhelper HOT 5 CLOSED

volcan01010 commented on September 15, 2024

Allow custom error handling for executemany (and execute?)

from etlhelper.

Comments (5)

ximenesuk commented on September 15, 2024

There are some initial commits in https://github.com/BritishGeologicalSurvey/etlhelper/tree/handle-load-error

The test for log_and_continue highlights the problem here (it currently outputs the list of ids in the table written to). The failures are at a chunk level, if one row fails all subsequent rows are not inserted, and so any continue needs to pick up the missed rows. In the case of the simple test it means slightly different results depending on the chunk size. In addition, I have added one possible exception that might be raised to the Postgres exceptions, there might be a longer list of exceptions needed here, so there may be a superclass that could be used.

Are there any arguments that can be passed to the upstream helper functions to handle this?

If not then log_and_continue would need to invoke (or flag) running execute over the failed batch or re-chunking using progressively smaller sizes - i.e. recursive! Or, using the exception details, drop some rows and try again - and again recursively. Both of these could prove expensive if the whole of a large batch contain duplicate rows, say.

In either case where etlhelper handles the retrying, the function raise_and_continue either needs to be somewhere else or needs to do something else, return a value?

from etlhelper.

ximenesuk commented on September 15, 2024

I've added a branch https://github.com/BritishGeologicalSurvey/etlhelper/tree/recursive-many which is illustrative but handles chunks containing duplicate rows recursively. This solves the problem but is fairly ugly. It would need to handle other exceptions caused by problem rows, so may need to use IntegrityError and DataError more generally. I have not confirmed whether the count processed is correct.

from etlhelper.

volcan01010 commented on September 15, 2024

It's complicated. I'd been thinking about it for a while and am glad to have your opinion, too. I think that going through the data row-by-row is unavoidable in some settings.

New thoughts from today are:

I can think of four different scenarios that we need to address:

User just wants to fail quickly on the first error in the chunk (current behaviour)
User wants to retry row-by-row and only fail on the problem row, so that they can get full data
User wants to retry row-by-row and log the problem rows, so that they can get a list of them
User wants to retry row-by-row and call a custom function on problem rows e.g. so they can write them to a file or post them to some queue

If we have an ErrorHandler class, it could have an error function that gets called e.g.

class BaseErrorHandler:
    handler_function = None

    def __init__(self, sql, chunk, exc, logger, ...):
        self.sql = sql
        self.chunk = chunk
        ...

    def handle_error_rows(self):
        for row in chunk:
            try:
                execute(self.sql, self.row, ...)
            except:
                self.handler_function(row, exc, logger, ...)

class LogAndContinueErrorHandler(BaseErrorHandler):
    handler_function = log_and_continue

class CustomErrorHandler(BaseErrorHandler):
    handler_function = my_error_function

I'm not sure how much that helps, but it may reduce repetition of code between different handler functions.
If we wanted to get really recursive, we could make chunksize a parameter of executemany and then call it again with chunksize=1 to retry the rows.

In the executemany code we could do:

try:
    helper.executemany(...)
except helper.Exceptions as exc:
    if on_error_handler is None:
        # Current behaviour - no handler specified
        raise
    else:
        handler = on_error_handler(sql, chunk, exc, logger, ...)
        handler.handle_error_rows()

from etlhelper.

volcan01010 commented on September 15, 2024

I had a simpler idea. Either we fail on the first error, as now, or we have a catch_failures option that retries the individual rows and then returns the failed one with their exceptions at the end. Users can do what they want with them afterwards.

The benefit is simplicity, the downside is that the list of failed data could get very big.

from etlhelper.

volcan01010 commented on September 15, 2024

I've been thinking even more! I think that the best compromise will be:

try:
    # call executemany on the chunk
except:
    # Roll back failed executemany
    conn.rollback()

    if on_error:  # If we have been given a function to call
        # Retry the chunk one row at a time, capturing a list of failed rows and their errors
        bad_rows_and_errors = execute_by_row()
        on_error(bad_rows_and_errors)
    else:
        # Raise the error from the failed executemany
        raise

The execute_by_row will be something like:

def execute_by_row(sql, conn, chunk, helper):
    bad_rows_and_errors = []

    for params in chunk:
        try:  
            execute(sql, conn, params)
        except helper.exception as exc:
            conn.rollback()
            bad_rows_and_errors.append((params, exc))

The benefits of this are:

It's simple, with no need for handler classes
on_error function gets called once per chunk, so there is less risk of overloading memory
on_error function gets called once per chunk, so there can be feedback along the way
bad_rows_and_errors is a list of tuples that will be easy for users to define functions to work with
Users don't have to worry about handling exceptions

Example on_error functions could be as simple as:

# Collect all the errors to deal with at the end
errors = []
executemany(sql, conn, on_error=errors.extend)


# Log the errors to the etlhelper logger
from etlhelper import logger

def log_errors(bad_rows_and_errors):
    for row, error in bad_rows_and_errors:
        logger.error(error)

executemany(sql, conn, on_error=log_errors)

# Write the failed ids to a file
def write_bad_ids(bad_rows_and_errors):
    with open('bad_ids.txt', 'at') as out_file:
        for row, error in bad_rows_and_errors:
            out_file.write(f"{row.id}\n")

executemany(sql, conn, on_error=write_bad_ids)

from etlhelper.

Allow custom error handling for executemany (and execute?) about etlhelper HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs