GithubHelp home page GithubHelp logo

Comments (5)

ximenesuk avatar ximenesuk commented on September 15, 2024

There are some initial commits in https://github.com/BritishGeologicalSurvey/etlhelper/tree/handle-load-error

The test for log_and_continue highlights the problem here (it currently outputs the list of ids in the table written to). The failures are at a chunk level, if one row fails all subsequent rows are not inserted, and so any continue needs to pick up the missed rows. In the case of the simple test it means slightly different results depending on the chunk size. In addition, I have added one possible exception that might be raised to the Postgres exceptions, there might be a longer list of exceptions needed here, so there may be a superclass that could be used.

Are there any arguments that can be passed to the upstream helper functions to handle this?

If not then log_and_continue would need to invoke (or flag) running execute over the failed batch or re-chunking using progressively smaller sizes - i.e. recursive! Or, using the exception details, drop some rows and try again - and again recursively. Both of these could prove expensive if the whole of a large batch contain duplicate rows, say.

In either case where etlhelper handles the retrying, the function raise_and_continue either needs to be somewhere else or needs to do something else, return a value?

from etlhelper.

ximenesuk avatar ximenesuk commented on September 15, 2024

I've added a branch https://github.com/BritishGeologicalSurvey/etlhelper/tree/recursive-many which is illustrative but handles chunks containing duplicate rows recursively. This solves the problem but is fairly ugly. It would need to handle other exceptions caused by problem rows, so may need to use IntegrityError and DataError more generally. I have not confirmed whether the count processed is correct.

from etlhelper.

volcan01010 avatar volcan01010 commented on September 15, 2024

It's complicated. I'd been thinking about it for a while and am glad to have your opinion, too. I think that going through the data row-by-row is unavoidable in some settings.

New thoughts from today are:

I can think of four different scenarios that we need to address:

  • User just wants to fail quickly on the first error in the chunk (current behaviour)
  • User wants to retry row-by-row and only fail on the problem row, so that they can get full data
  • User wants to retry row-by-row and log the problem rows, so that they can get a list of them
  • User wants to retry row-by-row and call a custom function on problem rows e.g. so they can write them to a file or post them to some queue

If we have an ErrorHandler class, it could have an error function that gets called e.g.

class BaseErrorHandler:
    handler_function = None

    def __init__(self, sql, chunk, exc, logger, ...):
        self.sql = sql
        self.chunk = chunk
        ...

    def handle_error_rows(self):
        for row in chunk:
            try:
                execute(self.sql, self.row, ...)
            except:
                self.handler_function(row, exc, logger, ...)

class LogAndContinueErrorHandler(BaseErrorHandler):
    handler_function = log_and_continue

class CustomErrorHandler(BaseErrorHandler):
    handler_function = my_error_function

I'm not sure how much that helps, but it may reduce repetition of code between different handler functions.
If we wanted to get really recursive, we could make chunksize a parameter of executemany and then call it again with chunksize=1 to retry the rows.

In the executemany code we could do:

try:
    helper.executemany(...)
except helper.Exceptions as exc:
    if on_error_handler is None:
        # Current behaviour - no handler specified
        raise
    else:
        handler = on_error_handler(sql, chunk, exc, logger, ...)
        handler.handle_error_rows()

from etlhelper.

volcan01010 avatar volcan01010 commented on September 15, 2024

I had a simpler idea. Either we fail on the first error, as now, or we have a catch_failures option that retries the individual rows and then returns the failed one with their exceptions at the end. Users can do what they want with them afterwards.

The benefit is simplicity, the downside is that the list of failed data could get very big.

from etlhelper.

volcan01010 avatar volcan01010 commented on September 15, 2024

I've been thinking even more! I think that the best compromise will be:

try:
    # call executemany on the chunk
except:
    # Roll back failed executemany
    conn.rollback()

    if on_error:  # If we have been given a function to call
        # Retry the chunk one row at a time, capturing a list of failed rows and their errors
        bad_rows_and_errors = execute_by_row()
        on_error(bad_rows_and_errors)
    else:
        # Raise the error from the failed executemany
        raise

The execute_by_row will be something like:

def execute_by_row(sql, conn, chunk, helper):
    bad_rows_and_errors = []

    for params in chunk:
        try:  
            execute(sql, conn, params)
        except helper.exception as exc:
            conn.rollback()
            bad_rows_and_errors.append((params, exc))

The benefits of this are:

  • It's simple, with no need for handler classes
  • on_error function gets called once per chunk, so there is less risk of overloading memory
  • on_error function gets called once per chunk, so there can be feedback along the way
  • bad_rows_and_errors is a list of tuples that will be easy for users to define functions to work with
  • Users don't have to worry about handling exceptions

Example on_error functions could be as simple as:

# Collect all the errors to deal with at the end
errors = []
executemany(sql, conn, on_error=errors.extend)


# Log the errors to the etlhelper logger
from etlhelper import logger

def log_errors(bad_rows_and_errors):
    for row, error in bad_rows_and_errors:
        logger.error(error)

executemany(sql, conn, on_error=log_errors)

# Write the failed ids to a file
def write_bad_ids(bad_rows_and_errors):
    with open('bad_ids.txt', 'at') as out_file:
        for row, error in bad_rows_and_errors:
            out_file.write(f"{row.id}\n")

executemany(sql, conn, on_error=write_bad_ids)

from etlhelper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.