Comments (5)
There are some initial commits in https://github.com/BritishGeologicalSurvey/etlhelper/tree/handle-load-error
The test for log_and_continue
highlights the problem here (it currently outputs the list of ids in the table written to). The failures are at a chunk level, if one row fails all subsequent rows are not inserted, and so any continue needs to pick up the missed rows. In the case of the simple test it means slightly different results depending on the chunk size. In addition, I have added one possible exception that might be raised to the Postgres exceptions, there might be a longer list of exceptions needed here, so there may be a superclass that could be used.
Are there any arguments that can be passed to the upstream helper functions to handle this?
If not then log_and_continue
would need to invoke (or flag) running execute
over the failed batch or re-chunking using progressively smaller sizes - i.e. recursive! Or, using the exception details, drop some rows and try again - and again recursively. Both of these could prove expensive if the whole of a large batch contain duplicate rows, say.
In either case where etlhelper
handles the retrying, the function raise_and_continue
either needs to be somewhere else or needs to do something else, return a value?
from etlhelper.
I've added a branch https://github.com/BritishGeologicalSurvey/etlhelper/tree/recursive-many which is illustrative but handles chunks containing duplicate rows recursively. This solves the problem but is fairly ugly. It would need to handle other exceptions caused by problem rows, so may need to use IntegrityError
and DataError
more generally. I have not confirmed whether the count processed
is correct.
from etlhelper.
It's complicated. I'd been thinking about it for a while and am glad to have your opinion, too. I think that going through the data row-by-row is unavoidable in some settings.
New thoughts from today are:
I can think of four different scenarios that we need to address:
- User just wants to fail quickly on the first error in the chunk (current behaviour)
- User wants to retry row-by-row and only fail on the problem row, so that they can get full data
- User wants to retry row-by-row and log the problem rows, so that they can get a list of them
- User wants to retry row-by-row and call a custom function on problem rows e.g. so they can write them to a file or post them to some queue
If we have an ErrorHandler class, it could have an error function that gets called e.g.
class BaseErrorHandler:
handler_function = None
def __init__(self, sql, chunk, exc, logger, ...):
self.sql = sql
self.chunk = chunk
...
def handle_error_rows(self):
for row in chunk:
try:
execute(self.sql, self.row, ...)
except:
self.handler_function(row, exc, logger, ...)
class LogAndContinueErrorHandler(BaseErrorHandler):
handler_function = log_and_continue
class CustomErrorHandler(BaseErrorHandler):
handler_function = my_error_function
I'm not sure how much that helps, but it may reduce repetition of code between different handler functions.
If we wanted to get really recursive, we could make chunksize
a parameter of executemany
and then call it again with chunksize=1
to retry the rows.
In the executemany
code we could do:
try:
helper.executemany(...)
except helper.Exceptions as exc:
if on_error_handler is None:
# Current behaviour - no handler specified
raise
else:
handler = on_error_handler(sql, chunk, exc, logger, ...)
handler.handle_error_rows()
from etlhelper.
I had a simpler idea. Either we fail on the first error, as now, or we have a catch_failures
option that retries the individual rows and then returns the failed one with their exceptions at the end. Users can do what they want with them afterwards.
The benefit is simplicity, the downside is that the list of failed data could get very big.
from etlhelper.
I've been thinking even more! I think that the best compromise will be:
try:
# call executemany on the chunk
except:
# Roll back failed executemany
conn.rollback()
if on_error: # If we have been given a function to call
# Retry the chunk one row at a time, capturing a list of failed rows and their errors
bad_rows_and_errors = execute_by_row()
on_error(bad_rows_and_errors)
else:
# Raise the error from the failed executemany
raise
The execute_by_row
will be something like:
def execute_by_row(sql, conn, chunk, helper):
bad_rows_and_errors = []
for params in chunk:
try:
execute(sql, conn, params)
except helper.exception as exc:
conn.rollback()
bad_rows_and_errors.append((params, exc))
The benefits of this are:
- It's simple, with no need for handler classes
on_error
function gets called once per chunk, so there is less risk of overloading memoryon_error
function gets called once per chunk, so there can be feedback along the waybad_rows_and_errors
is a list of tuples that will be easy for users to define functions to work with- Users don't have to worry about handling exceptions
Example on_error
functions could be as simple as:
# Collect all the errors to deal with at the end
errors = []
executemany(sql, conn, on_error=errors.extend)
# Log the errors to the etlhelper logger
from etlhelper import logger
def log_errors(bad_rows_and_errors):
for row, error in bad_rows_and_errors:
logger.error(error)
executemany(sql, conn, on_error=log_errors)
# Write the failed ids to a file
def write_bad_ids(bad_rows_and_errors):
with open('bad_ids.txt', 'at') as out_file:
for row, error in bad_rows_and_errors:
out_file.write(f"{row.id}\n")
executemany(sql, conn, on_error=write_bad_ids)
from etlhelper.
Related Issues (20)
- Create GitHub workflow for automated testing HOT 3
- Deprecate `dumprows`
- Set default row factory to `as_dict` HOT 4
- Deprecate `fetchmany` HOT 1
- Deprecate CHUNKSIZE constant HOT 1
- Tag Version 1 HOT 2
- Provide pyproject.toml for pip install HOT 6
- Add deprecation warning HOT 2
- Add "all" option for driver installation HOT 3
- Update docs and gh-pages workflow to run on `main` HOT 3
- Add `merge` function
- Fix dodumentation for executemany() HOT 1
- Broken link to cxOracle documentation HOT 1
- Update MS SQL server driver and test configuration HOT 2
- Deprecate `get_rows` HOT 1
- Run PostgreSQL integration tests in CI HOT 4
- Create ETLHelper logo HOT 9
- Allow setting default schema HOT 1
- Make query SQL, error message, database type and paramstyle attributes of ETLHelperQueryErrors
- Investigate packaging warning
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from etlhelper.