Comments (5)
Thank you for reporting and finding the root cause.
If we remove the wrapper, what is your plan on supporting pickle on HDFS?
There are a few cases we need to replace the original file object to support some functionalities. For example, reading as text from HDFS and zip
And the internal zip for Python < 3.7
I am thinking about making the functionality of __init__
of file object, which is currently used to determine the proper file object type for opening, to a separate function to be called by open decorator.
In this way, we can for now get rid of the following I/O calls going to file object wrapper while still having the ability to solve the issues like text read on HDFS and internal zip
from pfio.
To make it more clear, I am thinking about creating a file_object_maker
as a replacement of __init__
in the current file_object.py, which is used to cover those cases where some functionalities are not supported by the original file objects like the text read cases in HDFS and zip.
The currently implementation of open_wrapper returns the FileObject defined in fileobject.pyor its derived classes. And the file object replacement
takes places in such __init__
of the FileObject, hence the replacement
is bound to the file object.
If we extract the __init__
from the FileObject and make it a function (e.g. file_object_maker
), we can still have the ability to control which kind of file object to return to user while not returning a always-wrappered FileObject to workaround the issues we have now.
from pfio.
Wrapping where we need it is fine, but FileObject in fileobject.py is unnecessary for now especially for "posix" filesystem and zip container. This is maybe because aligning other filesystems and file objects' behaviour to "posix" is the best way to prevent potential performance issues like this, just until profiler.
from pfio.
I think we need to wrap zip container for text reading and internal zip.
The only two don't need wrapper are POSIX and HTTP.
And how about pickle on HDFS? an simple plan will be extracting the content and putting into io.bufferedreader in the file_object_maker
from pfio.
This issue for POSIX filesystems is addressed by #38 .
from pfio.
Related Issues (20)
- Drop Python 3.6 (EoL 2021-12-23)
- Support Python 3.10 HOT 1
- tar support HOT 1
- Path.glob has different behavior from standard pathlib.Path
- Opening a giant (exceeding 4GB~?) zip in S3 using `pfio.v2.from_url` raises "BadZipFile: Bad magic number for central directory" HOT 1
- File-like object returned from `open_url` is extremely slow with S3
- Support OpenTelemetry Instrumentation HOT 1
- Support PPE profiling
- Support PPE profiling HOT 1
- pfio.v2.lazify() may fail in case the PFIO-related context has things that can't pickle HOT 1
- Use $XDG_CACHE_HOME for file cache directory by default
- Document a tip on shutil.copyfileobj()
- `ValueError: buffer size must be strictly positive` when opening an empty file in S3 with "rb" mode
- `S3.read(-1)` for a large file (2^31+α bytes) fails due to an SSL `OverflowError` HOT 4
- Support Google Cloud Storage
- Concurrency control in sparse file cache
- Cleanup after test run; a lot of temporary files left HOT 1
- Deprecate `reset_on_fork` flag
- Drop Python 3.7 as of PEP537
- Introduce type checking with mypy
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pfio.