pfnet / pfio Goto Github PK
View Code? Open in Web Editor NEWIO library to access various filesystems with unified API
Home Page: https://pfio.readthedocs.io/
License: MIT License
IO library to access various filesystems with unified API
Home Page: https://pfio.readthedocs.io/
License: MIT License
Problem statement
Getting default Principal name fails when "KRB5_KTNAME" is given
How to reproduce
>>> import os
>>> import chainerio
>>> os.getenv("KRB5_KTNAME")
'/etc/keytab/user.keytab'
>>> chainerio.exists("hdfs:///user/tianqi")
kinit: Keytab contains no suitable keys for user_xxx@NAMESERVICE while getting initial credentials
Cause
https://github.com/chainer/chainerio/blob/9ef197461fd715f1fe90b10bf0090b9a0c1849c2/chainerio/filesystems/hdfs.py#L52
klist -c
takes cache instead of keytab.
We should also create an interface for user to specify their username instead of always using the default.
krbticket was introduced for updating kerberos ticket periodically, but I noticed hdfs java client also has the similar functionality. It seems krbticket is no longer needed.
Note that hdfs client can use other environment variables to specify keytab, ccache path, and principal. It's available in 3.0 releases or later.
Currently, we have two issues regarding the trailing slash of directories:
The first issue, for recursive list of zip, as we simply yield the content of zipfile.namelist()
, all the directories have trailing slash. However, for other filesystems and default zip, no trailing slash is added, which creates an inconsistent behavior.
>>> with chainerio.open_as_container("outside.zip") as container:
... print(list(container.list()))
... print(list(container.list(recursive=True)))
...
['testdir']
['testdir/', 'testdir/testfile', 'testdir/nested.zip']
>>> list(chainerio.list("."))
['bench.py', 'tests', 'outside.zip', '.gitignore', '.python-version', 'docs', 'test.py', 'chainerio.egg-info', 'examples', '.mypy_cache', 'setup.cfg', 'setup.py', 'dist', '.pytest_cache', 'tox.ini', 'nested.py', '.git', 'LICENSE', 'README.md', '.tox', 'test', 'test.sh', '.pfnci', 'chainerio']
>>> list(chainerio.list("tests", recursive=True))
['chainer_extensions_test', 'chainer_extensions_test/test_snapshot.py', 'chainer_extensions_test/__pycache__', 'chainer_extensions_test/__pycache__/test_snapshot.cpython-37-PYTEST.pyc', 'chainer_extensions_test/__pycache__/test_snapshot.cpython-36-pytest-5.0.1.pyc', 'chainer_extensions_test/__pycache__/test_snapshot.cpython-37-pytest-5.0.1.pyc', '.python-version', 'cache_tests', 'cache_tests/test_cache.py', 'cache_tests/test_mt.py', 'cache_tests/__pycache__', 'cache_tests/__pycache__/test_file_cache.cpython-36-pytest-5.0.1.pyc', 'cache_tests/__pycache__/test_mt.cpython-37-pytest-5.0.1.pyc', 'cache_tests/__pycache__/test_cache.cpython-37-pytest-5.0.1.pyc', 'cache_tests/__pycache__/test_file_cache.cpython-37-pytest-5.0.1.pyc', 'cache_tests/__pycache__/test_mt.cpython-36-pytest-5.0.1.pyc', 'cache_tests/__pycache__/test_file_cache.cpython-37-PYTEST.pyc', 'cache_tests/__pycache__/test_mt.cpython-37-PYTEST.pyc', 'cache_tests/__pycache__/test_cache.cpython-37-PYTEST.pyc', 'cache_tests/__pycache__/test_cache.cpython-36-pytest-5.0.1.pyc', 'cache_tests/test_file_cache.py', 'profiler_tests', 'profiler_tests/__pycache__', 'profiler_tests/__pycache__/test_naive_profile_writer.cpython-37-PYTEST.pyc', 'profiler_tests/__pycache__/test_naive_profiler.cpython-37-PYTEST.pyc', 'profiler_tests/__pycache__/test_chrome_profile_writer.cpython-37-PYTEST.pyc', 'profiler_tests/__pycache__/test_profiler.cpython-37-PYTEST.pyc', 'profiler_tests/__pycache__/test_chrome_profiler.cpython-37-PYTEST.pyc', 'test_context.py', '__pycache__', '__pycache__/test_context.cpython-36-pytest-5.0.1.pyc', '__pycache__/test_context.cpython-37-PYTEST.pyc', 'filesystem_tests', 'filesystem_tests/test_posix_handler.py', 'filesystem_tests/__pycache__', 'filesystem_tests/__pycache__/test_http_handler.cpython-37-PYTEST.pyc', 'filesystem_tests/__pycache__/test_http_handler.cpython-37-pytest-5.0.1.pyc', 'filesystem_tests/__pycache__/test_hdfs_handler.cpython-37-PYTEST.pyc', 'filesystem_tests/__pycache__/test_hdfs_handler.cpython-36-pytest-5.0.1.pyc', 'filesystem_tests/__pycache__/test_posix_handler.cpython-37-pytest-5.0.1.pyc', 'filesystem_tests/__pycache__/test_posix_handler.cpython-36-pytest-5.0.1.pyc', 'filesystem_tests/__pycache__/test_posix_handler.cpython-37-PYTEST.pyc', 'filesystem_tests/__pycache__/test_http_handler.cpython-36-pytest-5.0.1.pyc', 'filesystem_tests/__pycache__/test_hdfs_handler.cpython-37-pytest-5.0.1.pyc', 'filesystem_tests/test_hdfs_handler.py', 'container_tests', 'container_tests/__pycache__', 'container_tests/__pycache__/test_zip_container.cpython-36-pytest-5.0.1.pyc', 'container_tests/__pycache__/test_zip_container.cpython-37-pytest-5.0.1.pyc', 'container_tests/__pycache__/test_zip_container.cpython-37-PYTEST.pyc', 'container_tests/test_zip_container.py']
The second issue, since we somehow merged the functionality of os.walk()
into list()
, but unlike the os.walk()
, we do not differentiate files and directories in the return value. With no trailing slash, users need to call is_file
to distinguish files from directories, which introduce unnecessary filesystem access.
There are two solutions:
os.walk()
There is some out-of-date tests which should be removed
https://github.com/chainer/chainerio/blob/9ef197461fd715f1fe90b10bf0090b9a0c1849c2/tests/filesystem_tests/test_hdfs_handler.py#L56
Since the "keytab_path" parameter was removed in #91
libhdfs hadoop client takes user name from various points, but the priority is from Kerberos ticket. It should not only be taken from operating system like this: https://github.com/chainer/chainerio/blob/master/chainerio/filesystems/hdfs.py#L24
Also a nit: uid is not used and should be removed. https://github.com/chainer/chainerio/blob/master/chainerio/filesystems/hdfs.py#L25
Background
nested zip
is zip inside a zip. For example we can have file
in nested.zip
in outside.zip
as follow:
outside.zip
| - nested.zip
| - | - file
The following code is an example of how ChainerIO actually accesses such nested zipfile.
with zipfile.ZipFile("outside.zip", "rb") as outside_zip:
with outside_zip.open("nested.zip", "rb") as inside_f:
with zipfile.ZipFile(inside_f) as inside_zip:
with inside_f.open("file", "rb") as f:
f.read()
Both file_path
and file object
can be passed to create a zipfile.ZipFile
.
Upon creating the zipfile.ZipFile
, the zipfile module reads the header of the file to check whether the given file is a zip file. When checking, a "seek" is performed on the given file/file object, which requires the given file/file object to be seekable
.
Problem
In Python<3.7, the ZipExtFile, which is the returned file object by the zipfile.open, i.e. the inside_f
in the above code, is not seekable
in any case. Which leads to failure on creation of ZipFile
of the nested zip.
The problem was fixed in Python 3.7 as code
Problem Statement
recursive list
in zip ignores the specified path
, and just dumps all the files in the zip
How to reproduce
tianqi@host:~$ unzip -l source-to-image.zip
Archive: source-to-image.zip
Length Date Time Name
--------- ---------- ----- ----
0 2019-08-15 16:36 source-to-image/
7213952 2018-05-15 04:16 source-to-image/sti
7213952 2018-05-15 04:16 source-to-image/s2i
435401 2019-07-24 19:43 profile
--------- -------
14863305 4 files
>>> import chainerio
>>> handler = chainerio.open_as_container('source-to-image.zip'
>>> list(handler.list('source-to-image', recursive=True))
['source-to-image/', 'source-to-image/sti', 'source-to-image/s2i', 'profile']
Related to #61, and #66, we need to accept directory without trailing slash in zip operations
For example:
>>> import chainerio
>>> handler = chainerio.open_as_container('test.zip')
>>> handler.isdir('test')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "tianqi/repository/chainerio/chainerio/containers/zip.py", line 130, in isdir
stat = self.stat(file_path)
File "tianqi/repository/chainerio/chainerio/containers/zip.py", line 101, in stat
return self.zip_file_obj.getinfo(path)
File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/zipfile.py", line 1395, in getinfo
'There is no item named %r in the archive' % name)
KeyError: "There is no item named 'test' in the archive"
>>> handler.isdir('test/')
True
List of Affected Operations:
The default context currently is implemented with a few raw global variables. Such implementation cause many troubles, e.g. difficult to switch and keep everything in consistent and no support for context manager for local configure. We need to use the configuration like the one in Chainer.
import _hashlib
import mysql.connector
import pfio
dest_path = 'hdfs:///my/hdfs/file'
with pfio.open(dest_path, 'wb') as file_out:
file_out.write(b'data')
In my environment, the above code results in a segmentation fault:
python: Relink `/usr/local/lib/python3.6/site-packages/pyarrow/libarrow.so.17' with `/lib/x86_64-linux-gnu/librt.so.1' for IFUNC symbol `clock_gettime'
Segmentation fault (core dumped)
_hashlib
with any libraries that internally depend on hashlib
, including itself.Some environment info:
$ which python
/usr/local/bin/python
$ python --version
Python 3.6.9
$ pip freeze | grep mysql
mysql-connector-python==8.0.21
$ pip freeze | grep pfio
pfio==1.0.0
$ ldd /usr/local/lib/python3.6/lib-dynload/_hashlib.cpython-36m-x86_64-linux-gnu.so
linux-vdso.so.1 (0x00007ffd4e7ce000)
libcrypto.so.1.1 => /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 (0x00007fa2138dc000)
libpython3.6m.so.1.0 => /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0 (0x00007fa213231000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa212e40000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa212c3c000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fa212a1d000)
libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007fa2127eb000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fa2125ce000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fa2123cb000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa21202d000)
/lib64/ld-linux-x86-64.so.2 (0x00007fa213fae000)
@yuyu2172 reported that with ChainerIO unpickling fairly large file in NFS is much slower than Python's built-in IO system. The difference was like 10x or even more.
Here's complete benchmark results and code: https://gist.github.com/kuenishi/d8d93847e0705c110501d68101fe5f53
To unpickle long list l = [0.1] * 10000000
takes 12 seconds in ChainerIO and 0.75 secs with io
.
Main difference in profile result was number of read calls, like 20M calls vs almost zero (doesn't appear in cProfile).
This is because ChainerIO's file object doen't have peek
method, while Python's io.BufferdReader
( BufferedIOBase
implementation ) has it.ใCode:
/* Prefetch some data without advancing the file pointer, if possible */
if (self->peek && n < PREFETCH) {
len = PyLong_FromSsize_t(PREFETCH);
if (len == NULL)
return -1;
data = _Pickle_FastCall(self->peek, len);
With this patch
diff --git a/chainerio/fileobject.py b/chainerio/fileobject.py
index df7fedd..da31ad9 100644
--- a/chainerio/fileobject.py
+++ b/chainerio/fileobject.py
@@ -43,6 +43,9 @@ class FileObject(io.IOBase):
def fileno(self) -> int:
return self.base_file_object.fileno()
+ def peek(self, size: int = 0) -> bytes:
+ return self.base_file_object.peek(size)
+
def read(self, size: Optional[int] = None) -> Union[bytes, str]:
return self.base_file_object.read(size)
The number of read call comes back to sanity and the performance has imporoved:
$ python b.py chainerio
chainerio <class 'chainerio.fileobject.FileObject'> haspeek: True
chainerio <module 'chainerio' from '/home/kuenishi/src/chainerio/chainerio/__init__.py'> : 0.8312304019927979 sec
156254 function calls in 0.831 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.428 0.428 0.831 0.831 {built-in method _pickle.load}
39063 0.016 0.000 0.295 0.000 fileobject.py:49(read)
39063 0.279 0.000 0.279 0.000 {method 'read' of '_io.BufferedReader' objects}
39063 0.016 0.000 0.109 0.000 fileobject.py:46(peek)
39063 0.093 0.000 0.093 0.000 {method 'peek' of '_io.BufferedReader' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
I would suggest retreating from current wrapper strategy on file objects until we truly support profiling. This is mainly because we don't have perfect solution for now. For example, any of these aren't good, or even nonsense: 1) adding peek
method is a hacky workaround, 2) wrapping ChainerIO's file object again with io.BufferedReader
or io.BufferedWriter
is like matryoshka doll and crazy.
io.BufferedIOBase
or io.IOBase
.peek
zipfile.ZipExtFile
has peek
since at least 3.5Problems
A few chainer.xxx
functions (e.g. chainerio.exists
) do not follow the set_root
.
How to reproduce
hdfs dfs -ls /home/
ls: `/home/': No such file or directory
$ python
Python 3.6.8 (default, Oct 7 2019, 09:28:08)
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import chainerio
>>> chainerio.set_root("hdfs")
>>> chainerio.exists("/home/")
True
Cause
These functions are calling an old function which ignores the default_handler
https://github.com/chainer/chainerio/blob/ca79de952f21cc216cc70c1a0b87afd6d07426cd/chainerio/__init__.py#L267
https://github.com/chainer/chainerio/blob/ca79de952f21cc216cc70c1a0b87afd6d07426cd/chainerio/__init__.py#L43
When I access to HDFS using pfio, the following DeprecationWarning occurs.
/..../pfio/pfio/filesystems/hdfs.py:153: DeprecationWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
pfio==1.0.1
pyarrow==2.0.0
Suggested by @kuenishi
Not now, but I'd rather feel like moving on to newer and more popular formatters like black .
Deep learning frameworks support multi-process data loading, such as num_worker
option of DataLoader
in PyTorch, MultiprocessIterator
in Chainer, etc.
They use multiprocessing module to launch worker processes using fork
by default (in Linux).
When using PFIO, in case an HDFS connection is established before the fork, information of the connections are also copied to child processes. They are eventually destroyed when one of the workers has completed its work (this happens at the end of each epoch in PyTorch DataLoader
). However remaining worker processes still want to keep in touch with HDFS, but since the connection is unexpectedly and uncontrollably closed, they will break.
As far as I know, the actual error message or phenomenon that users face may be different depending on the situation (such as freezing, some strange error like RuntimeError: threads can only be started once
, etc), and this makes the troubleshooting even more difficult.
The workaround for this issue is to set multiprocessing module forkserver
mode before having access to HDFS.
Due to a similar reason (prevent MPI context being broken after fork), ChainerCV and Chainer examples apply the same workaround, and it works for PFIO+HDFS case, too.
https://github.com/chainer/chainercv/blob/master/examples/classification/train_imagenet_multi.py#L96-L100
https://github.com/chainer/chainer/blob/df53bff3f36920dfea6b07a5482297d27b31e5b7/examples/chainermn/imagenet/train_imagenet.py#L145-L148
Describe abstract API like we do in ChainerMN . That would help describe how it behaves to developers.
We need to test whether auto filesystem dectections also works on containers.
For example
files = ["posix.zip", "hdfs://hdfs.zip"]
for file in files:
with chainerio.open_as_container(file) as f:
print(f.read())
Currently, we have only checked the detections on normal open
operation.
As we added support for Python 3.5.2 again, we need to add tests of Python 3.5.2 into CI since the workaround needs to be checked for future PR.
to avoid this empty page. https://pypi.org/project/chainerio/
https://chainerio.readthedocs.io/en/latest/reference.html#chainerio.list just says "Please note that the meaning of $HOME depends on each filesystem." It would be nice clarify the difference between filesystems that cannot be absorbed by the library.
I found that the current pfio ZipContainer cannot handle a Zip file with DOS-compatible external attributes.
# Creation of hello_dos.zip is explained later.
$ zipinfo hello_dos.zip
Archive: hello_dos.zip
Zip file size: 318 bytes, number of entries: 2
drwx--- 2.0 fat 0 bx stor 20-Sep-09 16:34 FOO/
-rw---- 2.0 fat 6 tx stor 20-Sep-09 16:34 FOO/HELLO.TXT
2 files, 6 bytes uncompressed, 6 bytes compressed: 0.0%
$ python
>>> import pfio
>>> container = pfio.open_as_container('hello_dos.zip')
>>> container.isdir('FOO')
False # <--------- Supposed to be True because FOO is a directory in the zip
# Since it cannot recognize the directory, it cannot list the contents of the directory either
>>> list(container.list('FOO/'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "xxxxxxxxxxxxxxxxxxxxx/pfio/containers/zip.py", line 190, in list
"{} is not a directory".format(path_or_prefix))
NotADirectoryError: FOO is not a directory
...
Assuming Linux (Ubuntu 18.04) environment.
$ mkdir foo
$ echo hello > foo/hello.txt
# OK case: Unix attributes (I refer UNIX zip)
$ zip -r hello_unix.zip foo
adding: foo/ (stored 0%)
adding: foo/hello.txt (stored 0%)
# NG case: DOS attributes
$ zip -rk hello_dos.zip foo
adding: FOO/ (stored 0%)
adding: FOO/HELLO.TXT (stored 0%)
Here, -k
option to zip
command stands for having DOS-like attribute in the external attribute field in a ZIP file.
-k
--DOS-names
Attempt to convert the names and paths to conform to MSDOS, store only the MSDOS attribute (just the user write attribute from Unix), and mark the entry as made under MSDOS (even though it was not); for compatibility with PKUNZIP under MSDOS which cannot handle certain names such as those with two dots.
What actually happens is explained in the ZIP file format specification
If the external file attributes are compatible with MS-DOS and can be read by PKZIP for DOS version 2.04g then this value will be zero
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
zipfile
ZIP file created in the second way has different external_attr
, making ZipInfo
object slightly different.
>>> zipfile.ZipFile('hello_unix.zip').getinfo('foo/')
<ZipInfo filename='foo/' filemode='drwxrwxr-x' external_attr=0x10>
>>> zipfile.ZipFile('hello_dos.zip').getinfo('FOO/')
<ZipInfo filename='FOO/' external_attr=0x10>
Note that we cannot see filemode
in the ZipInfo object for the DOS zip.
Since filemode
is parsed from the first 2 bytes in the external_attr
(cf zipfile.py#L393-L396), but in DOS zip it's filled by zero.
Since pfio (ZipContainer.stat
) relies on the external attributes to distinguish a specified name is a directory or not, it cannot properly handle the directory.
>>> pfio.open_as_container('hello_unix.zip').stat('foo')
<ZipFileStat filename="foo/" mode="drwxrwxr-x">
>>> pfio.open_as_container('hello_unix.zip').isdir('foo')
True
>>> pfio.open_as_container('hello_dos.zip').stat('FOO')
<ZipFileStat filename="FOO/" mode="?---------">
>>> pfio.open_as_container('hello_dos.zip').isdir('FOO')
False
Directory check is done by simply checking the trailing '/'
in CPython and the previous pfio but I (yes, I did! ๐) introduced the dependency with the external attrs field by #114...
We need an API to make archive (e.g. zip) on a target filesystem.
The current list
in zip only list the first level of files in zip with the given prefix to match the behavior of POSIX. However, listing all the files in zip is needed in many cases, e.g. read all the files in a zip. We need to have an way to list all the files.
This is two bugs when list directory in zip
ZIP:
tianqi@host:tianqi$ unzip -l test.zip
Archive: test.zip
Length Date Time Name
--------- ---------- ----- ----
0 2019-08-16 15:11 test/
12 2019-07-24 13:13 test/file
0 2019-08-16 15:11 test/dir/
0 2019-08-16 15:11 test/dir/nested_dir/
170 2019-07-24 13:16 test/file.zip
167 2019-01-16 16:53 python.py
--------- -------
349 6 files
Bug 1
>>> import chainerio
>>> handler = chainerio.open_as_container('test.zip')
>>> list(handler.list('test/'))
['file', 'dir', 'file.zip', 'python.py']
Where python.py
is not in test/
Bug 2
>>> import chainerio
>>> handler = chainerio.open_as_container('test.zip')
>>> list(handler.list('test'))
['python.py']
Problem Statement
Massive calls to the global functions
, such as chainerio.open
, chainerio.list
, to hdfs might break the Kerberos ticket, and cause the code to fail. However, access from the handlers will cause the problem.
Examples of the Problem Code
while True:
# generating massive calls to chainerio.open
# we have massive updaters inside
with chainerio.open('hdfs:///user/README.txt', 'rb') as f:
# fails after the ticket is broken
f.read()
Error Message
20/03/17 11:51:51 WARN security.UserGroupInformation: Exception encountered while running the renewal command for xxxxxxx. (TGT end time:1584453111000, renewalFailures: 0, renewalFailuresTotal: 1)
ExitCodeException exitCode=1: kinit: Internal credentials cache error (filename: /tmp/krb5cc_xxxx) when
initializing cache
at org.apache.hadoop.util.Shell.runCommand(Shell.java:1009)
at org.apache.hadoop.util.Shell.run(Shell.java:902)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:1321)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:1303)
at org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupI
nformation.java:1077)
at org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroup
Information.java:988)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Root Cause
Each Call to a global function, such as chainerio.open
, chainerio.list
creates an hdfs handler with a kerberos ticket updater inside, which runs in a separate thread and periodically updates the kerberos ticket cache. A massive calls results in a huge number of updater runs parallel. Moreover, the java hadoop client
will also perform a sort of ticket update. The lack of race control among the updater can break the ticket data inside the java hadoop client
and cause it to fail.
Currently, we still rely on our own ticket updater
, it is because before Hadoop 3.0
the java hadoop client
cannot update the ticket from the keytab set though environment variable KRB5_KTNAME
.
How to avoid
Avoid using the global functions when you need to call it massively. Instead, create a handler
# one handler, one updater
hdfs_handler = chainerio.create_handler("hdfs")
while True:
with handler.open('/user/README.txt', 'rb') as f:
# will not fail
f.read()
Please NOTE: using handler CANNOT eliminate this issue, it can only significantly reduce the probability of having it by having only one updater instead of having massive updaters
tianqi:pfio$ flake8 pfio
pfio/__init__.py:68:48: F821 undefined name 'IOBase'
pfio/io.py:19:55: F821 undefined name 'IOBase'
pfio/io.py:82:47: F821 undefined name 'IOBase'
pfio/io.py:90:64: F821 undefined name 'IOBase'
pfio/io.py:109:52: F821 undefined name 'IOBase'
pfio/containers/zip.py:119:47: F821 undefined name 'IOBase'
pfio/containers/zip.py:127:64: F821 undefined name 'IOBase'
pfio/filesystems/hdfs.py:180:47: F821 undefined name 'IOBase'
pfio/filesystems/hdfs.py:188:64: F821 undefined name 'IOBase'
tianqi:pfio$ flake8 --version
3.8.2 (mccabe: 0.6.1, pycodestyle: 2.6.0, pyflakes: 2.2.0) CPython 3.7.4 on
Linux
tianqi:pfio$ git branch
bump_v_1_0_0
bump_version_0_1_2
* master
pr_114
pr_121
pr_125
refactor_zip_test
rename_to_pfio
tianqi:pfio$
A related issue to #36, As peek
is not in the HdfsFile
, the same issue also happens when using HDFS. Unlike the #36, where the peek
in POSIX was just hiden by the FileObject
, peek
needs to be added in case of HDFS.
$ python test.py
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/share/java/slf4j-simple.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.SimpleLoggerFactory]
19/08/16 14:53:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
chainerio: 186.07044076919556
$ cat test.py
import time
import pickle
import chainerio
chainerio.set_root("hdfs")
cache_path = 'a_large_file.pkl'
start = time.time()
with chainerio.open(cache_path, 'rb') as f:
data = pickle.load(f)
print('chainerio: ', time.time() - start)
Wed Jul 24 19:43:47 2019 profile
466895442 function calls (466887441 primitive calls) in 161.088 seconds
Ordered by: internal time
List reduced from 2941 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
1 74.242 74.242 161.008 161.008 {built-in method _pickle.load}
233293409 46.669 0.000 84.175 0.000
xxx/versions/3.7.2/lib/python3.7/site-packages/chainerio/fileobject.py:50(read)
233293409 37.505 0.000 37.505 0.000 {method 'read' of '_io.BufferedReader' objects}
1452 0.384 0.000 0.388 0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
When test fails files are not cleaned up well.
try:
> mkdir(name, mode)
E FileExistsError: [Errno 17] File exists: 'testlsdir/nested_dir1'
/usr/local/lib/python3.7/os.py:221: FileExistsError
=========================================================================== warnings summary ============================================================================
/home/kota/.local/lib/python3.7/site-packages/_pytest/mark/structures.py:324
/home/kota/.local/lib/python3.7/site-packages/_pytest/mark/structures.py:324: PytestUnknownMarkWarning: Unknown pytest.mark.gpu - is this a typo? You can register cus
tom marks to avoid this warning - for details, see https://docs.pytest.org/en/latest/mark.html
PytestUnknownMarkWarning,
-- Docs: https://docs.pytest.org/en/latest/warnings.html
========================================================= 3 failed, 89 passed, 8 xfailed, 1 warnings in 12.61s ==========================================================
~/chainerio$ ls
LICENSE README.md chainerio chainerio.egg-info docs examples setup.cfg setup.py testlsdir tests tox.ini
Problem statement
Errors occur when multiple child processes read from an implicitly shared ZipContainer
that is opened by parents process before forking
How to reproduce
import chainerio
from multiprocessing import Process
container = chainerio.open_as_container("dummy.zip")
# open a file inside the container to actually open the zip
# the zip will be implicitly shared after forking
with container.open("dummy_data") as f:
f.read()
def func():
# accessing the shared container
with container.open("dummy_data") as f:
f.read()
p1 = Process(target=func)
p2 = Process(target=func)
p1.start()
p2.start()
p1.join()
p2.join()
Error Message
Process Process-2:
Traceback (most recent call last):
File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "test.py", line 10, in func
with container.open("dummy_data") as f:
File "tianqi/repository/chainerio/chainerio/io.py", line 20, in wrapper
errors, newline, closefd, opener)
File "tianqi/repository/chainerio/chainerio/containers/zip.py", line 88, in open
nested_file = self.zip_file_obj.open(file_path, "r")
File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/zipfile.py", line 1488, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header
Process Process-1:
Traceback (most recent call last):
File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "test.py", line 11, in func
f.read()
File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/zipfile.py", line 885, in read
buf += self._read1(self.MAX_N)
File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/zipfile.py", line 975, in _read1
data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid block type
Root Cause
As mentioned in the comments of the code, the root cause of this problem is that multiple child processes access to the same zip container that was opened by their parent process before forking. Since the underlying fd
was copied when forking, while the actual file struct, which is stored outside the process heap, was not. Hence, all the child processes would read though the same fd
under a shared zip and proceed the offset inside the shared fd
, which cause all the processes to read broken data.
Why do we need to prevent this is ChainerIO layer
The problem can be eliminated if user avoid opening the container before forking. However, such behavior can be hard to detect when the code gets complicated with classes framework.
For example:
import chainerio
from multiprocessing import Process
class SharedZip(object):
def __init__(self, name_of_container):
self.container = None
self.name_of_container = name_of_container
def open_container(self):
self.container = chainerio.open_as_container(self.name_of_container)
def read_file(self, file_in_container):
if self.container is None:
self.open_container()
with self.container.open(file_in_container) as f:
return f.read()
-------
# from codes uses other frameworks
self.shared_zip = SharedZip("dummy.zip")
-------
# then, called by some routines
read_some_data = self.shared_zip.read_file("dummy_data")
------
# later on, after forking
# in child processes
# error occurs here
read_other_data = self.shared_zip.read_file("other_dummy_data")
If we can prevent it from ChainerIO
layer, then such problem can be eliminated completely without users' attention.
Solution 1
Detect that the process has been forked by comparing the pid
:
pid
when opening the zip_file_obj
pid
before every calls to the zipzip_file_obj
if the pid
do not matchpid
As the pid
will be cached, the overhead to os.getpid()
can be small
http://man7.org/linux/man-pages/man2/getpid.2.html
Solution 2
Closing the zip_file_obj
when forking
__getstate__
__copy__
The same solution might be applied to HDFS connection.
ToDo
There are tons of codes redundancy in filesystems and containers tests. We need to refactor the tests to reduce the redundancy. Since the aim of ChainerIO is to unify the APIs and behaviors for different filesystems and containers, the test should also be unified.
For containers like zip, it is a common need to list all the files, as discussed in #43.
To support such operations in a general way, i.e. from API level, we have decided to add recursive
flag into list
. When set, list
returns a generator that will walk though all the files and directories under the given path, similar to os.walk.
To-do
We need to remove Python3.5.2 from our supported version list.
For more details:
https://stackoverflow.com/questions/42942867/optionaltypefoo-raises-typeerror-in-python-3-5-2
Memo for the requirements
In the purpose of filesystem/container abstraction, it is a common idiom for pfio:
if path.endswith('.zip'):
fs = pfio.open_as_container(path)
else:
fs = pfio
pfio.set_root(path) # <--- In some cases (especially when abstracting zip and filesystem) we need this.
However pfio.set_root
changes the global state and thus sometimes it may be troublesome.
fs = PosixFileSystem(root='/path/to/dir')
is better use to solve this issue.
The small problem here is that currently PosixFileSystem
is only used internally, it's not documented and we need to specify the fully qualified path to use this class (pfio.filesystems.posix.PosixFileSystem
).
I'd like this class to be documented and can be accessed directly under the pfio
module.
Another approach to realize the same thing is to support POSIX/HDFS directory in open_as_container
, which would internally behave the same (makes PosixFileSystem
object with root).
Problem Statement
Heimdal is a variant of Kerberos other than MIT, and it is provided on Mac. Heimdal Kerberos has a different format on output which causes problem when we parse the information of authentication.
for example:
Line 30 in 29e5c6d
Error Message
~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/krbticket/ticket.py in parse_from_klist(config, output)
140 file = lines[0].split(':')[2]
141 principal = lines[1].split(':')[1].strip()
--> 142 starting, expires, service_principal = lines[4].strip().split(' ')
143 if len(lines) > 5:
144 renew_expires = lines[5].strip().replace('renew until ', '')
ValueError: too many values to unpack (expected 3)
Output of klist of Heimdal
xxx@xxx[~] klist -c /tmp/krb5cc_xxx
Credentials cache: FILE:/tmp/krb5cc_xxx
Principal: xxxx@xxxxxxxxx
Issued Expires Principal
Aug 5 16:06:47 2020 Aug 12 16:06:47 2020 krbtgt/xxxxxxx@xxxxxxxxxx
Output of klist of Heimdal
xxxxx@xxxxxxx:/opt/$ klist -c /tmp/krb5cc_xxxx
Ticket cache: FILE:/tmp/krb5cc_xxxx
Default principal: xxxxx@xxxxx
Valid starting Expires Service principal
08/04/20 12:33:52 08/11/20 12:33:52 krbtgt/xxxxxx@xxxxxx
renew until 09/01/20 12:33:52
Problem
Currently, once created, the handlers are cached inside and reused. As the handlers contains individual data such as root
, operations on those data would interfere multiple handlers. Hence, a new brand handler should be created each time create_handler
is called.
Example
>>> h1 = chainerio.create_handler("boo")
>>> h1
<chainerio.filesystems.posix.PosixFileSystem object at 0x2b2848e58eb8>
>>> h1.root
'boo'
>>> h2 = chainerio.create_handler("foo")
>>> h2
<chainerio.filesystems.posix.PosixFileSystem object at 0x2b2848e58eb8>
# the creation of h2 changes the data in h1 as they are eventually the same
>>> h1.root
'foo'
>>>
Problem
We need to check if the given scheme is supported or not, which is not done currently
Example
# 'boo' is not a supported scheme
# The call returns h1 as a POSIX (default) handler
# and the root is set to "boo"
>>> h1 = chainerio.create_handler('boo')
>>> h1
<chainerio.filesystems.posix.PosixFileSystem object at 0x2b2848e58eb8>
>>> print(h1.root)
boo
Problem statement
After forking with prior HDFS calls in the main process, the program freezes at any future HDFS calls.
For example:
import chainerio
import multiprocessing
def func(hdfs_handler):
# freeze here (2)
print(list(hdfs_handler.list()))
hdfs_handler = chainerio.create_handler("hdfs")
# create HDFS connection interally (1)
print(list(hdfs_handler.list()))
p = multiprocessing.Process(target=func, args=(hdfs_handler, ))
p.start()
p.join()
Cause
Chainerio uses the pyarrow module to access the HDFS internally, and the pyarrow uses the HDFS Java module. The HDFS connection is pooled inside, and If the connection is first created (implicitly though calls to HDFS, like (1)) and then forked, it breaks the pooling and cause future HDFS calls to freeze.
Solution
Fork before the creation of HDFS connection, i.e. fork before any calls to the HDFS, e.g. (1).
set_root()
accepts URI or handler, while following examples are unclear.
>>> chainerio.set_root('')
>>> chainerio.set_root('foo')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/kuenishi/Library/Python/3.7/lib/python/site-packages/chainerio/__init__.py", line 172, in set_root
default_context.set_root(uri_or_handler)
File "/Users/kuenishi/Library/Python/3.7/lib/python/site-packages/chainerio/_context.py", line 82, in set_root
raise RuntimeError("the URI does not point to a directory")
RuntimeError: the URI does not point to a directory
>>> chainerio.set_root('hdfs')
>>>
''
is not a path but no error reported.'foo'
is supposed to mean relative local directory regarding "root" path or from current working directory?'hdfs'
is supposed to mean relative path of local directory, but it seems it's understood by ChainerIO as hdfs:///
URI.As the Chainer
has entered the maintenance phrase, we are planing on renaming this library to pfio
.
But to reduce the porting effort in users code, we can keep the chainerio
name and let users to import from both name
This is a note for offline discussion:
A new design for open wrapper to replace the current file_object
In current implementation, a decorator open_wrapper
wraps the original file object returned by open to a fileobject
.
During the __init__
, the fileobject
further wraps the original file object according to the underlying filesystem and the necessary to add missing functionalities to make them behave like built-in POSIX file objects.
Such fileobject
is also a base for profiling.
peek
function in fileobject
, the pickle shows a huge performance degradation. (#36 )seekable
function, the nested zip container does not work (#34 )peek
in HDFS, the same problem in #36 would happen when using HDFS.fileobject
. Instead, wrap only when it is necessaryWe have decided on the 2nd solution.
For the 1st solution, it can be a huge work, and unnecessary for now as the profile module is not ready yet.
Extracting the functionality of wrapping the original file object from __init__
of fileobject
to a function, and called by the open wrapper. By doing so, we can still wrap the file object when necessary while avoid wrapping everything with fileobject
.
This new design is supposed to fix #36 and #34.
It seems the handler set via set_root
is ignored in pfio.mkdir
.
>>> import pfio
>>> pfio.set_root(pfio.create_handler('hdfs')) # or pfio.set_root('hdfs')
>>> pfio.mkdir('test/test') # this is treated as Posix path
>>> pfio.open('test.txt') # this is treated as HDFS path
(I was just playing around these APIs so this issue is low-priority (at least for me))
We want to add chmod
to support mode setting
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.