pfnet / pfio Goto Github PK

View Code? Open in Web Editor NEW

52.0 52.0 19.0 1.87 MB

IO library to access various filesystems with unified API

Home Page: https://pfio.readthedocs.io/

License: MIT License

Python 100.00%

pfio's People

Contributors

Stargazers

Watchers

Forkers

belltailjp gwtnb stjordanis shu65 ksauzz msakai hiroakimikami k5342 niboshi mitmul superbrothers disktnk akawashiro knshnb y1r himkt yhrkw alexisvallet keisukefukuda

pfio's Issues

Bugs in getting HDFS Principal Name when keytab is given

Problem statement
Getting default Principal name fails when "KRB5_KTNAME" is given

How to reproduce

>>> import os
>>> import chainerio
>>> os.getenv("KRB5_KTNAME")
'/etc/keytab/user.keytab'
>>> chainerio.exists("hdfs:///user/tianqi")
kinit: Keytab contains no suitable keys for user_xxx@NAMESERVICE while getting initial credentials

Cause
https://github.com/chainer/chainerio/blob/9ef197461fd715f1fe90b10bf0090b9a0c1849c2/chainerio/filesystems/hdfs.py#L52
klist -c takes cache instead of keytab.

We should also create an interface for user to specify their username instead of always using the default.

Consider to remove krbticket

krbticket was introduced for updating kerberos ticket periodically, but I noticed hdfs java client also has the similar functionality. It seems krbticket is no longer needed.

Note that hdfs client can use other environment variables to specify keytab, ccache path, and principal. It's available in 3.0 releases or later.

Trailing slash of directories in `list`

Currently, we have two issues regarding the trailing slash of directories:

The behavior of recursive list of zip is different from HDFS/POSIX and default list of zip
directories without trailing slash can hardly be differentiated from files

The first issue, for recursive list of zip, as we simply yield the content of zipfile.namelist(), all the directories have trailing slash. However, for other filesystems and default zip, no trailing slash is added, which creates an inconsistent behavior.

>>> with chainerio.open_as_container("outside.zip") as container:
...     print(list(container.list()))
...     print(list(container.list(recursive=True)))
...
['testdir']
['testdir/', 'testdir/testfile', 'testdir/nested.zip']

>>> list(chainerio.list("."))
['bench.py', 'tests', 'outside.zip', '.gitignore', '.python-version', 'docs', 'test.py', 'chainerio.egg-info', 'examples', '.mypy_cache', 'setup.cfg', 'setup.py', 'dist', '.pytest_cache', 'tox.ini', 'nested.py', '.git', 'LICENSE', 'README.md', '.tox', 'test', 'test.sh', '.pfnci', 'chainerio']
>>> list(chainerio.list("tests", recursive=True))
['chainer_extensions_test', 'chainer_extensions_test/test_snapshot.py', 'chainer_extensions_test/__pycache__', 'chainer_extensions_test/__pycache__/test_snapshot.cpython-37-PYTEST.pyc', 'chainer_extensions_test/__pycache__/test_snapshot.cpython-36-pytest-5.0.1.pyc', 'chainer_extensions_test/__pycache__/test_snapshot.cpython-37-pytest-5.0.1.pyc', '.python-version', 'cache_tests', 'cache_tests/test_cache.py', 'cache_tests/test_mt.py', 'cache_tests/__pycache__', 'cache_tests/__pycache__/test_file_cache.cpython-36-pytest-5.0.1.pyc', 'cache_tests/__pycache__/test_mt.cpython-37-pytest-5.0.1.pyc', 'cache_tests/__pycache__/test_cache.cpython-37-pytest-5.0.1.pyc', 'cache_tests/__pycache__/test_file_cache.cpython-37-pytest-5.0.1.pyc', 'cache_tests/__pycache__/test_mt.cpython-36-pytest-5.0.1.pyc', 'cache_tests/__pycache__/test_file_cache.cpython-37-PYTEST.pyc', 'cache_tests/__pycache__/test_mt.cpython-37-PYTEST.pyc', 'cache_tests/__pycache__/test_cache.cpython-37-PYTEST.pyc', 'cache_tests/__pycache__/test_cache.cpython-36-pytest-5.0.1.pyc', 'cache_tests/test_file_cache.py', 'profiler_tests', 'profiler_tests/__pycache__', 'profiler_tests/__pycache__/test_naive_profile_writer.cpython-37-PYTEST.pyc', 'profiler_tests/__pycache__/test_naive_profiler.cpython-37-PYTEST.pyc', 'profiler_tests/__pycache__/test_chrome_profile_writer.cpython-37-PYTEST.pyc', 'profiler_tests/__pycache__/test_profiler.cpython-37-PYTEST.pyc', 'profiler_tests/__pycache__/test_chrome_profiler.cpython-37-PYTEST.pyc', 'test_context.py', '__pycache__', '__pycache__/test_context.cpython-36-pytest-5.0.1.pyc', '__pycache__/test_context.cpython-37-PYTEST.pyc', 'filesystem_tests', 'filesystem_tests/test_posix_handler.py', 'filesystem_tests/__pycache__', 'filesystem_tests/__pycache__/test_http_handler.cpython-37-PYTEST.pyc', 'filesystem_tests/__pycache__/test_http_handler.cpython-37-pytest-5.0.1.pyc', 'filesystem_tests/__pycache__/test_hdfs_handler.cpython-37-PYTEST.pyc', 'filesystem_tests/__pycache__/test_hdfs_handler.cpython-36-pytest-5.0.1.pyc', 'filesystem_tests/__pycache__/test_posix_handler.cpython-37-pytest-5.0.1.pyc', 'filesystem_tests/__pycache__/test_posix_handler.cpython-36-pytest-5.0.1.pyc', 'filesystem_tests/__pycache__/test_posix_handler.cpython-37-PYTEST.pyc', 'filesystem_tests/__pycache__/test_http_handler.cpython-36-pytest-5.0.1.pyc', 'filesystem_tests/__pycache__/test_hdfs_handler.cpython-37-pytest-5.0.1.pyc', 'filesystem_tests/test_hdfs_handler.py', 'container_tests', 'container_tests/__pycache__', 'container_tests/__pycache__/test_zip_container.cpython-36-pytest-5.0.1.pyc', 'container_tests/__pycache__/test_zip_container.cpython-37-pytest-5.0.1.pyc', 'container_tests/__pycache__/test_zip_container.cpython-37-PYTEST.pyc', 'container_tests/test_zip_container.py']

The second issue, since we somehow merged the functionality of os.walk() into list(), but unlike the os.walk(), we do not differentiate files and directories in the return value. With no trailing slash, users need to call is_file to distinguish files from directories, which introduce unnecessary filesystem access.

There are two solutions:

adding trailing slash to all the directories
differentiate files and directories like os.walk()

Out of date test in HDFS tests

There is some out-of-date tests which should be removed
https://github.com/chainer/chainerio/blob/9ef197461fd715f1fe90b10bf0090b9a0c1849c2/tests/filesystem_tests/test_hdfs_handler.py#L56

Since the "keytab_path" parameter was removed in #91

Use valid username in HDFS

libhdfs hadoop client takes user name from various points, but the priority is from Kerberos ticket. It should not only be taken from operating system like this: https://github.com/chainer/chainerio/blob/master/chainerio/filesystems/hdfs.py#L24

Also a nit: uid is not used and should be removed. https://github.com/chainer/chainerio/blob/master/chainerio/filesystems/hdfs.py#L25

Nested Zip not supported in Python < 3.7

Background
nested zip is zip inside a zip. For example we can have file in nested.zip in outside.zip as follow:

outside.zip
| - nested.zip
| - | - file

The following code is an example of how ChainerIO actually accesses such nested zipfile.

with zipfile.ZipFile("outside.zip", "rb") as outside_zip:
       with outside_zip.open("nested.zip", "rb") as inside_f:
              with zipfile.ZipFile(inside_f) as inside_zip:
                     with inside_f.open("file", "rb") as f:
                              f.read()

Both file_path and file object can be passed to create a zipfile.ZipFile.
Upon creating the zipfile.ZipFile, the zipfile module reads the header of the file to check whether the given file is a zip file. When checking, a "seek" is performed on the given file/file object, which requires the given file/file object to be seekable.

Problem

In Python<3.7, the ZipExtFile, which is the returned file object by the zipfile.open, i.e. the inside_f in the above code, is not seekable in any case. Which leads to failure on creation of ZipFile of the nested zip.

The problem was fixed in Python 3.7 as code

Bug in recursive list in zip

Problem Statement
recursive list in zip ignores the specified path, and just dumps all the files in the zip

How to reproduce

tianqi@host:~$ unzip -l source-to-image.zip
Archive:  source-to-image.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2019-08-15 16:36   source-to-image/
  7213952  2018-05-15 04:16   source-to-image/sti
  7213952  2018-05-15 04:16   source-to-image/s2i
   435401  2019-07-24 19:43   profile
---------                     -------
 14863305                     4 files

>>> import chainerio
>>> handler = chainerio.open_as_container('source-to-image.zip'
>>> list(handler.list('source-to-image', recursive=True))
['source-to-image/', 'source-to-image/sti', 'source-to-image/s2i', 'profile']

Accept directory without trailing slash

Related to #61, and #66, we need to accept directory without trailing slash in zip operations
For example:

>>> import chainerio
>>> handler = chainerio.open_as_container('test.zip')
>>> handler.isdir('test')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "tianqi/repository/chainerio/chainerio/containers/zip.py", line 130, in isdir
    stat = self.stat(file_path)
  File "tianqi/repository/chainerio/chainerio/containers/zip.py", line 101, in stat
    return self.zip_file_obj.getinfo(path)
  File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/zipfile.py", line 1395, in getinfo
    'There is no item named %r in the archive' % name)
KeyError: "There is no item named 'test' in the archive"
>>> handler.isdir('test/')
True

List of Affected Operations:

isdir
stat
list
exists

Use configure instead of global variables for default context

The default context currently is implemented with a few raw global variables. Such implementation cause many troubles, e.g. difficult to switch and keep everything in consistent and no support for context manager for local configure. We need to use the configuration like the one in Chainer.

Segmentation fault in combination with certain libraries

import _hashlib
import mysql.connector
import pfio

dest_path = 'hdfs:///my/hdfs/file'
with pfio.open(dest_path, 'wb') as file_out:
    file_out.write(b'data')

In my environment, the above code results in a segmentation fault:

python: Relink `/usr/local/lib/python3.6/site-packages/pyarrow/libarrow.so.17' with `/lib/x86_64-linux-gnu/librt.so.1' for IFUNC symbol `clock_gettime'
Segmentation fault (core dumped)

You can replace _hashlib with any libraries that internally depend on hashlib, including itself.
Removing any of the first two imports would resolve the error.

Some environment info:


$ which python
/usr/local/bin/python

$ python --version
Python 3.6.9

$ pip freeze | grep mysql
mysql-connector-python==8.0.21

$ pip freeze | grep pfio
pfio==1.0.0

$ ldd /usr/local/lib/python3.6/lib-dynload/_hashlib.cpython-36m-x86_64-linux-gnu.so
        linux-vdso.so.1 (0x00007ffd4e7ce000)
        libcrypto.so.1.1 => /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 (0x00007fa2138dc000)
        libpython3.6m.so.1.0 => /usr/lib/x86_64-linux-gnu/libpython3.6m.so.1.0 (0x00007fa213231000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa212e40000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa212c3c000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fa212a1d000)
        libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007fa2127eb000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fa2125ce000)
        libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fa2123cb000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa21202d000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fa213fae000)

Known Unsupported Operations

numpy (On HDFS)
1. tofile
2. fromfile
3. save
4. load
  The issue is caused by missing fileno when using HDFS in tofile and fromfile. save and load call tofile and fromfile internally.
  However, savez and loadz are fine to use since they use a zip wrapper.

Performance degrade in unpickling fairly large objects

@yuyu2172 reported that with ChainerIO unpickling fairly large file in NFS is much slower than Python's built-in IO system. The difference was like 10x or even more.

Microbench

Here's complete benchmark results and code: https://gist.github.com/kuenishi/d8d93847e0705c110501d68101fe5f53

To unpickle long list l = [0.1] * 10000000 takes 12 seconds in ChainerIO and 0.75 secs with io .
Main difference in profile result was number of read calls, like 20M calls vs almost zero (doesn't appear in cProfile).

Root cause

This is because ChainerIO's file object doen't have peek method, while Python's io.BufferdReader ( BufferedIOBase implementation ) has it.　Code:

        /* Prefetch some data without advancing the file pointer, if possible */
        if (self->peek && n < PREFETCH) {
            len = PyLong_FromSsize_t(PREFETCH);
            if (len == NULL)
                return -1;
            data = _Pickle_FastCall(self->peek, len);

With this patch

diff --git a/chainerio/fileobject.py b/chainerio/fileobject.py
index df7fedd..da31ad9 100644
--- a/chainerio/fileobject.py
+++ b/chainerio/fileobject.py
@@ -43,6 +43,9 @@ class FileObject(io.IOBase):
     def fileno(self) -> int:
         return self.base_file_object.fileno()
 
+    def peek(self, size: int = 0) -> bytes:
+        return self.base_file_object.peek(size)
+
     def read(self, size: Optional[int] = None) -> Union[bytes, str]:
         return self.base_file_object.read(size)

The number of read call comes back to sanity and the performance has imporoved:

$ python b.py chainerio    
chainerio <class 'chainerio.fileobject.FileObject'> haspeek: True
chainerio <module 'chainerio' from '/home/kuenishi/src/chainerio/chainerio/__init__.py'> : 0.8312304019927979 sec
        156254 function calls in 0.831 seconds
                                
  Ordered by: cumulative time                                     
                                
  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       1    0.428    0.428    0.831    0.831 {built-in method _pickle.load}
   39063    0.016    0.000    0.295    0.000 fileobject.py:49(read)
   39063    0.279    0.000    0.279    0.000 {method 'read' of '_io.BufferedReader' objects}
   39063    0.016    0.000    0.109    0.000 fileobject.py:46(peek)
   39063    0.093    0.000    0.093    0.000 {method 'peek' of '_io.BufferedReader' objects}
       1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Discussion

I would suggest retreating from current wrapper strategy on file objects until we truly support profiling. This is mainly because we don't have perfect solution for now. For example, any of these aren't good, or even nonsense: 1) adding peek method is a hacky workaround, 2) wrapping ChainerIO's file object again with io.BufferedReader or io.BufferedWriter is like matryoshka doll and crazy.

Which file object interface to support must be cafully chosen at open time, e.g. io.BufferedIOBase or io.IOBase .
PyArrow's HFDS file object doesn't have peek
Python's zipfile.ZipExtFile has peek since at least 3.5

Some global functions do not follow the `set_root`

Problems
A few chainer.xxx functions (e.g. chainerio.exists) do not follow the set_root.

How to reproduce

 hdfs dfs -ls /home/
ls: `/home/': No such file or directory
$ python
Python 3.6.8 (default, Oct  7 2019, 09:28:08)                
[GCC 7.4.0] on linux                                                                                                                                          
Type "help", "copyright", "credits" or "license" for more information.      
>>> import chainerio
>>> chainerio.set_root("hdfs")
>>> chainerio.exists("/home/")
True

Cause
These functions are calling an old function which ignores the default_handler
https://github.com/chainer/chainerio/blob/ca79de952f21cc216cc70c1a0b87afd6d07426cd/chainerio/__init__.py#L267
https://github.com/chainer/chainerio/blob/ca79de952f21cc216cc70c1a0b87afd6d07426cd/chainerio/__init__.py#L43

DeprecationWarning of pyarrow occurs when accessing to HDFS

When I access to HDFS using pfio, the following DeprecationWarning occurs.

/..../pfio/pfio/filesystems/hdfs.py:153: DeprecationWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.

Versions

pfio==1.0.1
pyarrow==2.0.0

A new formatter

Suggested by @kuenishi

Not now, but I'd rather feel like moving on to newer and more popular formatters like black .

Known issue: setting forkserver mode in multiprocessing module is needed for parallel data load from HDFS

Deep learning frameworks support multi-process data loading, such as num_worker option of DataLoader in PyTorch, MultiprocessIterator in Chainer, etc.
They use multiprocessing module to launch worker processes using fork by default (in Linux).
When using PFIO, in case an HDFS connection is established before the fork, information of the connections are also copied to child processes. They are eventually destroyed when one of the workers has completed its work (this happens at the end of each epoch in PyTorch DataLoader). However remaining worker processes still want to keep in touch with HDFS, but since the connection is unexpectedly and uncontrollably closed, they will break.

As far as I know, the actual error message or phenomenon that users face may be different depending on the situation (such as freezing, some strange error like RuntimeError: threads can only be started once, etc), and this makes the troubleshooting even more difficult.

The workaround for this issue is to set multiprocessing module forkserver mode before having access to HDFS.
Due to a similar reason (prevent MPI context being broken after fork), ChainerCV and Chainer examples apply the same workaround, and it works for PFIO+HDFS case, too.
https://github.com/chainer/chainercv/blob/master/examples/classification/train_imagenet_multi.py#L96-L100
https://github.com/chainer/chainer/blob/df53bff3f36920dfea6b07a5482297d27b31e5b7/examples/chainermn/imagenet/train_imagenet.py#L145-L148

Document `IO` API precisely

Describe abstract API like we do in ChainerMN . That would help describe how it behaves to developers.

Add filesystem detections tests on contaniner

We need to test whether auto filesystem dectections also works on containers.
For example

files = ["posix.zip", "hdfs://hdfs.zip"]
for file in files:
     with chainerio.open_as_container(file) as f:
          print(f.read())

Currently, we have only checked the detections on normal open operation.

Add Python 3.5.2 tests to CI

As we added support for Python 3.5.2 again, we need to add tests of Python 3.5.2 into CI since the workaround needs to be checked for future PR.

Add long_description in setup.py

to avoid this empty page. https://pypi.org/project/chainerio/

Clarify the difference of HDFS ans POSIX mode in list() document

https://chainerio.readthedocs.io/en/latest/reference.html#chainerio.list just says "Please note that the meaning of $HOME depends on each filesystem." It would be nice clarify the difference between filesystems that cannot be absorbed by the library.

ZipContainer cannot properly handle zip files with DOS attributes

I found that the current pfio ZipContainer cannot handle a Zip file with DOS-compatible external attributes.

# Creation of hello_dos.zip is explained later.
$ zipinfo hello_dos.zip
Archive:  hello_dos.zip
Zip file size: 318 bytes, number of entries: 2
drwx---     2.0 fat        0 bx stor 20-Sep-09 16:34 FOO/
-rw----     2.0 fat        6 tx stor 20-Sep-09 16:34 FOO/HELLO.TXT
2 files, 6 bytes uncompressed, 6 bytes compressed:  0.0%

$ python
>>> import pfio
>>> container = pfio.open_as_container('hello_dos.zip')
>>> container.isdir('FOO')
False                  # <--------- Supposed to be True because FOO is a directory in the zip

# Since it cannot recognize the directory, it cannot list the contents of the directory either
>>> list(container.list('FOO/'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "xxxxxxxxxxxxxxxxxxxxx/pfio/containers/zip.py", line 190, in list
    "{} is not a directory".format(path_or_prefix))
NotADirectoryError: FOO is not a directory
...

Reproduction

Assuming Linux (Ubuntu 18.04) environment.

(1) Create a few test data

$ mkdir foo
$ echo hello > foo/hello.txt

# OK case: Unix attributes (I refer UNIX zip)
$ zip -r hello_unix.zip foo
  adding: foo/ (stored 0%)
  adding: foo/hello.txt (stored 0%)

# NG case: DOS attributes
$ zip -rk hello_dos.zip foo
  adding: FOO/ (stored 0%)
  adding: FOO/HELLO.TXT (stored 0%)

Here, -k option to zip command stands for having DOS-like attribute in the external attribute field in a ZIP file.

-k
--DOS-names
Attempt to convert the names and paths to conform to MSDOS, store only the MSDOS attribute (just the user write attribute from Unix), and mark the entry as made under MSDOS (even though it was not); for compatibility with PKUNZIP under MSDOS which cannot handle certain names such as those with two dots.

What actually happens is explained in the ZIP file format specification

If the external file attributes are compatible with MS-DOS and can be read by PKZIP for DOS version 2.04g then this value will be zero
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

(2) Check by `zipfile`

ZIP file created in the second way has different external_attr, making ZipInfo object slightly different.

>>> zipfile.ZipFile('hello_unix.zip').getinfo('foo/')
 <ZipInfo filename='foo/' filemode='drwxrwxr-x' external_attr=0x10>

>>> zipfile.ZipFile('hello_dos.zip').getinfo('FOO/')
<ZipInfo filename='FOO/' external_attr=0x10>

Note that we cannot see filemode in the ZipInfo object for the DOS zip.
Since filemode is parsed from the first 2 bytes in the external_attr (cf zipfile.py#L393-L396), but in DOS zip it's filled by zero.

(3) pfio

Since pfio (ZipContainer.stat) relies on the external attributes to distinguish a specified name is a directory or not, it cannot properly handle the directory.

>>> pfio.open_as_container('hello_unix.zip').stat('foo')
<ZipFileStat filename="foo/" mode="drwxrwxr-x">
>>> pfio.open_as_container('hello_unix.zip').isdir('foo')
True

>>> pfio.open_as_container('hello_dos.zip').stat('FOO')
<ZipFileStat filename="FOO/" mode="?---------">
>>> pfio.open_as_container('hello_dos.zip').isdir('FOO')
False

Reason

Directory check is done by simply checking the trailing '/' in CPython and the previous pfio but I (yes, I did! 😇) introduced the dependency with the external attrs field by #114...

Add functionality of shutil.make_archive

We need an API to make archive (e.g. zip) on a target filesystem.

List all the files in Zip

The current list in zip only list the first level of files in zip with the given prefix to match the behavior of POSIX. However, listing all the files in zip is needed in many cases, e.g. read all the files in a zip. We need to have an way to list all the files.

Cache for infinitely increasing Dataset

Bug in list of directory in zip

This is two bugs when list directory in zip

ZIP:

tianqi@host:tianqi$ unzip -l test.zip
Archive:  test.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2019-08-16 15:11   test/
       12  2019-07-24 13:13   test/file
        0  2019-08-16 15:11   test/dir/
        0  2019-08-16 15:11   test/dir/nested_dir/
      170  2019-07-24 13:16   test/file.zip
      167  2019-01-16 16:53   python.py
---------                     -------
      349                     6 files

Bug 1

>>> import chainerio
>>> handler = chainerio.open_as_container('test.zip')
>>> list(handler.list('test/'))
['file', 'dir', 'file.zip', 'python.py']

Where python.py is not in test/

Bug 2

>>> import chainerio
>>> handler = chainerio.open_as_container('test.zip')
>>> list(handler.list('test'))
['python.py']

Normalize input path for all the supported operations in zip

We need to normalize the input path for all the operations in zip

Massive hdfs access from global functions might break the Kerberos ticket

Problem Statement
Massive calls to the global functions, such as chainerio.open, chainerio.list, to hdfs might break the Kerberos ticket, and cause the code to fail. However, access from the handlers will cause the problem.

Examples of the Problem Code

while True:
    # generating massive calls to chainerio.open
    # we have massive updaters inside
    with chainerio.open('hdfs:///user/README.txt', 'rb') as f:
            # fails after the ticket is broken
            f.read()

Error Message

20/03/17 11:51:51 WARN security.UserGroupInformation: Exception encountered while running the renewal command for xxxxxxx. (TGT end time:1584453111000, renewalFailures: 0, renewalFailuresTotal: 1)
ExitCodeException exitCode=1: kinit: Internal credentials cache error (filename: /tmp/krb5cc_xxxx) when 
initializing cache 

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:1009)
        at org.apache.hadoop.util.Shell.run(Shell.java:902)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1227)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:1321)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:1303)
        at org.apache.hadoop.security.UserGroupInformation$TicketCacheRenewalRunnable.relogin(UserGroupI
nformation.java:1077)
        at org.apache.hadoop.security.UserGroupInformation$AutoRenewalForUserCredsRunnable.run(UserGroup
Information.java:988)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Root Cause
Each Call to a global function, such as chainerio.open, chainerio.list creates an hdfs handler with a kerberos ticket updater inside, which runs in a separate thread and periodically updates the kerberos ticket cache. A massive calls results in a huge number of updater runs parallel. Moreover, the java hadoop client will also perform a sort of ticket update. The lack of race control among the updater can break the ticket data inside the java hadoop client and cause it to fail.

Currently, we still rely on our own ticket updater, it is because before Hadoop 3.0 the java hadoop client cannot update the ticket from the keytab set though environment variable KRB5_KTNAME.

How to avoid
Avoid using the global functions when you need to call it massively. Instead, create a handler

# one handler, one updater
hdfs_handler = chainerio.create_handler("hdfs")
while True:
    with handler.open('/user/README.txt', 'rb') as f:
         # will not fail
         f.read()

Please NOTE: using handler CANNOT eliminate this issue, it can only significantly reduce the probability of having it by having only one updater instead of having massive updaters

Master fails on style with the latest `flake8`

tianqi:pfio$ flake8 pfio
pfio/__init__.py:68:48: F821 undefined name 'IOBase'
pfio/io.py:19:55: F821 undefined name 'IOBase'
pfio/io.py:82:47: F821 undefined name 'IOBase'
pfio/io.py:90:64: F821 undefined name 'IOBase'
pfio/io.py:109:52: F821 undefined name 'IOBase'
pfio/containers/zip.py:119:47: F821 undefined name 'IOBase'
pfio/containers/zip.py:127:64: F821 undefined name 'IOBase'
pfio/filesystems/hdfs.py:180:47: F821 undefined name 'IOBase'
pfio/filesystems/hdfs.py:188:64: F821 undefined name 'IOBase'
tianqi:pfio$ flake8 --version
3.8.2 (mccabe: 0.6.1, pycodestyle: 2.6.0, pyflakes: 2.2.0) CPython 3.7.4 on
Linux
tianqi:pfio$ git branch
  bump_v_1_0_0
  bump_version_0_1_2
* master
  pr_114
  pr_121
  pr_125
  refactor_zip_test
  rename_to_pfio
tianqi:pfio$

Pickle is slow with large objects on HDFS

A related issue to #36, As peek is not in the HdfsFile, the same issue also happens when using HDFS. Unlike the #36, where the peek in POSIX was just hiden by the FileObject, peek needs to be added in case of HDFS.

$ python test.py
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/share/java/slf4j-simple.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.SimpleLoggerFactory]
19/08/16 14:53:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
chainerio:  186.07044076919556

$ cat test.py
import time
import pickle
import chainerio

chainerio.set_root("hdfs")
cache_path = 'a_large_file.pkl'

start = time.time()
with chainerio.open(cache_path, 'rb') as f:
    data = pickle.load(f)
print('chainerio: ', time.time() - start)

Wed Jul 24 19:43:47 2019    profile

         466895442 function calls (466887441 primitive calls) in 161.088 seconds

   Ordered by: internal time
   List reduced from 2941 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   74.242   74.242  161.008  161.008 {built-in method _pickle.load}
233293409   46.669    0.000   84.175    0.000 
xxx/versions/3.7.2/lib/python3.7/site-packages/chainerio/fileobject.py:50(read)
233293409   37.505    0.000   37.505    0.000 {method 'read' of '_io.BufferedReader' objects}
     1452    0.384    0.000    0.388    0.000 <frozen importlib._bootstrap>:157(_get_module_lock)

Use tempfile.TemporarayDirectory instead of creating in test code

When test fails files are not cleaned up well.

        try:
>           mkdir(name, mode)
E           FileExistsError: [Errno 17] File exists: 'testlsdir/nested_dir1'

/usr/local/lib/python3.7/os.py:221: FileExistsError
=========================================================================== warnings summary ============================================================================
/home/kota/.local/lib/python3.7/site-packages/_pytest/mark/structures.py:324
  /home/kota/.local/lib/python3.7/site-packages/_pytest/mark/structures.py:324: PytestUnknownMarkWarning: Unknown pytest.mark.gpu - is this a typo?  You can register cus
tom marks to avoid this warning - for details, see https://docs.pytest.org/en/latest/mark.html
    PytestUnknownMarkWarning,

-- Docs: https://docs.pytest.org/en/latest/warnings.html
========================================================= 3 failed, 89 passed, 8 xfailed, 1 warnings in 12.61s ==========================================================
~/chainerio$ ls
LICENSE  README.md  chainerio  chainerio.egg-info  docs  examples  setup.cfg  setup.py  testlsdir  tests  tox.ini

Clean up docstrings in the code

Avoid accessing implicit shared zip after forking

Problem statement
Errors occur when multiple child processes read from an implicitly shared ZipContainer that is opened by parents process before forking

How to reproduce

import chainerio
from multiprocessing import Process

container = chainerio.open_as_container("dummy.zip")
# open a file inside the container to actually open the zip
# the zip will be implicitly shared after forking
with container.open("dummy_data") as f:
    f.read()

def func():
    # accessing the shared container
    with container.open("dummy_data") as f:
        f.read()

p1 = Process(target=func)
p2 = Process(target=func)
p1.start()
p2.start()

p1.join()
p2.join()

Error Message

Process Process-2:
Traceback (most recent call last):
  File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "test.py", line 10, in func
    with container.open("dummy_data") as f:
  File "tianqi/repository/chainerio/chainerio/io.py", line 20, in wrapper
    errors, newline, closefd, opener)
  File "tianqi/repository/chainerio/chainerio/containers/zip.py", line 88, in open
    nested_file = self.zip_file_obj.open(file_path, "r")
  File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/zipfile.py", line 1488, in open
    raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header
Process Process-1:
Traceback (most recent call last):
  File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "test.py", line 11, in func
    f.read()
  File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/zipfile.py", line 885, in read
    buf += self._read1(self.MAX_N)
  File "tianqi/.pyenv/versions/3.7.2/lib/python3.7/zipfile.py", line 975, in _read1
    data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid block type

Root Cause
As mentioned in the comments of the code, the root cause of this problem is that multiple child processes access to the same zip container that was opened by their parent process before forking. Since the underlying fd was copied when forking, while the actual file struct, which is stored outside the process heap, was not. Hence, all the child processes would read though the same fd under a shared zip and proceed the offset inside the shared fd, which cause all the processes to read broken data.

Why do we need to prevent this is ChainerIO layer
The problem can be eliminated if user avoid opening the container before forking. However, such behavior can be hard to detect when the code gets complicated with classes framework.
For example:

import chainerio
from multiprocessing import Process


class SharedZip(object):
    def __init__(self, name_of_container):
        self.container = None
        self.name_of_container = name_of_container

    def open_container(self):
        self.container = chainerio.open_as_container(self.name_of_container)

    def read_file(self, file_in_container):
        if self.container is None:
            self.open_container()

        with self.container.open(file_in_container) as f:
            return f.read()



-------
# from codes uses other frameworks

self.shared_zip = SharedZip("dummy.zip")

-------
# then, called by some routines
read_some_data = self.shared_zip.read_file("dummy_data")

------
# later on, after forking
# in child processes

# error occurs here
read_other_data = self.shared_zip.read_file("other_dummy_data")

If we can prevent it from ChainerIO layer, then such problem can be eliminated completely without users' attention.

Solution 1
Detect that the process has been forked by comparing the pid:

Record the pid when opening the zip_file_obj
Check the current pid before every calls to the zip
Reopen the zip_file_obj if the pid do not match
Record the new pid

As the pid will be cached, the overhead to os.getpid() can be small
http://man7.org/linux/man-pages/man2/getpid.2.html

Solution 2
Closing the zip_file_obj when forking

in the call of __getstate__
in the call of __copy__

The same solution might be applied to HDFS connection.

ToDo

Add pid check to https://github.com/chainer/chainerio/blob/master/chainerio/containers/zip.py
(#105)
Add pid check to https://github.com/chainer/chainerio/blob/master/chainerio/filesystems/hdfs.py

Refactor tests to reduce redundancy

There are tons of codes redundancy in filesystems and containers tests. We need to refactor the tests to reduce the redundancy. Since the aim of ChainerIO is to unify the APIs and behaviors for different filesystems and containers, the test should also be unified.

Add recursive to `list`

For containers like zip, it is a common need to list all the files, as discussed in #43.
To support such operations in a general way, i.e. from API level, we have decided to add recursive flag into list. When set, list returns a generator that will walk though all the files and directories under the given path, similar to os.walk.

To-do

Add recursive to zip (#47)
Add recursive to hdfs (#49)
Add recursive to posix (#57)

Remove workaround after Chainer fix

chainer/chainer#7036

Fails due to Python3.5.2 Bug

We need to remove Python3.5.2 from our supported version list.

For more details:
https://stackoverflow.com/questions/42942867/optionaltypefoo-raises-typeerror-in-python-3-5-2

Design of ChainerIO profiler

Memo for the requirements

Plug-in like => can switch between different profilers, or customize profilers, e.g. record fields
Separate log writers to switch formats
Being able to profile a certain block of code
Being able to turn on/off at run time

Expose PosixFileSystem in the document

In the purpose of filesystem/container abstraction, it is a common idiom for pfio:

if path.endswith('.zip'):
    fs = pfio.open_as_container(path)
else:
    fs = pfio
    pfio.set_root(path)    # <--- In some cases (especially when abstracting zip and filesystem) we need this.

However pfio.set_root changes the global state and thus sometimes it may be troublesome.
fs = PosixFileSystem(root='/path/to/dir') is better use to solve this issue.
The small problem here is that currently PosixFileSystem is only used internally, it's not documented and we need to specify the fully qualified path to use this class (pfio.filesystems.posix.PosixFileSystem).

I'd like this class to be documented and can be accessed directly under the pfio module.

Another approach to realize the same thing is to support POSIX/HDFS directory in open_as_container, which would internally behave the same (makes PosixFileSystem object with root).

Fail to Get Authentication Information from Heimdal Kerberos

Problem Statement
Heimdal is a variant of Kerberos other than MIT, and it is provided on Mac. Heimdal Kerberos has a different format on output which causes problem when we parse the information of authentication.
for example:

pfio/pfio/filesystems/hdfs.py

Line 30 in 29e5c6d

r'Default principal: (?P<username>.+)@(?P<service>.+)')

Error Message

~/.pyenv/versions/3.6.8/lib/python3.6/site-packages/krbticket/ticket.py in parse_from_klist(config, output)
    140         file = lines[0].split(':')[2]
    141         principal = lines[1].split(':')[1].strip()
--> 142         starting, expires, service_principal = lines[4].strip().split('  ')
    143         if len(lines) > 5:
    144             renew_expires = lines[5].strip().replace('renew until ', '')
ValueError: too many values to unpack (expected 3)

Output of klist of Heimdal

xxx@xxx[~] klist -c /tmp/krb5cc_xxx
Credentials cache: FILE:/tmp/krb5cc_xxx
        Principal: xxxx@xxxxxxxxx
  Issued                Expires               Principal
Aug  5 16:06:47 2020  Aug 12 16:06:47 2020  krbtgt/xxxxxxx@xxxxxxxxxx

Output of klist of Heimdal

xxxxx@xxxxxxx:/opt/$ klist -c /tmp/krb5cc_xxxx
Ticket cache: FILE:/tmp/krb5cc_xxxx
Default principal: xxxxx@xxxxx
Valid starting     Expires            Service principal
08/04/20 12:33:52  08/11/20 12:33:52  krbtgt/xxxxxx@xxxxxx
	renew until 09/01/20 12:33:52

Create new instance when calling `create_handler`

Problem
Currently, once created, the handlers are cached inside and reused. As the handlers contains individual data such as root, operations on those data would interfere multiple handlers. Hence, a new brand handler should be created each time create_handler is called.

Example

>>> h1 = chainerio.create_handler("boo")
>>> h1
<chainerio.filesystems.posix.PosixFileSystem object at 0x2b2848e58eb8>
>>> h1.root
'boo'
>>> h2 = chainerio.create_handler("foo")
>>> h2
<chainerio.filesystems.posix.PosixFileSystem object at 0x2b2848e58eb8>

# the creation of h2 changes the data in h1 as they are eventually the same
>>> h1.root
'foo'
>>>

Continuously test with all supported Python versions

Check for supported scheme in `create_handler`

Problem
We need to check if the given scheme is supported or not, which is not done currently

Example

# 'boo' is not a supported scheme
# The call returns h1 as a POSIX (default) handler
# and the root is set to "boo"
>>> h1 = chainerio.create_handler('boo')
>>> h1
<chainerio.filesystems.posix.PosixFileSystem object at 0x2b2848e58eb8>
>>> print(h1.root)
boo

Avoid forking HDFS connection with multiprocessing

Problem statement

After forking with prior HDFS calls in the main process, the program freezes at any future HDFS calls.
For example:

import chainerio
import multiprocessing


def func(hdfs_handler):
    # freeze here (2)
    print(list(hdfs_handler.list()))


hdfs_handler = chainerio.create_handler("hdfs")

# create HDFS connection interally (1)
print(list(hdfs_handler.list()))

p = multiprocessing.Process(target=func, args=(hdfs_handler, ))
p.start()
p.join()

Cause
Chainerio uses the pyarrow module to access the HDFS internally, and the pyarrow uses the HDFS Java module. The HDFS connection is pooled inside, and If the connection is first created (implicitly though calls to HDFS, like (1)) and then forked, it breaks the pooling and cause future HDFS calls to freeze.

Solution
Fork before the creation of HDFS connection, i.e. fork before any calls to the HDFS, e.g. (1).

set_root() argument

set_root() accepts URI or handler, while following examples are unclear.

>>> chainerio.set_root('')
>>> chainerio.set_root('foo')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kuenishi/Library/Python/3.7/lib/python/site-packages/chainerio/__init__.py", line 172, in set_root
    default_context.set_root(uri_or_handler)
  File "/Users/kuenishi/Library/Python/3.7/lib/python/site-packages/chainerio/_context.py", line 82, in set_root
    raise RuntimeError("the URI does not point to a directory")
RuntimeError: the URI does not point to a directory
>>> chainerio.set_root('hdfs')
>>>

Argument '' is not a path but no error reported.
Argument 'foo' is supposed to mean relative local directory regarding "root" path or from current working directory?
Argument 'hdfs' is supposed to mean relative path of local directory, but it seems it's understood by ChainerIO as hdfs:/// URI.

Rename to `pfio`

As the Chainer has entered the maintenance phrase, we are planing on renaming this library to pfio.
But to reduce the porting effort in users code, we can keep the chainerio name and let users to import from both name

New design of open wrapper

This is a note for offline discussion:

Summary:

A new design for open wrapper to replace the current file_object

Background:

In current implementation, a decorator open_wrapper wraps the original file object returned by open to a fileobject.
During the __init__, the fileobject further wraps the original file object according to the underlying filesystem and the necessary to add missing functionalities to make them behave like built-in POSIX file objects.
Such fileobject is also a base for profiling.

Problem:

Due to lack of peek function in fileobject, the pickle shows a huge performance degradation. (#36 )
Due to lack of seekable function, the nested zip container does not work (#34 )
Due to lack of peek in HDFS, the same problem in #36 would happen when using HDFS.

Solution:

Add all the missing functions.
Stop wrapping all the file objects with fileobject. Instead, wrap only when it is necessary

Decision:

We have decided on the 2nd solution.
For the 1st solution, it can be a huge work, and unnecessary for now as the profile module is not ready yet.

Design:

Extracting the functionality of wrapping the original file object from __init__ of fileobjectto a function, and called by the open wrapper. By doing so, we can still wrap the file object when necessary while avoid wrapping everything with fileobject.
This new design is supposed to fix #36 and #34.

Todo List:

Replace file object to a wrapper function as described in Design. Solves problem 1 and partially for problem 2 (except for Python <= 3.6). (#38)
Python < 3.6 seekable issue for nested zip as described in #41 . solves Problem 2 for Python <= 3.6 (#40)
peek for HDFS solves solves Problem 3 (#42)

Handler set via `set_root` is ignored in `pfio.mkdir`

It seems the handler set via set_root is ignored in pfio.mkdir.

>>> import pfio
>>> pfio.set_root(pfio.create_handler('hdfs'))  # or pfio.set_root('hdfs')
>>> pfio.mkdir('test/test')  # this is treated as Posix path
>>> pfio.open('test.txt')   # this is treated as HDFS path

(I was just playing around these APIs so this issue is low-priority (at least for me))

Retain cache file for re-running training

Add chmod

We want to add chmod to support mode setting

pfnet / pfio Goto Github PK

pfio's People

Contributors

Stargazers

Watchers

Forkers

pfio's Issues

Microbench

Root cause

Discussion

Versions

Reproduction

(1) Create a few test data

(2) Check by zipfile

(3) pfio

Reason

Summary:

Background:

Problem:

Solution:

Decision:

Design:

Todo List:

Recommend Projects

Recommend Topics

Recommend Org

Jobs

(2) Check by `zipfile`