GithubHelp home page GithubHelp logo

megvii-research / megfile Goto Github PK

View Code? Open in Web Editor NEW
113.0 3.0 15.0 5.54 MB

Megvii FILE Library - Working with Files in Python same as the standard library

Home Page: http://megvii-research.github.io/megfile

License: Apache License 2.0

Makefile 0.15% Python 99.85%
python file s3 streaming hdfs oss sftp

megfile's Introduction

megfile - Megvii FILE library

Build Documents Codecov Latest version Support python versions License CII Best Practices

megfile provides a silky operation experience with different backends (currently including local file system and s3), which enable you to focus more on the logic of your own project instead of the question of "Which backend is used for this file?"

megfile provides:

  • Almost unified file system operation experience. Target path can be easily moved from local file system to s3.
  • Complete boundary case handling. Even the most difficult (or even you can't even think of) boundary conditions, megfile can help you easily handle it.
  • Perfect type hints and built-in documentation. You can enjoy the IDE's auto-completion and static checking.
  • Semantic version and upgrade guide, which allows you enjoy the latest features easily.

megfile's advantages are:

  • smart_open can open resources that use various protocols. Especially, reader / writer of s3 in megfile is implemented with multi-thread, which is faster than known competitors.
  • smart_glob is available on majority protocols. And it supports zsh extended pattern syntax of [], e.g. s3://bucket/video.{mp4,avi}.
  • All-inclusive functions like smart_exists / smart_stat / smart_sync. If you don't find the functions you want, submit an issue.
  • Compatible with pathlib.Path interface, referring to SmartPath and other protocol classes like S3Path.

Support Protocols

  • fs(local filesystem)
  • s3
  • sftp
  • http
  • stdio
  • hdfs: pip install 'megfile[hdfs]'

Quick Start

Path string in megfile almost is protocol://path/to/file, for example s3://bucketA/key. But sftp path is a little different, format is sftp://[username[:password]@]hostname[:port]//file_path, and relative path is replace //file_path to /file_path. Here's an example of writing a file to s3 / fs, syncing to local, reading and finally deleting it.

Functional Interface

from megfile import smart_open, smart_exists, smart_sync, smart_remove, smart_glob

# open a file in s3 bucket
with smart_open('s3://playground/megfile-test', 'w') as fp:
    fp.write('megfile is not silver bullet')

# test if file in s3 bucket exist
smart_exists('s3://playground/megfile-test')

# or in local file system
smart_exists('/tmp/playground/megfile-test')

# copy files or directories
smart_sync('s3://playground/megfile-test', '/tmp/playground/megfile-test')

# remove files or directories
smart_remove('s3://playground/megfile-test')

# glob files or directories in s3 bucket
smart_glob('s3://playground/megfile-?.{mp4,avi}')

SmartPath Interface

SmartPath has a similar interface with pathlib.Path.

from megfile.smart_path import SmartPath

path = SmartPath('s3://playground/megfile-test')
if path.exists():
    with path.open() as f:
        result = f.read(7)
        assert result == b'megfile'

Command Line Interface

$ megfile --help  # see what you can do

$ megfile ls s3://playground/
$ megfile ls -l -h s3://playground/

$ megfile cat s3://playground/megfile-test

$ megfile cp s3://playground/megfile-test /tmp/playground/megfile-test

Installation

PyPI

pip3 install megfile

You can specify megfile version as well

pip3 install "megfile~=0.0"

Build from Source

megfile can be installed from source

git clone [email protected]:megvii-research/megfile.git
cd megfile
pip3 install -U .

Development Environment

git clone [email protected]:megvii-research/megfile.git
cd megfile
pip3 install -r requirements.txt -r requirements-dev.txt

Configuration

Using s3 as an example, the following describes the configuration methods. For more details, please refer to Configuration.

You can use environments and configuration file for configuration, and priority is that environment variables take precedence over configuration file.

Use environments

You can use environments to setup authentication credentials for your s3 account:

  • AWS_ACCESS_KEY_ID: access key
  • AWS_SECRET_ACCESS_KEY: secret key
  • OSS_ENDPOINT / AWS_ENDPOINT_URL_S3 / AWS_ENDPOINT_URL: endpoint url of s3
  • AWS_S3_ADDRESSING_STYLE: addressing style

Use command

You can update config file with megfile command easyly: megfile config s3 [OPTIONS] AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY

$ megfile config s3 accesskey secretkey

# for aliyun
$ megfile config s3 accesskey secretkey \
--addressing-style virtual \
--endpoint-url http://oss-cn-hangzhou.aliyuncs.com \

You can get the configuration from ~/.aws/credentials, like:

[default]
aws_secret_access_key = accesskey
aws_access_key_id = secretkey

s3 =
    addressing_style = virtual
    endpoint_url = http://oss-cn-hangzhou.aliyuncs.com

How to Contribute

  • We welcome everyone to contribute code to the megfile project, but the contributed code needs to meet the following conditions as much as possible:

    You can submit code even if the code doesn't meet conditions. The project members will evaluate and assist you in making code changes

    • Code format: Your code needs to pass code format check. megfile uses yapf as lint tool and the version is locked at 0.27.0. The version lock may be removed in the future

    • Static check: Your code needs complete type hint. megfile uses pytype as static check tool. If pytype failed in static check, use # pytype: disable=XXX to disable the error and please tell us why you disable it.

      Note : Because pytype doesn't support variable type annation, the variable type hint format introduced by py36 cannot be used.

      i.e. variable: int is invalid, replace it with variable # type: int

    • Test: Your code needs complete unit test coverage. megfile uses pyfakefs and moto as local file system and s3 virtual environment in unit tests. The newly added code should have a complete unit test to ensure the correctness

  • You can help to improve megfile in many ways:

    • Write code.
    • Improve documentation.
    • Report or investigate bugs and issues.
    • If you find any problem or have any improving suggestion, submit a new issuse as well. We will reply as soon as possible and evaluate whether to adopt.
    • Review pull requests.
    • Star megfile repo.
    • Recommend megfile to your friends.
    • Any other form of contribution is welcomed.

megfile's People

Contributors

bbtfr avatar hugoahoy avatar leavers avatar loveeatcandy avatar today-debug avatar xyb avatar yujianpeng66 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

megfile's Issues

CLI cp 的行为

目前 CLI 拷贝的行为是 megfile cp src dst 把 src 拷到 dst,如果 dst 是个目录会报是个目录的错,这点跟 cp 不一样
期望改为和 cp 行为一致:

  1. 如果目标位置是个目录,把文件以 basename 拷进去
  2. -T / --no-target-directory 保持目前行为

mv 看起来也是

smart_stat 返回的对象的 st_mode 是 None

我通过 hack 把 os.lstat 和 os.stat 替换为了 megfile.smart_stat, 但发觉由于 smart_stat(s3 文件) 返回的结果没有 st_mode, 导致没法使用 tarfile

~/anaconda3/lib/python3.8/tarfile.py in gettarinfo(self, name, arcname, fileobj)
   1838                                 
   1839         stmd = statres.st_mode
-> 1840         if stat.S_ISREG(stmd):                                                   
   1841             inode = (statres.st_ino, statres.st_dev)
   1842             if not self.dereference and statres.st_nlink > 1 and \
                                                       
TypeError: an integer is required           

smart_stat 返回的对象的 st_mode 是 None

In [5]: megfile.smart_stat('s3://yl-dataset/carton/2301_huafeng/raw_2scan/2023-03-10-13_43_01.47646')
Out[5]: StatResult(size=16479269, ctime=0.0, mtime=1678783895.183064, isdir=True, islnk=False, extra=None)

In [6]: megfile.smart_stat('/tmp')
Out[6]: StatResult(size=8049368, ctime=1677743737.520099, mtime=1679212536.7508314, isdir=True, islnk=False, extra=os.stat_result(st_mode=17407, st_ino=1, st_dev=47, st_nlink=18, st_uid=0, st_gid=0, st_size=380, st_atime=1679209864, st_mtime=1679212536, st_ctime=1679212536))

patch botocore _make_api_call

_make_request 对 ClientError An error occurred (503) when calling the HeadObject operation (reached max retries: 4): Service Unavailable 这类错误并不生效

feat request: support http/https download function

For example,

import megfile

path = "https://dl.fbaipublicfiles.com/detectron2/ImageNetPretrained/MSRA/R-50.pkl"
megfile.smart_copy(path, local_path)
# or new interface like: megfile.smart_cache(path, local_path)

megfile.smart_scandir(".") 返还结果不支持 with 操作

import os, megfile
with os.scandir("/tmp/") as itr:
    print(list(itr))
# 正常打印

with megfile.smart_scandir("/tmp/") as itr:
    print(list(itr))

# 报错
Traceback (most recent call last):

  File "/tmp/ipykernel_3560/2555088509.py", line 1, in <module>
    with megfile.smart_scandir("/tmp/") as itr:

AttributeError: __enter__

ValueError: unacceptable mode: 'w+b'

from megfile import smart_open
with smart_open('/tmp/test-open-w-p', 'w+b') as f:
    f.write(b"test")
# 正常运行

with smart_open('s3://yl-share/tmp/tmp/test-open-w-p', 'w+b') as f:
    f.write(b"test")
# ValueError: unacceptable mode: 'w+b'

smart_open 不兼容 open pipe 管道

我将 builtins.open = smart_open 后, 发现 smart_open 不支持 管道 pipe, 最小复现如下:

import os, sys
from megfile import smart_open as open

print ("The child will write text to a pipe and ")
print ("the parent will read the text written by child...")

# 文件描述符 r, w 用于读、写
r, w = os.pipe() 

processid = os.fork()
if processid:
    # 父进程
    # 关闭文件描述符 w
    os.close(w)
    r = open(r)
    print ("Parent reading")
    str = r.read()
    print ("text =", str)
    # sys.exit(0)
else:
    # 子进程
    os.close(r)
    w = open(w, 'w')
    print ("Child writing")
    w.write("Text written by child...")
    w.close()
    print ("Child closing")
    sys.exit(0)
Traceback (most recent call last):

  File "/home/dl/megvii/project/ai_asrs/jinyu_data_code/analysis_domain_gap.py", line 26, in <module>
    r = open(r)

  File "/home/dl/mygit/megfile/megfile/smart.py", line 436, in smart_open
    return SmartPath(path).open(mode, **options)

  File "/home/dl/mygit/megfile/megfile/smart_path.py", line 37, in __init__
    pathlike = self._create_pathlike(path)

  File "/home/dl/mygit/megfile/megfile/smart_path.py", line 64, in _create_pathlike
    protocol, path_without_protocol = cls._extract_protocol(path)

  File "/home/dl/mygit/megfile/megfile/smart_path.py", line 59, in _extract_protocol
    raise ProtocolNotFoundError('protocol not found: %r' % path)

ProtocolNotFoundError: protocol not found: 67

handling gzip encoded http responses?

A lot content on the web is hosted using gzip encoding.

For example the following command produces a lot of binary data output since github responds with gzip encoding by default.

megfile head https://raw.githubusercontent.com/huyng/table-bench/main/table.csv

To see the raw text, you have to pipe it through gunzip like so:

megfile head https://raw.githubusercontent.com/huyng/table-bench/main/table.csv | gunzip

Do you plan on supporting this gzip streaming in your library? Thank you for creating this amazing tool.

Add performance benchmarks

To proof that reader / writer of s3 in megfile is implemented with multi-thread, which is faster than known competitors.

megfile sync 不能同步文件夹

cli

megfile sync s3://yl-dataset/carton/2301_huafeng/raw_2scan/2023-03-07-13_39_37.35595 /tmp/asdf

[S3IsADirectoryError] Is a directory: 's3://yl-dataset/carton/2301_huafeng/raw_2scan/2023-03-07-13_39_37.35595/'

API

smart_sync('s3://yl-dataset/carton/2301_huafeng/raw_2scan/2023-03-07-13_39_37.35595', '/tmp/megfile-test')


~/anaconda3/lib/python3.8/site-packages/megfile/s3_path.py in s3_download(src_url, dst_url, followlinks, callback)
    909 
    910     if not S3Path(src_url).is_file():
--> 911         raise S3IsADirectoryError('Is a directory: %r' % src_url)
    912 
    913     dst_url = fspath(dst_url)

S3IsADirectoryError: Is a directory: 's3://yl-dataset/carton/2301_huafeng/raw_2scan/2023-03-07-13_39_37.35595/'

Cannot use s3 related functions

Sysinfo:

(base) ubuntu@ip-10-53-8-252:~$ pip show megfile
Name: megfile
Version: 2.2.9.post3
Summary: Megvii file operation library
Home-page: https://github.com/megvii-research/megfile
Author: megvii
Author-email: [email protected]
License: 
Location: /home/ubuntu/miniconda3/lib/python3.10/site-packages
Requires: boto3, botocore, paramiko, pyyaml, requests, tqdm
Required-by: 

(base) ubuntu@ip-10-53-8-252:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.6 LTS
Release:        20.04
Codename:       focal

Issue:

tried using it in CLI and here's what happened:

(base) ubuntu@ip-10-53-8-252:~$ aws s3 ls s3://dataset-ingested/user-preference/
                           PRE pref_100k_min513x768/
                           PRE pref_100k_min513x768_YIELD/
                           PRE sd-human-ft/
                           PRE sd-user-pref-50k-ft-gpt/
                           PRE sd-user-pref-75k-ft/
                           PRE sd-user-pref-v2-large-full/
2023-07-04 00:41:59          0 
2023-07-11 06:16:49       1444 README.md
(base) ubuntu@ip-10-53-8-252:~$ 
(base) ubuntu@ip-10-53-8-252:~$ megfile ls s3://dataset-ingested/user-preference/

[S3UnknownError] Unknown error encountered: 's3://dataset-ingested/user-preference/', error: botocore.exceptions.ClientError('An error occurred (PermanentRedirect) when calling the ListObjectsV2 operation: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.'), endpoint: 'https://s3.amazonaws.com'
(base) ubuntu@ip-10-53-8-252:~$ 

the aws credentials have been configured with aws configure and works with AWS cli.

also tried using it in python:

from megfile import smart_walk

s3_directory = 's3://dataset-ingested/user-preference/'

# Walking through the directory
for root, dirs, files in smart_walk(s3_directory):
    print(f"Current directory: {root}")
    print(f"Subdirectories: {dirs}")
    print(f"Files: {files}")
    print("-" * 20)

error message:

---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
File ~/miniconda3/lib/python3.10/site-packages/megfile/s3_path.py:1534, in S3Path.is_dir(self, followlinks)
   1533 try:
-> 1534     resp = self._client.list_objects_v2(
   1535         Bucket=bucket, Prefix=prefix, Delimiter='/', MaxKeys=1)
   1536 except Exception as error:

File ~/miniconda3/lib/python3.10/site-packages/botocore/client.py:535, in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
    534 # The "self" in this scope is referring to the BaseClient.
--> 535 return self._make_api_call(operation_name, kwargs)

File ~/miniconda3/lib/python3.10/site-packages/botocore/client.py:980, in BaseClient._make_api_call(self, operation_name, api_params)
    979     error_class = self.exceptions.from_code(error_code)
--> 980     raise error_class(parsed_response, operation_name)
    981 else:

ClientError: An error occurred (PermanentRedirect) when calling the ListObjectsV2 operation: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.

The above exception was the direct cause of the following exception:

S3UnknownError                            Traceback (most recent call last)
/home/ubuntu/dev/data-processings/ingested_gdl_twitter_processings.ipynb Cell 18 line 6
      3 s3_directory = 's3://dataset-ingested/user-preference/'
      5 # Walking through the directory
----> 6 for root, dirs, files in smart_walk(s3_directory):
      7     print(f"Current directory: {root}")
      8     print(f"Subdirectories: {dirs}")

File ~/miniconda3/lib/python3.10/site-packages/megfile/s3_path.py:2040, in S3Path.walk(self, followlinks)
   2037 if not bucket:
   2038     raise UnsupportedError('Walk whole s3', self.path_with_protocol)
-> 2040 if not self.is_dir():
   2041     return
   2043 stack = [key]

File ~/miniconda3/lib/python3.10/site-packages/megfile/s3_path.py:1540, in S3Path.is_dir(self, followlinks)
   1537     error = translate_s3_error(error, self.path_with_protocol)
   1538     if isinstance(error,
   1539                   (S3UnknownError, S3ConfigError, S3PermissionError)):
-> 1540         raise error
   1541     return False
   1543 if not key:  # bucket is accessible

S3UnknownError: Unknown error encountered: 's3://dataset-ingested/user-preference/', error: botocore.exceptions.ClientError('An error occurred (PermanentRedirect) when calling the ListObjectsV2 operation: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.'), endpoint: 'https://s3.amazonaws.com'

the bucket I'm trying to get is in the same region as my configuration (from aws configure). any clue on what happened? thanks.

Release wheel packages including `libjfs.so`

The wheel package released on pypi is missing the libjfs.so file. Having people install the .so file manually is not friendly, as JuiceFS does not provide a way to install it on the system also. Therefore, it will be useful if Megfile could shipped them all in one.

megfile 重构优化

  • s3, fs, http, stdio 将 functions 挪到 class,并自动生成 s3.py, fs.py, http.py, stdio.py
  • 自动生成 smart_path.py 及 smart.py,smart 在知道 protocol 是哪个后,判断入参不对时报错
  • 优化 s3 请求,添加带 cache 参数的方法,同一个方法内可以用 cache

s3 支持 symlink

目前 megfile 认为 文件系统中 symlink 是一个文件,s3 中不存在 symlink
期望修改为,文件系统中 symlink 行为和标准库一致,s3 中 symlink 使用对象 headers 中特殊的 key 来标注,行为模仿文件系统中的 symlink

请求实现官方版本 smart_lstat

~/anaconda3/lib/python3.8/tarfile.py in gettarinfo(self, name, arcname, fileobj)                
   1830         if fileobj is None:                                                             
   1831             if not self.dereference:                                                    
-> 1832                 statres = os.lstat(name)                                                
   1833             else:                                                                       
   1834                 statres = os.stat(name)                                                 
                                                                                                
FileNotFoundError: [Errno 2] No such file or directory: 's3://yl-dataset/carton/2301_huafeng/raw
_2scan/2023-03-07-13_39_37.35595'

image

megfile.smart_glob missing s3 profile

when I use megfile.smart_glob with profile:

megfile.smart_glob('s3+profile://data/*')

megfile/s3_path.py:463 will access wrong bucket for top_dir not include profile

提升 s3_glob 的性能

目前 s3_glob 实现方式使得它在

  • a/{b,c}
  • a/{b,c}*
  • {a,b}/c

性能不佳,需要改进一下

v1.0.0执行smart_copy报错

下面的脚本在v1.0.0上执行报错, 但在v0.1.2上可以正常执行

import megfile
url = 'https://data.megengine.org.cn/models/weights/resnet50_fbaug_76254_4e14b7d1.pkl'
dst = './xxx.pkl'
megfile.smart_copy(url, dst)

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.