GithubHelp home page GithubHelp logo

alluxio / alluxio-py Goto Github PK

View Code? Open in Web Editor NEW
25.0 28.0 21.0 3.13 MB

Alluxio Python client - Access Any Data Source with Python

License: Apache License 2.0

Python 100.00%
alluxio big-data storage python

alluxio-py's Introduction

Alluxio Python Library

This repo contains the Alluxio Python API to interact with Alluxio servers, bridging the gap between computation frameworks and undelying storage systems. This module provides a convenient interface for performing file system operations such as reading, writing, and listing files in an Alluxio cluster.

Features

  • Directory listing and file status fetching
  • Put data to Alluxio system cache and read from Alluxio system cache (include range read)
  • Alluxio system Load operations with progress tracking
  • Support dynamic Alluxio worker membership services (ETCD periodically refreshing and manually specified worker hosts)

Limitations

Alluxio Python library supports reading from Alluxio cached data. The data needs to either

  • Loaded into Alluxio servers via load operations
  • Put into Alluxio servers via write_page operation.

If you need to read from storage systems directly with Alluxio on demand caching capabilities, please use alluxiofs instead.

Installation

Install from source

cd alluxio-python-library
python setup.py sdist bdist_wheel
pip install dist/alluxio_python_library-0.1-py3-none-any.whl

Usage

Initialization

Import and initialize the AlluxioFileSystem class:

# Minimum setup for Alluxio with ETCD membership service
alluxio = AlluxioFileSystem(etcd_hosts="localhost")

# Minimum setup for Alluxio with user-defined worker list
alluxio = AlluxioFileSystem(worker_hosts="worker_host1,worker_host2")

# Minimum setup for Alluxio with self-defined page size
alluxio = AlluxioFileSystem(
            etcd_hosts="localhost",
            options={"alluxio.worker.page.store.page.size": "20MB"}
            )
# Minimum setup for Alluxio with ETCD membership service with username/password
options = {
    "alluxio.etcd.username": "my_user",
    "alluxio.etcd.password": "my_password",
    "alluxio.worker.page.store.page.size": "20MB"  # Any other options should be included here
}
alluxio = AlluxioFileSystem(
    etcd_hosts="localhost",
    options=options
)

Load Operations

Dataset metadata and data in the Alluxio under storage need to be loaded into Alluxio system cache to read by end-users. Run the load operations before executing the read commands.

# Start a load operation
load_success = alluxio_fs.load('s3://mybucket/mypath/file')
print('Load successful:', load_success)

# Check load progress
progress = alluxio_fs.load_progress('s3://mybucket/mypath/file')
print('Load progress:', progress)

# Stop a load operation
stop_success = alluxio_fs.stop_load('s3://mybucket/mypath/file')
print('Stop successful:', stop_success)

(Advanced) Page Write

Alluxio system cache can be used as a key value cache system. Data can be written to Alluxio system cache via write_page command after which the data can be read from Alluxio system cache (Alternative to load operations).

success = alluxio_fs.write_page('s3://mybucket/mypath/file', page_index, page_bytes)
print('Write successful:', success)

Directory Listing

List the contents of a directory:

"""
contents = alluxio_fs.listdir('s3://mybucket/mypath/dir')
print(contents)

Get File Status

Retrieve the status of a file or directory:

status = alluxio_fs.get_file_status('s3://mybucket/mypath/file')
print(status)

File Reading

Read the entire content of a file:

"""
Reads a file.

Args:
    file_path (str): The full ufs file path to read data from

Returns:
    file content (str): The full file content
"""
content = alluxio_fs.read('s3://mybucket/mypath/file')
print(content)

Read a specific range of a file:

content = alluxio_fs.read_range('s3://mybucket/mypath/file', offset, length)
print(content)

Development

See Contributions for guidelines around making new contributions and reviewing them.

alluxio-py's People

Contributors

chunxutang avatar gilesalderson avatar jja725 avatar luqqiu avatar pedersen avatar psolomin avatar rastogiasr avatar reiz avatar soliverr avatar ssyssy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alluxio-py's Issues

Provide support for using tox for automated tests

This ticket is to allow an easy way to get automated tests for the various python versions that are needed. And the issue is being made specifically to give me a branch number/name for my pull request I'm going to submit.

Alluxio client as context manager that closes session.

Ideally, shouldn't the Alluxio Client be used as a context manager that opens the requests.Session() as a context manager and closes this session properly (closing all the connection adapters) when the Client exits?

Currently the session object is kept open.

Open file in text mode: open(<file>, 'rt')

Currently the open() method opens the file in "byte" mode which causes the following error, when you try to run an iterator over the file contents:

 Error: iterator should return strings, not bytes (did you open the file in text mode?)

It would be very useful to allow the caller to specify that the file be open in "text" mode so that the following code would work without error:

import sys
import alluxio
from alluxio import option
import csv

with alluxio_client.open('/tmp/customers.csv', 'rt') as alluxio_file:
    csvreader = csv.reader(alluxio_file, delimiter=',', quotechar='|')
    for row in csvreader:
        print (', '.join(row))

Error while connecting to Alluxio client through Python

I get the error "No JSON object could be decoded" when I do the following steps
import alluxio
from alluxio import option
client = alluxio.client(hostname, port)
root_stats = client.list_status('/')
for stat in root_stats:
print pretty_json(stat.json())

This is as per the python client documentation

Issue in REST API endpoints on server

i am using https://alluxio-py.readthedocs.io. When configuring and running locally it works fine.I can explain with a example.When creating the folder through the api locally,its working fine.But when i tried to create directory through REST API on the server ,it does not work accordingly .The generated path url on the server gives me HTTP error 404. Also the rest api document does not provide with a sample url https://docs.alluxio.io/os/restdoc/stable/master/index.html#-1568692663

Which License?

Under which license is this code published?
I checked out the license.txt file but could not really recognise a license. Just a bunch of XML tags.

Setup.py Installation Bug

I've submitted a PR for this issue at #7

We can't add alluxio-py to any requirements.txt because of the way that the modules are structured. I've done a minimally intrusive update to setup.py to keep the version string where it is and allow setup.py to find it. If we could merge this, and do a new release, it would help out everybody who uses Alluxio and Python.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.