mars-project / mars Goto Github PK

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.

Home Page: https://mars-project.readthedocs.io

License: Apache License 2.0

Python 96.96% Shell 0.13% C++ 0.20% C 0.01% Dockerfile 0.01% HTML 0.01% JavaScript 1.03% Cython 1.67%

python numpy tensor pandas machine-learning scikit-learn tensorflow pytorch xgboost lightgbm

mars's Introduction

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and many other libraries.

Documentation, 中文文档

Installation

Mars is easy to install by

pip install pymars

Installation for Developers

When you want to contribute code to Mars, you can follow the instructions below to install Mars for development:

git clone https://github.com/mars-project/mars.git
cd mars
pip install -e ".[dev]"

More details about installing Mars can be found at installation section in Mars document.

Architecture Overview

Getting Started

Starting a new runtime locally via:

>>> import mars
>>> mars.new_session()

Or connecting to a Mars cluster which is already initialized.

>>> import mars
>>> mars.new_session('http://<web_ip>:<ui_port>')

Mars Tensor

Mars tensor provides a familiar interface like Numpy.

Numpy Mars tensor

import numpy as np
N = 200_000_000
a = np.random.uniform(-1, 1, size=(N, 2))
print((np.linalg.norm(a, axis=1) < 1)
      .sum() * 4 / N)

import mars.tensor as mt
N = 200_000_000
a = mt.random.uniform(-1, 1, size=(N, 2))
print(((mt.linalg.norm(a, axis=1) < 1)
        .sum() * 4 / N).execute())

3.14174502
CPU times: user 11.6 s, sys: 8.22 s,
           total: 19.9 s
Wall time: 22.5 s

3.14161908
CPU times: user 966 ms, sys: 544 ms,
           total: 1.51 s
Wall time: 3.77 s

Mars can leverage multiple cores, even on a laptop, and could be even faster for a distributed setting.

Mars DataFrame

Mars DataFrame provides a familiar interface like pandas.

Pandas Mars DataFrame

import numpy as np
import pandas as pd
df = pd.DataFrame(
    np.random.rand(100000000, 4),
    columns=list('abcd'))
print(df.sum())

import mars.tensor as mt
import mars.dataframe as md
df = md.DataFrame(
    mt.random.rand(100000000, 4),
    columns=list('abcd'))
print(df.sum().execute())

CPU times: user 10.9 s, sys: 2.69 s,
           total: 13.6 s
Wall time: 11 s

CPU times: user 1.21 s, sys: 212 ms,
           total: 1.42 s
Wall time: 2.75 s

Mars Learn

Mars learn provides a familiar interface like scikit-learn.

Scikit-learn

Mars learn

from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA
X, y = make_blobs(
    n_samples=100000000, n_features=3,
    centers=[[3, 3, 3], [0, 0, 0],
             [1, 1, 1], [2, 2, 2]],
    cluster_std=[0.2, 0.1, 0.2, 0.2],
    random_state=9)
pca = PCA(n_components=3)
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)

from mars.learn.datasets import make_blobs
from mars.learn.decomposition import PCA
X, y = make_blobs(
    n_samples=100000000, n_features=3,
    centers=[[3, 3, 3], [0, 0, 0],
              [1, 1, 1], [2, 2, 2]],
    cluster_std=[0.2, 0.1, 0.2, 0.2],
    random_state=9)
pca = PCA(n_components=3)
pca.fit(X)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_)

Mars learn also integrates with many libraries:

Mars remote

Mars remote allows users to execute functions in parallel.

Vanilla function calls Mars remote

import numpy as np


def calc_chunk(n, i):
    rs = np.random.RandomState(i)
    a = rs.uniform(-1, 1, size=(n, 2))
    d = np.linalg.norm(a, axis=1)
    return (d < 1).sum()

def calc_pi(fs, N):
    return sum(fs) * 4 / N

N = 200_000_000
n = 10_000_000

fs = [calc_chunk(n, i)
      for i in range(N // n)]
pi = calc_pi(fs, N)
print(pi)

import numpy as np
import mars.remote as mr

def calc_chunk(n, i):
    rs = np.random.RandomState(i)
    a = rs.uniform(-1, 1, size=(n, 2))
    d = np.linalg.norm(a, axis=1)
    return (d < 1).sum()

def calc_pi(fs, N):
    return sum(fs) * 4 / N

N = 200_000_000
n = 10_000_000

fs = [mr.spawn(calc_chunk, args=(n, i))
      for i in range(N // n)]
pi = mr.spawn(calc_pi, args=(fs, N))
print(pi.execute().fetch())

3.1416312
CPU times: user 32.2 s, sys: 4.86 s,
           total: 37.1 s
Wall time: 12.4 s

3.1416312
CPU times: user 616 ms, sys: 307 ms,
           total: 923 ms
Wall time: 3.99 s

DASK on Mars

Refer to DASK on Mars for more information.

Eager Mode

Mars supports eager mode which makes it friendly for developing and easy to debug.

Users can enable the eager mode by options, set options at the beginning of the program or console session.

>>> from mars.config import options
>>> options.eager_mode = True

Or use a context.

>>> from mars.config import option_context
>>> with option_context() as options:
>>>     options.eager_mode = True
>>>     # the eager mode is on only for the with statement
>>>     ...

If eager mode is on, tensor, DataFrame etc will be executed immediately by default session once it is created.

>>> import mars.tensor as mt
>>> import mars.dataframe as md
>>> from mars.config import options
>>> options.eager_mode = True
>>> t = mt.arange(6).reshape((2, 3))
>>> t
array([[0, 1, 2],
       [3, 4, 5]])
>>> df = md.DataFrame(t)
>>> df.sum()
0    3
1    5
2    7
dtype: int64

Mars on Ray

Mars also has deep integration with Ray and can run on Ray efficiently and interact with the large ecosystem of machine learning and distributed systems built on top of the core Ray.

Starting a new Mars on Ray runtime locally via:

import mars
mars.new_session(backend='ray')
# Perform computation

Interact with Ray Dataset:

import mars.tensor as mt
import mars.dataframe as md
df = md.DataFrame(
    mt.random.rand(1000_0000, 4),
    columns=list('abcd'))
# Convert mars dataframe to ray dataset
ds = md.to_ray_dataset(df)
print(ds.schema(), ds.count())
ds.filter(lambda row: row["a"] > 0.5).show(5)
# Convert ray dataset to mars dataframe
df2 = md.read_ray_dataset(ds)
print(df2.head(5).execute())

Refer to Mars on Ray for more information.

Easy to scale in and scale out

Mars can scale in to a single machine, and scale out to a cluster with thousands of machines. It's fairly simple to migrate from a single machine to a cluster to process more data or gain a better performance.

Bare Metal Deployment

Mars is easy to scale out to a cluster by starting different components of mars distributed runtime on different machines in the cluster.

A node can be selected as supervisor which integrated a web service, leaving other nodes as workers. The supervisor can be started with the following command:

mars-supervisor -h <host_name> -p <supervisor_port> -w <web_port>

Workers can be started with the following command:

mars-worker -h <host_name> -p <worker_port> -s <supervisor_endpoint>

After all mars processes are started, users can run

>>> sess = new_session('http://<web_ip>:<ui_port>')
>>> # perform computation

Kubernetes Deployment

Refer to Run on Kubernetes for more information.

Yarn Deployment

Refer to Run on Yarn for more information.

Getting involved

Read development guide.
Join our Slack workgroup: Slack.
Join the mailing list: send an email to [email protected].
Please report bugs by submitting a GitHub issue.
Submit contributions using pull requests.

Thank you in advance for your contributions!

mars's People

Contributors

Stargazers

Watchers

Forkers

qinxuye hekaisheng wjsi nanaakwasiabayieboateng zzxx-husky lmatz kerwin6182828 kakaxi2shi yuqi1129 xilongpei bigfalse lqleeqee awesome-archive cyhunter yyht yueguangguang zorrock allensmile thecooltechguy jercas ssurprising rewminy ludi7532 guanlongtianzi zzszmyf u20024804 cclauss codingwhite deletee mbrukman awesomemachinelearning chengguobiao awesome-python zhijxu-ms huguanglong wojohowitz00 giserh vanly233 oyz samhays yangfengkaust codeaudit igraph2017 yanxiaobin-ben kaizeonwong airob fsfshijm q64545 df595149790 stjordanis tutty427 fhgzs itwpub turbofei williamhatch tonyyang-svail eminsight shanmine weizai118 wwjiang007 paradoxfxx danpeng2 mia1874 panwang3 iworkathz2017 dynamicode lpumpkin yangsishu sspeng bozzg mazzawill cccc0212 lindenwxl123 derfei zhangzhuang1995 lcy2013 zj2089 zys0070 wusamx songroger mmm311 y495965825 skymysky chenonly sirutong liuyuhua1984 pingrunhuang yangboyd leiwssxl nothinglz troyworld lkhoho zjx4041739 bgfurfeature shaunstanislauslau vencent-love-python wycharry geekhuyang sundayyorker wangjia5693

mars's Issues

[BUG][TENSOR] mt.random.rand returns different results when executed in different sessions

Describe the bug

A random tensor returns different results when executed in different sessions.

To Reproduce

In [2]: import mars.tensor as mt                                                

In [3]: a = mt.random.rand(10, 20)                                              

In [4]: from mars.session import new_session                                    

In [5]: session = new_session()                                                 

In [6]: session.run(a)                                                          
Out[6]: 
array([[0.40063535, 0.45796697, 0.36712569, 0.57410564, 0.03316188,
        0.39214044, 0.58855336, 0.04936224, 0.62870674, 0.89812263,
        0.0950565 , 0.94972637, 0.01124284, 0.79766443, 0.53592006,
        0.63557538, 0.25438726, 0.95038912, 0.26640497, 0.94912579],
       [0.31029473, 0.47018138, 0.93268746, 0.24141035, 0.36309559,
        0.74774755, 0.83962496, 0.69642224, 0.66627376, 0.83162516,
        0.06235467, 0.95071957, 0.67219839, 0.54850925, 0.8226961 ,
        0.88564641, 0.78807562, 0.94937921, 0.73962081, 0.57448563],
       [0.92392073, 0.25436821, 0.46914224, 0.00534222, 0.06011203,
        0.06211908, 0.68319277, 0.34030356, 0.26849195, 0.05883898,
        0.37259568, 0.85177104, 0.53141152, 0.11695296, 0.73013186,
        0.05868005, 0.19118145, 0.92374115, 0.40840235, 0.6835809 ],
       [0.0803077 , 0.1893075 , 0.70055443, 0.48284417, 0.20130932,
        0.52467516, 0.54734497, 0.36079865, 0.32631029, 0.07459611,
        0.34592097, 0.50444219, 0.37569038, 0.35744269, 0.18466475,
        0.08775113, 0.46061581, 0.90114707, 0.94049296, 0.23930423],
       [0.52935223, 0.49590164, 0.35146561, 0.00771866, 0.29229495,
        0.68584935, 0.1255136 , 0.19971982, 0.73346201, 0.51213047,
        0.46768733, 0.90407055, 0.82762451, 0.12175357, 0.89542905,
        0.07258344, 0.22948567, 0.96548473, 0.22674003, 0.31870152],
       [0.55410522, 0.87324508, 0.79980585, 0.64095256, 0.29683337,
        0.07802534, 0.45419463, 0.36569806, 0.16964579, 0.41355839,
        0.35727195, 0.13078792, 0.08463963, 0.64743208, 0.64189296,
        0.35168302, 0.39599695, 0.19812248, 0.91877611, 0.23550386],
       [0.65416422, 0.21367134, 0.38747314, 0.82068465, 0.12929007,
        0.5461681 , 0.71207599, 0.71685563, 0.77510339, 0.30145219,
        0.71298617, 0.97620726, 0.23433513, 0.13521878, 0.68018848,
        0.09735912, 0.3985609 , 0.95363321, 0.32997877, 0.31743948],
       [0.76982216, 0.0395714 , 0.11986159, 0.13023568, 0.74617636,
        0.78373296, 0.30350781, 0.30281778, 0.91149265, 0.69775049,
        0.56690621, 0.900408  , 0.87716403, 0.96302196, 0.56072209,
        0.95010627, 0.34613038, 0.24320805, 0.77750347, 0.76634203],
       [0.74276879, 0.86975402, 0.63704401, 0.62308415, 0.48459096,
        0.41607366, 0.02747562, 0.52673396, 0.53733691, 0.17104523,
        0.07152666, 0.83350753, 0.19309586, 0.70668821, 0.08741014,
        0.75735304, 0.89656358, 0.00696518, 0.18004327, 0.0669692 ],
       [0.0069334 , 0.44147947, 0.53509785, 0.16366487, 0.65910071,
        0.24187519, 0.447849  , 0.66905146, 0.1523497 , 0.35651314,
        0.32655428, 0.7419353 , 0.69160677, 0.29974135, 0.73811682,
        0.21814573, 0.96069355, 0.23094692, 0.30193065, 0.16580655]])

In [7]: session2 = new_session()                                                

In [8]: session2.run(a)                                                         
Out[8]: 
array([[0.06278953, 0.92966713, 0.15261774, 0.588857  , 0.05788972,
        0.71180712, 0.62486666, 0.47462451, 0.99727786, 0.15202203,
        0.72452653, 0.98535792, 0.0346885 , 0.64527834, 0.83797195,
        0.83395962, 0.39824545, 0.8934134 , 0.541806  , 0.83978902],
       [0.63635538, 0.18294176, 0.69331702, 0.52763233, 0.3229291 ,
        0.62241287, 0.68414848, 0.97999275, 0.55168412, 0.4603575 ,
        0.82990232, 0.90552774, 0.97451053, 0.40686349, 0.54823289,
        0.4247938 , 0.23535826, 0.55458135, 0.55258465, 0.24617388],
       [0.44944288, 0.04010546, 0.55743127, 0.45643243, 0.16292647,
        0.38589025, 0.97295906, 0.2309159 , 0.36286914, 0.6497375 ,
        0.57341974, 0.09344122, 0.01967838, 0.98499235, 0.40975712,
        0.57559919, 0.11186148, 0.74766026, 0.51243854, 0.7151745 ],
       [0.59554684, 0.27307507, 0.37105536, 0.23731401, 0.07928558,
        0.30030247, 0.11912166, 0.30569325, 0.44055106, 0.66682974,
        0.70194368, 0.19419587, 0.33057051, 0.1957851 , 0.91913741,
        0.39932088, 0.39925625, 0.21255173, 0.10834017, 0.40510625],
       [0.58570478, 0.63999024, 0.60711114, 0.35790321, 0.03935359,
        0.29713775, 0.23494106, 0.18004252, 0.65265073, 0.02935006,
        0.56979132, 0.12347523, 0.17526049, 0.61784638, 0.95982497,
        0.64096918, 0.71338032, 0.0979669 , 0.01520631, 0.15262291],
       [0.3135371 , 0.44314403, 0.42788622, 0.96726161, 0.07888723,
        0.60221491, 0.53075983, 0.71256098, 0.5216132 , 0.11446414,
        0.93905272, 0.39946166, 0.00228463, 0.74673268, 0.68813588,
        0.15785701, 0.37980197, 0.07209914, 0.91218897, 0.8224859 ],
       [0.66014014, 0.32113028, 0.90041553, 0.54046677, 0.68700057,
        0.69033259, 0.7842881 , 0.88329396, 0.58093067, 0.81175768,
        0.19481369, 0.01494406, 0.15493862, 0.37604864, 0.348721  ,
        0.34257649, 0.62440358, 0.6258676 , 0.35766471, 0.71645263],
       [0.10860085, 0.18229978, 0.87975443, 0.12793471, 0.0596285 ,
        0.41011133, 0.3194171 , 0.15745823, 0.27606328, 0.19873117,
        0.03263342, 0.13748292, 0.69313661, 0.35918474, 0.4565669 ,
        0.85733087, 0.44626454, 0.10169915, 0.12741476, 0.78491404],
       [0.81891845, 0.00973453, 0.61652424, 0.71913264, 0.34067129,
        0.44458781, 0.52092245, 0.67830463, 0.47626831, 0.56424497,
        0.76968304, 0.73731452, 0.71749442, 0.69947051, 0.46067184,
        0.41743385, 0.56679369, 0.27563535, 0.96041557, 0.20614126],
       [0.92617988, 0.766046  , 0.39547086, 0.48542568, 0.2632229 ,
        0.39583571, 0.71995094, 0.02390021, 0.24162666, 0.69003697,
        0.98268602, 0.62513413, 0.32820124, 0.51544208, 0.17805757,
        0.24546812, 0.15784564, 0.27552066, 0.77633782, 0.88856292]])

Conventional schedulers (Slurm/PBS/Loadleveler) compatibility?

Is your feature request related to a problem? Please describe.
Most supercomputers in the world use of the few schedulers available, like the ones mentioned at the title. Those usually don't play well with other schedulers.

Describe the solution you'd like
To be able to run it directly from a slurm session.

Describe alternatives you've considered
In many SPMD environments, one is able to submit one single program to run in a number of compute nodes. So, this program should be able to run as a master in one node, and as a worker in all the other nodes. Something like a front-end to mars.

Additional context
Supercomputer schedulers are quite simple in operation. The user submits a job to the batch system, which waits until the amount of resources requested is available. Then, it runs the code in all the processes. It's that simple.

[BUG][TENSOR] mt.split got an error when running in cluster

Describe the bug
When submit split tensors to cluster, the ValueError is raised on the scheduler side.

To Reproduce

In [1]: from mars.deploy.local import new_cluster                                                        

In [2]: cluster = new_cluster(scheduler_n_process=2, worker_n_process=3, web=True)                       

In [3]: import mars.tensor as mt                                                                         

In [4]: t = mt.random.rand(10, 10)

In [5]: mt.split(t, 5).execute()                                                                         
Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 716, in gevent._greenlet.Greenlet.run
  File "mars/actors/pool/gevent_pool.pyx", line 88, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
  File "mars/actors/pool/gevent_pool.pyx", line 91, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
  File "mars/actors/pool/gevent_pool.pyx", line 102, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
  File "mars/actors/pool/gevent_pool.pyx", line 96, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
  File "mars/actors/core.pyx", line 108, in mars.actors.core.FunctionActor.on_receive
  File "mars/actors/core.pyx", line 110, in mars.actors.core.FunctionActor.on_receive
  File "/Users/hekaisheng/Documents/mars_dev/mars/mars/scheduler/graph.py", line 527, in get_executable_operand_dag
    kws=kws)
  File "/Users/hekaisheng/Documents/mars_dev/mars/mars/tensor/expressions/indexing/getitem.py", line 58, in new_chunks
    chunk, indexes = inputs
ValueError: not enough values to unpack (expected 2, got 1)
2018-12-25T02:23:13Z <Greenlet "Greenlet-0" at 0x11a9f47b8: <built-in method fire_run of mars.actors.pool.gevent_pool.ActorExecutionContext object at 0x11b154458>> failed with ValueError

Expected behavior
It should run as with thread-based scheduler on a single machine.

In [1]: import mars.tensor as mt                                                                         

In [2]: mt.split(mt.random.rand(10, 10), 5).execute()                                                    
Out[2]: 
[array([[0.97707867, 0.65788903, 0.6474784 , 0.14588068, 0.66482111,
         0.41569224, 0.37568813, 0.54747152, 0.55119096, 0.75261828],
        [0.92090163, 0.33438114, 0.33748729, 0.85393778, 0.85343273,
         0.03576545, 0.25153039, 0.84161614, 0.47162213, 0.44469732]]),
 array([[0.19063479, 0.82669046, 0.80276226, 0.69243507, 0.409638  ,
         0.80037753, 0.06239842, 0.48988705, 0.70018657, 0.08936353],
        [0.45298261, 0.97734578, 0.46953004, 0.6408237 , 0.27882975,
         0.71627508, 0.97158351, 0.24061975, 0.39703134, 0.82116613]]),
 array([[0.08595908, 0.98163836, 0.31229413, 0.9138722 , 0.69957077,
         0.55984328, 0.65914406, 0.0157759 , 0.967717  , 0.63201471],
        [0.89268134, 0.4228981 , 0.49237109, 0.12342419, 0.17950104,
         0.05108572, 0.39138997, 0.75485722, 0.64511224, 0.78004265]]),
 array([[0.45108104, 0.96795335, 0.02066064, 0.14883567, 0.60484485,
         0.27167651, 0.66651133, 0.11039622, 0.32972396, 0.27723386],
        [0.5275256 , 0.16618801, 0.97140139, 0.53816923, 0.23349218,
         0.94030826, 0.38716731, 0.88068514, 0.69808024, 0.22292709]]),
 array([[0.52674368, 0.96576199, 0.212461  , 0.70662301, 0.65729539,
         0.9263901 , 0.48157777, 0.81997881, 0.8481292 , 0.7402317 ],
        [0.84808454, 0.2683585 , 0.21626634, 0.68554189, 0.17512773,
         0.84865071, 0.42345878, 0.7005781 , 0.13699969, 0.8125516 ]])]

Use ChunkMetaActor to replace KVStoreActor

Is your feature request related to a problem? Please describe.
KVStoreActor is single-pointed and introduces the cost of path formatting and resolving. It should be separated to reduce cost on a single scheduler.

Describe the solution you'd like
Use ChunkMetaActor to replace KVStoreActor to store chunk meta. Graph meta can be introduced with GraphMetaActor. Worker resources can be handled in ResourceActor. Broadcast and cache strategy can be used as the storage and consumer of chunks can be determined before execution.

Optimize preparation of huge graphs

Is your feature request related to a problem? Please describe.
When running huge graphs (~80k nodes) in Mars, runners have to wait for a long time (~20min) before the graph is tiled into chunks and actually start running. Therefore optimization is needed.

Describe the solution you'd like
All time-consuming steps and possible optimizations:

graph tiling (~10min -> 1.5min by applying #107)
initial placement (~6min)
creating and distributing operands (~2min) -> eliminated by upgrading to v0.7

These steps can be optimized one by one.

Pre-push operands whose worker can be fixed

Is your feature request related to a problem? Please describe.
Scheduling ready operands on a worker reduces the cost operands in a worker to execute, which will contribute to the case when submitted, low-priority operands are scheduled before unsubmitted, high-priority operands, which lead to much larger intermediate storage.

Describe the solution you'd like
The basic intuition is to reduce the cost subsequent nodes need to schedule on workers. Therefore when all predecessors of an operand are being executed, the worker for the operand can be fixed with high probability. Therefore we can pre-push this operand with higher priority into the worker. When there are free resources, the worker will schedule the operands with higher priority first.

[BUG][TENSOR] Cannot run a tensor tuple by web session

Describe the bug
When created a web session to submit the tensors to cluster, it raised TypeError: run got unexpected key arguments n_parallel which may be caused by class ExecutableTuple.

To Reproduce
I start a local cluster with web, and run SVD of a random matrix.

n [1]: from mars.deploy.local import new_cluster

In [2]: cluster = new_cluster(scheduler_n_process=2, worker_n_process=3, web=True)

In [3]: import mars.tensor as mt

In [4]: t = mt.random.rand(10, 10)

In [5]: mt.linalg.svd(t).execute()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-7615968c966e> in <module>()
----> 1 mt.linalg.svd(t).execute()

~/Documents/mars_dev/mars/mars/tensor/core.py in execute(self, session, n_parallel)
    466         if session is None:
    467             session = Session.default_or_local()
--> 468         return session.run(*self, n_parallel=n_parallel)
    469 
    470 

~/Documents/mars_dev/mars/mars/session.py in run(self, *tensors, **kw)
     81 
     82         tensors = tuple(mt.tensor(t) for t in tensors)
---> 83         result = self._sess.run(*tensors, **kw)
     84         self._executed_keys.update(t.key for t in tensors)
     85         for t in tensors:

~/Documents/mars_dev/mars/mars/deploy/local/session.py in run(self, *tensors, **kw)
     47         timeout = kw.pop('timeout', -1)
     48         if kw:
---> 49             raise TypeError('run got unexpected key arguments {0}'.format(', '.join(kw.keys())))
     50 
     51         graph = DirectedGraph()

TypeError: run got unexpected key arguments n_parallel

Move priority mechanism into worker

Now that there are two queueing mechanisms in both scheduler and worker, it is nicer to merge them into worker.

Add log reader page on web

Is your feature request related to a problem? Please describe.
Currently it is hard for users to get logs for diagnosing problems.

Describe the solution you'd like
Add simple log pages showing tails of logs. The size of contents can be configured via url.

Add PEP8 check

RT. Some criteria proposed below.

# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics
# exit-zero treats all errors as warnings.  The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --statistics

Config file .flake8:

[flake8]
max-complexity = 10
max-line-length = 127
exclude =
    *_pb2.py
    __init__.py
    __pycache__
    mars/lib/enum.py
    mars/lib/six.py
    mars/lib/futures/*

[BUG][TENSOR] tensor of singular values calculated by svd tiles chunk with 2-d chunk index

Describe the bug

Tensor of singular values caculated by svd should be a vector(1-d), but after tiling, it's chunks have 2-d chunk index which is wrong.

To Reproduce

IIn [13]: import mars.tensor as mt 
    ...:  
    ...: a = mt.random.rand(9, 6, chunks=(3, 6)) 
    ...: U, s, V = mt.linalg.svd(a) 
    ...:  
    ...: s.tiles()                                                              
Out[13]: <mars.tensor.core.Tensor at 0x118462888>

In [14]: s.chunks[0].index                                                      
Out[14]: (0, 0)

In [15]: s.shape                                                                
Out[15]: (6,)

Expected behavior

Chunk index should also be 1-d.

[BUG][DISTRIBUTED] client will hang if submit the same graph again

Describe the bug
if we resubmit the same tensor by session, the client will hang, at the same time, i found scheduler raise an error ActorAlreadyExist, which i think can be solved by prune existed operand.

To Reproduce
I creat a distributed environment with 1 worker.

In [1]: import mars.tensor as mt

In [2]: from mars.session import new_session

In [3]: session = new_session('http://0.0.0.0:33335')

In [4]: a = mt.ones((10, 10)) + 1

In [5]: session.run(a)
Out[5]: 
array([[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]])

In [6]: session.run(a)

the client hang when execute line 6, and scheduler log shows:

Creating operand actors for graph 93ba7372-e748-40ab-af14-6c250f845ba4
Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 716, in gevent._greenlet.Greenlet.run
  File "mars/actors/pool/gevent_pool.pyx", line 88, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
  File "mars/actors/pool/gevent_pool.pyx", line 91, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
  File "mars/actors/pool/gevent_pool.pyx", line 102, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
  File "mars/actors/pool/gevent_pool.pyx", line 96, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
  File "mars/actors/core.pyx", line 108, in mars.actors.core.FunctionActor.on_receive
  File "mars/actors/core.pyx", line 110, in mars.actors.core.FunctionActor.on_receive
  File "/Users/hekaisheng/Documents/mars/mars/scheduler/graph.py", line 156, in execute_graph
    self.create_operand_actors()
  File "/Users/hekaisheng/Documents/mars/mars/scheduler/graph.py", line 569, in create_operand_actors
    [future.result() for future in itertools.chain(six.itervalues(op_refs), [write_future])]
  File "/Users/hekaisheng/Documents/mars/mars/scheduler/graph.py", line 569, in <listcomp>
    [future.result() for future in itertools.chain(six.itervalues(op_refs), [write_future])]
  File "src/gevent/event.py", line 457, in gevent._event.AsyncResult.result
  File "src/gevent/event.py", line 381, in gevent._event.AsyncResult.get
  File "src/gevent/event.py", line 399, in gevent._event.AsyncResult.get
  File "src/gevent/event.py", line 379, in gevent._event.AsyncResult._raise_exception
  File "/Users/hekaisheng/miniconda3/lib/python3.6/site-packages/gevent/_compat.py", line 47, in reraise
    raise value.with_traceback(tb)
  File "mars/actors/pool/gevent_pool.pyx", line 758, in mars.actors.pool.gevent_pool.Communicator._create_local_actor
  File "mars/actors/pool/gevent_pool.pyx", line 237, in mars.actors.pool.gevent_pool.LocalActorPool.create_actor
mars.actors.errors.ActorAlreadyExist: Actor s:operator$2724de06-fc4a-11e8-8d24-47a7eeef77d5$c84e3ef10df7fabaf2dc028149609284 already exist, cannot create

Provide a common api for mars cluster

Is your feature request related to a problem? Please describe.
Now, when create a mars cluster, we must start mars web and interact via RESTful API, however it's not necessary in some cases if we don't need web UI.

Describe the solution you'd like
I think a common api should be provided and interact with scheduler via it. Like

from mars import MarsApi

api = MarsApi(scheduler_ip)
api.create_session(session_id)
api.submit_graph(session_id, graph)

Additional context
It may be useful when create a local cluster.

[BUG] gevent is not required for mars client but we wrongly import a module use gevent

Describe the bug

gevent is not required for mars client but we wrongly import a module use gevent which leads to a ModuleNotFoundError.

To Reproduce

Start a fresh environment by conda create. After pip install pymars, run

>>> import mars.tensor as mt
>>> a = mt.random.rand(100, 100)
>>> a.sum().execute()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/qinxuye/Workspace/mars/mars/tensor/core.py", line 441, in execute
    from ..session import Session
  File "/Users/qinxuye/Workspace/mars/mars/session.py", line 22, in <module>
    from .api import MarsAPI
  File "/Users/qinxuye/Workspace/mars/mars/api.py", line 18, in <module>
    from .cluster_info import ClusterInfoActor
  File "/Users/qinxuye/Workspace/mars/mars/cluster_info.py", line 20, in <module>
    from . import kvstore
  File "/Users/qinxuye/Workspace/mars/mars/kvstore.py", line 2, in <module>
    from gevent.event import Event
ModuleNotFoundError: No module named 'gevent'

Expected behavior

Either we shouldn't import these modules or catch the ImportError.

Additional context

We must add tests for this case which proposed in #52 .

[BUG][client] web session has a different signature with the local session

Describe the bug

Web has it's own session, but it's run method has a different signature with the local session, which lead to a bug that when user submits more than one tensor, only the first tensor would be executed.

To Reproduce

In [14]: from mars.session import new_session                                   

In [15]: sess = new_session('http://0.0.0.0:49911')                             

In [16]: a = mt.ones((2, 2))                                                    

In [17]: sess.run(a, a+1)                                                       
Out[17]: 
[array([[1., 1.],
        [1., 1.]])]

How does this compare to Dask?

[DataFrame] DataFrame expressions initiative

Is your feature request related to a problem? Please describe.

Add DataFrame expressions, including Series, DataFrame, Index, and chunk counterpart for each

make the fuse logic separated out of graph

For now, the fuse logic is adhere to the graph.pyx, we should separate the logic out of graph, and make it standalone, so we can do some unit test on the fuse.

[BUG] IndexError raised when summing slices with same index

Describe the bug
When submitting

import numpy as np
import operator
from functools import reduce

base_arr = np.random.random((100, 100))
a = mt.array(base_arr)
sumv = reduce(operator.add, [a[:10, :10] for _ in range(10)])

GraphActor reports

Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 716, in gevent._greenlet.Greenlet.run
  File "mars/actors/pool/gevent_pool.pyx", line 88, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
  File "mars/actors/pool/gevent_pool.pyx", line 91, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
  File "mars/actors/pool/gevent_pool.pyx", line 102, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
  File "mars/actors/pool/gevent_pool.pyx", line 96, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
  File "mars/actors/core.pyx", line 108, in mars.actors.core.FunctionActor.on_receive
  File "mars/actors/core.pyx", line 110, in mars.actors.core.FunctionActor.on_receive
  File "/Users/wenjun.swj/Code/mars/mars/scheduler/graph.py", line 421, in prepare_graph
    for n in tensor_to_tiled[tk][-1].chunks:
IndexError: list index out of range

test_gevent_pool should be fixed in Windows

Now, we skip all the cases in actors.pool.test_gevent_pool in Windows, it may be used in future, so all gevent test cases should pass.

[TENSOR] Support fetch=False for tensor.execute and session.run

Is your feature request related to a problem? Please describe.

For now, the execute of tensor and session.run will trigger fetching data which may lead to a huge burden for the client's memory, so we should provide a parameter fetch. With fetch=False we only trigger the execution without data fetch.

Describe the solution you'd like

Add fetch parameter to Tensor.execute and Session.run.

Add test environments for mars client and local thread-based scheduling

We need to add a test environment with the minimized requirements in requirements.txt installed only, leveraging maybe virtualenv or sth, to make sure mars client and local thread-based scheduling works fine.

[BUG][TENSOR] Check chunks parameter for mt.random.rand and mt.random.randn

Describe the bug

chunks parameter has been renamed to chunk_size, but for mt.random.rand and mt.random.randn, we did not check the parameter. If chunks is specified, no exception raised.

Expected behavior

Raise ValueError if chunks is specified for mt.random.rand and mt.random.randn

[BUG][TENSOR] part of svd results cannot be submitted if other results are garbage collected

Describe the bug

If users only run part of svd results, like _, s, _ = mt.linalg.svd(a), the U and V are gc collected, the submittion will fail, call local thread execution will also fail.

To Reproduce

In [16]: a = mt.random.rand(20, 10, chunk_size=10) 

In [19]: _, s, _ = mt.linalg.svd(a)                                             

In [20]: s.build_graph(tiled=False)                                             
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-20-77797d4ed4a0> in <module>
----> 1 s.build_graph(tiled=False)

~/Workspace/mars/mars/tensor/core.py in build_graph(self, graph, cls, tiled, compose)
    298             if not graph.contains(chunk):
    299                 graph.add_node(chunk)
--> 300             children = chunk.inputs or []
    301             for c in children:
    302                 if not graph.contains(c):

AttributeError: 'NoneType' object has no attribute 'inputs'

Add scheduler list on web page

Is your feature request related to a problem? Please describe.
In current version, scheduler information is shown in dashboard, which is not an elegant design as there can be multiple schedulers. Dashboard should be used to show overall statistics of the cluster.

Describe the solution you'd like
Add a scheduler list and scheduler detail pages.

[BUG] Assertion error in testCumReduction

Pls check if some of the corner cases are met.

____________________________ Test.testCumReduction _____________________________
self = <mars.tensor.execution.tests.test_reduction_execute.Test testMethod=testCumReduction>
    def testCumReduction(self):
        raw = np.random.randint(5, size=(8, 8, 8))
    
        arr = tensor(raw, chunks=3)
    
        res1 = self.executor.execute_tensor(arr.cumsum(axis=1), concat=True)
        res2 = self.executor.execute_tensor(arr.cumprod(axis=1), concat=True)
        expected1 = raw.cumsum(axis=1)
        expected2 = raw.cumprod(axis=1)
        self.assertTrue(np.array_equal(res1[0], expected1))
        self.assertTrue(np.array_equal(res2[0], expected2))
    
        raw = sps.random(8, 8, density=.1)
    
        arr = tensor(raw, chunks=3)
    
        res1 = self.executor.execute_tensor(arr.cumsum(axis=1), concat=True)
        res2 = self.executor.execute_tensor(arr.cumprod(axis=1), concat=True)
        expected1 = raw.A.cumsum(axis=1)
        expected2 = raw.A.cumprod(axis=1)
>       self.assertTrue(np.array_equal(res1[0], expected1))
E       AssertionError: False is not true
mars/tensor/execution/tests/test_reduction_execute.py:360: AssertionError

[BUG] Cannot receive any chunks on a multi-process scheduler

Describe the bug
Serialization error in return values from GraphActor.get_tiled_tensor() led to failure in fetching results from a scheduler with multiple processes.

To Reproduce

Start a multi-process scheduler
Run any expression and you'll get IndexError
In the scheduler log you can see the serialization error due to weakref objs.

[TENSOR] Update tensor's shape if it's executed especially for those with unknown shape

Is your feature request related to a problem? Please describe.

Some op like boolean indexing will create a tensor with unknown shape, and some other op like reshape, split and so forth requires that the input tensor's shape must be confirmed.

In this case, user may execute with or without fetching data so that the process can continue. However, for now, we do not update the tensor's shape even if it's executed. We need to fix it.

Describe the solution you'd like

Update the tensor's shape once the tensor is executed.

[BUG][DISTRIBUTED] run parts of svd results will cause serialize error

Describe the bug

Some tensor op can generate multiple outputs, svd etc. Assume u, s, v is the result tensors for SVD decomposition, for the distributed setting, if we only run parts of the tensors, for example, u and s, the KeyError will be raised which I guess is all due to the serialization issue.

To Reproduce

I start one scheduler, worker and web on my laptop.

On the client side:

In [3]: from mars.session import new_session

In [4]: session  = new_session('http://0.0.0.0:61378')

In [5]: a = mt.random.rand(10, 20)

In [6]: dec = np.linalg.svd(a)

In [7]: dec = mt.linalg.svd(a)

In [8]: session.run(dec[:2])

The client will hang.

On the scheduler side, we can get the error:

Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 716, in gevent._greenlet.Greenlet.run
  File "mars/actors/pool/gevent_pool.pyx", line 88, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
    cpdef object fire_run(self):
  File "mars/actors/pool/gevent_pool.pyx", line 91, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
    with self.lock:
  File "mars/actors/pool/gevent_pool.pyx", line 102, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
    raise
  File "mars/actors/pool/gevent_pool.pyx", line 96, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
    res = actor.on_receive(message_ctx.message)
  File "mars/actors/core.pyx", line 108, in mars.actors.core.FunctionActor.on_receive
    cpdef on_receive(self, message):
  File "mars/actors/core.pyx", line 110, in mars.actors.core.FunctionActor.on_receive
    return getattr(self, method)(*args, **kwargs)
  File "/Users/qinxuye/Workspace/mars/mars/scheduler/graph.py", line 153, in execute_graph
    self.prepare_graph()
  File "/Users/qinxuye/Workspace/mars/mars/scheduler/graph.py", line 207, in prepare_graph
    tensor_graph = deserialize_graph(self._serialized_tensor_graph)
  File "/Users/qinxuye/Workspace/mars/mars/utils.py", line 403, in deserialize_graph
    return DirectedGraph.from_json(json_obj)
  File "mars/graph.pyx", line 415, in mars.graph.DirectedGraph.from_json
    return cls.deserialize(SerialiableGraph.from_json(json_obj))
  File "mars/serialize/core.pyx", line 537, in mars.serialize.core.Serializable.from_json
    return cls.deserialize(provider, obj)
  File "mars/serialize/core.pyx", line 510, in mars.serialize.core.Serializable.deserialize
    [cb(key_to_instance) for cb in callbacks]
  File "mars/serialize/jsonserializer.pyx", line 176, in mars.serialize.jsonserializer.JsonSerializeProvider._deserialize_list.cb.inner
    o = subs[v.key, v.id]
KeyError: ('9e2478de2695b727435601490ebaa999', '5219323976')
2018-12-07T11:14:35Z <Greenlet "Greenlet-0" at 0x138f5a448: <built-in method fire_run of mars.actors.pool.gevent_pool.ActorExecutionContext object at 0x139002cc8>> failed with KeyError

Pass timestamp inside cluster instead of date strings

Is your feature request related to a problem? Please describe.
When difference in timezone between server and client is considered, it is better to pass timestamp instead of date string in Mars scheduler and worker.

Describe the solution you'd like
Date string passed as status shall be replaced with timestamps.

[DISTRIBUTED] Add failover at worker level

Add worker-level failover for scheduler and worker.

Provide a way to start a complete distributed environment in a single machine

Is your feature request related to a problem? Please describe.

Now it's quite painful to start a complete distributed environment in a single machine, we have to start the scheduler, start the worker, then start the web.

Describe the solution you'd like

I hope there is a simple way to do this by a single line of code. For example

from mars.deploy.local import new_cluster

cluster = new_cluster()

and then we can create a session by

from mars.session import new_session

session = new_session(cluster.endpoint)

Of course this is just an example.

Describe alternatives you've considered

I suggest we start the scheduler and worker in an actor pool, and start the web in a separate process.

Additional context

It's also useful for using in our unittests.

add requirements-dev.txt

add a requirements-dev.txt including cython.

[BUG] Deleting broadcasted meta may cause item in kvstore be deleted multiple times

Describe the bug
Deleting broadcasted meta may cause item in kvstore be deleted multiple times.

To Reproduce
In multiple worker settings, error reported like

Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 716, in gevent._greenlet.Greenlet.run
  File "mars/actors/pool/gevent_pool.pyx", line 88, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
    cpdef object fire_run(self):
  File "mars/actors/pool/gevent_pool.pyx", line 91, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
    with self.lock:
  File "mars/actors/pool/gevent_pool.pyx", line 102, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
    raise
  File "mars/actors/pool/gevent_pool.pyx", line 96, in mars.actors.pool.gevent_pool.ActorExecutionContext.fire_run
    res = actor.on_receive(message_ctx.message)
  File "mars/actors/core.pyx", line 108, in mars.actors.core.FunctionActor.on_receive
    cpdef on_receive(self, message):
  File "mars/actors/core.pyx", line 110, in mars.actors.core.FunctionActor.on_receive
    return getattr(self, method)(*args, **kwargs)
  File "/home/admin/wenjun.swj/mars/mars/scheduler/kvstore.py", line 54, in delete
    return self._store.delete(key, dir=dir, recursive=recursive)
  File "/home/admin/wenjun.swj/mars/mars/kvstore.py", line 265, in delete
    raise KeyError(key)
KeyError: '/sessions/c3c9d252-09bc-11e9-bae9-97dedca03eb9/chunks/6f02ffa9ca5cf0b82362b37418d29ccc'
2018-12-27T10:17:46Z <Greenlet "Greenlet-0" at 0x7fed5e0f6268: <built-in method fire_run of mars.actors.pool.gevent_pool.ActorExecutionContext object at 0x7fed5f8bbf48>> failed with KeyError

[BUG] Precision error on linalg.norm

Describe the bug
When running UT on MacOS, the following errors are shown.

____________________________ Test.testNormExecution ____________________________
self = <mars.tensor.execution.tests.test_linalg_execute.Test testMethod=testNormExecution>
    def testNormExecution(self):
        d = np.arange(9) - 4
        d2 = d.reshape(3, 3)
    
        ma = [tensor(d, chunks=2), tensor(d, chunks=9),
              tensor(d2, chunks=(2, 3)), tensor(d2, chunks=3)]
    
        for i, a in enumerate(ma):
            data = d if i < 2 else d2
            for ord in (None, 'nuc', np.inf, -np.inf, 0, 1, -1, 2, -2):
                for axis in (0, 1, (0, 1)):
                    for keepdims in (True, False):
                        try:
                            expected = np.linalg.norm(data, ord=ord, axis=axis, keepdims=keepdims)
                            t = norm(a, ord=ord, axis=axis, keepdims=keepdims)
                            concat = t.ndim > 0
                            res = self.executor.execute_tensor(t, concat=concat)[0]
    
                            np.testing.assert_allclose(res, expected, atol=.0001)
                        except ValueError:
                            continue
    
        m = norm(tensor(d))
        expected = self.executor.execute_tensor(m)[0]
        res = np.linalg.norm(d)
        self.assertEqual(expected, res)
    
        d = uniform(-0.5, 0.5, size=(500, 2), chunks=50)
        inside = (norm(d, axis=1) < 0.5).sum().astype(float)
        t = inside / 500 * 4
        res = self.executor.execute_tensor(t)[0]
>       self.assertAlmostEqual(res, 3.14, delta=0.3)
E       AssertionError: 2.808 != 3.14 within 0.3 delta
mars/tensor/execution/tests/test_linalg_execute.py:328: AssertionError

To Reproduce
It never happens in tests under Linux. In MacOS it only occurs in a number of attempts.

Additional context
The test uses built-in Python 2.7.15 in MacOS.

[BUG][DISTRIBUTED] client may hang if something wrong happens in the scheduler

Describe the bug

If some errors occur in the distributed, the client may hang, #8 and #17 are examples.

To Reproduce

As examples in #8 and #17.

Expected behavior

Either we deliver the error to the client, or we hide the real error and deliver an InternalServerError to the client.

Additional context

We should separate the system error with the user caused error. We should add some tests.

Rename chunk to chunk_size for all tensor expressions

chunk is used in tensor expressions to illustrate the chunk size of the given tensor, but users may misunderstand and treat it as the chunk numbers.

In order to clear up the ambiguity, we can rename the chunks to chunk_size.

Documents also need to be updated.

Test tensor execution in distributed mode

Is your feature request related to a problem? Please describe.
Currently, our expressions test cases all run by thread-based executor, however the behavior changes when running in cluster, only a few cases in test_api and test_main works for end to end.

Several issues(#19 #64 #8 ) are related to this, maybe we need some tests to cover them in the future.

Describe the solution you'd like
At least, we can add cases in testGraphActor to test GraphActor with different input tensors.

[BUG] Assertion Error in testArgReduction

Describe the bug
When running UT on MacOS, the following errors are shown.

____________________________ Test.testArgReduction _____________________________
self = <mars.tensor.execution.tests.test_reduction_execute.Test testMethod=testArgReduction>
    def testArgReduction(self):
        raw = np.random.random((20, 20, 20))
    
        arr = tensor(raw, chunks=3)
    
        self.assertEqual(raw.argmax(),
                         self.executor.execute_tensor(arr.argmax())[0])
        self.assertEqual(raw.argmin(),
                         self.executor.execute_tensor(arr.argmin())[0])
    
        self.assertTrue(np.array_equal(raw.argmax(axis=0),
                        self.executor.execute_tensor(arr.argmax(axis=0), concat=True)[0]))
        self.assertTrue(np.array_equal(raw.argmin(axis=0),
                        self.executor.execute_tensor(arr.argmin(axis=0), concat=True)[0]))
    
        raw = sps.random(20, 20, density=.1)
    
        arr = tensor(raw, chunks=3)
    
        self.assertEqual(raw.argmax(),
                         self.executor.execute_tensor(arr.argmax())[0])
        self.assertEqual(raw.argmin(),
>                        self.executor.execute_tensor(arr.argmin())[0])
E       AssertionError: 1 != 2
mars/tensor/execution/tests/test_reduction_execute.py:277: AssertionError

Maybe some corner cases should be dealt with.

Additional context
Linux, Python 3.7

[BUG] for python2.7, raise ImportError if mock is not installed

Describe the bug
For python 2.7, if mock library is not installed, ImportError will be raised.

To Reproduce

if mock not installed

>>> import mars.tensor as mt
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "mars/tensor/__init__.py", line 18, in <module>
    from .expressions.datasource import tensor, array, asarray, scalar, \
  File "mars/tensor/expressions/__init__.py", line 17, in <module>
    from . import rechunk
  File "mars/tensor/expressions/rechunk/__init__.py", line 17, in <module>
    from .rechunk import rechunk
  File "mars/tensor/expressions/rechunk/rechunk.py", line 24, in <module>
    from ....config import options
  File "mars/config.py", line 22, in <module>
    from .compat import six
  File "mars/compat/__init__.py", line 198, in <module>
    import mock
ImportError: No module named mock

[BUG][Actor] client in a different process will hang when call actor pool

Describe the bug

When create an actor client in a different process with the actor pool, the call to the actor pool will hang.

To Reproduce

Start a pool.

from mars.actors import create_actor_pool  

pool = create_actor_pool('127.0.0.1:33445')

Create a client in a different process.

from mars.actors import new_client

client = new_client()  
client.has_actor(client.actor_ref('0.0.0.0:33445', 's:KVStoreActor'))

The client will hang.

If we call the same code in the process of actor pool, all the hang will stop.

[TEST]Specialized tests needed under Windows

As plasma store does not support Windows currently, cases for web sessions on a mock server are needed to make sure Mars client works properly under Windows.

[TENSOR] eager mode support

Is your feature request related to a problem? Please describe.

Currently, the tensor can only be executed by the graph mode, we build a tensor graph, submit it to the cluster when users call the execute method.

Although the graph mode is born for performance, it's quite unfriendly for developing, and hard to debug. Thus, I suggest that we can provide an eager mode.

Describe the solution you'd like

We can enable the eager mode by the option.

Enable the eager mode globally.

from mars.config import options

options.eager_mode = True

Or use a context.

from mars.config import option_context

with option_context() as options:
    options.eager_mode = True
    # the eager mode is on only for the with statement

Since every Tensor object is created by the new_tensors method defined in tensor/expressions/core.py, if the eager mode is on, we can trigger the submission graph there to make sure the tensor is executed eagerly.

Additional context

Maybe we can abstract a new_entities not only for the tensors, prepared for the coming dataframe module.

By default use core number as n_parallel for threaded scheduling

Use core number as n_parallel for threaded scheduling, currently 1 thread by default.

[BUG] mt.random.rand got an unexpected result

example:

In [1]: import mars.tensor as mt

In [2]: r = mt.random.rand((2, 3))

In [3]: r.execute()
Out[3]: array([0.86173081, 0.0367415 ])

this interface should raise TypeError if argument is a shape-tuple, like numpy:

In [1]: import numpy as np

In [2]: np.random.rand((2, 3))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-e49eb55bb286> in <module>()
----> 1 np.random.rand((2, 3))

mtrand.pyx in mtrand.RandomState.rand()

mtrand.pyx in mtrand.RandomState.random_sample()

mtrand.pyx in mtrand.cont0_array()

TypeError: 'tuple' object cannot be interpreted as an integer

Refactor transfer.py to reduce complexity

Is your feature request related to a problem? Please describe.
Implementation of SenderActor and ReceiverActor uses long and nested methods, which increases the difficulty in maintaining these actors.

Describe the solution you'd like
Move nested functions into actor methods. Actor-wise state storage can be extended to store shared variables previously shared as closures. Test cases should also be added.

Loading and saving external data

This might be a silly question, but how do you load and save data with mars?

With dask.array, the main APIs are dask.array.from_array and dask.array.store. These allow for a great deal of flexibility, users can pass in arbitrary Python objects that support the __getitem__ and __setitem__ protocols.

I am intrigued about the idea of testing with mars with xarray, but these APIs would be pretty essential for us. In xarray, we use these to support IO to various file formats (e.g., HDF5, netCDF, zarr).

[BUG] Broadcast destinations not cleaned when meta deleted

Describe the bug
Both worker and chunk deletion does not deal with configured broadcasts, which may exhaust scheduler memory.

Expected behavior
Delete broadcast info when deleting meta.

[BUG][TENSOR] Different slice on the same tensor generate same operand key

Describe the bug

Different slice on the same tensor generate same operand key which must be different.

To Reproduce

In [22]: base_arr = np.random.random((4000, 2))                                 

In [23]: a = mt.array(base_arr)                                                 

In [24]: a[:400].op.key                                                         
Out[24]: '52efbfe8840294880d82da453025296b'

In [25]: a[400: 800].op.key                                                     
Out[25]: '52efbfe8840294880d82da453025296b'

Memory copy is not necessary when IndexSetValue operand is composed

Is your feature request related to a problem? Please describe.
When execute IndexSetValue operand, we will copy the input value from context, however, when the IndexSetValue is executed in a fuse operand, memory copy can be skipped because the operand in fusion won't be referred by other operands not in current fuse operand.

Describe the solution you'd like
We can set flag about fuse on execution thread and only do copy when IndexSetValue is not in a fusion.

mars-project / mars Goto Github PK

mars's Introduction

Installation

Installation for Developers

Architecture Overview

Getting Started

Mars Tensor

Mars DataFrame

Mars Learn

Mars remote

DASK on Mars

Eager Mode

Mars on Ray

Easy to scale in and scale out

Bare Metal Deployment

Kubernetes Deployment

Yarn Deployment

Getting involved

mars's People

Contributors

Stargazers

Watchers

Forkers

mars's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs