Comments (5)
Hi, @Dzg0309 . You should upload your data to a filesystem like S3 and use xorbits interfaces like read_csv, read_parquet (According to your data) to read it, then use deduplication operation to complete what you want.
Here's an example:
- First start you xorbits cluster, get the xorbits cluster endpoint.
import xorbits
xorbits.init("<your xorbits cluster endpoint>")
import xorbits.pandas as pd
df = pd.read_csv("s3://xxx") # according to your dataset. This step will process your data in a distributed way.
# then use xorbits operator to complete your task
from xorbits.
You should upload your data to a filesystem like S3 and use xorbits interfaces like read_csv, read_parquet (According to your data) to read it,
Is it necessary to use the s3 file system? Or you can use the hadoop file system, or you can read the data files directly from the local. I tried putting all the data under the supervisor, and then when starting the cluster, other workers would report file not found errors. When the data is split to all nodes (the data directories of each node remain consistent), there will be a problem that the file does not exist during the operation, which makes me very confused.
Below is my code:
import xorbits
import xorbits.pandas as pd
from xorbits.experimental import dedup
import xorbits.datasets as xdatasets
xorbits.init(address='http://xx.xx.xxx.xx:xxxx',
session_id = 'xorbits_dedup_test_09',
)
ds = xdatasets.from_huggingface("/passdata/xorbits_data/mnbvc_test", split='train', cache_dir='/passdata/.cache')
df = ds.to_dataframe()
res = dedup(df, col="content", method="minhash", threshold=0.7, num_perm=128, min_length=5, ngrams=5, seed=42) # for 'minhash' method
res.to_parquet('/passdata/xorbits_data/output')
xorbits.shutdown()
from xorbits.
You should upload your data to a filesystem like S3 and use xorbits interfaces like read_csv, read_parquet (According to your data) to read it,
Is it necessary to use the s3 file system? Or you can use the hadoop file system, or you can read the data files directly from the local. I tried putting all the data under the supervisor, and then when starting the cluster, other workers would report file not found errors. When the data is split to all nodes (the data directories of each node remain consistent), there will be a problem that the file does not exist during the operation, which makes me very confused.
Below is my code:
import xorbits import xorbits.pandas as pd from xorbits.experimental import dedup import xorbits.datasets as xdatasets xorbits.init(address='http://xx.xx.xxx.xx:xxxx', session_id = 'xorbits_dedup_test_09', ) ds = xdatasets.from_huggingface("/passdata/xorbits_data/mnbvc_test", split='train', cache_dir='/passdata/.cache') df = ds.to_dataframe() res = dedup(df, col="content", method="minhash", threshold=0.7, num_perm=128, min_length=5, ngrams=5, seed=42) # for 'minhash' method res.to_parquet('/passdata/xorbits_data/output') xorbits.shutdown()
If the data is in a local
directory, then each worker should have the same copy of the data in the same local path. Or, you can put the data in a S3 directory, then each worker get it's partition from S3 directly.
If your data is in csv or parquet format, you can try the read_parquet
API: https://doc.xorbits.io/en/stable/reference/pandas/generated/xorbits.pandas.read_parquet.html#xorbits.pandas.read_parquet These APIs allow for more flexible slicing of data. But local data still needs to be copied to each worker in advance.
from xorbits.
ok I tried to copy the data to each node and it worked, but at the same time two other problems occurred:
- Can the read_json of xorbits directly read the data folder path and load the json file in parallel?
- After performing the deduplication, I wanted to use to_parquet('/passdata/xorbits_data/output') to save the data. I found that it was very slow and only saved a 0.parquet file to one of the nodes. This made me It’s a headache. I want it to be saved to multiple nodes in parallel to increase the saving speed. What should I do?
from xorbits.
ok I tried to copy the data to each node and it worked, but at the same time two other problems occurred:
- Can the read_json of xorbits directly read the data folder path and load the json file in parallel?
- After performing the deduplication, I wanted to use to_parquet('/passdata/xorbits_data/output') to save the data. I found that it was very slow and only saved a 0.parquet file to one of the nodes. This made me It’s a headache. I want it to be saved to multiple nodes in parallel to increase the saving speed. What should I do?
-
Currenlty,
read_json
API is not implemented. So, it is fall back to pandasread_json
which is not distributed. If your json data is in jsonl format (each line is a json string), then we can schedule a PR to implement the distributedread_json
. -
to_parquet
accept a path contains*
to write the chunk data to the node that generated it. If yourto_parquet
only save the data to one node, you may want to check:- Is
*
in your save path? If not, add a*
to it. - Is the data to parquet already tiled? If not, rechunk it or use a distributed data source, e.g. read_csv..
- Is
from xorbits.
Related Issues (20)
- BUG: df.map_chunk with empty DataFrame cannot work
- BUG: df groupby nunique when by is series type
- BUG: read_parquet generates a memory allocation error HOT 1
- BUG: Integrated pandas can't Read CSV while latest pandas can HOT 1
- BUG: too many open files HOT 6
- BUG: user-defined function groupby.agg has unexpected keyword argument HOT 2
- BUG: pd.read_csv cannot read pathlib.Path
- BUG: read_csv Indexing to a list of numbers is not supported. HOT 1
- BUG: pd.read_csv(compression="gzip") can not run paralllel
- BUG: `xorbits._mars.learn.neighbors.NearestNeighbors` doesn't work
- BUG: service stopped when pivot a 1125138913x5 matrix into 4000 columns on a 160U-4096GBmem machine HOT 1
- BUG: set column when using fallback results
- BUG: FileNotFoundError: [Errno 2] No such file or directory HOT 2
- Does xorbits support sklearn and which algorithms are supported? HOT 6
- FEAT: how xorbits datastes export to json file HOT 5
- BUG: How to read local csv file HOT 4
- BUG: xorbits.shutdown occur some error
- BUG: OSError: [Errno 24] Too many open files HOT 5
- ENH: xorbits's read_parquet compatible with pandas on pyarrow engine
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xorbits.