GithubHelp home page GithubHelp logo

optimization-of-pandas-for-large-csv's Introduction

optimization-of-pandas-for-large-CSV

Pandas 是常用的 Python 软件库,可用于数据操作和分析。在进行数据分析时,导入数据(例如pd.read_csv)几乎是必需的,但对于大的CSV,可能会需要占用大量的内存和读取时间,这对于数据分析时如果需要Reloading原始数据的话会非常低效。 Dataquest.io 发布了一篇关于如何优化 pandas 内存占用的教程,仅需进行简单的数据类型转换,就能够将一个棒球比赛数据集的内存占用减少了近 90%,而pandas本身集成上的一些压缩数据类型可以帮助我们快速读取数据。

1. 方法介绍

博客

2.类的使用

step1:导入:from Reduce_fastload import reduce_fastload
step2:实例化:process=reduce_fastload('your path',use_HDF5=True/False,use_feather=True/False)
step3:对原始数据作内存优化:process.reduce_data()
step4:加载优化数据:process_data=process.reload_data()

example

在这里插入图片描述
可以看出,原CSV文件占用内存为616.95MB,优化内存后的占用仅为173.9MB,且相对于原来pd.read_csv的7.7s的loading time,读入优化后的预处理数据文件能很大程度上的加速了读取。

Reference

[1].https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65
[2].https://zhuanlan.zhihu.com/p/56541628
[3].https://blog.csdn.net/weiyongle1996/article/details/78498603

optimization-of-pandas-for-large-csv's People

Contributors

lixiangwang avatar

Stargazers

 avatar nixuanfan avatar jasonli avatar Raymond_Li avatar 汪博杰 avatar ConleyKong avatar  avatar  avatar  avatar Peter Ye avatar ZefengYu avatar Lei avatar  avatar Lea avatar  avatar alexhu avatar Hongshu Che avatar  avatar  avatar  avatar

Watchers

 avatar

optimization-of-pandas-for-large-csv's Issues

抱歉,出现了报错

您好,在使用您的代码过程中,出现了报错,如下:
import time
from reduce_fastload.Reduce_fastload import *
process1 = reduce_fastload(target_path+'icgc_tcga_uniprotID_20201026.tsv',use_HDF5=True)
process1.reduce_data()
process_data1 = process1.reload_data()

bug : Error tokenizing data. C error: Expected 2 fields in line 3, saw 3

reduce_fastload('xxxxcsv',use_HDF5=True) use_HDF5 return error


TypeError Traceback (most recent call last)
in
1 process1=reduce_fastload('titanic_training_set-1.csv',use_HDF5=True)
----> 2 process1.reduce_data()
3 print('读入 h5 file: ')
4 process1_data=process1.reload_data()

~/Documents/pandas-optimate/Reduce_fastload.py in reduce_data(self)
63 data_store = pd.HDFStore('processed_data.h5')
64 # Store object in HDFStore
---> 65 data_store.put('preprocessed_df', df, format='table')
66
67 data_store.close()

~/git_repo/miniconda3/lib/python3.8/site-packages/pandas/io/pytables.py in put(self, key, value, format, index, append, complib, complevel, min_itemsize, nan_rep, data_columns, encoding, errors, track_times, dropna)
1090 format = get_option("io.hdf.default_format") or "fixed"
1091 format = self._validate_format(format)
-> 1092 self._write_to_group(
1093 key,
1094 value,

~/git_repo/miniconda3/lib/python3.8/site-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, axes, index, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, encoding, errors, track_times)
1740
1741 # write the object
-> 1742 s.write(
1743 obj=value,
1744 axes=axes,

~/git_repo/miniconda3/lib/python3.8/site-packages/pandas/io/pytables.py in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, track_times)
4251 # validate the axes and set the kinds
4252 for a in table.axes:
-> 4253 a.validate_and_set(table, append)
4254
4255 # add the rows

~/git_repo/miniconda3/lib/python3.8/site-packages/pandas/io/pytables.py in validate_and_set(self, handler, append)
2103 self.validate_attr(append)
2104 self.validate_metadata(handler)
-> 2105 self.write_metadata(handler)
2106 self.set_attr()
2107

~/git_repo/miniconda3/lib/python3.8/site-packages/pandas/io/pytables.py in write_metadata(self, handler)
2193 """ set the meta data """
2194 if self.metadata is not None:
-> 2195 handler.write_metadata(self.cname, self.metadata)
2196
2197

~/git_repo/miniconda3/lib/python3.8/site-packages/pandas/io/pytables.py in write_metadata(self, key, values)
3415 """
3416 values = Series(values)
-> 3417 self.parent.put(
3418 self._get_metadata_path(key),
3419 values,

~/git_repo/miniconda3/lib/python3.8/site-packages/pandas/io/pytables.py in put(self, key, value, format, index, append, complib, complevel, min_itemsize, nan_rep, data_columns, encoding, errors, track_times, dropna)
1090 format = get_option("io.hdf.default_format") or "fixed"
1091 format = self._validate_format(format)
-> 1092 self._write_to_group(
1093 key,
1094 value,

~/git_repo/miniconda3/lib/python3.8/site-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, axes, index, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, encoding, errors, track_times)
1740
1741 # write the object
-> 1742 s.write(
1743 obj=value,
1744 axes=axes,

~/git_repo/miniconda3/lib/python3.8/site-packages/pandas/io/pytables.py in write(self, obj, data_columns, **kwargs)
4543 name = obj.name or "values"
4544 obj = obj.to_frame(name)
-> 4545 return super().write(obj=obj, data_columns=obj.columns.tolist(), **kwargs)
4546
4547 def read(

~/git_repo/miniconda3/lib/python3.8/site-packages/pandas/io/pytables.py in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, track_times)
4216
4217 # create the axes
-> 4218 table = self._create_axes(
4219 axes=axes,
4220 obj=obj,

~/git_repo/miniconda3/lib/python3.8/site-packages/pandas/io/pytables.py in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize)
3885
3886 new_name = name or f"values_block
{i}"
-> 3887 data_converted = _maybe_convert_for_string_atom(
3888 new_name,
3889 b,

~/git_repo/miniconda3/lib/python3.8/site-packages/pandas/io/pytables.py in _maybe_convert_for_string_atom(name, block, existing_col, min_itemsize, nan_rep, encoding, errors)
4885 # we cannot serialize this data, so report an exception on a column
4886 # by column basis
-> 4887 for i in range(len(block.shape[0])):
4888 col = block.iget(i)
4889 inferred_type = lib.infer_dtype(col, skipna=False)

TypeError: object of type 'int' has no len()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.