mm-mansour / fast-pandas Goto Github PK

View Code? Open in Web Editor NEW

967.0 36.0 87.0 1.58 MB

Benchmark for different operations in pandas against various dataframe sizes.

Python 100.00%

fast-pandas's People

Stargazers

Watchers

Forkers

miguelperalvo carsondahlberg aristeid satoshirobatofujimoto mmngreco ansidong allensmile volker48 raybuhr webmarko emilkindt tukeyone nielsenj wkryst mrimal dimitris0mg anya-alexa lepy gustavocarita shyamalschandra nikolayvoronchikhin rammanohar2015 arunkumarramanan fus3rx nikhil0410 gridl whitepoplar022 farhadkpx dandelioncn jeffreymingyue chichak donggunn zzm422 fenix0817 tianyangzhang1 coloratto lalalland jianjian12138 rowhit jmg656 dengboyaogouwu karandeepdps jilldwright56 pq7799 srvand sophiasorokina sinha96 mfrigillana sylvia1664 erichonestbee nextravert ktaranov cyan1001 shekharrajak yogeshtak stjordanis gaopgx karthikeyann zengyouxuan elifsudegokay yuanjie-ai kcsekhar-de nofeetbird0321 manoharsai techwrekfix artlessy amir22010 sahanduiuc zhangmengyu-gridsum ksaur vponomarev42 masimov netchose leehou96 kumaf stevenseveur amhijazi standardgalactic wjptak lenamax2355 carlitosh rushiraj98 zuxianghuang iq-scm serdinskyj

fast-pandas's Issues

Potential typo in section 1.3 of readme

Hey there,

Thanks for all the work you put into this, it will be a great reference in the future.

I noticed something that could be either a typo or an overlooked error - or I could just be misunderstanding. In section 1.3 of the readme, it seems like you've swapped the contents of the query_selection and bracket_selection functions:

def query_selection(df):
    return df[(df["A"] > 0) & (df["A"] < 100)]

def bracket_selection(df):
    return df.query("A > 0 and A < 100")

It seems like query_selection should be making a call to the df.query() method, and bracket_selection should be using [] to select data. If this isn't the case, then I'd argue you should probably use different or more explicit function names, as I definitely got the wrong impression from your current names. =)

Finally, if this was unintentional, I'd be curious as to whether the benchmarks were actually reversed, or if this was just a typo in creating the write-up.

Thanks again!

Use standard deviation to have significant analysis

Using only mean/average isn't a good way to evaluate these results. Plot the standard deviation or do the analysis using bloxplot. You should also perform a hypothesis test (t-test or z-test)

Improving the timing even more

I dont know the implementation details of np.mean and df["A"].mean() but I guess that np.mean(df["A"]) is slow because numpy has to cast the pd.Series in some way.
Using np.mean(df["A"].values) is probably always faster, especially for large arrays.

benchmarks for pivot, groupby, resampling

these microbenchmarks are informative but in supporting tensorflow dataprep for larger dataset (2.5 million events) pandas ventures into swap territory(for me).

maybe there are expert pandas techniques and more than one way to do things

Either legend is wrong or bullet points are wrong

In the selection section these bullet points don't match the graph according to the legend.

loc and query selections are identical in performance.
Square bracket selection is the slowest method.

According to the graph it looks like loc and square brackets are identical in performance. It looks like query is the slowest method. I'm not sure if the issue is with the legend in the graph or the bullet points.

mm-mansour / fast-pandas Goto Github PK

fast-pandas's People

Stargazers

Watchers

Forkers

fast-pandas's Issues

Potential typo in section 1.3 of readme

Use standard deviation to have significant analysis

Improving the timing even more

benchmarks for pivot, groupby, resampling

Either legend is wrong or bullet points are wrong

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs