mm-mansour / fast-pandas Goto Github PK
View Code? Open in Web Editor NEWBenchmark for different operations in pandas against various dataframe sizes.
Benchmark for different operations in pandas against various dataframe sizes.
Hey there,
Thanks for all the work you put into this, it will be a great reference in the future.
I noticed something that could be either a typo or an overlooked error - or I could just be misunderstanding. In section 1.3 of the readme, it seems like you've swapped the contents of the query_selection
and bracket_selection
functions:
def query_selection(df):
return df[(df["A"] > 0) & (df["A"] < 100)]
def bracket_selection(df):
return df.query("A > 0 and A < 100")
It seems like query_selection
should be making a call to the df.query()
method, and bracket_selection
should be using []
to select data. If this isn't the case, then I'd argue you should probably use different or more explicit function names, as I definitely got the wrong impression from your current names. =)
Finally, if this was unintentional, I'd be curious as to whether the benchmarks were actually reversed, or if this was just a typo in creating the write-up.
Thanks again!
Using only mean/average isn't a good way to evaluate these results. Plot the standard deviation or do the analysis using bloxplot. You should also perform a hypothesis test (t-test or z-test)
I dont know the implementation details of np.mean
and df["A"].mean()
but I guess that np.mean(df["A"])
is slow because numpy has to cast the pd.Series
in some way.
Using np.mean(df["A"].values)
is probably always faster, especially for large arrays.
these microbenchmarks are informative but in supporting tensorflow dataprep for larger dataset (2.5 million events) pandas ventures into swap territory(for me).
maybe there are expert pandas techniques and more than one way to do things
In the selection section these bullet points don't match the graph according to the legend.
According to the graph it looks like loc and square brackets are identical in performance. It looks like query is the slowest method. I'm not sure if the issue is with the legend in the graph or the bullet points.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.