GithubHelp home page GithubHelp logo

fabianschuetze / dataframe Goto Github PK

View Code? Open in Web Editor NEW
21.0 4.0 4.0 27.97 MB

A Pandas-like DataFrame in c++

C++ 99.62% CMake 0.38%
cpp cpp17 data dataframes heterogeneous-data template-metaprogramming data-science stl-algorithms

dataframe's Introduction

codecov Build Status C++17

A pandas like dataframe in c++

Copy-on-Write

A strong motivation to write this form of DataFrames, was Wes McKinney's blog post about (his new project) Apache Arrow and the "10 Things I Hate About pandas" My Kinney says that keeping track of pandas memory is difficult as data can either be appropriated by pandas or the underlying numpy arrays. He would instead use pointers to individuals columns whenever possible and create a unique column when the column is modified and more than one DataFrame references it. I implemented such a copy-on-write idiom here:

vector<string> col = {"a", "b"};
vector<int> idx({1, 2});
vector<string> name({"column_a"});
DataFrame df1(Index(idx), name, col);
cout << df1.use_count("column_a") << endl; 
1
DataFrame df2 = df1;
cout << df1.use_count("column_a") << endl;
2
DataFrame::iterator<string> it = df1.begin<string>("column_a");
DataFrame::iterator<string>  e = df1.end<string>("column_a");
transform(it, e, it, [](auto& d) {return  d + "_modify";});
cout << df1.use_count("column_a") << endl;
1

Assignments

One of the things I like most about pandas is the ease with which one can assign new columns. Thanks to Scott Meyers More Effective C++ I could use the ProxyClass idiom to assign new columns too:

vector<double> first({30});
vector<string> names = {"first_col"};
DataFrame df1 = DataFrame(names, first);
 -----------------
 |   | first_col |
 -----------------
 | 0 | 30.000000 |
 -----------------

vector<string> replace({"a"});
df1["first_col"] = replace;
 -----------------
 |   | first_col |
 -----------------
 | 0 | a         |
 -----------------

vector<vector<double>> expand_by_two({{-100, -200}, {400, 600}});
df1[vector<string>({"third_col", "fourth_col"})] = expand_by_two;
 --------------------------------------------
 |   | first_col | fourth_col | third_col   |
 --------------------------------------------
 | 0 | a         | 400.000000 | -100.000000 |
 --------------------------------------------
vector<bool> other({true, false});
DataFrame df2 = DataFrame(names, other);
df1["first_col"] = df2;
 --------------------------------------------
 |   | first_col | fourth_col | third_col   |
 --------------------------------------------
 | 0 | 1         | 400.000000 | -100.000000 |
 --------------------------------------------

STL Integration

Each column of the dataframe can be invoked as (bi-directional) iterators. The iterators can thus cooperate with the STL algorithms, std::algortihm. Examples are:

c++
vector<double> data({1, 2, 3});
vector<string> idx({"a", "b", "c"});
DataFrame df = DataFrame(Index(idx), {"first_col"}, data);
DataFrame::iterator<double> it( df.begin<double>("first_col"));
DataFrame::iterator<double>  e(df.end<double>("first_col"));
auto res = find(it, e, 2); \\finds the value
transform(it, e, it, [](auto& d) {return sqrt(d);});
cout << df << endl;
 -----------------
 |   | first_col |
 -----------------
 | a | 1.000000  |
 | b | 1.414214  |
 | c | 1.732051  |
 -----------------

SummaryStatistics

One can calculate various summary statistics across columns.

vector<string> key1({"a", "a", "b", "b", "a"});
vector<double> data1({1, 1, 3, 4, 5});
vector<double> data2({11, 12, 13, 14, 15});
DataFrame df({"key1", "data1", "data2"}, key1, data1, data2);
head t;
Statistic *p = &t;
cout << df.summarize(p) << endl;
 --------------------------------------
 |      | data1    | data2     | key1 |
 --------------------------------------
 | head | 1.000000 | 11.000000 | a    |
 --------------------------------------

 ---------------------------
 |             | data2     |
 ---------------------------
 | a, 1.000000 | 11.500000 |
 | a, 5.000000 | 15.000000 |
 | b, 3.000000 | 13.000000 |
 | b, 4.000000 | 14.000000 |
 ---------------------------

Groupby

The DataFrame can be grouped by values in one or multiple columns.

vector<string> key1({"a", "a", "b", "b", "a"});
vector<double> data1({1, 1, 3, 4, 5});
vector<double> data2({11, 12, 13, 14, 15});
DataFrame df({"key1", "data1", "data2"}, key1, data1, data2);
mean t;
Statistic *p = &t;
DataFrame::Grouper<string, double> grouper =
df.groupby<string, double>(df.cbegin<string>("key1"),
                           df.cbegin<double>("data1"));
cout << grouper.summarize(p) << endl;
 ---------------------------
 |             | data2     |
 ---------------------------
 | a, 1.000000 | 11.500000 |
 | a, 5.000000 | 15.000000 |
 | b, 3.000000 | 13.000000 |
 | b, 4.000000 | 14.000000 |
 ---------------------------

Iterators

The columns of the iterators provide const and non_const (bi-directional) iterators. These iterators traverse in the direction of the DataFrame index. They can be used in conjunction with the STL algorithms.

vector<string> key1({"a", "a", "b", "b", "f"});
vector<double> data1({1, 2, 3, 4, 5});
std::vector<std::string> idx = {"1", "2", "4", "5", "6"};
DataFrame df(Index(idx), {"key1", "data1"}, key1, data1);
DataFrame::iterator<string> it(df.begin<string>("key1"));
DataFrame::iterator<string> e(df.end<string>("key1"));
vector<string>res(it, e);
std::reverse_copy(it, e, it);

CSV Reader

A DataFrame can be constructed from a CSV file. It is assumed that the first two lines of the CSV file describe the schema of the table: The first line contains the column names and the second line the datatypes.

ifstream infile("csv_example1.csv");
DataFrame df1(infile);
 ---------------------------------------
 |            | first_col | second_col |
 ---------------------------------------
 | 2019/02/02 | 10.000000 | india      |
 ---------------------------------------

Addition

Addition relates the different column and indices of both DataFrames automatically:

DataFrame dfa(Index(vector<int>({1, 1, 3})), {"colOne"},
              vector<double>({-6, 0, 10}));
DataFrame dfb(Index(vector<int>({1})), {"colOne"}, vector<double>({8}));
DataFrame sum = dfa + dfb;
std::cout << sum << std::endl;
 ----------------
 |   | colOne   |
 ----------------
 | 1 | 2.000000 |
 | 1 | 8.000000 |
 | 3 | nan      |
 ----------------

NA Values

One can work with na values. NA values can be dropped or created when adding dataframes with different indices

ifstream infile("contains_nan.csv");
DataFrame df1(infile);
cout << df1 << endl;
 ---------------------------------------
 |            | first_col | second_col |
 ---------------------------------------
 | 2019/02/02 | 10.000000 | nan        |
 | 2019/02/03 | nan       | 50.000000  |
 | 2019/02/04 | 10.000000 | -10.000000 |
 ---------------------------------------
 df1.dropna();
 ---------------------------------------
 |            | first_col | second_col |
 ---------------------------------------
 | 2019/02/04 | 10.000000 | -10.000000 |
 ---------------------------------------

dataframe's People

Contributors

fabianschuetze avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.