GithubHelp home page GithubHelp logo

gota's Introduction

Gota: DataFrames, Series and Data Wrangling for Go

Meet us on Slack: Slack: gophers.slack.com #go-gota (invite)

This is an implementation of DataFrames, Series and data wrangling methods for the Go programming language. The API is still in flux so use at your own risk.

DataFrame

The term DataFrame typically refers to a tabular dataset that can be viewed as a two dimensional table. Often the columns of this dataset refers to a list of features, while the rows represent a number of measurements. As the data on the real world is not perfect, DataFrame supports non measurements or NaN elements.

Common examples of DataFrames can be found on Excel sheets, CSV files or SQL database tables, but this data can come on a variety of other formats, like a collection of JSON objects or XML files.

The utility of DataFrames resides on the ability to subset them, merge them, summarize the data for individual features or apply functions to entire rows or columns, all while keeping column type integrity.

Usage

Loading data

DataFrames can be constructed passing Series to the dataframe.New constructor function:

df := dataframe.New(
	series.New([]string{"b", "a"}, series.String, "COL.1"),
	series.New([]int{1, 2}, series.Int, "COL.2"),
	series.New([]float64{3.0, 4.0}, series.Float, "COL.3"),
)

You can also load the data directly from other formats. The base loading function takes some records in the form [][]string and returns a new DataFrame from there:

df := dataframe.LoadRecords(
    [][]string{
        []string{"A", "B", "C", "D"},
        []string{"a", "4", "5.1", "true"},
        []string{"k", "5", "7.0", "true"},
        []string{"k", "4", "6.0", "true"},
        []string{"a", "2", "7.1", "false"},
    },
)

Now you can also create DataFrames by loading an slice of arbitrary structs:

type User struct {
	Name     string
	Age      int
	Accuracy float64
    ignored  bool // ignored since unexported
}
users := []User{
	{"Aram", 17, 0.2, true},
	{"Juan", 18, 0.8, true},
	{"Ana", 22, 0.5, true},
}
df := dataframe.LoadStructs(users)

By default, the column types will be auto detected but this can be configured. For example, if we wish the default type to be Float but columns A and D are String and Bool respectively:

df := dataframe.LoadRecords(
    [][]string{
        []string{"A", "B", "C", "D"},
        []string{"a", "4", "5.1", "true"},
        []string{"k", "5", "7.0", "true"},
        []string{"k", "4", "6.0", "true"},
        []string{"a", "2", "7.1", "false"},
    },
    dataframe.DetectTypes(false),
    dataframe.DefaultType(series.Float),
    dataframe.WithTypes(map[string]series.Type{
        "A": series.String,
        "D": series.Bool,
    }),
)

Similarly, you can load the data stored on a []map[string]interface{}:

df := dataframe.LoadMaps(
    []map[string]interface{}{
        map[string]interface{}{
            "A": "a",
            "B": 1,
            "C": true,
            "D": 0,
        },
        map[string]interface{}{
            "A": "b",
            "B": 2,
            "C": true,
            "D": 0.5,
        },
    },
)

You can also pass an io.Reader to the functions ReadCSV/ReadJSON and it will work as expected given that the data is correct:

csvStr := `
Country,Date,Age,Amount,Id
"United States",2012-02-01,50,112.1,01234
"United States",2012-02-01,32,321.31,54320
"United Kingdom",2012-02-01,17,18.2,12345
"United States",2012-02-01,32,321.31,54320
"United Kingdom",2012-02-01,NA,18.2,12345
"United States",2012-02-01,32,321.31,54320
"United States",2012-02-01,32,321.31,54320
Spain,2012-02-01,66,555.42,00241
`
df := dataframe.ReadCSV(strings.NewReader(csvStr))
jsonStr := `[{"COL.2":1,"COL.3":3},{"COL.1":5,"COL.2":2,"COL.3":2},{"COL.1":6,"COL.2":3,"COL.3":1}]`
df := dataframe.ReadJSON(strings.NewReader(jsonStr))

Subsetting

We can subset our DataFrames with the Subset method. For example if we want the first and third rows we can do the following:

sub := df.Subset([]int{0, 2})

Column selection

If instead of subsetting the rows we want to select specific columns, by an index or column name:

sel1 := df.Select([]int{0, 2})
sel2 := df.Select([]string{"A", "C"})

Updating values

In order to update the values of a DataFrame we can use the Set method:

df2 := df.Set(
    []int{0, 2},
    dataframe.LoadRecords(
        [][]string{
            []string{"A", "B", "C", "D"},
            []string{"b", "4", "6.0", "true"},
            []string{"c", "3", "6.0", "false"},
        },
    ),
)

Filtering

For more complex row subsetting we can use the Filter method. For example, if we want the rows where the column "A" is equal to "a" or column "B" is greater than 4:

fil := df.Filter(
    dataframe.F{"A", series.Eq, "a"},
    dataframe.F{"B", series.Greater, 4},
)

filAlt := df.FilterAggregation(
    dataframe.Or,
    dataframe.F{"A", series.Eq, "a"},
    dataframe.F{"B", series.Greater, 4},
) 

Filters inside Filter are combined as OR operations, alternatively we can use df.FilterAggragation with dataframe.Or.

If we want to combine filters with AND operations, we can use df.FilterAggregation with dataframe.And.

fil := df.FilterAggregation(
    dataframe.And, 
    dataframe.F{"A", series.Eq, "a"},
    dataframe.F{"D", series.Eq, true},
)

To combine AND and OR operations, we can use chaining of filters.

// combine filters with OR
fil := df.Filter(
    dataframe.F{"A", series.Eq, "a"},
    dataframe.F{"B", series.Greater, 4},
)
// apply AND for fil and fil2
fil2 := fil.Filter(
    dataframe.F{"D", series.Eq, true},
)

Filtering is based on predefined comparison operators:

  • series.Eq
  • series.Neq
  • series.Greater
  • series.GreaterEq
  • series.Less
  • series.LessEq
  • series.In

However, if these filter operations are not sufficient, we can use user-defined comparators. We use series.CompFunc and a user-defined function with the signature func(series.Element) bool to provide user-defined filters to df.Filter and df.FilterAggregation.

hasPrefix := func(prefix string) func(el series.Element) bool {
        return func (el series.Element) bool {
            if el.Type() == String {
                if val, ok := el.Val().(string); ok {
                    return strings.HasPrefix(val, prefix)
                }
            }
            return false
        }
    }

fil := df.Filter(
    dataframe.F{"A", series.CompFunc, hasPrefix("aa")},
)

This example filters rows based on whether they have a cell value starting with "aa" in column "A".

GroupBy && Aggregation

GroupBy && Aggregation

groups := df.GroupBy("key1", "key2") // Group by column "key1", and column "key2" 
aggre := groups.Aggregation([]AggregationType{Aggregation_MAX, Aggregation_MIN}, []string{"values", "values2"}) // Maximum value in column "values",  Minimum value in column "values2"

Arrange

With Arrange a DataFrame can be sorted by the given column names:

sorted := df.Arrange(
    dataframe.Sort("A"),    // Sort in ascending order
    dataframe.RevSort("B"), // Sort in descending order
)

Mutate

If we want to modify a column or add one based on a given Series at the end we can use the Mutate method:

// Change column C with a new one
mut := df.Mutate(
    series.New([]string{"a", "b", "c", "d"}, series.String, "C"),
)
// Add a new column E
mut2 := df.Mutate(
    series.New([]string{"a", "b", "c", "d"}, series.String, "E"),
)

Joins

Different Join operations are supported (InnerJoin, LeftJoin, RightJoin, CrossJoin). In order to use these methods you have to specify which are the keys to be used for joining the DataFrames:

df := dataframe.LoadRecords(
    [][]string{
        []string{"A", "B", "C", "D"},
        []string{"a", "4", "5.1", "true"},
        []string{"k", "5", "7.0", "true"},
        []string{"k", "4", "6.0", "true"},
        []string{"a", "2", "7.1", "false"},
    },
)
df2 := dataframe.LoadRecords(
    [][]string{
        []string{"A", "F", "D"},
        []string{"1", "1", "true"},
        []string{"4", "2", "false"},
        []string{"2", "8", "false"},
        []string{"5", "9", "false"},
    },
)
join := df.InnerJoin(df2, "D")

Function application

Functions can be applied to the rows or columns of a DataFrame, casting the types as necessary:

mean := func(s series.Series) series.Series {
    floats := s.Float()
    sum := 0.0
    for _, f := range floats {
        sum += f
    }
    return series.Floats(sum / float64(len(floats)))
}
df.Capply(mean)
df.Rapply(mean)

Chaining operations

DataFrames support a number of methods for wrangling the data, filtering, subsetting, selecting columns, adding new columns or modifying existing ones. All these methods can be chained one after another and at the end of the procedure check if there has been any errors by the DataFrame Err field. If any of the methods in the chain returns an error, the remaining operations on the chain will become a no-op.

a = a.Rename("Origin", "Country").
    Filter(dataframe.F{"Age", "<", 50}).
    Filter(dataframe.F{"Origin", "==", "United States"}).
    Select("Id", "Origin", "Date").
    Subset([]int{1, 3})
if a.Err != nil {
    log.Fatal("Oh noes!")
}

Print to console

fmt.Println(flights)

> [336776x20] DataFrame
> 
>     X0    year  month day   dep_time sched_dep_time dep_delay arr_time ...
>  0: 1     2013  1     1     517      515            2         830      ...
>  1: 2     2013  1     1     533      529            4         850      ...
>  2: 3     2013  1     1     542      540            2         923      ...
>  3: 4     2013  1     1     544      545            -1        1004     ...
>  4: 5     2013  1     1     554      600            -6        812      ...
>  5: 6     2013  1     1     554      558            -4        740      ...
>  6: 7     2013  1     1     555      600            -5        913      ...
>  7: 8     2013  1     1     557      600            -3        709      ...
>  8: 9     2013  1     1     557      600            -3        838      ...
>  9: 10    2013  1     1     558      600            -2        753      ...
>     ...   ...   ...   ...   ...      ...            ...       ...      ...
>     <int> <int> <int> <int> <int>    <int>          <int>     <int>    ...
> 
> Not Showing: sched_arr_time <int>, arr_delay <int>, carrier <string>, flight <int>,
> tailnum <string>, origin <string>, dest <string>, air_time <int>, distance <int>, hour <int>,
> minute <int>, time_hour <string>

Interfacing with gonum

A gonum/mat.Matrix or any object that implements the dataframe.Matrix interface can be loaded as a DataFrame by using the LoadMatrix() method. If one wants to convert a DataFrame to a mat.Matrix it is necessary to create the necessary structs and method implementations. Since a DataFrame already implements the Dims() (r, c int) method, only implementations for the At and T methods are necessary:

type matrix struct {
	dataframe.DataFrame
}

func (m matrix) At(i, j int) float64 {
	return m.Elem(i, j).Float()
}

func (m matrix) T() mat.Matrix {
	return mat.Transpose{m}
}

Series

Series are essentially vectors of elements of the same type with support for missing values. Series are the building blocks for DataFrame columns.

Four types are currently supported:

Int
Float
String
Bool

For more information about the API, make sure to check:

License

Copyright 2016 Alejandro Sanchez Brotons

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

gota's People

Contributors

adjulbic avatar arjunmahishi avatar beaquant avatar benjmarshall avatar chrmang avatar chrmang-jambit avatar chrstphlbr avatar danicat avatar fredericlemoine avatar gautamdoulani avatar gmarcais avatar jfussion avatar julienrbrt avatar kellrott avatar kmrx avatar kniren avatar limads avatar mcolosimo-p4 avatar michaelbironneau avatar prithvipal avatar prliu avatar shixzie avatar szaydel avatar typeless avatar wanglong001 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gota's Issues

Bug: Non sequential column names when auto-renaming

When using automatic renaming of column names, the numeric suffix is not assigned sequentially if the column names that are repeated appear more than twice. This means that this works as expected:

     b := LoadRecords(
         [][]string{
             []string{"A", "B", "B", "C", "D"},
             []string{"1", "4", "4", "5.1", "true"},
             []string{"1", "4", "4", "6.0", "true"},
             []string{"2", "3", "3", "6.0", "false"},
         },
     )
     fmt.Print(b.Names())

> [A B_0 B_1 C D]

But this won't:

     b := LoadRecords(
         [][]string{
             []string{"A", "B", "B", "B", "C", "D"},
             []string{"1", "4", "4", "4", "5.1", "true"},
             []string{"1", "4", "4", "4", "6.0", "true"},
             []string{"2", "3", "3", "3", "6.0", "false"},
         },
     )
     fmt.Print(b.Names())

> [A B_1 B_3 B_5 C D] // Expected [A B_0 B_1 B_2 C D]

Split-Apply-Combine methods for DataFrames

The Split-Apply-Combine data analysis paradigm focus on separating the data rows in groups, apply a function over each group rows/cols and then combining the results into a unique table.

Split/Group

The grouping could be done by first splitting the rows by a given factor Series:

func (d DataFrame) Split(factor Series) ([]DataFrame, err) { ... }

It could also be done by storing the grouping factor inside the DataFrame object and then delegating the responsibility of using this grouping to the functions that need it. This is more similar to what dplyr does and it could facilitate chain operations.

func (d DataFrame) GroupBy(factor Series) DataFrame { ... }
func (d DataFrame) Split() ([]DataFrame, err) { ... } // Uses the stored GroupBy groups

Maybe instead of passing GroupBy a Series, we could rely on the column name and use one of the columns instead. This will help a lot when subsetting.

Apply

We want to be able to apply functions to both rows and columns over a DataFrame. The dimension of the returned Series should be compatible with each other. Additionally, when applying functions over rows, since we can't expect the columns to be all of the same type, we will have to cast the types.

The API should be pretty straightforward:

func (d DataFrame) RApply(f func(Series) Series) DataFrame { ... }
func (d DataFrame) CApply(f func(Series) Series) DataFrame { ... }

With the implementation of Apply operations we will have a powerful aggregation mechanism that don't have to depend on data splitting but could work on it's own.

Combine

The easier of the bunch. The main decision is if we want to try to preserve the original order or we will just concatenate the results of all DataFrame group operations.

The example doesn't add up.

The function application example in the README:

mean := func(s series.Series) series.Series {
    floats := s.Float()
    sum := 0.0
    for _, f := range floats {
        sum += f
    }
    return series.Floats(sum / float64(len(floats)))
}
df.Cbind(mean)
df.Rbind(mean)

CBind and RBind seem to receive a DataFrame value rather than a function value. (?)

Error when data frame on 1000 records

I run dataframe 1000 records, show error as
Error indexer greater than 1000

df = df.Set(
indexer,
dataframe.LoadRecords(
[][]string{
[]string{"Name", "Total"},
[]string{keyword, total},
},
),
)

![error

error
Hope everybody help please
Thank you

Add Donation Link

Would it be possible to add a donation link? I would like to buy you a beer/coffee for all your hard work.

Usage with binary data ?

This is useful for me. In the roadmap, have you considered binary data ( images etc ) and tensors ?
I do a fair bit of ETL and fbp style work in golang, and so might be able to contribute

Support bitwise left-shift and right-shift

I was wondering how you feel about series of Integers supporting bitwise operations like shifts? I actually realized that I could benefit from this in my own work, where for example I have a whole bunch of numbers that are say in kilobytes, but I really want to operate on bytes instead of KBs.

Thanks!

Which license applies to this repo

It looks a quite interesting project and you may find some people to collaborate, but without specifying a LICENSE is pretty hard that people decide to push something.

LICENSE can be any, some people may not like to collaborate depending on it, but at least specifying one, everybody will be clear, than having to fallback to whatever law.

Thanks for considering

Allowing type specification through a map rather than a variadic string argument would be more flexible

Right now I have to either specify all types or no types at all. Specifying the types in a map[string]string (column name -> type name) would add the possibility to specify types only for the columns you want to and fallback to auto typing for the other columns.

It could possibly also shorten the code in ReadRecords by simply checking if the column name is in the map and if not fallback to findType.

What do you think?

is this project dumped out

are there any similar package or fork of this package to help us work well with dataframes.

Please at least reply if you are planning to drop the package.

New index?

pandas has the ability to set your own data as new index for a dataframe. gota seems to not have this?

Is it possible to add a method to lazyread a csv?

I'm doing this kind of thing:

	fmt.Println("Reading csv...")	
	csv, err := os.Open(myfile)  //myfile is 200M or so, takes awhile to read
	if err != nil {
		fmt.Print(err)
		os.Exit(1)
	}

	fmt.Println("Make it a df...")			
	df := dataframe.ReadCSV(csv)

	fmt.Println("Sorting, filtering df...")		
	fil := df.Filter(
		dataframe.F{"colA", series.Eq, "VARIABLE"},
    )

Would be very cool if my filtering could start happening as the initial lines are read.

ReadCSV should take a Reader as input rather than a string

I've started looking at reimplementing some of the functionality of https://github.com/tobgu/qcache in Go (currently python and pandas) and found gota. Looks good, keep it up!

I think it would be better to take a Reader as input to ReadCSV than a string. As it is right now I have to read all bytes from a reader into a byte buffer that then has to be converted to a string. That string is then immediately converted into a Reader in ReadCSV.

Do you agree?

typo in README.md

Cbind and Rbind needs to be updated to Capply and Rapply respectively.

Documentation clearly states it is Capply but readme might throw off newcomers.

Thanks

Allow the modification of DataFrames/Series values

As it currently stands, when a Series is created, its elements should not be able to change for other than subsetting operations. However, we might want to modify elements of a Series once it has been initialised (Not necessarily by modifying the Series in situ but returning a new Series with the updated values).

An appropriate API should be designed for this purpose and then integrate Series modifications directly into DataFrame operations.

Delete/Drop Function

Have you considered the addition of a delete/remove/drop type function? I've started to use your library and have come across a use case where it would be advantageous to be able to explicitly identify a column to be removed. Currently I would have to get all of the column names, find which index I want to remove, and then generate a series of indexes excluding that one to perform a Select operation.

I am happy to have a first pass at adding this type of function if there is a consensus it would be helpful.

colnames get lost after calling Rapply()

Hi alex, thank you for your great work.

I noticed that the column names get lost after calling Rapply during my tests, also the detected types.

test codes:

package main

import (
	"log"

	"github.com/kniren/gota/dataframe"
	"github.com/kniren/gota/series"
)

func main() {
	df := dataframe.LoadRecords(
		[][]string{
			[]string{"A", "B", "C", "D"},
			[]string{"a", "4", "5.1", "true"},
			[]string{"k", "5", "7.0", "true"},
			[]string{"k", "4", "6.0", "true"},
			[]string{"a", "2", "7.1", "false"},
		},
	)

	applied := df.Rapply(func(s series.Series) series.Series {
		return s
	})

	log.Println(df)
	log.Println(applied)
}

output:

2017/11/01 17:38:32 [4x4] DataFrame

    A        B     C        D
 0: a        4     5.100000 true
 1: k        5     7.000000 true
 2: k        4     6.000000 true
 3: a        2     7.100000 false
    <string> <int> <float>  <bool>

2017/11/01 17:38:32 [4x4] DataFrame

    X0       X1       X2       X3
 0: a        4        5.100000 true
 1: k        5        7.000000 true
 2: k        4        6.000000 true
 3: a        2        7.100000 false
    <string> <string> <string> <string>

Comparing equality of two DataFrames

How one does compare two DataFrames? Is the functionality of directly comparing the equality or not of two DataFrames desirable?

A first check for dimensions, column types and column names could be useful to quickly reject the equality. Column order is important, so two DataFrames that have the same data but the order columns are switched should be marked as not equal.

When comparing columns of different types we can compare element by element and row by row or we could try to think of hashing the rows and/or columns and check these. If the hashes are stored, this approach would allow for faster comparisons when we are comparing a DataFrame multiple times.

func (d DataFrame) Eq(b DataFrame) bool { ... }

Add support for more types

In addition to the four main types (Strings, Int, Float, Bool), the next candidates for new types are:

  • Date (time.Time)
  • Decimal (big.Int/Rational for currency analyses)
  • Complex (complex64/complex128)

The pros and cons of these additions have to be taken into account, since every new type increases the complexity of the library significantly.

custom sort?

What's the easiest way to do a custom sort of a dataframe? For example, I have a column with string values like 10/04/2014 04:10:10 p.m. and I would like to sort them by the date that represents (ascending).

If this is not easily possible, consider this a feature request. Thanks for a useful package.

Add support for GroupBy and Summarize

A fundamental feature of dataframes is grouping by column/s and summarizing (mean, median, max, min, etc..) other column/s, are you thinking about implementing this functionality?

Filter dataframe return indexer

I have a data frame like this

df := dataframe.LoadRecords(
[][]string{
[]string{"A", "B", "C", "D"},
[]string{"a", "4", "5.1", "true"},
[]string{"k", "5", "7.0", "true"},
[]string{"k", "4", "6.0", "true"},
[]string{"a", "2", "7.1", "false"},
},
)

I want to filter column B to a value of 5, I want the return value to be int 2 .

2 is indexer of df

ReadAll has a bad failure condition

Was using the project and noticed a weird situation on low memory machines where data ended up missing on different runs from the bottoms rows of CSV files.

I'm pretty sure this is the culprit - There's a nice tangential article on why readall functions are considered bad

Looking at the csv.ReadAll function it will allocate up until max memory and then just drop records on the floor. Due to the interface provided by gota - there's no way to pass a Reader() style interface which would allow us to work around it.

Any thoughts on fixing it?

Implement a DataFrame Summary method

Sometimes it is really interesting to get a fast summary of the data contained in a DataFrame quickly, with the dimensions, counts or quartile information depending on the type of the column. In R this could be done with something like summary(df). Perhaps we should try to mimic this functionality and expand upon it for quick data summarization.

Improve DataFrame Stringer interface behaviour for large tables

Consider modifying DataFrame.String() by limiting the column length to a number of characters. Likewise, it could be interesting to wrap columns in separate lines if the combined length is too large.

Additionally one might want to summarise this information if the number of rows is very high. We could use something like what dplyr or data.table are doing, showing only the first and last 10/20 rows instead of the whole table.

An alternative could be to leave DataFrame.String() as is and move these suggested modifications to a separate function. That way, if we want to print the entire table instead of a summary of it we will still be able to do it.

Allow encoding of certain columns to reduce memory usage

Memory consumption is an issue when dealing with large data sets.

Similar in memory columnar stores like python's panadas and microsoft's proprietary vertipaq engine for it's ssas products have the ability to minimize memory usage by using techniques such as:

  1. Value encoding - for numbers, vertipaq will calculate a number it can subtract from every row to lower the requirement of bits needed.

  2. Categorical encoding (dictionary encoding) - for strings, pandas and vertipaq will create a lookup table and use integers to represent the data therefore reducing number of bits.

More info: https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&seqNum=3

This is a feature request for similar functionality.

how to get the SQLDatabase into GOTA

can you please subscribe or detail a method to get the data collected from SQL into GOTA.

Something like

package main

import (
"database/sql"
"fmt"
_ "github.com/lib/pq"
)

const (
host = "localhost"
port = 5432
user = "postgres"
password = "Gurgaon@65"
dbname = "vikram"
)

func main() {
psqlInfo := fmt.Sprintf("host=%s port=%d user=%s "+
"password=%s dbname=%s sslmode=disable",
host, port, user, password, dbname)
db, err := sql.Open("postgres", psqlInfo)
if err != nil {
panic(err)
}
defer db.Close()

err = db.Ping()
if err != nil {
	panic(err)
}
row,_:=db.Query("SELECT * from salesdata limit 10")

println(row)

}

add WriteOption to write CSV with all values quoted

Hi,

I am reading from csv files where all the values are double quoted, even if they do not contain a comma (or whatever the delimiter is).

I read these into a DataFrame and then I do some transformation on it and write it back out to a CSV file. The resulting CSV file's values are not double-quoted unless they contain a comma.

I'd like to be able to pass a WriteOption to WriteCSV() that would force the quoting of all values written., even if they do not contain a delimiter. Just to have consistency between my input and output files.

If this request sounds weird, I will explain my use case. My input files are medical study data that contains personally identifiable information such as name and birth date. My code basically takes this information and changes it to random strings of characters that resemble the original but is no longer identifiable as a specific person. I take the resulting csv files and use them as test fixture data to test another code base. This fixture data can be checked into a public GitHub repository because it no longer contains identifying information. I would like the files to be identical in all respects to the original files (except for the identifying information) so that I can have confidence that my passing tests mean the code will also work with real data. That's why I want the csv files to have all fields double-quoted even if it does not seem necessary or is not called for by the CSV spec.

Does that make sense?

Thanks for a nice package.

Is using sort.Sort safe when specifying more than one order column?

The Go stdlib documentation states that sort.Sort does not guarantee stability of the sorted results (sort.Stable does). Isn't stability a requirement when sorting the dataframe according to the content of multiple columns to guarantee correctness the way dataframe.Arrange is currently implemented?

Profiling, memory usage and performance review

So far there has not been an study on the performance of this library in terms of speed and memory consumption. I'm prioritising now those features that impact the users directly, since the API design is still on flux, but this should be addressed on the near future.

Records method but for float64

Hello,

First of all I would like to show my appreciation for this library, it does a lot of redundant heavy-lifting.

For a machine learning project I'm using gota to load a CSV file and input the data into an algorithm. The thing is need to cast a DataFrame to a [][]float64 slice of slices. I noticed there is a DataFrame.Records method to cast the DataFrame as a slice of slices of strings. Would it in any way be possible to do the same thing for the float64? I think this is really practical because it is a common use-case for machine learning applications.

Regards.

Question about Type Accessor/conversion method

So while this is intentional, I wanted to make sure I understand reasoning before deciding how to work with this. Whereas Float() for example does not have two return parameters, Int() and Bool() do. What was the thinking there? Is the idea that some of these conversions more fellable than others?

// Accessor/conversion methods
Copy() Element     // FIXME: Returning interface is a recipe for pain
Val() ElementValue // FIXME: Returning interface is a recipe for pain
String() string
Int() (int, error)
Float() float64
Bool() (bool, error)

Thanks a lot, sorry if I am being dense...

Golang + Dataframes + Arrow

I stumbled upon your repo while searching around to see if anybody is using the Golang hooks for the Apache Arrow library.

I read an article recently from Wes Mckinny about his involvement with Arrow and how he’s stoked to provide a more flexible Pandas API that would support parallelism out of the box with shared memory, rather than the default pickled approach.

I’d love to see this implemented in golang. I can imagine using the Plasma store API that it would be pretty easy.

How do I add a record to a available dataframe?

Hello Kniren !
How do I add a record to a available dataframe?

df := dataframe.LoadRecords(
		[][]string{
			[]string{"A", "B", "C", "D"},
			[]string{"a", "4", "5.1", "true"},
			[]string{"k", "5", "7.0", "true"},
			[]string{"k", "4", "6.0", "true"},
			[]string{"a", "2", "7.1", "false"},
		},
	)

I want add 1 record in last df

Changing types of an existing DataFrame

We might want to change the type of a column or columns of a DataFrame. To do so, we could enable two methods, one for parsing a given column to the desired type and another one to change all of them at the same time.

func (d DataFrame) ChangeType(colname string, newtype SeriesType) DataFrame { ... }
func (d DataFrame) ChangeTypes(newtype []SeriesType) DataFrame { ... }

Incorrect output when sorting on multiple columns using DataFrame.Arrange().

I'm currently having an issue when attempting to sort by multiple columns.

Given the following code (I'll explain the commented lines in a moment.):

package main

import (
	"fmt"

	"github.com/kniren/gota/dataframe"
)

func main() {
	df := dataframe.LoadRecords(
		[][]string{
			{"A", "B"},
			{"0.346", "662"},
			{"0.331", "725"},
			// { "0.33", "561"},
			// { "0.322", "593"},
			// { "0.322", "543"},
			// { "0.32", "707"},
			// { "0.32", "568"},
			// { "0.318", "671"},
			// {"0.318", "645"},
			// { "0.314", "540"},
			// { "0.312", "679"},
			{"0.31", "682"},
			{"0.309", "680"},
			{"0.308", "695"},
			{"0.307", "514"},
			{"0.306", "530"},
			// { "0.306", "507"},
			// { "0.305", "597"},
			{"0.304", "675"},
			{"0.304", "718"},
			// { "0.303", "576"},
			// { "0.303", "515"},
			// { "0.301", "605"},
			// { "0.3", "645"},
			// { "0.3", "566"},
			{"0.299", "564"},
			{"0.297", "665"},
			{"0.297", "689"},
			{"0.297", "507"},
			{"0.295", "665"},
			// { "0.295", "613"},
			{"0.294", "577"},
			{"0.293", "577"},
			{"0.293", "586"},
			{"0.293", "675"},
			{"0.29", "589"},
			{"0.288", "568"},
			{"0.288", "630"},
			{"0.288", "645"},
			{"0.288", "573"},
		},
	)

	fmt.Println(df.Arrange(dataframe.Sort("A"), dataframe.Sort("B")))
}

I get a correct output of:

[23x2] DataFrame

    A        B
 0: 0.288000 568
 1: 0.288000 573
 2: 0.288000 630
 3: 0.288000 645
 4: 0.290000 589
 5: 0.293000 577
 6: 0.293000 675
 7: 0.293000 586
 8: 0.294000 577
 9: 0.295000 665
    ...      ...
    <float>  <int>

Now comes the reason for the commented out lines.

If I uncomment any of the commented lines, I get the following output.

[24x2] DataFrame

    A        B
 0: 0.288000 645
 1: 0.288000 568
 2: 0.288000 573
 3: 0.288000 630
 4: 0.290000 589
 5: 0.293000 577
 6: 0.293000 675
 7: 0.293000 586
 8: 0.294000 577
 9: 0.295000 665
    ...      ...
    <float>  <int>

The order is no longer correct. Please note the "B" column.

Since I don't yet know what combination of values is causing the incorrect sorting, I've left them all commented out in the data. This is in the hopes of someone seeing something in the values that might trigger this incorrect behavior.

Any thoughts on what might be happening?

Improve error handling and reporting

Error handling could be improved by using errors.Wrap and errors.Unwrap for more descriptive error messages. Also, error handling for Series should be managed in the same way that is done on DataFrames, by reading from the Series.Err() method to retrieve the error message.

Ideally if we have a pipe of DataFrame operations we want to be able to track at which point it failed. Maybe to do so we have to store some piping information inside the DataFrame structure to know what the pipe operation looks like.

Replace comparator strings with an enum

When comparing Series or filtering with df.F, the comparator should be moved to a string/int enum for maintainability, clarity and better type safety.

type Comparators string
const (
    Eq Comparators = "eq"
    In Comparators = "in"
    ...
)

Set value by filter dataframe

I have a dataframe

df := dataframe.LoadRecords(
[][]string{
[]string{"Name", "Total"},
[]string{"ABC", "4"},
[]string{"XYX", "5"},
[]string{"MNK", "4"},
[]string{"OPP", "2"},
},
)

I filter the Name column for the OPP value and I want to change the Total column from 2 to 3.
Hope for a help
Thank you

UNION two dataframes

Hi - I am trying to do a UNION ALL statement on 2 dataframes and am wondering if this is possible. I have loaded up 2 dataframes successfully but want a way to merge them together.

Thanks

--
Update - nevermind. just saw outerjoin :-)

Implement an Arrange method for DataFrames

One should be able to sort the DataFrame by one or several of it's columns.

func (d DataFrame) Arrange(keys string...) DataFrame { ... }

A possible implementation could start by enabling each Series to return a []int array containing it's sorted order. For example:

a := Strings("b", "c", "a")
var b []int = a.Order() // b == []int{2,3,1}

In case we have NA elements we should decide what to do with them. Maybe they all have the same order index and appear at the end?

a := Strings("b", nil, "c", nil,"a")
var b []int = a.order() // b == []int{2,4,3,4,1}

In any case, once we have an []int array for each key column we could calculate the new row order array and use it to sort the DataFrame.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.