juliadata / dataframes.jl Goto Github PK
View Code? Open in Web Editor NEWIn-memory tabular data in Julia
Home Page: https://dataframes.juliadata.org/stable/
License: Other
In-memory tabular data in Julia
Home Page: https://dataframes.juliadata.org/stable/
License: Other
This fails:
df2 = DataFrame(quote
a = shuffle(reverse([1:5]))
b2 = ["A","B","C"][randi(3,5)]
v2 = randn(3) # test unequal lengths in the constructor
end)
julia> df2[1:2,:] = df2[4:5,:]
no method convert(Type{DataFrame},Int64)
in method_missing at base.jl:70
in DataVec at /home/tshort/julia/JuliaData/src/datavec.jl:161
in assign at /home/tshort/julia/JuliaData/src/dataframe.jl:455
in assign at /home/tshort/julia/JuliaData/src/dataframe.jl:466
We should decide if this is a bug or a feature. The main question is how much we require column groupings to propagate. To me this is low priority, but since it touches basics, it may need some thought now.
The existing nafilter/Filter and similar methods, and their flags in the DataVecs, seem limited when it comes to working with DataFrames. I like the way that different columns have different behaviors (replace/filter modes), but it's not clear how to combine them. For example, if building a model matrix for an OLS model, do you do a complete_cases operation? The naFilter iterator generator doesn't really work usefully in that context.
One option would be to have a filter_nas() method that generates a SubDataFrame without any rows that contained an NA in a column with filtering mode set. The result could then be iterated over row-wise, with NAs being replaced in any columns in replace mode. Other options and variations are certainly possible.
See also #4.
currently only have implemented very simple inner join
Using the latest I get the following. Am I not importing something?
tests("test/data.jl");tests("test/dataframe.jl"); tests("test/formula.jl")
................................................
In Data types and NAs / DataVec to something else
sshow(dvint)=="[1,2,NA,4]" FAILED
Nothing with args:
1: Nothing
2: Nothing
3: Nothing
Exception: sshow not defined
NaN seconds
Function that converts a DF to a single Float64 matrix, with other numerical types promoted and dummy variables created for strings and other types.
for prettier output and alignment
Currently we have refs::Vector{UInt16}
which is intentionally short to save space, but there's no overflow check if you add more than 2^16 different items. Presumably there should be a mechanism to detect this, and at a minimum throw an error, and ideally convert itself to a DataVec.
Functionality to take a DF in wide form and make it long, and vice-versa
Possible construction modes include pushing data out from a central process, or having individual nodes load chunks from a CSV file or another source. Rows could be split by row-number or by value, depending on the application. The DDF would be resident in RAM, and read/write operations could be performed, as well as distributed statistical operations.
Built-in serialization may possibly work?
currently it's just a single column of summary tables
Per #18, allow the user to define groups of columns by name (perhaps hierarchically). E.g., assume columns are y1, y2, x1, x2, x3, x4. Then allow the user to define responses = ["y1", "y2"]
and odd_predictors = ["x1", "x3"]
and predictors = ["odd_predictors", "x2", "x4"]
. Then you could do reference ala df["odd_predictors"]
and formulae ala :(responses ~ predictors)
with automatic expansion in intuitive ways.
This would be part of a DataFrame, presumably, and the methods would need discussion.
Should we leave room for metadata on structures? Frank Harrell's Hmisc package allows units and labels to be attached to data.frame columns.
People may want to attach other metadata like experimenter name or a DataFrame comment.
We could add a meta Dict to the DataFrame, the colindex, and/or at the DataVec level.
The current 3 fields are duplicated in DV and PDV and should be probably replaced with a better-constrained type.
Warning: New definition ==(NAtype,Any) is ambiguous with ==(Any,AbstractArray{T,N}).
Make sure ==(NAtype,AbstractArray{T,N}) is defined first.
Warning: New definition ==(Any,NAtype) is ambiguous with ==(AbstractArray{T,N},Any).
Make sure ==(AbstractArray{T,N},NAtype) is defined first.
Warning: New definition .==(AbstractDataVec{T},T) is ambiguous with .==(Any,AbstractArray{T,N}).
Make sure .==(AbstractDataVec{AbstractArray{T,N}},AbstractArray{T,N}) is defined first.
...
I think these can mostly be made to disappear by clever re-ordering, but it might need a few new methods too.
The motivation: in some instances PooledDataVecs might have many levels (i.e. .more than 65000). The plan is to automatically convert the refs to Uint32 if we are about to overflow. One option is to make PooledDataVec a parametric type depending both on the type in the vector but also on the type of the reference, e.g. PooledDataVec{U, T}.
See #55 for a discussion.
This will make a lot of functions work, including most functions in statistics.jl (mean, median, etc.).
It will take some work to go through and cut down on warnings and tweak things. Also, the default functions won't pay attention to the nafilter or nareplace indicators, but methods supporting those can be added as we go along.
y ~ 1
y ~ x + 0
(probably not y ~ x - 1
-- seems superfluous)
There's an argument that DataVecs should not be unitary vectors, but instead should be blocked arrays. This would allow relatively common data-cleaning operations such as "delete 17 random rows from this billion-row DataFrame" to be very fast instead of very slow. Basically, columns would be stored in chunks of, say, 4KB (often the size of disk blocks), and single row deletions would require only a shuffling of a single block. Blocks would merge when two adjacent blocks were less than half full. A block structure might also be convenient for indexing, and likely for memory-mapping too. On the other hand, other operations would be more complex.
Currently, start/next/done iterates over columns of an AbstractDataFrame. It seems to me that they should instead iterate over rows, as do these functions for DataStreams. The next() return value should presumably be a 1-row SubDataFrame.
Are there any current functions that depend on the current behavior?
@tshort , github says I should blame you for this. :)
As a complement to the feature demo...
This request has been shot down before when we were working on Harlan's branch. I'm repeating it here "for the record".
I think we should add NA support to floating-point vectors using NaN's. It only takes ten lines of code as follows. Those ten lines of code will allow us to use regular vectors, and immediately most floating-point columns will "just work" (arithmetic, mean, quantile, ...). That will save us a lot of work trying to support every floating point function folks create for DataVecs. I don't think it's a problem to have multiple NA types as long as it appears pretty consistent from the user's point of view.
NA_Float64 = NaN
NA_Float32 = NaN32
na(::Type{Float64}) = NA_Float64
na(::Type{Float32}) = NA_Float32
convert{T <: Float}(::Type{T}, x::NAtype) = na(T)
promote_rule{T <: Float}(::Type{T}, ::Type{NAtype} ) = T
isna{T <: Float}(x::T) = isnan(x)
isna{T <: Float}(x::AbstractVector{T}) = x .!= x
nafilter{T <: Float}(v::AbstractVector{T}) = v[!isna(v)]
nareplace{T <: Float}(v::AbstractVector{T}, r::T) = [isna(v)[i] ? r : v[i] for i = 1:length(v)]
What I'm proposing is to add NA support for Vectors{Float} by using NaNs as NAs for use as DataFrame columns. This is along the lines of what R and pandas do, and it's one of the options that Numpy is moving to (the other option is a masking approach like DataVec).
Arrays are Julia's fundamental type, so if we can support the use of Arrays{Float, 1}, our DataFrame columns will be more robust.
The #1 reason (by far) for using Vectors with NaNs as NAs is reduced support and development by the DataFrame team. With just these 10 lines of code, Vectors have pretty good NA support while at the same time supporting all Julia functions that operate on Arrays. Here are some examples:
julia> srand(1)
julia> v = randn(10)
10-element Float64 Array:
0.00422471
0.0636925
1.41376
-1.09879
0.503439
1.75336
-0.202676
-0.458741
0.526426
1.60172
julia> dv = DataVec(v)
[0.004224711539662927,0.0636925153577793,1.413764097398493,-1.0987858271983026,0.5034390675674981,1.753360709001194,-0.20267565554863,-0.4587414680524865,0.5264259022023546,1.601723433618217]
julia> quartile(v)
3-element Float64 Array:
-0.150951
0.283566
1.19193
julia> quartile(dv) # This is what I worry will be too common.
no method sort(DataVec{Float64},)
in method_missing at base.jl:60
in quantile at statistics.jl:356
in quartile at statistics.jl:369
julia> mean(v)
0.41064274858857797
julia> mean(dv)
no method mean(DataVec{Float64},)
in method_missing at base.jl:60
julia>
julia> v[[1,5]] = NA
NA
julia> dv[[1,5]] = NA
NA
julia> mean(v)
NaN
julia> mean(dv)
no method mean(DataVec{Float64},)
in method_missing at base.jl:60
julia> mean(nafilter(v))
0.44984546334732733
julia> mean(nafilter(dv))
0.44984546334732733
julia> v + dv
no method +(Array{Float64,1},DataVec{Float64})
in method_missing at base.jl:60
julia> v * 2
10-element Float64 Array:
NaN
0.127385
2.82753
-2.19757
NaN
3.50672
-0.405351
-0.917483
1.05285
3.20345
julia> dv * 2
no method *(DataVec{Float64},Int64)
in method_missing at base.jl:60
With DataVecs, the most common FAQ will likely be: why doesn't function xyz() work on DataFrame columns? The answer will be: wrap the column in nafilter() or write a method for xyz() to support DataVecs.
Especially since DataVecs cannot inherit from AbstractArrays, there is a lot of work to be done to support even the functions in base. Once packages proliferate, it'll be even worse. Most of these will be fairly simple wrappers, but it'll still be work to do initially and to maintain going forward.
I also worry that forcing DataVecs will create more of a division between users coming from R and Matlab backgrounds. If DataFrames support Vectors, then it'll be easier for the Matlab folks to use and for R folks to use functions from the Matlab folks that work on Arrays.
In the groupby testing I did, the bare array columns were faster.
By supporting AbstractVectors, new Julia data types that are AbstractVectors can automatically be used in DataFrames.
Here is an example of using mmap_arrays as columns of a DataFrame:
julia> s = open("bigdataframe.bin","w+")
IOStream(<file bigdataframe.bin>)
julia> N = 1000000000
1000000000
julia> v1 = mmap_array(Float64,(N,),s)
julia> v2 = mmap_array(Float64,(N,),s, numel(v1)*sizeof(eltype(v1)))
julia> d = DataFrame({v1, v2})
DataFrame (1000000000,2)
x1 x2
[1,] 0.0 0.0
[2,] 0.0 0.0
[3,] 0.0 0.0
[4,] 0.0 0.0
[5,] 0.0 0.0
[6,] 0.0 0.0
[7,] 0.0 0.0
[8,] 0.0 0.0
[9,] 0.0 0.0
[10,] 0.0 0.0
[11,] 0.0 0.0
[12,] 0.0 0.0
[13,] 0.0 0.0
[14,] 0.0 0.0
[15,] 0.0 0.0
[16,] 0.0 0.0
[17,] 0.0 0.0
[18,] 0.0 0.0
[19,] 0.0 0.0
[20,] 0.0 0.0
:
[999999981,] 0.0 0.0
[999999982,] 0.0 0.0
[999999983,] 0.0 0.0
[999999984,] 0.0 0.0
[999999985,] 0.0 0.0
[999999986,] 0.0 0.0
[999999987,] 0.0 0.0
[999999988,] 0.0 0.0
[999999989,] 0.0 0.0
[999999990,] 0.0 0.0
[999999991,] 0.0 0.0
[999999992,] 0.0 0.0
[999999993,] 0.0 0.0
[999999994,] 0.0 0.0
[999999995,] 0.0 0.0
[999999996,] 0.0 0.0
[999999997,] 0.0 0.0
[999999998,] 0.0 0.0
[999999999,] 0.0 0.0
[1000000000,] 0.0 0.0
julia> d["x1"][[2,5,N]] = NA
NA
julia> d["x2"][[1,3,6,9]] = pi
3.141592653589793
julia> head(d)
DataFrame (6,2)
x1 x2
[1,] 0.0 3.14159
[2,] NaN 0.0
[3,] 0.0 3.14159
[4,] 0.0 0.0
[5,] NaN 0.0
[6,] 0.0 3.14159
julia> sum(d[11:15, "x1"])
0.0
julia> sum(d[1:5, "x1"])
NaN
The biggest downside to using vectors is that they won't have built in filtering/replacing. The use of filter and replace Bools in DataVecs is a cool idea. It's much better than R's options()$na.action -- I never use that because it makes code less portable. By attaching the na.action flag to the data, code is more portable while allowing the user to avoid wrapping things with nafilter.
I think the advantages of using Vectors outweigh losing built-in filter/replace for most applications of floating-point columns.
As it is now, NAs become NaNs, and they are displayed that way. It is probably possible to modify the show() functions to overcome this. It is probably possible to distinguish between NaNs and NAs by using a different bit pattern for each. With support of Julia core, output could be modified for all arrays.
New features like indexing cannot be added automatically to all column types. For many features, this could be handled by API's rather than by type inheritance. For features where this is not possible, DataVecs can be used.
This approach cannot be used for these types because they don't have an NaN. R's approach of picking a bit pattern for each as an NA would not work unless it was integrated into Julia's core (not likely and probably not a good idea). Anyway, DataVecs or PooledDataVecs are more appropriate here anyway. Also, the universe of functions that need to work on these types is smaller than the set of functions operating on Vector{Float}.
I'm not proposing that we ditch DataVecs. I'm proposing that we have an alternative. One may work better for the user in some situations than the other.
More work is needed to sort all of this out. This includes deciding what's the default for column assignment or reading from CSV's. Here's what I would choose:
We probably need an AsIs type for overriding defaults.
Another area is promotion. What type should x::DataVec{Float64} + v::Vector{Float64} produce?
As far as DataVec{Float} support, I think we should continue to support it, but the list of functions we support at the start could be a lot smaller. If it gains wide use, folks will add methods to support it.
Currently merge(a, b, bycol, jointype)
always calls join_idx
to creates a left and right index needed for joining two DataFrames. It would be nice if this could take advantage of columns that are Index
types.
It would be nice to add support for the following forms.
dvfloat = DataVec[1., NA, 2, 3]
dvfloat2 = DataVec[1., 3, 5, NA]
dvint = DataVec[NA, 4, 5, 6]
dvint2 = DataVec[1, 4, 5, 6]
vfloat = [4., 2, 1, 2]
vint = [4, 2, 1, 2]
1.0 .< dvfloat
0.0 .< dvfloat .< 1.0
dvfloat .< 1
dvint .< 1.0
dvfloat .< dvfloat2
dvfloat .< dvint
dvfloat .< vfloat
dvfloat .< vint
dvint .< vfloat
dvint .< vint
dvfloat .< NA
There should be a way to enforce a fixed set of pool items in a DV, and to optionally flag the ordering as important. It may also be useful to have meta-data for constrast construction -- or maybe this isn't the appropriate place for it (cf. R).
Note that we've defined oftype(ASCIIString)
to allow zero(ASCIIString)
to work. But any type without a zero will fail with a weird error during DataVec construction.
something like:
rename!(df, quote
x1 = "cat"
x2 = "dog"
end)
rename!(df, {"x1"="cat", "x2"="dog"})
rename!(df, {1="cat", 2="dog"})
rename!(df, ["cat", "dog"]) # must be same length as ncol(df)
rename!(df, function) # allow functional generation of new colname from old one
Note that column name groups should be updated appropriately.
I've taken a first shot at implementing bitstypes with missing value checking for integers, booleans, and floating point numbers. Implementing missing values as bitstypes has a number of advantages:
This is an expansion on issue #22 (NA support for Vector{Float}).
Bitstype NA arrays and DataVecs and PooledDataVecs seem to inter-operate well. Having both a masking approach (DataVecs) and a bit-pattern approach will give users more options to best handle missing data based on their problem. Here is a specification document showing how I think everything fits together:
https://github.com/tshort/JuliaData/blob/bitstypeNA/spec/MissingValues.md
Here is the code:
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/bitstypeNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/boolNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/intNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/src/floatNA.jl
Some tests are also available:
https://github.com/tshort/JuliaData/blob/bitstypeNA/test/bitstypeNA.jl
https://github.com/tshort/JuliaData/blob/bitstypeNA/test/bitstypeNA2.jl
Using bitstypes is a good showcase for Julia's capability. It was surprisingly easy to create these new types. I started with integers and Jeff's suggestion [1], and from there, it was mostly a matter of adapting int.jl. I think I had something working in 1.5 hours one evening earlier this week. Bools and floats each took about the same amount of time.
[1] https://groups.google.com/d/msg/julia-dev/n3ntT4M0gwo/xpPuTgwSpb0J
I didn't submit this as a pull request. It's in a branch under my fork of JuliaData. We should probably have discussions on what and how this should be incorporated. Some options are:
Currently, things seem to work pretty well. Vectors work well in DataFrames. Indexing with IntNA's and BoolNA's with and without NA's seems to work. Conversion to DataVecs and inter-operability with DataVecs seems good. It's still not well tested, so I'm sure there are many bugs and missing features, especially in conversion. I have not run into any big gotchas that would require language changes or other action by Julia core developers.
cbind(df1, df2) currently makes copies of both df1 and df2 (expensive)
cbind!(df1, df2) currently makes no copies and modifies df1
We need something that doesn't copy columns but doesn't modify df1 or df2. I'd rather have:
cbind(df1, df2) be equivalent to DataFrame([df1.columns, df2.columns], newcolnames)
That doesn't change df1 or df2, but they share columns. If we have that, I don't see too much need for cbind!.
Creating a DataFrame from existing columns isn't expensive.
Currently csvDataFrame uses core csvread into a matrix of Anys then parses each column. Preferred features incude:
something like index!(df, "colname")
creates an index for the clumn and stores it inside the DataFrame object.
currently just converts from a DataVec, which is silly.
to match the current Julia standard library behavior
There are several options here, from indexing CSV rows to allow random access to giant CSV files, to creating a new binary file format for storing DataVecs that supports mmap-style direct memory mapping.
did I miss any obvious requirements?
Tom says: Make ModelFrame hold an AbstractDataFrame instead of a DataFrame, and modify expand and others to support this. I looked through the code, and I didn't see anything that would prevent this. It'd be nice to be able to form a DesignMatrix from a SubDataFrame.
This requires two things. The function should work on a DataVec (currently log() does, see datavec.jl), and the ModelFrame (or maybe ModelMatrix) constructor should evaluate the function.
Using :
to index DataFrame columns returns an empty DataFrame.
d = DataFrame()
d["y"] = [1:4]
d["x1"] = [5:8]
d["x2"] = [9:12]
julia> d[1:4,:] # wrong answer
DataFrame (0,0)
julia> d[:,1:3]
DataFrame (4,3)
y x1 x2
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
julia> d[1:4,1:3]
DataFrame (4,3)
y x1 x2
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
Such a call occurs in the first line of model_matrix:
df = mfarg.df[complete_cases(mfarg.df),:]
currently probably breaks if you mix ASCII and UTF-8?
It'd be nice to know things like how much overhead does the NA masking have over a raw vector, and how much memory does a PooledDataVec save, etc. The core Julia team is thinking about this too, and also about storing benchmarks over time as the language matures. JuliaLang/julia#1073
One aspect of missing data that JuliaData does not support is dense arrays with missing data. Extending DataVec
and PooledDataVec
to a DataArray
type that can handle missing data for an array of arbitrary dimension seems like a useful addition to the package. The semantics of a DataArray
should be the same as normal Arrays with the addition that functions that operate on them should have the option of excluding missing data. Additionally, slicing a DataArray
would return a DataArray
with the proper number of dimensions. In principle a 1d DataArray
would be a DataVec
, however, there may be compelling reasons to keep the DataVec
implementation separate. The proposed implementation of DataArray
will be such that a 1d DataArray
will behave exactly as a current DataVec
. Having a special type DataMatrix
for 2d data also would be useful. The nafilter/naFilter and nareplace/naReplace would return flattened versions of the objects. I'm sure there are behaviors I have not specified or have not been clear about, so any thoughts on the design and implementation of DataArray
are appreciated.
One other special case that deserves attention is float arrays with missing data. In this case I think it is worth implementing something similar to the approach in issue #22 for arbitrary arrays of floats. That is using NaN to indicate missing data in arrays of floats. In this special case NaN has the correct semantics for missing data and does not require a separate mask. It is also straightforward to implement this behavior. Again, any thoughts are appreciated. This type could then be sub-typed to allow named rows and columns via the Index type already in JuliaData. However, just the added functionality for Float arrays would be very useful for machine learning and statistics algorithms that operate on float arrays.
I am planning on implementing these ideas as time permits, but help is welcome if anyone wants to run with the ideas.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.