fslaborg / deedle Goto Github PK

View Code? Open in Web Editor NEW

937.0 937.0 195.0 38.33 MB

Easy to use .NET library for data and time series manipulation and for scientific programming

Home Page: http://fslab.org/Deedle/

License: BSD 2-Clause "Simplified" License

Shell 0.01% F# 94.55% C# 0.92% HTML 2.17% Batchfile 0.01% PowerShell 0.03% Jupyter Notebook 1.66% Forth 0.66%

deedle's People

Contributors

Stargazers

Watchers

Forkers

tpetricek colinbull kos59125 andy-p kziemski utlww bohdanszymanik adamklein cdrnet forki tonyabell vsmida yukitos calwi stic atifaziz atwoodtm tcopple johannh-zz biswapanda zuiwanting dcharbon applied-duality codingday r0k3 marklam tush1r linearregression devjuice hijeenu dsyme buybackoff ammachado ascjones dilico patrickmcdonald alexracoon casbby terencecraig jeyoor chris-b1 modulexcite telefunkenvf14 ovatsus tleviathan filmor dependencies augustoproiete-forks evilpepperman dsimba rikace robertpi denmerc esirola transformersprimeabcxyz yagilofir chrisharding smartcaveman yenyenx aabbcczz msitt croland sandboxorg arshbucks jweibel22 kimserey benkalegin soundarkarunagaran kflu davydovpv ingted romanshestakov philipjadler sigino mjul ahorjia intellibrain munik kostrse rmunn huguojunsy rjshaver spreads cutelittle marioquillas yukibonji valmac andrewrothstein georgemasonopensource stoneflyop1 deepakkumar1984 jlw109 sebhofer huangzhengyong shalokshalom arvidjb sksundaram-learning kblohm erisonliang erikaleblanc88

deedle's Issues

Improve testing

Add more tests for series, etc. based on F# interactive scripts.

Customizable question mark

Automatically casting to Series<'K, float> when writing frame?ColName is not ideal for all applications. Perhaps make it possible to change the behaviour by opening a different namespace?

The font used in the docs for the tooltips has a problem: 0 and o are very hard to distinguish (tested both in Chrome and IE on Windows).
For example, in http://bluemountaincapital.github.io/FSharp.DataFrame/tutorial.html, in the second code block, in the first (...), where the tooltip is seq { for i in 0 .. (count - 1) ->
I was honestly confused and looking for the definition of the o variable :)
I suggest the same font from the code is used instead.

Deedle DataFrame with sliced columns to R conversion exception.

I gets an exception when converts Deedle data frame to R (Frame with sliced columns)
This is script:

#I "..\\packages\\Deedle.0.9.11-beta"
#I "..\\packages\\RProvider.1.0.4"
#load "RProvider.fsx"
#load "Deedle.fsx"

open Deedle
open RDotNet
open RProvider
open RProvider.``base``
open RProvider.datasets

let mtcars : Frame<string, string> = R.mtcars.GetValue()
let mtcars' = mtcars.Columns.[["vs";"am";"gear";"carb"]]
R.as_data_frame(mtcars)  // works
R.as_data_frame(mtcars') // fails

Error message:

System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.Exception: No converter registered for type System.Object[] or any of its base types
   at [email protected](String message) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 102
   at RProvider.RInteropInternal.REngine.SetValue(REngine this, Object value, FSharpOption`1 symbolName) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 212
   at RProvider.RInteropInternal.toR(Object value) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 225
   at RProvider.RInterop.passArg@312(List`1 tempSymbols, Object arg) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 326
   at [email protected](IEnumerable`1& next) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 334
   at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.MoveNextImpl()
   at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.System-Collections-IEnumerator-MoveNext()
   at Microsoft.FSharp.Collections.SeqModule.ToArray[T](IEnumerable`1 source)
   at RProvider.RInterop.callFunc(String packageName, String funcName, IEnumerable`1 argsByName, Object[] varArgs) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 331
   at <StartupCode$Deedle-RProvider-Plugin>.$Exports.RProvider-IConvertToR-1-Convert@21.Deedle-IFrameOperation`1-Invoke[a,b](Frame`2 )
   --- End of inner exception stack trace ---
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
   at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at [email protected](a engine, b value) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 98
   at RProvider.RInteropInternal.REngine.SetValue(REngine this, Object value, FSharpOption`1 symbolName) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 212
   at RProvider.RInteropInternal.toR(Object value) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 225
   at RProvider.RInterop.passArg@312(List`1 tempSymbols, Object arg) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 326
   at [email protected](IEnumerable`1& next) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 334
   at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.MoveNextImpl()
   at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.System-Collections-IEnumerator-MoveNext()
   at Microsoft.FSharp.Collections.SeqModule.ToArray[T](IEnumerable`1 source)
   at RProvider.RInterop.callFunc(String packageName, String funcName, IEnumerable`1 argsByName, Object[] varArgs) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 331
   at RProvider.RInterop.call(String packageName, String funcName, String serializedRVal, Object[] namedArgs, Object[] varArgs) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 375
   at <StartupCode$FSI_0008>.$FSI_0008.main@()

Join DF and Series

Add overload taking series...

.NET 4.0 support

This is a great and long awaited library, but could it target .NET 4.0?

It is quite trivial to replace IReadOnlyList by ReadOnlyCollection (done here https://github.com/buybackoff/FSharp.DataFrame/commit/7e0b84c096ab3a27a55fc0658c832555cd65f269, all tests pass).

However there are modules FrameUtils and FrameExtentions that are tightly coupled with FSharp.Data.DesignTime for type inference from TextReader. Then the method ReadCSV is used from tests, but the data supplied is a .csv file. As I understand, runtime FSharp.Data could infer types from sample files, but in FrameUtils the data is supplied as TextReader.

This SO question says one doesn't nees DesignTime reference and could delete it, but not in this case. http://stackoverflow.com/questions/19214044/is-fsharp-data-designtime-net-4-5-only

Probably .CSV parsing utility and extensions should not be a part of the DataFrame itself, but reside in tests or samples? I am quite happy with Frame constructor only and could easily construct columns myself and use the constructor like on the last line in FrameUtils: Frame(rowIndex, columnIndex, Vector.ofValues columns).

Index that avoids duplicates

Add IIndex type that automatically avoids duplicate-key errors (e.g. when appending data frames that have an ordinal index, the index should be re-calculated)

Plugin for R provider

Consider adding query builder

We could support something like this:

frame { for r in frame do
        indexRowsString "Name"
        shift 1
        window 5 into win 
        select ... }

Type provider

Add type provider that can be used for creating statically typed data frames.

Interpolation functions

Add functions for interpolating missing values in a series (or better, build a function that can calculate values for keys not in series - and also can be used in FillMissing)

Throw when using Lookup with Join on unordered series

(because that does not make sense)

Naming

Don suggests renaming Series.ofObservations to something else (like Series.ofPairs). I think "observations" is a bit too long, so I agree ... not entirely sure what would the best name be.. "pair" sounds okay, but maybe not ideal.

Frame.ReadCsv throws an exception with web streams

I have tried to retrieve a data from web, but I got an error.

open System.Net
open Deedle

let irisDataUri = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
let iris =
   let request = WebRequest.Create (irisDataUri)
   use response = request.GetResponse ()
   use stream = response.GetResponseStream ()
   Frame.ReadCsv (stream, false)

Unhandled Exception: System.NotSupportedException: This stream does not support seek operations.

Query support

Provide query builder for series/frame.

Not entirely clear what could be supported, but we could certainly add something!

Diff data frames/series

Given two series, we want to know how they differ. That is, find which keys are available in one, but not in the other and find the keys for which they both have values, but the values differ.

This would be very useful for interactive exploration - when you get two data frames or two series and want to quickly check how they differ (e.g. when they represent two versions of the same data set).

For example, say we have the following two series:

let s1 = series [ 1 => 1.0; 2 => 2.0; 3 => 3.0 ]
let s2 = series [ 1 => 10.0; 2 => 2.0; 4 => 4.0 ]

The difference could be described using a simple discriminated union, something like this:

type Diff<'T> = 
  | Change of 'T * 'T 
  | Remove of 'T 
  | Add of 'T
  override x.ToString() =
    match x with 
    | Change(a, b) -> sprintf "%A -> %A" a b 
    | Remove v -> sprintf "-%A" v | Add v -> sprintf "+%A" v

And Series.compare a b would return something like this:

series [1 => Change(1.0, 10.0); 3 => Remove 3.0; 4 => Add 4.0 ]

Comparing frames could work in a similar way...

Float to string conversion when framing a serie

Hi,
I am new to F#. I bumped in a strange issue experimenting with Deedle, which I posted here. I'd appreciate if you could help me out with this.

http://stackoverflow.com/questions/19795949/map-to-deedle-frame/19796225#19796225

Thanks.

Support loading a csv directly from an URL into a Frame

It would be nice to be able to do:

let frame = Frame.ReadCsv "http://faculty.washington.edu/heagerty/Books/Biostatistics/DATA/ozone.csv"

similarly to what can be done with the CsvProvider

Missing FSharp.Data dependency on .nuspec

Frame.ReadCsv depends on it

Improve exceptions

It would be helpful if duplicate key exceptions reported the duplicate key, otherwise its painful to figure out what happened

System.ArgumentException: Duplicate keys are not allowed in the index.
Parameter name: keys
at Microsoft.FSharp.Core.Operators.Raise[T](Exception exn)
at FSharp.DataFrame.Indices.Linear.LinearIndex1..ctor(IEnumerable1 keys, IIndexBuilder builder, FSharpOption`1 ordered) in C:\dev\FSharp.DataFrame\src\Indices\LinearIndex.fs:line 50

View source button in docs

... to go the fsx file in generated documentation.

Add Frame.empty and Series.empty

Suggestion: when using Frame.ofRecords and there's a single record field of type Date, automatically use that as the key

Handle CSV files with missing column keys

For example, given the following CSV file:

a,,
1,2,3 
1,2,3
1,2,3

The Frame.ReadCsv function fails. It should instead generate some names for the unlabeled columns.

Generated docs from XML comments

The comments are written in Markdown, so this needs to be transformed first. Then we need to generate nice doc page from it...

Improve behaviour of `nestBy`

In the current version, the function takes just a projection:

df |> Frame.nestBy fst

Given Frame<R1 * R2, C>, this produces Series<R1, Frame<R1 * R2, C>> but it would be more reasonable to produce Series<R1, Frame<R2, C>>. To do that, we would have to take a pair of functions rather than fst.

Also, rename this to nestRowsBy and add nestColsBy.

C# documentation

Provide some examples of using data frame from C#

Expose types of columns

Internally, columns are stored as values of type IVector<V> and the type of V matters sometimes (e.g. when passing data to R provider).

The Print operation should say what the types are and we need some functions to convert those if they are incorrect.

(But also, slicing should preserve these types...)

Add packages.config for samples

Should get F# Charting....

Consider better story for building frames

Do we need some sort of computation builder for creating frames & series?

R plugin - data frame columns

This commit (ef65df7) tried to fix an error where passing Deedle data frame to R would fail.

However, the problem isn't the size of the data frame, but instead, the column keys - the operation R.data_frame fails when the column names are not valid R identifiers. The $<- operation can handle that, because it takes the name as a string (not as a named param).

We should probably build data frames using $<- unless that is slower.

Frame.sum should not fail

Frame.sum on a frame that contains columns with non-numeric data should not throw. It should return a series with missing values or drop the columns.

Finish walkthroughs & sample scripts

Add Series.fold, Series.reduce, Series.scan etc.

Fix and document functions

Like this one:

let inline filterCols f (frame:Frame<'TColumnKey, 'C>) = 
  frame.Columns |> Series.filter f |> FrameUtils.fromColumns

Series and Frames for real-time streaming data

What would be the right way to use Series in a real-time environment where new data arrive asynchronously?

I have found a question (and probably a part of an answer) that describes exactly the idea. http://stackoverflow.com/questions/17941932/f-immutable-data-structures-for-high-frequency-real-time-streaming-data

The answers on SO suggest using FSharpx.Collections.Vector<T> data structure instead of arrays. Another answer (http://stackoverflow.com/a/19520214/801189) on SO by @tpetricek explains why arrays are faster than lists for fixed data, and I believe that was one of the reason for initial implementation of Vector as ArrayVector in Deedle. I think the current focus of Deedle is to deal with fixed existing data series and frame - the workflow much similar to R. But if the data length is fixed then the performance is less important that in a real-time environment.

For streaming data we need to append existing series with new value(s) and use the new series. With current array implementation that will require copying the whole old array to the new resized array. In the first question the author mentions 5 mn data point per instrument per day (let's assume 8 bytes double + DateTime's 8 bytes), or around 80 Mb per instrument. With e.g. 100 instruments copying all arrays many times per second is probably not the best option.

Simplest use case
For stock price A with 1 second interval we calculate 60-second moving average and store it in a series MA_A_60. We update all vectors as new data points arrive.

For a new price point we create a new series object by appending the old object (in the case of a very large data set copying array is slow)
Then we take last 60 values from the new series object and calculate new MA value (crucial point is to avoid recalculation for all MA values, but take only the last 60-point window from A)
We append new MA value to the MA_A_60.

Will the current implementation be suitable for such workflow for hundreds of instruments, multiple calculated values for each one and sub-second frequency?

Will an implementation of Deedle's IVector with FSharpx.Collections.Vector be more suitable for such use case? (I know one should run some tests in a similar situation, but there is no second implementation to compare with)

I would love to have Deedle's abstraction and API for such use case!

P.S. An abstraction of the workflow: if seriesB = f seriesA, then we could somehow link series B to series A, watch for new values in A and add the new values to B (applying f function only for incremental data). For this we would need some projection object that would keep seriesB always synchronized with seriesA using the transformation function f. In turn, there could be some seriesC = f2 seriesB on so on. I am not sure that this functionality should be inside the library, but that is what I hope to achieve.

Simplify setting index column.

Using frame.WithRowIndex<T>(...) is ugly.

Support Frame.indexWithDate "foo" and Frame.indexWithInt "foo" (and a few standard things) - and similarly for member methods (maybe)?

IFsiFormattable

Have you considered instead of having the IFsiFormattable interface and then registering an fsi printer for it, just turning the Format method into a property and use [<StructuredFormatDisplay("{Format}")>]?

That way we wouldn't need to have #load "FSharp.DataFrame.fsx", just #r "FSharp.DataFrame.dll".

This would be especially helpful with the new "Send to FSI" command in VS2013, making it a better experience: just reference the dll and be ready to go.

The problem with this is that the Format member would appear in IntelliSense, but it
could be hidden from C# by using [<EditorBrowsable(EditorBrowsableState.Never)>] and from F# by using [<CompilerMessage("This method is intended to be used only by FSI Printer", 10002, IsHidden=true, IsError=false)>]

(Another simpler option would be just to use the ToString, which would be even nicer for C# users)

Stack & unstack

Review and do the standard R thing

Rows with missing values

Add function to get rows with some missing values (useful for diagnostics...)

Output CSV/TSV & Serialization in general

Support serializing data frames & series and writing them to CSV/TSV.

Series.Join with JoinKind.Outer ignores lookup option

Currently lookup option only works for left or right joins on series, but doesn't work with outer join. The more expected behavior would be to lookup values on both sides respectively. Could we do this?

An example use case could be ticks for two stocks that arrive at different time while the ratio of the two of them should be updated with every new information.

Another example is exchanges schedule in Israel/Gulf countries (closed on Fri, Sat) and Western exchanges (closed on Sat, Sun). On Fri one would want to get Thu data from Israel/GCC, but on Sun the Fri data from West.

In both cases one would need outer join with Lookup.NearestSmaller.

Some time ago I did a small comparison of some R code (https://gist.github.com/ovatsus/5354187#file-original-r), with the equivalent code using CsvProvider (https://gist.github.com/ovatsus/5354187#file-csvprovider-fsx) and using the untyped CsvFile (https://gist.github.com/ovatsus/5354187#file-csvfile-fsx)

I did the same for DataFrame (https://gist.github.com/ovatsus/5354187#file-dataframe-fsx), and I have some feedback:

Line 45: I needed to filter the data to where Ozone > 31 and Temp > 90, but some rows didn't have the Ozone value. I was forced to fill the missing values with zeros to workaround it, is there a better way to do this? Maybe the dynamic operator on a series could return a Double.NaN when the value is missing?
Line 52: I was expecting frame |> Frame.getCol "Solar.R" to be the same as frame.["Solar.R"], but the first one gives this error: Type constraint mismatch when applying the default type 'int' for a type inference variable. The type 'int' does not support the operator 'DivideByInt' Consider adding further type constraints
Line 61: Is there any better way to do this than iris |> Frame.filterRowValues (fun x -> x.GetAs<string>("Species") = "virginica")? It feels very long for such a simple operation

fslaborg / deedle Goto Github PK

deedle's People

Contributors

Stargazers

Watchers

Forkers

deedle's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs