fslaborg / deedle Goto Github PK
View Code? Open in Web Editor NEWEasy to use .NET library for data and time series manipulation and for scientific programming
Home Page: http://fslab.org/Deedle/
License: BSD 2-Clause "Simplified" License
Easy to use .NET library for data and time series manipulation and for scientific programming
Home Page: http://fslab.org/Deedle/
License: BSD 2-Clause "Simplified" License
Add more tests for series, etc. based on F# interactive scripts.
Automatically casting to Series<'K, float>
when writing frame?ColName
is not ideal for all applications. Perhaps make it possible to change the behaviour by opening a different namespace?
flatten nested data structures
The font used in the docs for the tooltips has a problem: 0 and o are very hard to distinguish (tested both in Chrome and IE on Windows).
For example, in http://bluemountaincapital.github.io/FSharp.DataFrame/tutorial.html, in the second code block, in the first (...), where the tooltip is seq { for i in 0 .. (count - 1) ->
I was honestly confused and looking for the definition of the o variable :)
I suggest the same font from the code is used instead.
I gets an exception when converts Deedle data frame to R (Frame with sliced columns)
This is script:
#I "..\\packages\\Deedle.0.9.11-beta"
#I "..\\packages\\RProvider.1.0.4"
#load "RProvider.fsx"
#load "Deedle.fsx"
open Deedle
open RDotNet
open RProvider
open RProvider.``base``
open RProvider.datasets
let mtcars : Frame<string, string> = R.mtcars.GetValue()
let mtcars' = mtcars.Columns.[["vs";"am";"gear";"carb"]]
R.as_data_frame(mtcars) // works
R.as_data_frame(mtcars') // fails
Error message:
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.Exception: No converter registered for type System.Object[] or any of its base types
at [email protected](String message) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 102
at RProvider.RInteropInternal.REngine.SetValue(REngine this, Object value, FSharpOption`1 symbolName) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 212
at RProvider.RInteropInternal.toR(Object value) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 225
at RProvider.RInterop.passArg@312(List`1 tempSymbols, Object arg) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 326
at [email protected](IEnumerable`1& next) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 334
at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.MoveNextImpl()
at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.System-Collections-IEnumerator-MoveNext()
at Microsoft.FSharp.Collections.SeqModule.ToArray[T](IEnumerable`1 source)
at RProvider.RInterop.callFunc(String packageName, String funcName, IEnumerable`1 argsByName, Object[] varArgs) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 331
at <StartupCode$Deedle-RProvider-Plugin>.$Exports.RProvider-IConvertToR-1-Convert@21.Deedle-IFrameOperation`1-Invoke[a,b](Frame`2 )
--- End of inner exception stack trace ---
at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)
at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
at [email protected](a engine, b value) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 98
at RProvider.RInteropInternal.REngine.SetValue(REngine this, Object value, FSharpOption`1 symbolName) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 212
at RProvider.RInteropInternal.toR(Object value) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 225
at RProvider.RInterop.passArg@312(List`1 tempSymbols, Object arg) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 326
at [email protected](IEnumerable`1& next) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 334
at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.MoveNextImpl()
at Microsoft.FSharp.Core.CompilerServices.GeneratedSequenceBase`1.System-Collections-IEnumerator-MoveNext()
at Microsoft.FSharp.Collections.SeqModule.ToArray[T](IEnumerable`1 source)
at RProvider.RInterop.callFunc(String packageName, String funcName, IEnumerable`1 argsByName, Object[] varArgs) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 331
at RProvider.RInterop.call(String packageName, String funcName, String serializedRVal, Object[] namedArgs, Object[] varArgs) in c:\dev\git\RProvider\src\RProvider\RInterop.fs:line 375
at <StartupCode$FSI_0008>.$FSI_0008.main@()
Add overload taking series...
This is a great and long awaited library, but could it target .NET 4.0?
It is quite trivial to replace IReadOnlyList by ReadOnlyCollection (done here https://github.com/buybackoff/FSharp.DataFrame/commit/7e0b84c096ab3a27a55fc0658c832555cd65f269, all tests pass).
However there are modules FrameUtils and FrameExtentions that are tightly coupled with FSharp.Data.DesignTime for type inference from TextReader. Then the method ReadCSV is used from tests, but the data supplied is a .csv file. As I understand, runtime FSharp.Data could infer types from sample files, but in FrameUtils the data is supplied as TextReader.
This SO question says one doesn't nees DesignTime reference and could delete it, but not in this case. http://stackoverflow.com/questions/19214044/is-fsharp-data-designtime-net-4-5-only
Probably .CSV parsing utility and extensions should not be a part of the DataFrame itself, but reside in tests or samples? I am quite happy with Frame constructor only and could easily construct columns myself and use the constructor like on the last line in FrameUtils: Frame(rowIndex, columnIndex, Vector.ofValues columns).
Add IIndex
type that automatically avoids duplicate-key errors (e.g. when appending data frames that have an ordinal index, the index should be re-calculated)
!
We could support something like this:
frame { for r in frame do
indexRowsString "Name"
shift 1
window 5 into win
select ... }
Add type provider that can be used for creating statically typed data frames.
Add functions for interpolating missing values in a series (or better, build a function that can calculate values for keys not in series - and also can be used in FillMissing
)
(because that does not make sense)
Don suggests renaming Series.ofObservations
to something else (like Series.ofPairs
). I think "observations" is a bit too long, so I agree ... not entirely sure what would the best name be.. "pair" sounds okay, but maybe not ideal.
I have tried to retrieve a data from web, but I got an error.
open System.Net
open Deedle
let irisDataUri = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
let iris =
let request = WebRequest.Create (irisDataUri)
use response = request.GetResponse ()
use stream = response.GetResponseStream ()
Frame.ReadCsv (stream, false)
Unhandled Exception: System.NotSupportedException: This stream does not support seek operations.
Provide query builder for series/frame.
Not entirely clear what could be supported, but we could certainly add something!
Given two series, we want to know how they differ. That is, find which keys are available in one, but not in the other and find the keys for which they both have values, but the values differ.
This would be very useful for interactive exploration - when you get two data frames or two series and want to quickly check how they differ (e.g. when they represent two versions of the same data set).
For example, say we have the following two series:
let s1 = series [ 1 => 1.0; 2 => 2.0; 3 => 3.0 ]
let s2 = series [ 1 => 10.0; 2 => 2.0; 4 => 4.0 ]
The difference could be described using a simple discriminated union, something like this:
type Diff<'T> =
| Change of 'T * 'T
| Remove of 'T
| Add of 'T
override x.ToString() =
match x with
| Change(a, b) -> sprintf "%A -> %A" a b
| Remove v -> sprintf "-%A" v | Add v -> sprintf "+%A" v
And Series.compare a b
would return something like this:
series [1 => Change(1.0, 10.0); 3 => Remove 3.0; 4 => Add 4.0 ]
Comparing frames could work in a similar way...
Hi,
I am new to F#. I bumped in a strange issue experimenting with Deedle, which I posted here. I'd appreciate if you could help me out with this.
http://stackoverflow.com/questions/19795949/map-to-deedle-frame/19796225#19796225
Thanks.
It would be nice to be able to do:
let frame = Frame.ReadCsv "http://faculty.washington.edu/heagerty/Books/Biostatistics/DATA/ozone.csv"
similarly to what can be done with the CsvProvider
Frame.ReadCsv depends on it
It would be helpful if duplicate key exceptions reported the duplicate key, otherwise its painful to figure out what happened
System.ArgumentException: Duplicate keys are not allowed in the index.
Parameter name: keys
at Microsoft.FSharp.Core.Operators.Raise[T](Exception exn)
at FSharp.DataFrame.Indices.Linear.LinearIndex1..ctor(IEnumerable
1 keys, IIndexBuilder builder, FSharpOption`1 ordered) in C:\dev\FSharp.DataFrame\src\Indices\LinearIndex.fs:line 50
... to go the fsx file in generated documentation.
For example, given the following CSV file:
a,,
1,2,3
1,2,3
1,2,3
The Frame.ReadCsv
function fails. It should instead generate some names for the unlabeled columns.
The comments are written in Markdown, so this needs to be transformed first. Then we need to generate nice doc page from it...
In the current version, the function takes just a projection:
df |> Frame.nestBy fst
Given Frame<R1 * R2, C>
, this produces Series<R1, Frame<R1 * R2, C>>
but it would be more reasonable to produce Series<R1, Frame<R2, C>>
. To do that, we would have to take a pair of functions rather than fst
.
Also, rename this to nestRowsBy
and add nestColsBy
.
Provide some examples of using data frame from C#
Internally, columns are stored as values of type IVector<V>
and the type of V
matters sometimes (e.g. when passing data to R provider).
The Print
operation should say what the types are and we need some functions to convert those if they are incorrect.
(But also, slicing should preserve these types...)
Should get F# Charting....
Do we need some sort of computation builder for creating frames & series?
This commit (ef65df7) tried to fix an error where passing Deedle data frame to R would fail.
However, the problem isn't the size of the data frame, but instead, the column keys - the operation R.data_frame
fails when the column names are not valid R identifiers. The $<-
operation can handle that, because it takes the name as a string (not as a named param).
We should probably build data frames using $<-
unless that is slower.
Frame.sum
on a frame that contains columns with non-numeric data should not throw. It should return a series with missing values or drop the columns.
Like this one:
let inline filterCols f (frame:Frame<'TColumnKey, 'C>) =
frame.Columns |> Series.filter f |> FrameUtils.fromColumns
What would be the right way to use Series in a real-time environment where new data arrive asynchronously?
I have found a question (and probably a part of an answer) that describes exactly the idea. http://stackoverflow.com/questions/17941932/f-immutable-data-structures-for-high-frequency-real-time-streaming-data
The answers on SO suggest using FSharpx.Collections.Vector<T>
data structure instead of arrays. Another answer (http://stackoverflow.com/a/19520214/801189) on SO by @tpetricek explains why arrays are faster than lists for fixed data, and I believe that was one of the reason for initial implementation of Vector
as ArrayVector
in Deedle. I think the current focus of Deedle is to deal with fixed existing data series and frame - the workflow much similar to R. But if the data length is fixed then the performance is less important that in a real-time environment.
For streaming data we need to append existing series with new value(s) and use the new series. With current array implementation that will require copying the whole old array to the new resized array. In the first question the author mentions 5 mn data point per instrument per day (let's assume 8 bytes double + DateTime's 8 bytes), or around 80 Mb per instrument. With e.g. 100 instruments copying all arrays many times per second is probably not the best option.
Simplest use case
For stock price A with 1 second interval we calculate 60-second moving average and store it in a series MA_A_60. We update all vectors as new data points arrive.
Will the current implementation be suitable for such workflow for hundreds of instruments, multiple calculated values for each one and sub-second frequency?
Will an implementation of Deedle's IVector
with FSharpx.Collections.Vector
be more suitable for such use case? (I know one should run some tests in a similar situation, but there is no second implementation to compare with)
I would love to have Deedle's abstraction and API for such use case!
P.S. An abstraction of the workflow: if seriesB = f seriesA
, then we could somehow link series B to series A, watch for new values in A and add the new values to B (applying f
function only for incremental data). For this we would need some projection object that would keep seriesB
always synchronized with seriesA
using the transformation function f
. In turn, there could be some seriesC = f2 seriesB
on so on. I am not sure that this functionality should be inside the library, but that is what I hope to achieve.
Using frame.WithRowIndex<T>(...)
is ugly.
Support Frame.indexWithDate "foo"
and Frame.indexWithInt "foo"
(and a few standard things) - and similarly for member methods (maybe)?
Have you considered instead of having the IFsiFormattable
interface and then registering an fsi printer for it, just turning the Format
method into a property and use [<StructuredFormatDisplay("{Format}")>]
?
That way we wouldn't need to have #load "FSharp.DataFrame.fsx"
, just #r "FSharp.DataFrame.dll"
.
This would be especially helpful with the new "Send to FSI" command in VS2013, making it a better experience: just reference the dll and be ready to go.
The problem with this is that the Format member would appear in IntelliSense, but it
could be hidden from C# by using [<EditorBrowsable(EditorBrowsableState.Never)>]
and from F# by using [<CompilerMessage("This method is intended to be used only by FSI Printer", 10002, IsHidden=true, IsError=false)>]
(Another simpler option would be just to use the ToString, which would be even nicer for C# users)
Review and do the standard R thing
Add function to get rows with some missing values (useful for diagnostics...)
Support serializing data frames & series and writing them to CSV/TSV.
Currently lookup option only works for left or right joins on series, but doesn't work with outer join. The more expected behavior would be to lookup values on both sides respectively. Could we do this?
An example use case could be ticks for two stocks that arrive at different time while the ratio of the two of them should be updated with every new information.
Another example is exchanges schedule in Israel/Gulf countries (closed on Fri, Sat) and Western exchanges (closed on Sat, Sun). On Fri one would want to get Thu data from Israel/GCC, but on Sun the Fri data from West.
In both cases one would need outer join with Lookup.NearestSmaller.
Make sure the CSV reader is fast... (pandas default CSV reader can handle some 10k rows, but more is slow?)
The operation is currently only available as a member (mutating the frame). We should add non-mutating module operation.
They should cover the same functionality.
On series & frame for getting/adding values and series. Also on Series builder, tests currently fail.
Consider & add support for hierarchical indexing...
Should not need midpoint
argument.
Some time ago I did a small comparison of some R code (https://gist.github.com/ovatsus/5354187#file-original-r), with the equivalent code using CsvProvider (https://gist.github.com/ovatsus/5354187#file-csvprovider-fsx) and using the untyped CsvFile (https://gist.github.com/ovatsus/5354187#file-csvfile-fsx)
I did the same for DataFrame (https://gist.github.com/ovatsus/5354187#file-dataframe-fsx), and I have some feedback:
frame |> Frame.getCol "Solar.R"
to be the same as frame.["Solar.R"]
, but the first one gives this error: Type constraint mismatch when applying the default type 'int' for a type inference variable. The type 'int' does not support the operator 'DivideByInt' Consider adding further type constraints
iris |> Frame.filterRowValues (fun x -> x.GetAs<string>("Species") = "virginica")
? It feels very long for such a simple operationA declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.