GithubHelp home page GithubHelp logo

nietras / sep Goto Github PK

View Code? Open in Web Editor NEW
735.0 7.0 29.0 447 KB

World's Fastest .NET CSV Parser. Modern, minimal, fast, zero allocation, reading and writing of separated values (`csv`, `tsv` etc.). Cross-platform, trimmable and AOT/NativeAOT compatible.

Home Page: http://nietras.com

License: MIT License

PowerShell 0.40% C# 99.60%
csharp csv csv-parser csv-reader csv-writer dotnet performance simd

sep's People

Contributors

dependabot[bot] avatar nietras avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sep's Issues

Namespace nietras.SeperatedValues does not conform to naming guidelines

@nietras
Not really an Issue but more a suggestion. Namespace naming guidelines suggest using PascalCasing (https://learn.microsoft.com/en-us/dotnet/standard/design-guidelines/names-of-namespaces).
You might also want to reserve the "Nietras" prefix on nuget and name the package with that name as well. So instead of "Sep" use "Nietras.Sep". Or remove the "nietras" namespace prefix from the source like some other packages do (such as Serilog) and then reserve "Sep" as prefix on nuget.

Case Sensitivity on row[string]

Hi, @nietras

Been trying out the library and love it brilliant work.

One issue I have had was case of CSV file headers not matching the case when referencing the row key.

This isn't an issue on my test data but the files come from an external source and they can easily use the wrong case. So would it be possible to have an option to match case insensitive for the lookup?

No async support

Hi @nietras thanks for the library

I was really excited to try it, however, it's not async compatible, because Row is a ref struct

foreach (SepReader.Row row in reader) //Error CS4012 Parameters or locals of type 'SepReader.Row' cannot be declared in async methods or async lambda expressions.
{
    await DoStuff();
}

Any workarounds? (other than "read everything into a huge buffer and do your thing")

Question about batch write

Hi, first of all, thank you very much for the work that does an excellent job.

I am trying to write down around 600M rows into Csv. Have tried several options but couldn't find the sweet spot between batch insert and single line per time. What would be your suggestion on where to focus?

Essentially a bunch of work to get into the array
and then.

 using var row = writer.NewRow();
        for (int i = 0; i < array.Length; i++)
        {
            row[i.ToString()].Format(array[i]);
        }

array is always a long number

Assembly not StrongNamed signed

@nietras the Sep.dll assembly isn't strongly signed so it cannot be referenced by assembly that are strong name signed.

It looks like Sep's dependency csFastFloat isn't signed either.

Getting System.ArgumentOutOfRangeException when creating reader with large stream

Hi,
I'm getting:

System.ArgumentOutOfRangeException: minimumLength ('-688018613') must be a non-negative value. (Parameter 'minimumLength')
Actual value was -688018613.
at System.ArgumentOutOfRangeException.ThrowNegative[T](T value, String paramName)
at System.ArgumentOutOfRangeException.ThrowIfNegative[T](T value, String paramName)
at System.Buffers.SharedArrayPool`1.Rent(Int32 minimumLength)
at nietras.SeparatedValues.SepReader..ctor(SepReaderOptions options, TextReader reader)
at nietras.SeparatedValues.SepReaderExtensions.From(SepReaderOptions options, TextReader reader)
at nietras.SeparatedValues.SepReaderExtensions.From(SepReaderOptions options, Stream stream)

when creating a Sep Reader with a very large Stream object:

using var reader = Sep.Reader().From(stream);

Is there a limit to how big it should be?

Implement SepWriterOptions.WriteHeader (false)

Currently SepWriter does not support writing without a header implement this via new option WriteHeader. Default true.

  • Add SepWriterOptions.WriteHeader
  • Allow adding rows by col index e.g. if index is +1 from last index add column.
  • Still allow named columns, but fail if trying to access a column by name if column only added by index.

Support for unescaping column values

When I heard about Sep on MS community update I was immediately intrigued and tried some experiments by replacing my existing use of CsvHelper. I am very happy with the functionality of CsvHelper, as it does exactly what I need:
Take some text (string) and parse it into a string[][] of rows and columns.

After adjusting to Seps different way of doing things, I hit a wall when I noticed that there is absolutely no parsing of column values. E.g. A,"B" return A, "B" instead of A, B.
For me this felt like an oversight (I know it's 0.1 and I do not want to blame anybody).

So my question is, will there ever be a version of Sep that return the real content of a column or will it always be raw CSV that needs further parsing?

Error CS4012 Parameters or locals of type 'SepReader.Row' cannot be declared in async methods or async lambda expressions.

@nietras

I'm sure this is a general dotnet issue but I haven't been able to do a good search on the problem. Maybe you can help me work through this problem.

My main looks like this: static async Task Main(string[] args) since I am calling async methods elsewhere.

When I try to use either:

while (reader.MoveNext())
{
   var row = reader.Current;
}

or

foreach (var row in reader) 
 {
 }

The error occurs at the definition of the row, CS4012 Parameters or locals of type 'SepReader.Row' cannot be declared in async methods or async lambda expressions.

Suggestions are welcome.

Specified argument was out of the range of valid values. at nietras.SeparatedValues.SepReader.ParseNewRows()

@nietras first a compliment. This library is insanely fast!!!

I'm running into a problem parsing a comma separated CSV file with lots of double quoted strings with embedded double quotes. I get 143,651 rows into the file and then it crashes with the Specified argument was out of the range of valid values. at nietras.SeparatedValues.SepReader.ParseNewRows() error. The total row count in the file is around 650,000 and there are lots of columns.

I tried creating a small file with just the column header row and the block of rows -10 to +10 around the place where it crashes on the larger file. That doesn't crash.

I also experimented with the reader options. I found out that I can flip the setting for Unescaped and then it no longer crashes. That leaves me with another problem as I'm also using the SepWriter to write out a subset of that much larger input file. This subset becomes formatted differently with the Unescaped setting set to non-default value (most of my double quotes are now missing).

Since the tiny -10 to + 10 row test file didn't crash it is almost like some quote escape variable is running out of range after dealing with a ton of these column/cell values with lots of embedded quotes.

Here is code stripped down to the basics

        using (SepReader reader = Sep.New(',').Reader(o => o with { HasHeader = true, DisableColCountCheck = true }).FromFile(sFilename))
        {
           using (SepWriter writer = reader.Spec.Writer().ToFile(sListingFilename)) 
           {
              foreach (SepReader.Row row in reader)
              {
                 nLine++;

                 string sName = row["Name"].ToString();
                 if (sName.IndexOf("lookingfor", StringComparison.OrdinalIgnoreCase) >= 0)
                 {
                    using SepWriter.Row writeRow = writer.NewRow(row); 
                 }
              }
           }
        }

Ignore double quotes in data

Hi @nietras
is it possible to ignore double quotes in data, I have a csv line ( delimiter is ,):
0237,02,0000021594,000833738000041,RCSM-B1"/65,0008337380000,2022-11-01 00:30:16

and this line results in this exception:
Buffer or row has reached maximum supported length of 16777216. If no such row should exist ensure quotes " are terminated.

Thanks
grsgrs

How to parse whole object from csv having different column names?

We have csv file with column name "PERSON_F_NAME", "PERSON_L_NAME","PERSON_AGE" etc.

public class Person{
public string FirstName {get;set;};
public string LastName {get;set;};
public int Age {get;set;}
}

this is just for one file. there are other files which have different column names and have associated class with different Column Name
I want to write Generic way so that it can map Csv to Object

Add support for trimming quoted csv headers and values?

Currently, trying to parse a CSV with quotes expands this:
DateTime current_start_time = csv_reader.Current["Time Stamp"].Parse<DateTime>();
All the way out to this:
DateTime current_start_time = DateTime.Parse(csv_reader.Current["\"Time Stamp\""].ToString().Trim('"'));
Maybe there's a better or more optimized way to trim headers and values in-place before parsing? The main problem with the solution above is that a CSV with no quotes around the Time Stamp column in the header row would not be parsed correctly.

Add support for skipping empty lines?

Is it possible to add support to the parser that skips empty lines? They should still be counted when reporting issues such as an InvalidDataException.

Tagging @nietras for notification as instructed.

Sep does not support async only streams (such as IBrowserFile)

When trying to use Sep.Reader().From(stream) with a stream returned from IBrowserFile.OpenReadStream in the context of a user's browser in a .NET Blazor WASM application, it fails because Synchronous reads are not supported. To my understanding there is no other methods that can be used for Sep to use an asynchronous read from the supplied stream.

The stack trace will look like the following inside Blazor WASM:

crit: Microsoft.AspNetCore.Components.WebAssembly.Rendering.WebAssemblyRenderer[100]
      Unhandled exception rendering component: Synchronous reads are not supported.
System.NotSupportedException: Synchronous reads are not supported.
   at Microsoft.AspNetCore.Components.Forms.BrowserFileStream.Read(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.StreamReader.ReadBuffer(Span`1 userBuffer, Boolean& readToUserBuffer)
   at System.IO.StreamReader.ReadSpan(Span`1 buffer)
   at System.IO.StreamReader.Read(Span`1 buffer)
   at nietras.SeparatedValues.SepReader.CheckCharsAvailableDataMaybeRead(Int32 paddingLength)
   at nietras.SeparatedValues.SepReader.EnsureInitializeAndReadData(Boolean endOfFile)
   at nietras.SeparatedValues.SepReader.MoveNext()
   at nietras.SeparatedValues.SepReader..ctor(Info info, SepReaderOptions options, TextReader reader)
   at nietras.SeparatedValues.SepReaderExtensions.FromWithInfo(Info info, SepReaderOptions options, TextReader reader)
   at nietras.SeparatedValues.SepReaderExtensions.From(SepReaderOptions options, Stream stream)

Note that BrowserFileStream belongs to AspNetCore.Components so I suspect this might also happen other frameworks like MVC but I haven't tested. The msdoc link for the method returning the async only stream is here (IBrowserFile.OpenReadStream(Int64, CancellationToken) Method)

Having had a good look over Seps docs I understand that async stream compatibility might be tricky and out of scope for the intended use of Sep. If that is the conclusion, it would be appreciated to update Sep docs to mention that.

Thinking about what async support might look like with Sep, I thought of the following.

// SepAsyncReader implements the IAsyncEnumerable pattern. 
using SepAsyncReader reader = Sep.Reader().FromAsync(stream);

await foreach (SepReader.Row row in reader)
{
   // Normal synchronous indexers at this point. 
   string myString = row["Column1"].ToString();
}

Cheers.

GetOrAddCol(colIndex) is actually just Get(colIndex)

@nietras Is this intended?

public Col this[int colIndex] => new Col(_writer.GetOrAddCol(colIndex));
public Col this[string colName] => new Col(_writer.GetOrAddCol(colName));

vs.

internal ColImpl GetOrAddCol(int colIndex)
{
	return _cols[colIndex]; // This does NOT check the length
}

internal ColImpl GetOrAddCol(string colName)
{
	if ((uint)_cacheIndex < (uint)_colNameCache.Count)
	{
		// [...] This code does
	}
	return value;
}

Option to leave stream open after writing

Hello @nietras and thanks for this library!

It appears there is no option to leave a stream open after writing to it with SepWriter, would it be feasible to add one?

My use case: I am processing a file, applying some transformations to certain column values, which will then be sent off to another API. So I am writing into a MemoryStream which will be attached to the API call as file content.

Without an option to leave the stream open, my options are to either not Dispose the SepWriter (which I assume may leave some internal objects undisposed), or create a new copy of the MemoryStream that I can then re-use (which involves copying the whole file over in memory again).

Column with header mismatch exception

Hi, is there any way to somehow make the code not break if the column missmatch with header exception is thrown after we invoke reader.MoveNext()? Or to set some parameters when reader is initialized to maybe ignore the columns that have null values, something like that.
Also if we ignore the exception and continue, iterator doesn't go to next row and never exits the loop with reader.MoveNext(), it just keeps repeating the same row over and over again, only incrementing the ID value.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.