dotnet / datalab Goto Github PK

This repo is for experimentation and exploring new ideas involving ADO.NET, EF Core, and other areas related to .NET data.

License: MIT License

C# 100.00%

datalab's Introduction

.NET Data Lab

This repo is archived. See Woodstar experiment summary for more information.

This repo is for experimentation and exploring new ideas involving ADO.NET, EF Core, and other areas related to .NET data.

Current projects

SqlServer.Core (Project Woodstar)

Microsoft.Data.SqlClient is a fully-featured ADO.NET database provider for SQL Server. It supports a broad range of SQL Server features on both .NET Core and .NET Framework. However, it is also a large and old codebase with many complex interactions between its behaviors. This makes it difficult to investigate the potential gains that could be made using newer .NET Core features. Therefore, we are starting this experiment in collaboration with the community to determine what potential there is for a highly performant SQL Server driver for .NET.

Important! Investment in Microsoft.Data.SqlClient is not changing. It will continue to be the recommended way to connect to SQL Server and SQL Azure, both with and without EF Core. It will continue to support new SQL Server features as they are introduced.

License

This project is licensed under the MIT license.

.NET Foundation

This project is a part of the .NET Foundation.

Other .NET data projects on GitHub

If you're interested in making .NET data better, then consider contributing to one of the many open-source repos hosted on GitHub.

Microsoft repos

.NET Runtime - ADO.NET lives here in the .NET BCL
EF Core - Entity Framework Core (SQL Server/Sqlite/Cosmos) and Microsoft.Data.Sqlite
Microsoft.Data.SqlClient (ADO.NET provider for SQL Server) - Microsoft.Data.SqlClient

Community repos

Feel free to send a pull request to add your .NET data related GitHub repo to this list.

datalab's People

Contributors

Stargazers

Watchers

datalab's Issues

Hack simple query scenario to get upper bound for TDS/SQL Server

The idea here is to take the simple scenario implemented in #11 and hack together a raw implementation using low-level .NET constructors to hard-code some TDS. This gives us an upper bound as to how much perf is potentially on the table, at least in the very simple case.

Folder Structure

What should the folder structure look like for some rough starting place code? I am pretty busy but am hoping to carve out a little time to write a basic "connect, do a query, get a result" code but would like to avoid having to rename and refolder everything after I have made it.

I am proposing folder under the "datalab" folder of

datalab/SqlServer.Core

and then a solution file in that folder with the same name and then two sub folders under that

/src
/test

for now that is all I would really need, with "projects" under the src mostly

Implement a simple query scenario using Microsoft.Data.SqlClient

This will give us an initial baseline for current performance.

The scenario should be really simple.

guidance on an Apache Arrow Layer

I'm working off a comment by yzorg regarding integration of Apache Arrow into this project, and the answer was that this would be implemented at a higher layer than driver level.

I was after some pointers really,. my aim here is to intercept the ODBC driver's storage mechanism, and store it to "feather" format instead (a SIMd memory-optimized file format).

can you recommend:

materials to skill up on the current ODBC driver
a good place to try to insert what i'm doing in the current ODBC driver
other example projects that do a similar thing..?

Thanks!

Woodstar experiment summary

First, thank you for being so patient waiting for news here; we probably should have provided more timely updated on the project state. This was partially a result of us simply not knowing what's going to happen with Woodstar and SqlClient (e.g. the recently-started SqlClientX - see below), and partially a result of us simply being very overloaded with other things.

Woodstar's original idea was an exploratory, greenfield SQL Server (TDS) driver; the goal was to use modern, high-performance .NET techniques, liberated from SqlClient's technical debt, and to see where that would lead in terms of performance. Specifically, we were interested in seeing what kind of performance gains we would see on the TechEmpower Fortunes benchmark, compared to SqlClient. There was no clear future for Woodstar as an actually supported product that's usable in production - it was purely a technical experiment.

The main work actually done was initial experimentation/prototyping by @NinoFloris (a core contributor on Npgsql) and myself; we built a minimal TDS client that could support TechEmpower Fortunes, and nothing more; for example, parameters were not yet supported, as well as many other features. The experiment was async-only, did not implement ADO.NET, and used System.IO.Pipelines for I/O. The prototype source code is available on this repo, as-is: it really is just an exploratory prototype, nothing more.

For the very simple TechEmpower Fortunes scenario, the prototype did not provide meaningful performance improvements over SqlClient. This does not mean that SqlClient has no performance issues: it certainly does (see this discussion) - just not in the very narrow TechEmpower Fortunes usage scenario. Our exploration did yield some valuable conclusions; two important ones are the following:

We gained some interesting insights around TDS and its processing that impact both the client and the server side, and so we engaged internally with the SQL Server org. This has been a positive engagement and various things are happening behind the scenes.
System.IO.Pipelines (with SequenceReader) work great when parsing relatively large payloads; but in a client-side database driver scenario, the user repeatedly calls in to parse very small values in the resultset (e.g. an int). In that kind of usage, reinstantiating a SequenceReader (ref struct) each time is too much overhead. Similarly, continuously slicing ReadOnlySequence for each tiny was costly, so there was no good way for us to cheaply store the current position.

Further work on Woodstar did not continue, simply because we had other, more important things that got prioritized over this. However, the lessons learned from the experiment were quite valuable, shared with relevant parties internally, and are present in discussions with the SqlClient team.

On the SqlClient side, the SqlClientX effort has recently begun - this is a project to reimplement the I/O layer and pooling implementation inside SqlClient, allowing users to opt into the new experimental implementation and eventually switching to it as the default. In a way, SqlClientX is the spiritual successor to Woodstar; although SqlClientX it's not a greenfield new driver since it's being done within SqlClient and must respect backwards compat, the goals of the two projects are the same - arrive at a modern, efficient SQL Server driver without all the technical debt, and which is able to evolve safely and quickly. The future of SqlClientX is also much clearer, being owned and maintained by the SqlClient itself, whereas Woodstar was purely an experiment with no clear path to becoming a supported product at any point.

For now, we will be archiving this GitHub repo, as work in this area is not happening here.

Use pipelines?

There has been some discussion about whether use of data pipelines is the way to go for highly async, highly performant binary communication such as Tabular Data Stream (TDS) to and from SQL Server.

@davidfowl Thoughts? (I believe you discussed this with @roji already.)

@JamesNK We were wondering what gRPC uses?

Communication and collaboration

This issue tracks ways we will communicate and collaborate on Woodstar (SQL Server.Core). Please let us know what works for you, and also comment with your own ideas.

Some initial ideas:

Discord/Slack channel or similar
GitHub discussions on this repo
Teams/Zoom meetings with some/all collaborators
Live streaming

@ErikEJ @Wraith2 @NickCraver @mgravell @Drawaes

/cc @davidfowl

Set up build and perf testing infrastructure

We need a build/test environment that will allow us to run perf tests easily. We may also want simple functional/unit tests.

Another Tds client

4 years ago I did a almost complete rewrite of the sqlClient for fun and to do some performance test. The bulk insert was fast but for small queries the improvement was small. Most of the time was spend on the sql server and communication.
Async was wrong implemented and probably much more but a lot could be simplified.

the Repo can be found here.

The old repro was based on a data reader and writer. I skipped that part and did the reading and writing directly to a poco. I think with EF core this should also possible. There is some mapping between sql fields and poco properties and there are some value convertors.

Directly converting json strings from the input stream will skip the copy part.

SqlServer.Core: Performance-oriented SQL Server .NET driver

An experiment in collaboration with the community to determine what potential there is modern .NET features in a highly performant SQL Server driver.

Progress?

It was mentioned on an EF livestream that progress was being made but the only commit is from 2021. Is there code being worked on? If so can we see it in case we can help out?