icsharpcode / nullabilityinference Goto Github PK

Global type inference for C# 8 nullable reference types

License: MIT License

C# 100.00%

csharp dotnet dotnetcore c-sharp-8 nullable-reference-types tool

nullabilityinference's Introduction

C# 8 nullability inference

This is a prototype for an algorithm that modifies C# code in order to minimize the number of warnings caused by enabling C# 8.0 nullable reference types. If this ever gets out of the prototype stage, this might be a useful tool when migrate existing C# code to C# 8.0.

Note: this is a work in progress. Many C# constructs will trigger a NotImplementedException.

Usage

Update your project to use C# 8.0: <LangVersion>8.0</LangVersion>
Enable nullable reference types: <Nullable>enable</Nullable>
If possible, update referenced libraries to newer versions that have nullability annotations.
Compile the project and notice that you get a huge bunch of nullability warnings.
Run InferNull myproject.csproj. This modifies your code by inserting ? in various places.
Compile your project again. You should get a smaller (and hopefully manageable) number of nullability warnings.

Tips+Tricks:

The inference tool will only add/remove ? annotations on nullable reference types. It can also add the [NotNullWhen] attribute. It will never touch your code in any other way.
- Existing ? annotations on nullable reference types are discarded and inferred again from scratch.
Unconstrained generic types are not reference types, and thus will never be annotated by the tool.
The inference tool will not introduce any of the advanced nullability attributes.
- However, if these attributes are used in the input code, the tool will in some cases use them for better inference results.
- It can be useful to annotate generic code with these attributes before running the inference tool.
The inference tool acts on one project (.csproj) at a time. For best results, any referenced assemblies should already use nullability annotations.
- If using the tool on multiple projects; apply the tool in the build order.
- For the .NET base class library, use .NET Core 3 (or later), or use ReferenceAssemblyAnnotator.
- For third-party libraries, consider upgrading to a newer version of the library if that adds nullability annotations.
You can use #nullable enable to mark code that you have finished manually reviewing.
- The tool will never touch any code after #nullable enable or #nullable disable. It only modifies code prior to those directives and code after #nullable restore.
- You can use this to add nullability annotations to your project file-by-file:
  - Don't use <Nullable>enable</Nullable> on the project level
  - Use the InferNull --add-nullable-enable command-line option to let the inference tool add the directive to all files
  - Use git to revert all changes made by the tool except those to a subset of the files.
  - Make code changes to that subset of files to fix the remaining warnings.
  - Commit, then later re-run InferNull --add-nullable-enable to work on the next batch of files.

The algorithm

Let's start with a simple example:

 1: class C
 2: {
 3:    string key;   // #1
 4:    string value; // #2
 5:    
 6:    public C(string key, string value)  // key#3, value#4
 7:    {
 8:        this.key = key;
 9:        this.value = value;
10:    }
11:    
12:    public override int GetHashCode()
13:    {
14:        return key.GetHashCode();
15:    }
16:    
17:    public static int Main()
18:    {
19:        C c = new C("abc", null); // #5
20:        return c.GetHashCode();
21:    }
22: }

We will construct a global "nullability flow graph". For each appearance of a reference type in the source code that could be made nullable, we create a node in the graph. If there's an assignment a = b, we create an edge from b's type to a's type. If there's an assignment b = null, we create an edge from a special nullable node to b's type. On a dereference a.M();, we create an edge from a's type to a special nonnull node (unless the dereference is protected by if (a != null)).

Clearly, everything reachable from the nullable node should be marked as nullable. Similarly, everything that can reach the nonnull node should be marked as non-nullable.

Thus, in the example, key is inferred to be non-nullable, while value is inferred to be nullable.

Implementation Overview

Nullability inference works essentially in these steps:

Initially, modify the program to mark every reference type as nullable. (AllNullableSyntaxRewriter)
Create nodes for the nullability flow graph. (NodeBuildingSyntaxVisitor)
Create edges for the nullability flow graph. (EdgeBuildingSyntaxVisitor + EdgeBuildingOperationVisitor)
Assign nullabilities to nodes in the graph. (NullCheckingEngine)
Modify the program to mark reference types with the inferred nullabilities. (InferredNullabilitySyntaxRewriter)

The nullability graph

The fundamental idea is to do something similar to C#'s nullability checks. The C# compiler deals with types annotated with concrete nullabilities and emits a warning when a nullable type is used where a non-nullable type is expected. The EdgeBuildingOperationVisitor instead annotates types with nullability nodes, and creates an edge when node#1 is used where node#2 is expected.

While in simple examples the resulting graphs can look like data flow graphs, that's not always an accurate view. An edge from node#1 to node#2 really only represents a constraint "if node#1 is nullable, then node#2 must also be nullable".

To build this graph, the EdgeBuildingOperationVisitor assign a TypeWithNode to every expression in the program. For example, the field access this.key has the type-with-node string#1, where #1 is the node that was constructed for the declaration of the key field. The TypeWithNode can also represent generic types like IEnumerable#x<string#1>. With generics, there's a top-level node #x for the generic type, but there's also a separate node for each type argument.

Minimizing the number of compiler warnings

If the graph contains a path from the nullable node to the nonnull node, we will be unable to create nullability annotations that allow compiling the code without warning: no matter how we assign nullabilities to nodes along the path, there will be at least one edge where a nullable node points to a non-nullable node. This violates the constraint represented by the edge, and thus causes a compiler warning.

If we cannot assign nullabilities perfectly (without causing any compiler warnings), we would like to minimize the number of warnings instead. We do this by using the Ford-Fulkerson algorithm to compute the minimum cut (=minimum set of edges to be removed from the graph) so that the nonnull node is no longer reachable from the nullable node. This separates the graph into essentially three parts:

nodes reachable from nullable --> must be made nullable
nodes that reach nonnull --> must not be made nullable
remaining nodes --> either choice would work

The removed edges correspond to the constraints that will produce warnings after we insert ? for the types inferred as nullable. Thus the minimum cut ends up finding a solution that minimizes the number of constraints violated. If the constraints represented in our graph accurately model the C# compiler, this minimizes the number of compiler warnings.

For the remaining nodes where either choice would work, we mark all nodes occurring in "input positions" (e.g. parameters) as nullable. Then we propagate this nullability along the outgoing edges. Any nodes that still remain indeterminate after that, are marked as non-nullable.

More Examples

if (x != null)

Consider this program:

 1: class Program
 2: {
 3:     public static int Test(string input) // input#1
 4:     {
 5:         if (input == null)
 6:         {
 7:             return -1;
 8:         }
 9:         return input.Length;
10:     }
11: }

input has the type-with-node string#1. A member access like .Length normally causes us to generate an edge to the special nonnull node, to encode that the C# compiler will emit a "Dereference of a possibly null reference." warning. However, in this example the static type-based view is not appropriate: the C# compiler performs control flow analysis, and notices that input cannot be null at the dereference due to the null test earlier.

So for this example, we must not generate any edges, so that the input parameter can be made nullable. Instead of re-implementing the whole C# nullability analysis, we solve this problem by simply asking Microsoft.CodeAnalysis for the NullableFlowState of the expression we are analyzing. This works because prior to our analysis, we used the AllNullableSyntaxRewriter to mark everything as nullable -- if despite that the C# compiler still thinks something is non-nullable, it must be protected by a null check.

For the use of input in line 9, it has NullableFlowState.NotNull, so we represent its type-with-node as string#nonnull instead of string#1. This way the dereference due to the .Length member access creates a harmless edge nonnull->nonnull. This edge is then discarded because it is not a useful constraint. Thus this method does not result in any edges being added to the graph. Without any edge constraining input, it will be inferred as nullable due to occurring in input position.

Generic method invocations

 1: class Program
 2: {
 3:     public static void Main()
 4:     {
 5:         string n = null; // n#1
 6:         string a = Identity<string>(n); // a#3, type argument is #2
 7:         string b = Identity<string>("abc"); // b#5, type argument is #4
 8:     }
 9:     public static T Identity<T>(T input) => input;
10: }

With generic methods, we do not create nodes for the type T, as that cannot be marked nullable without additional constraints ("CS8627: A nullable type parameter must be known to be a value type or non-nullable reference type. Consider adding a 'class', 'struct', or type constraint."). Instead, any occurrences of T in the method signature are replaced with the type-with-node of the type arguments used to call the method. Thus, the example above results in the following graph:

Thus, n#1, the type argument #2 and a#3 are all marked as nullable. But b#5 and the type argument #4 can remain non-nullable.

If the type arguments are not explicitly specified but inferred by the compiler, nullability inference will create additional "helper nodes" for the graph that are not associated with any syntax. This allows us to construct the edges for the calls in the same way.

Generic Types

 1: using System.Collections.Generic;
 2: class Program
 3: {
 4:     List<string> list = new List<string>();
 5: 
 6:     public void Add(string name) => list.Add(name);
 7:     public string Get(int i) => list[i];
 8: }

In this graph, you can see how generic types are handled: The type of the list field generates two nodes:

list#3 represents the nullability of the list itself.
list!0#2 represents the nullability of the strings within the list. Similarly, new!0#1 represents the nullability of the string type argument in the new List<string> expression. Because the type parameter of List is invariant, the field initialization in line 4 creates a pair of edges (in both directions) between the new!0#1 and list!0#2 nodes. This forces both type arguments to have the same nullability.

The resulting graph expresses that the nullability of the return type of Get (represented by Get#5) depends on the nullability of the name parameter in the Add method (node name#4). Whether these types will be inferred as nullable or non-nullable will depend on whether the remainder of the program passes a nullable type to Add, and on the existance of code that uses the return value of Get without null checks.

Flow-analysis

01: using System.Collections.Generic;
02: 
03: class Program
04: {
05:     public string someString = "hello";
06: 
07:     public bool TryGet(int i, out string name)
08:     {
09:         if (i > 0)
10:         {
11:             name = someString;
12:             return true;
13:         }
14:         name = null;
15:         return false;
16:     }
17: 
18:     public int Use(int i)
19:     {
20:         if (TryGet(i, out string x))
21:         {
22:             return x.Length;
23:         }
24:         else
25:         {
26:             return 0;
27:         }
28:     }
29: }

The TryGet function involves a common C# code pattern: the nullability of an out parameter depends on the boolean return value. If the function returns true, callers can assume the out variable was assigned a non-null value. But if the function returns false, the value might be null.

Using our own flow-analysis, the InferNull tool can handle this case and automatically infer the [NotNullWhen(true)] attribute!

For the name parameter (in general: for any out-parameters in functions returning bool), we create not only the declared type name#2, but also the name_when_true and name_when_false nodes. These extra helper nodes represent the nullability of out string name in the cases where TryGet returns true/false.

Within the body of TryGet, we track the nullability of name based on the previous assignment as the "flow-state". After the assignment name = someString; in line 11, the nullability of name is the same as the nullability of someString. We represent this by saving the nullability node someString#1 as the flow-state of name. On the return true; statement in line 12, we connect the current flow-state of the out parameters with the when_true helper nodes, resulting in the someString#1-><name_when_true#1> edge. Similarly, the return false; statement in line 15 results in an edge from <nullable> to <name_when_false#2>, because the name = null; assignment has set the flow-state of name to <nullable>.

In the Use method, we also employ flow-state: even though x itself needs to be nullable, the then-branch of the if uses the <name_when_true#1> node as flow-state for the x variable. This causes the x.Length dereference to create an edge starting at <name_when_true#1>, rather than x's declared type (x#3).

This allows inference to success (no path from <nullable> to <nonnull>. In the inference result, name#2 and <name_when_false#2> are nullable, but <name_when_true#1> is non-nullable. The difference in nullabilities between the when_false and when_true cases causes the tool to emit a [NotNullWhen(true)] attribute:

using System.Collections.Generic;
using System.Diagnostics.CodeAnalysis;
class Program
{
    public string someString = "hello";

    public bool TryGet(int i, [NotNullWhen(true)] out string? name)
    {
        if (i > 0)
        {
            name = someString;
            return true;
        }
        name = null;
        return false;
    }

    public int Use(int i)
    {
        if (TryGet(i, out string? x))
        {
            return x.Length;
        }
        else
        {
            return 0;
        }
    }
}

nullabilityinference's People

Contributors

Stargazers

Watchers

Forkers

grahamthecoder isabella232 ibrahim-elsakka beedleka

nullabilityinference's Issues

Reference assemblies lack annotations

Even on .NET Core 3.1, not all system libraries are annotated.

For example, the System.Linq methods are lacking annotations.
This means a call like collection.FirstOrDefault() will:

incorrectly allow nullable collections
incorrectly allow the return value to be non-nullable

This can cause significantly wrong annotations being inferred (and then accepted by the compiler without warning).

Maybe we should somehow include the .NET 5 annotations with the inference tool, so that it can produce useful results on projects targeting .NET Core 3 or even .NET Framework 4.x?

Flow analysis

Currently our inference re-uses Roslyn's flow-analysis.

However, this has some fundamental problems.

9:	Dictionary<T, Node> mapping = new Dictionary<T, Node>();

11:	Node? GetNode(T element)
	{
13:		Node? node;
14:		if (!mapping.TryGetValue(element, out node))
15:		{
16:			node = new Node();
17:			mapping.Add(element, node);
18:		}
19:		return node;
	}

There's no edges created for line 16/17 because here Roslyn knows that node is non-null.
Line 14 creates two edges: one from <nullable> because TryGetValue will assign null when returning false; the other from mapping!1#2 (the mapping field's Node type argument) when TryGetValue returns true.

The return statement creates an edge from the variable's type, because Roslyn's flow analysis can't guarantee us that the variable is non-null -- our Roslyn code analysis runs under the pessimistic assumption that all types-to-be-inferred might end up nullable, so it considers mapping to be Dictionary<T, Node?>, which leaves open the possibility that GetNode returns null.

However, after our inference decides that mapping!1#2 is non-null, it would be correct to also indicate that the GetNode return value is non-null. After all, if no node exists yet, the function will create one.

The issue here is that Roslyn's flow analysis isn't aware of our types-to-be-inferred.
It would be better if, instead of using Roslyn's flow analysis, we had our own that keeps track of node's nullability.

The idea would be to create additional "helper" graph nodes for the different flow states of a local variable of reference type.
After TryGetValue initializes node, it's flow-state would be (true: mapping!1#2, false: <nullable>). Within the if body, the flow-state would initially be <nullable>, but after line 16 would change to <nonnull>.
After the if, the flow-state from both alternatives could be re-combined by creating a new node "node-after-line-18" and edges from the nodes from the two if-branches -- in this case <nonnull> from the then-branch and mapping!1#2 from the else branch.
Then the return statement would create an edge from this "node-after-line-18" instead of the node for the variable's declared type.
All flow-state nodes associated with a variable would have an edge to the variable's declared type node.
We'd end up with a graph somewhat like this:

Thus in the end, node would be inferred as nullable, but the GetNode return type would only depend on mapping!1#2 and thus can be inferred depending on whether there's a mapping.Add(x, null) access somewhere else in the program.

[return: NotNullIfNotNull(paramName)]

[return: NotNullIfNotNull(paramName)] is a semi-common attribute to use, especially in some code bases that like to use:
if (input == null) return null;
at the start of many functions.

Unlike [NotNullWhen(bool)] for out parameters, I don't see a clean way to infer NotNullIfNotNull with our current algorithm.
But it would be valuable to figure something out, so I'm creating this issue to collect some cases of [NotNullIfNotNull] methods and their constraint graphs.

Inference status: ICSharpCode.Decompiler

The primary use-case (how I ended up starting this project) was trying to annotate ILSpy's decompiler engine, which quickly felt like a task that could be automated.

Timeline:

March 2010 (yes, over 10 years ago): I had an idea for a null checking analysis as an analysis on IL code. I implemented some of that idea, but it didn't work well enough for my liking and I gave up on the idea. The original idea of the nullability constraint graph (back then I called it "nullability subtyping graph") is from back then. So is the idea of using the minimum cut for minimizing the number of warnings (back then: errors reported by the analysis tool).
Early May 2020: I wanted to annotate ILSpy's decompiler engine (ICSharpCode.Decompiler) with nullable reference types. But it felt like a bunch of monotonous work that ought to be automated. I realized that the C# 8 nullable reference type system is somewhat similar to what I did 10 years earlier, and that my ideas may be applicable to an inference tool. An inference tool doesn't need to be perfect to be useful, it just needs automate the vast majority of the work.
2020-05-09: I started building the NullabilityInference prototype.
2020-05-17: The prototype can handle some individual code files from ICSharpCode.Decompiler.Utils
2020-06-08: The prototype finally supports enough language features to run over the whole ICSharpCode.Decompiler without crashing with a NotImplementedException
- Use .NET Core 3 + enable NRT, but not using inference --> 2714 warnings
- After running InferNull --> 1134 warnings.
- Some bugfixes reduce this to 1120 warnings.
2020-06-13: Implementing flow analysis (#5) gets us down to 1079 warnings.
2020-06-21: [NotNullWhen(true)]-inference finally works correctly --> 722 warnings.

In the remaining warnings, I see some categories of problems occurring repeatedly:

unconstrained generics: we can't infer [AllowNullable] yet
generics: we can't infer T: notnull constraints yet
uninitialized fields: ILSpy has many classes where fields are initialized not in the constructor, but by other methods (e.g. a single public method serves as an entry point for a class and initializes a bunch of fields; with private methods relying on the fields already being initialized)
Roslyn doesn't realize that fields are initialized when the ctor calls a property setter

Crashing on documentation comment types

Hello!
I tried running InferNull on the SharpZipLib source and it failed with a duplicate key exception.
I added some debugging output to see what is going on, and it seems like it chokes on the documentation comment types:

Perhaps this is a known issue? I will try substituting the missing SyntaxTypes to continue testing.

The initial duplicate key exception was thrown in ICSharpCode.NullabilityInference.SyntaxToNodeMapping.CreateNewNode() and the missing key was in ICSharpCode.NullabilityInference.SyntaxToNodeMapping[TypeSyntax syntax]

Distribute as dotnet tool

It would be more convenient to install and use this tool as dotnet tool.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble