Recently I’ve been optimising LINQ to SQL queries running against an SQL CE 3.5 database in order to stop them taking over 5 minutes to execute and bringing them down to only a few seconds. In this post I’m going to go into two of the biggest offenders I’ve seen so far in terms of killing query performance. Credit must go to my fellow Readifarian Colin Savage with whom I worked to discover and find solutions for the offending LINQ query expressions.

In this post, I’m going to be using the following demonstration LINQ to SQL classes. Note that these are just demo classes and aren’t recommended practice or even fully fleshed out. Note that Game.DeveloperID has an index against it in the database.

[Table(Name = "Developer")]
public class Developer
{
    [Column(DbType = "int NOT NULL IDENTITY", IsPrimaryKey=true, IsDbGenerated=true)]
    public int ID { get; set; }
    
    [Column(DbType = "nvarchar(50) NOT NULL")]
    public string Name { get; set; } 
}

[Table(Name = "Game")]
public class Game
{
    [Column(DbType = "int NOT NULL IDENTITY", IsPrimaryKey=true, IsDbGenerated=true)]
    public int ID { get; set; } 
    
    [Column(DbType = "nvarchar(50) NOT NULL")]
    public string Name { get; set; } 
    
    [Column(DbType = "int NOT NULL")]
    public int DeveloperID { get; set; } 
    
    [Column(DbType = "int NOT NULL")]
    public int Ordinal { get; set; } 
}

Outer Joining and Null Testing Entities

The first performance killer Colin and I ran into is where you’re testing whether an outer joined entity is null. Here’s a query that includes the performance killer expression:

var query = from d in context.GetTable<Developer>()
            join g in context.GetTable<Game>()
                on d.ID equals g.DeveloperID into tempG
            from g in tempG.DefaultIfEmpty()
            select new 
            { 
                d,
        
                //Inefficient
                Name = g != null ? g.Name : String.Empty,
            };

The above query is doing a left outer join of Game against Developer, and in the case that the developer doesn’t have any games, it’s setting the Name property of the anonymous projection object to String.Empty. Seems like a pretty reasonable query, right?

Wrong. In SQL CE, this is the SQL generated and the query plan created:

SELECT [t0].[ID], [t0].[Name], 
(CASE 
    WHEN [t2].[test] IS NOT NULL THEN [t2].[Name]
    ELSE CONVERT(NVarChar(50),'')
 END) AS [Name2]
FROM [Developer] AS [t0]
LEFT OUTER JOIN (
    SELECT 1 AS [test], [t1].[Name], [t1].[DeveloperID]
    FROM [Game] AS [t1]
) AS [t2] ON [t0].[ID] = [t2].[DeveloperID]
Inefficient join between two table scans on SQL CE 3.5

Inefficient join between two table scans on SQL CE 3.5

The problem is that the “SELECT 1 AS [test]” subquery in the SQL is causing SQL CE to do a join between two table scans, which on tables with lots of data is very very slow. Thankfully for those using real SQL Server (I tested on 2008 R2), it seems to be able to deal with this query form and generates an efficient query plan, as shown below.

Efficient join between two clustered index scans on SQL Server 2008 R2

Efficient join between two clustered index scans on SQL Server 2008 R2

So, what can we do to eliminate that subquery from the SQL? Well, we’re causing that subquery by performing a null test against the entity object in the LINQ expression (LINQ to SQL looks at the [test] column to see if there is a joined entity there; if it’s 1, there is, if it’s NULL, there isn’t). So how about this query instead?

var query = from d in context.GetTable<Developer>()
            join g in context.GetTable<Game>()
               on d.ID equals g.DeveloperID into tempG
            from g in tempG.DefaultIfEmpty()
            select new 
            { 
                d,
        
                //Efficient
                Name = g.ID != 0 ? g.Name : String.Empty,
            };

Success! This generates the following SQL and query plan against SQL CE 3.5:

SELECT [t0].[ID], [t0].[Name], 
    (CASE 
        WHEN [t1].[ID] <> 0 THEN CONVERT(NVarChar(50),[t1].[Name])
        ELSE NULL
     END) AS [Name2]
FROM [Developer] AS [t0]
LEFT OUTER JOIN [Game] AS [t1] ON [t0].[ID] = [t1].[DeveloperID]
Efficient join between a table scan and an index seek on SQL CE 3.5

Efficient join between a table scan and an index seek on SQL CE 3.5

The subquery has been removed from the SQL and the query plan reflects this; it now uses an index seek instead of a second table scan in order to do the join. This is much faster!

Okay, that seems like a simple fix. So when we use it like below, putting g.Name into another object, it should keep working correctly, right?

var query = from d in context.GetTable<Developer>()
            join g in context.GetTable<Game>()
                on d.ID equals g.DeveloperID into tempG
            from g in tempG.DefaultIfEmpty()
            select new 
            { 
                d,
        
                //Broken
                Name = g.ID != 0 ? new Wrap { Str = g.Name } : null
            };

Unfortunately, no. You may get an InvalidOperationException at runtime (depending on the data in your DB) with the confusing message “The null value cannot be assigned to a member with type System.Boolean which is a non-nullable value type.”

If we look at the SQL generated by this LINQ query and the data returned from the DB, we can see what’s causing this problem:

SELECT [t0].[ID], [t0].[Name], 
    (CASE 
        WHEN [t1].[ID] <> 0 THEN 1
        WHEN NOT ([t1].[ID] <> 0) THEN 0
        ELSE NULL
     END) AS [value], [t1].[Name] AS [Str]
FROM [Developer] AS [t0]
LEFT OUTER JOIN [Game] AS [t1] ON [t0].[ID] = [t1].[DeveloperID]
The data returned by the broken query

The data returned by the broken query

It’s probably a fair to make the assumption that LINQ to SQL is using the [value] column internally to evaluate the “g.ID != 0” part of the LINQ query, but you’ll notice that in the data the value is NULL for one of the rows. This seems to be what is causing the “can’t assign a null to a bool” error we’re getting. I think this is a bug in LINQ to SQL, because as far as I can tell as this is pretty unintuitive behaviour. Note that this SQL query form, that causes this problem with its CASE, WHEN, WHEN, ELSE expression, is only generated when we project the results into another object, not when we just project the results straight into the main projection object. I don’t know why this is.

So how can we work around this? Prepare to vomit just a little bit, up in the back of your mouth:

var query = from d in context.GetTable<Developer>()
            join g in context.GetTable<Game>()
                on d.ID equals g.DeveloperID into tempG
            from g in tempG.DefaultIfEmpty()
            select new 
            { 
                d,
        
                //Fixed! (WTF??)
                Name = (int?)g.ID != null ? new Wrap { Str = g.Name } : null
            };

Mmm, tasty! :S Yes, that C# doesn’t even make sense and a good tool like ReSharper would tell you to remove that pointless int? cast, because ID is already an int and casting it to an int? and checking for null is entirely pointless. But this query form forces LINQ to SQL to generate the SQL we want:

SELECT [t0].[ID], [t0].[Name], 
    (CASE 
        WHEN ([t1].[ID]) IS NOT NULL THEN 1
        ELSE 0
     END) AS [value], [t1].[Name] AS [Str]
FROM [Developer] AS [t0]
LEFT OUTER JOIN [Game] AS [t1] ON [t0].[ID] = [t1].[DeveloperID]
The data returned by the fixed query

The data returned by the fixed query

Note that the query now returns the expected 0 value instead of NULL in the last row.

Outer Joining and Projecting into an Object Constructor

The other large performance killer Colin and I ran into is where you project into an object constructor. Here’s an example:

var query = from d in context.GetTable<Developer>()
            join g in context.GetTable<Game>()
               on d.ID equals g.DeveloperID into tempG
            from g in tempG.DefaultIfEmpty()
            select new 
            { 
                d,
        
                //Inefficient
                Name = (int?)g.ID != null ? new Wrap(g) : null
            };

In the above query we’re passing the whole Game object into Wrap’s constructor, where it’ll copy the Game’s properties to its properties. This makes for neater queries, instead of having a massive object initialiser block where you set all the properties on Wrap with properties from Game. Too bad it reintroduces our little subquery issue back into the SQL:

SELECT [t0].[ID], [t0].[Name], 
    (CASE 
        WHEN ([t2].[ID]) IS NOT NULL THEN 1
        ELSE 0
     END) AS [value], [t2].[test], [t2].[ID] AS [ID2], [t2].[Name] AS [Name2], [t2].[DeveloperID], [t2].[Ordinal]
FROM [Developer] AS [t0]
LEFT OUTER JOIN (
    SELECT 1 AS [test], [t1].[ID], [t1].[Name], [t1].[DeveloperID], [t1].[Ordinal]
    FROM [Game] AS [t1]
    ) AS [t2] ON [t0].[ID] = [t2].[DeveloperID]

Unfortunately, the only way to get rid of the subquery again is to ditch the constructor and manually initialise the object with an object initialiser, making your queries much longer and noisy when there are a lot of properties:

var query = from d in context.GetTable<Developer>()
            join g in context.GetTable<Game>()
                on d.ID equals g.DeveloperID into tempG
            from g in tempG.DefaultIfEmpty()
            select new 
            { 
                d,
        
                //Efficient
                Name = (int?)g.ID != null ? new Wrap { Str = g.Name } : null
            };

This gives us back our efficient (on SQL CE 3.5) SQL:

SELECT [t0].[ID], [t0].[Name], 
    (CASE 
        WHEN ([t1].[ID]) IS NOT NULL THEN 1
        ELSE 0
     END) AS [value], [t1].[Name] AS [Str]
FROM [Developer] AS [t0]
LEFT OUTER JOIN [Game] AS [t1] ON [t0].[ID] = [t1].[DeveloperID]

Conclusion

For those using LINQ to SQL against real SQL Server, these LINQ contortions are unnecessary to get your performance as real SQL Server is able to make efficient query plans for the form of query that LINQ to SQL creates. However, SQL CE 3.5 can’t deal with these queries and so you need to munge your LINQ queries a bit to get them to perform, which is frustrating. Heading into the future, this won’t be a problem (hopefully) because SQL CE 4 doesn’t support LINQ to SQL and hopefully Entity Framework 4 doesn’t write queries like this (or maybe SQL CE 4 can just deal with it properly). For those on that software stack, it’s probably worth checking out what EF and SQL CE 4 is doing under the covers, as these problems highlight the need for software developers to watch what their LINQ providers are writing for them under the covers to make sure it’s performant.

Edit: For a way to help clean up the verbose hacked up LINQ queries that you end up with when working around these performance problems, check out this post.