Tao Te KaChing
Workin' the cash register of the Great Tao

Linq, the Postage Stamp problem, and Me...

I had at one point been farting around with algorithms to "solve" efficiently a problem where, given a dartboard with R regions of values V0, V1, ... VR, find the set of regions with the greatest minimum sum you cannot get with N darts.  This is basically the Postage Stamp problem (see also the Coin and Knapsack problems).


One of the solutions I was working on used some Linq.  A central component to the solution was to find that least sum.  So I used Linq to find all “throwable” sums (say set ST), then compare that set against a set SP of all possible sums (VR * N, assuming VR is the largest value of a values in R), to get the smallest possible.
// query 1
int best =
    (from i in all
     where
     !(from set_1 in ints
       from set_2 in ints
       from set_3 in ints
       select set_1 + set_2 + set_3).Distinct().Contains(i)
     select i).Min();

"all" in the outer query's from clause is a set of integers from 1 to the maximum region value times the number of darts (in this case 3 darts).  The from set_1, set_2, and set_3 inner query created a cartesian join of all sums of the regions of the dartboard.  The distinct list of these are compared with "all" to find the lowest sum not found in there.

This query took a loooooooooong time.  I'd come to use Linq frequently, in particular nested joins, and had not seen such a performance hit before.  My only guess was that the inner query was not a cached set and was actually executing with each assessment of the where clause.  I tried separating out the inner query:
// query 2
var sums =
    (from set_1 in ints
     from set_2 in ints
     from set_3 in ints
     select set_1 + set_2 + set_3).Distinct();

int best =
    (from i in all
     where !(sums.Contains(i))
     select i).Min();

No improvement.  Stepping through the code showed the second query calling the first query; ergo the first query was really not executed to that point (1st hint of deferred execution), ergo no "cached" set of values.  I decided to try forcing the issue:
// query 3
int[] sums =
    (from set_1 in ints
     from set_2 in ints
     from set_3 in ints
     select set_1 + set_2 + set_3).Distinct().ToArray();

int best =
    (from i in all
     where !(a_sums.Contains(i))
     select i).Min();

Way way WAYYYYY faster.  This called for some delicious Excel graphs:

graph1

The vertical axis is the number of seconds to complete the sums query, the horizontal the number of integers in ints.  The blue trendline are queries 1 and 2 above (legend lines “all in one” and “sums out”, respectively), the tan query 3 (“sums arrayed” legend line).  Query 3 I know is not ultra super efficient by any means, but Holy Booyashaka!  I compared query 3 to it's gross O(n3) distant relative, query 4:

// query 4
for (int i = 0; i < __c; i++)
{
    for (int j = 0; j < __c; j++)
    {
        for (int k = 0; k < __c; k++)
        {
            if (!__i.Contains(i + j + k))
                __i.Add(i + j + k);
        }
    }
}
int[] sums = __i.ToArray();
best =
    (from i in all
     where !(sums.Contains(i))
     select i).Min();

graph3

Lovely.  Our ancestor actually performs better.  The axes (yes, this is the plural of axis) represent the same as our first graph.  At this point, I'm not really concerned with solving Big-O efficiency in my Postage Stamp problem; rather, I realized I had been presuming Linq sets were “immediate”(my term here); that is, I presumed a Linq query was executed and cached when the statement was “finalized,” say with a select clause.  Nein.

Linq queries are translated into sequences of methods to invoke to get the result inferred by the query; that is, a Linq statement is evaluated into a set of equivalent delegate methods which return the like result.  I knew that much, but had assumed that, say in query 1 for instance, the .Distinct() would “execute” it's predecessor, and so on down the chain.  Wrong.  Here's the ugly b'yotch as seen through Reflector:


private static void Linq0(List<int> ints, List<int> all)
{
      int best = all.Where<int>(delegate (int i) {
          return !ints
              .SelectMany(((Func<int, IEnumerable<int>>) (set_1 => ints)), (set_1, set_2) => new { set_1 = set_1, set_2 = set_2 })
              .SelectMany(((Func<<>f__AnonymousType0<int, int>, IEnumerable<int>>) (<>h__TransparentIdentifier14 => ints)), (<>h__TransparentIdentifier14, set_3) => ((<>h__TransparentIdentifier14.set_1 + <>h__TransparentIdentifier14.set_2) + set_3))
          .Distinct<int>().Contains<int>(i);
      }).Min(); }

Ergo, the nested queries are executing nested groups of methods.  This equals Big-Oh-Shit!  Compare to query 2's rendering:


private static void Linq2(List<int> ints, List<int> all)
{
    int[] sums =
        ints
        .SelectMany(delegate (int set_1) { return ints; }, delegate (int set_1, int set_2) { return new { set_1 = set_1, set_2 = set_2 }; })
        .SelectMany(delegate (<>f__AnonymousType0<int, int> <>h__TransparentIdentifierb) { return ints; }, delegate (<>f__AnonymousType0<int, int> <>h__TransparentIdentifierb, int set_3) { return ((<>h__TransparentIdentifierb.set_1 + <>h__TransparentIdentifierb.set_2) + set_3); }).Distinct<int>().ToArray<int>();
    int best = all.Where<int>(delegate (int i) {
        return !sums.Contains<int>(i);
    }).Min();
}

So, I guess it is safe enough to suppose that separating the nested query out was itself the solution to our problem?  Not so.  We forget we did that first and still saw no performance gain.  Let's look at query 1 through Reflector:


private static void Linq1(List<int> ints, List<int> all)
{
    IEnumerable<int> sums =
        ints
        .SelectMany(delegate (int set_1) { return ints; }, delegate (int set_1, int set_2) { return new { set_1 = set_1, set_2 = set_2 }; })
        .SelectMany(delegate (<>f__AnonymousType0<int, int> <>h__TransparentIdentifier0) { return ints; }, delegate (<>f__AnonymousType0<int, int> <>h__TransparentIdentifier0, int set_3) { return ((<>h__TransparentIdentifier0.set_1 + <>h__TransparentIdentifier0.set_2) + set_3); }).Distinct<int>();
    int best = all.Where<int>(delegate (int i) {
        return !sums.Contains<int>(i);
    }).Min();
}

Correct me if I'm wrong here, but it looks like the only difference is that sums is defined as IEnumerable<int> and that it is not assigned an int[] array as it is in query 2.  So WTF?  We have apparently run into an instance of deferred execution, and the way to “force” immediate execution where the Linq is defined is to make sure the query returns a value that does not inherit IEnumerable<T> or IQueryable<T>.  Deferred execution is well known to veterans of Linq.  See this excellent article for more info.  Don't feel like a knucklehead (like I do) for not knowing this.  I'd been doing Linq for some time now and just simply never ran into this.  Perhaps that's just because my Linq code is so superb…or something…

…but I digress.

It would be fantastic if the compilation process could include the means to optimize nested queries so that they are run once to avoid deferred execution and obtain a re-usable set, say through a compiler option.  Perhaps even include a means of caching sets to disk via disk based data structures if the sets are known in advance to be large.  When did you find out about deferred execution?  And do you agree it would be worth the while to optimize the nested queries for local Linq objects?

~ZagNut

,,,,,

COMMENTS