The fixed point combinator and memoization in Haskell

October 7, 2016

The Fibonacci numbers F_n for n \in \mathbb{N}_0 are defined by
F_n = n if n \le 1 and F_n = F_{n-1} + F_{n-2} otherwise.
In Haskell this becomes

fib n | n <= 1    = n
      | otherwise = fib (n-1) + fib (n-2)

The translation is almost verbatim. Unfortunately, as a show-case for functional programming, the definition of fib is pretty terrible. The problem can be seen from the evaluation tree for fib 5 below.

The tree has F_6 = 8 leaves. More generally, one can show by induction that the tree for fib n has F_{n+1} leaves. Clearly computing F_n by adding up F_{n+1} numbers, of which F_n are 1 and F_{n-1} are 0, is not time efficient.

A more minor objection is that the domain of fib is not \mathbb{N}_0. In fact, thanks to Haskell’s polymorphism, fib is defined on values of any numeric type. For example fib 1.5 evaluates to fib 0.5 + fib (-0.5), which, since both arguments are \le 1, evaluates to 0.5-0.5 = 0. It seems remarkable that, without any thought (in fact, with the deliberate absence of thought), we have extended the domain of the Fibonacci function to all numbers understood by Haskell. Anyway, this objection is easily addressed:

fib' :: Integer -> Integer
fib' n | n <  0    = undefined
       | n <= 1    = n
       | otherwise = fib' (n-1) + fib' (n-2)

Here undefined is the error value, also known as ‘bottom’, that silently inhabits every type.


To address the serious problem, we need to tell Haskell to remember, or memoize the values of fib, rather than recompute them each time. The following idiom works well and is amenable to generalization.

fibM = \n -> values !! n
    where values = [fibAux m | m <- [0..]]
          fibAux n | n <= 1    = n
                   | otherwise = fibM (n-2) + fib (n-1)                   

To understand why this works, one has to have some understanding of Haskell’s evaluation model. Haskell is lazy. Roughly put: no value is evaluated unless its necessary for the computation to proceed. Thus if fibM is evaluated at 30 (say), the apparently infinite list [fibAux m | m <- [0..]] is only evaluated as far as its 31st member. Moreover, since values appears at the top level of the ‘where’ in the definition of fibM, it is shared between all invocations of fibM. (To use the correct terminology, it is a `constant applicative form’.) Sharing a list means sharing all its members, so each list member, and so each value of fibM is computed at most once.

This point is maybe made more clearly by considering a simpler definition, that fails to memoize correctly.

fibM' n | n <= 1    = n
        | otherwise = values !! (n-1) + values !! (n-2)
    where values = [fibM' m | m <- [0..]]

The problem is that the list values could potentially depend on n (considering replacing [0..] with [0..n]). and so cannot be shared between all invocations of fibM'. The difference with fibM is that the closure for fibM includes values; in contrast, each evaluation of fibM' n creates a new closure, with a new values list. (So fibM' uses quadratic storage, as well as exponential time—what a triumph!)

Writing fibM needs some care. We have to introduce an auxillary `part-memoized’ function, and in turn we have to be careful in fibAux to use memoized values, rather than recursive calls to fibAux. Maybe we could make all this a one-time effort, by writing a function memoize :: (a -> a) -> (a -> a), such that memoize fib evaluates to a memoized version of fib?

A bit of thought suggests this is too ambitious. For example, what if the type a does not support testing for equality? Then we cannot even know if we are evaluating the memoized function on an argument already seen. But surely, we can write a memoizer for functions with domain and range the natural numbers less than 2^{64}, with type signature memoizeInt :: (Int -> Int) -> (Int -> Int)?

I don’t understand Haskell or functional programming well enough to be completely confident here, but my belief is that no such memoizer can be defined. Consider the following failed attempt.

memoizeInt :: (Int -> Int) -> (Int -> Int)
memoizeInt f = \m -> values !! m
    where values = [f m | m <- [0..]]

This is a useful memoizer for a function such as fac n = product [1..n] that may require a lot of work for each evaluation, but do not call themselves recursively. For instance, in ghci 7.8.4 on my laptop, evaluating fac (10^7) takes 4.5s; after defining facM = memoizeInt fac, the first evaluation of facM (10^7) takes 11s, and subsequent evaluations of facM (10^7) are instantaneous. (The result is in every case 0, because of the limited range of the Int type.)

The problem in fibM'' = memoizeInt fib is that when Haskell evaluates fibM'' n, the recursive evaluations fib (n-2) and fib (n-1) call the original fib function, not the memoized function fibM''. We would like to somehow ‘reach into’ the definition of fib so that the recursive calls are redirected, but the immutability of Haskell values makes this impossible.

Recursive functions as fixed points

The difficulty seems to be with functions that call themselves. Ignoring memoization, how do compilers cope such functions? We digress to consider a strategy sheds some light on the memoization problem. Consider the function fibBuilder below.

fibBuilder :: (Int -> Int) -> (Int -> Int)
fibBuilder f = g
   where g n | n <= 1    = f n
             | otherwise = f (n-2) + f (n-1) 

The input of fibBuilder is a function. The output is another function which we shall see is a ‘better approximation’ to a correct Fibonacci function. Consider for example

[id m | m <- [0,1,2,3]
[g m | m <- [0,1,2,3]] where g = fibBuilder id
[h m | m <- [0,1,2,3]] where h = (fibBuilder (fibBuilder id))

The output for the identity function id is clearly [0,1,2,3]. Chasing through the evaluations of g we get g 2 \rightarrow id 0 + id 1 \rightarrow 0 + 1 \rightarrow 1 and g 3 \rightarrow id 1 + id 2 \rightarrow 1 + 2 \rightarrow 3. Hence h 3 \rightarrow g 2 + g 1 where g = fibBuilder id evaluates to 1 + 1, and so to 2. Thus g agrees with fib on 0, 1, 2 and h agrees with fib on 0, 1, 2, 3.

An easy inductive argument shows that a function defined, like g and h, but with r calls to fibBuilder agrees with fib on \{0,1,\ldots, r+1\}. We want to somehow ‘pass to the limit’, and get the function fibInf defined by calling fibBuilder infinitely many times. So fibInf should be the same as fibBuilder fibInf. By analogy with the infinite list ones = 1 : ones, we boldly define

fibInf = fibBuilder fibInf

The result is disappointing: no evaluation of fibInf ever terminates. For example, fibInf 3 \rightarrow (fibBuilder fibInf) 3 \rightarrow fibInf 2 + fibInf 1 \rightarrow (fibBuilder fibInf) 2 + fibInf 1 \rightarrow fibInf 1 + fibInf 0 + fibInf 1. Now the evaluation gets stuck, since each summand evaluates to itself. But there is a simple solution: extend the builder function so that it knows the initial conditions for the Fibonacci numbers, as well as the pattern of recursion. (This amounts to deleting one f in the original definition.)

fibBuilder' :: (Int -> Int) -> (Int -> Int)
fibBuilder' f = g
   where g n | n <= 1    = n
             | otherwise = f (n-2) + f (n-1) 

fibInf' = fibBuilder' fibInf'

This works. While non-obvious to define, neither of the functions above calls itself recursively. So, at least in this case, we’ve seen that a compiler that can cope with functions that return other functions can cope with a truly recursive function.

An alternative definition, emphasising that fibInf and fibBuilder fibInf are the same is

fibInf' = fix fixBuilder' where fix k = f (fix k)

Thus the defining property of fibInf' also serves as a working Haskell definition. For more theory on the fixed point combinator see the Wikipedia page. If there was a prize for ‘function that seems the most useless, and is in fact the most useful’, fix would be a strong contender.

Memoizing builder functions

In the same spirit of ‘defining property as working definition’, we might try to define a memoized version of fib by

fibMInf = fibBuilder' (memoizeInt fibMInf)

Remarkably enough, this works. For example, chasing through the evaluation of fibMInf 3 we get fibMInf 3 \rightarrow fibBuilder' (memoizeInt fibMInf) 3 \rightarrow f 1 + f 2 where f = memoizeInt fibMInf. The closure for f has the list values, defined by values = [f m | m <- [0..]]. So, as if by magic, we now call a memoized function in the recursive evaluations.

So although we haven’t been able to define a good memoizer with type (Int -> Int) -> (Int -> Int), we can use the naive attempt above to write a good memoizer for builder functions.

memoizeBuilder :: ((Int -> Int) -> (Int -> Int)) -> (Int -> Int)
memoizeBuilder builder = fix (builder . memoizeInt)

fibMInf' = memoizeBuilder fibBuilder'

Differences of permutation characters of symmetric groups

July 13, 2016

Here are a few small observations on permutation characters motivated by discussions at the Herstmonceux conference Algebraic combinatorics and group actions. Let \chi^\lambda denote the irreducible character of the symmetric group S_n labelled by the partition \lambda of n. Let \pi^\lambda denote the permutation character of S_n acting on the cosets of the Young subgroup S_\lambda.

  • Since \pi^\lambda = \chi^\lambda + \phi where \phi is an integral linear combination of characters \chi^\mu for partitions \mu dominating \lambda, any irreducible character of S_n is an integral linear combination of the \pi^\lambda.

  • This can be made more explicit as follows. Let \delta = (n-1,\ldots,1,0). For \sigma \in S_n, let \lambda \cdot \sigma = (\lambda + \delta)\sigma - \delta, where \sigma acts on \lambda + \delta by place permutation. The Jacobi–Trudi formula states that

    \chi^\lambda = \sum_{\sigma \in S_n} \pi^{\lambda \cdot \sigma} \mathrm{sgn}(\sigma).

  • It is a special property of symmetric groups that any character is a difference of permutation characters. For example, the cyclic group C_n has \mathrm{d}(n) (the number of divisors of n) distinct permutation characters, but n irreducible characters, so unless n \le 2, the additive group generated by the permutation characters is a proper subgroup of the integral character ring. And clearly no irreducible character of a finite group taking a non-rational value can be an integral linear combination of permutation characters.

  • Since a transitive permutation character contains the trivial character exactly once, the trivial character is not a difference of two transitive permutation characters. A further example is given below.

Claim. \chi^{(n-1,1)} + \chi^{(1^n)} is not a difference of two transitive permutation characters of S_n.

Proof. Let \pi_K denote the permutation character of S_n acting on the subgroup K of S_n. Note that the sign character, \chi^{(1^n)}, appears in \pi_K if and only if K \le A_n.

Let s(K) be the number of orbits of K on \{1,\ldots, n\}. Since

\chi^{(n-1,1)} = \pi^{(n-1,1)} - 1_{S_n},

it follows from Frobenius reciprocity and Mackey’s Formula that \langle \pi_K, \chi^{(n-1,1)} \rangle = s(K)-1. Consider \chi^{(2,1^{n-2})}. Multiplying the displayed equation above by \mathrm{sgn}, we get

\chi^{(2,1^{n-2})} = \mathrm{sgn}_{S_{n-1}}\uparrow^{S_n} - \mathrm{sgn}_{S_n}.

By Frobenius reciprocity

\langle \pi_K, \mathrm{sgn}_{S_{n-1}}\uparrow^{S_n} \rangle = \langle \pi_K \uparrow^{S_n} \downarrow_{S_{n-1}}, \mathrm{sgn}_{S_{n-1}} \rangle.

By Mackey’s Formula, the right-hand side is

\sum \langle \pi_K \uparrow_{K^g \cap S_{n-1}}^{S_{n-1}}, \mathrm{sgn}_{S_{n-1}} \rangle,

where the sum is over distinct double cosets K g S_{n-1}. It follows that \langle \pi_K, \chi^{(2,1^{n-2})} \rangle \le s(K)-1, with equality when K \le A_n.

Suppose that H and G are subgroups of S_n such that \pi_H - \pi_G contains \chi^{(n-1,1)} and \chi^{(1^n)}. From \chi^{(1^n)} we see that H \le A_n and G \not\le A_n. From \chi^{(n-1,1)}, we see that s(G) = s(H) -1. By the previous paragraph,

\langle \pi_H, \chi^{(2,1^{n-2})} \rangle = s(H)-1,


\langle \pi_G, \chi^{(n-2,1^{n-2})} \rangle \le s(G)-1 < s(H)-1.

Hence \chi^{(2,1^{n-2})} also appears in \pi_H -\pi_G. Therefore

\chi^{(n-1,1)} + \chi^{(1^n)}

is not a difference of two transitive permutation characters. \Box

  • Since \pi^{(n-r,r)} - \pi^{(n-r+1,r-1)} = \chi^{(n-r,r)} for 1 \le r \le n/2 (this is a special case of the Jacobi—Trudi formula) each \chi^{(n-r,r)} is a difference of two transitive permutation characters. Moreover, \chi^{(1^n)} = \pi_{A_n} - 1_{S_n}. Exhausting over all subgroups of S_n by computer algebra supports the conjecture that, provided n \ge 9, these are the only irreducible characters of S_n that are the difference of two transitive permutation characters.
  • The behaviour of S_6 is interesting: all irreducible characters except for \chi^{(3,2,1)} are the difference of two transitive permutation characters. This is explained in part by the outer automorphism that exists in this case.

What price Brexit?

June 18, 2016

On next Thursday it looks like I may well have to abandon at least one somewhat-cherished belief: (1) democracy basically works; (2) leaving the EU will be terrible, economically, socially and culturally, for the UK; (3) my own beliefs on this matter have no gaping inconsistencies. Still, at least the campaign will be over. I thought nothing could be more painful than the pro-Brexit leaflet I was sent last week, including a truly-beyond-parody map of Europe, suitably enlarged to include Iraq and Syria, coloured in a threatening shade of blood, until I saw Farage’s latest poster (no link).

At the time of writing (Saturday 18 June), the FT poll-of-polls has Remain on 43% and leave on 48%, with the remainder undecided. The best odds available from conventional bookmakers are 1:2 for remain and 7:4 for leave, corresponding to 66.7% and 36.4% (with a 3% profit margin). The Betfair prices are 1.51 for remain and 2.94 for leave, corresponding to 66.2% and 34.0%.

Here is a highly inexpert attempt to work out the implied odds from option prices. The GBP/USD exchange rate closed on Friday at 1.4358, having bounced back slightly from a low of 1.4114 on Tuesday. According to the minutes of the MPC meeting on Thursday, if the UK votes to leave the EU then ‘Sterling is also likely to depreciate further, perhaps sharply’. Maybe there is a code, known to experts, that translates phrases such as ‘depreciate further, perhaps sharply’, ‘drop significantly’, ’cause the end of civilisation as we know it’ into approximate percentages. For this post, I’ll assume that a leave vote means the rate is 1.44(1-e), and a remain vote means the rate is at least 1.44, at any time when this matters. Let p be the probability of a vote to leave. From now on, all prices are in dollars.

The most obvious way to hedge Brexit is to buy a put option with strike price 1.44, expiring after the referendum. According to these prices, a put option with strike 1.44 expiring in July costs 0.0536. So for about 0.05, I (or rather, an investment bank) can purchase the right to sell 1 GBP for 1.44 at some date (I’m not sure when) in July. This option is worthless after a vote to remain. By assumption, it pays off

1.44 - 1.44(1-e) = 1.44e

after a vote to leave. No-arbitrage implies that 1.44 ep = 0.0536, giving p = 0.0372/e. Taking e = 0.1, corresponding to a post-Brexit rate of 1.296, gives p = 0.372.

A slightly more sophisticated hedge also sells a put option at a lower price. By assumption, the rate does not go below 1.44(1-e), so selling a put option at this price gives some free extra cash. (Of course this contradicts no-arbitrage, but never mind.) Assuming that e \le 0.1, the pay-off is maximised by selling a put option with strike 1.295, netting 0.0069. No-arbitrage now implies that

1.44 ep = 0.0536-0.0069 = 0.0467,

giving p = 0.0324/e. The new hedging strategy makes Brexit more profitable, so the implied odds go down: if e =0.1 then p=0.324.

Going the other way, if any of this is to be believed, the betting markets predict a 10% drop in the pound on Brexit, and the opinion polls predict a 6% drop (0.061425 = 0.0324 \times (48 + 43)/48).

Post Brexit update, 9th July. At least one somewhat-cherished belief has duly been abandoned. The pound/dollar rate is now 1.2953, almost exactly 10% down.

Access structures and CNF versus DNF

June 9, 2016

Suppose there are n people, numbered from 1 up to n. Some subsets of these people are authorized to launch a nuclear strike. We suppose that the access structure is monotone: that is, if Y is an authorized set and Z contains Y, then Z is also authorized. Thus the set \Gamma of authorized sets is determined by its subset \Delta of minimal elements. For example, if n=4 and any set of at least three people is authorized, then \Delta = \big\{ \{1,2,3\}, \{1,2,4\}, \{1,3,4\}, \{2,3,4\} \bigr\} and \Gamma = \Delta \cup \big\{ \{1,2,3,4\} \bigr\}. If in addition, \{1,2\} is an authorized set then \Delta = \big\{ \{1,2\}, \{1,3,4\}, \{2,3,4\} \bigr\}.

Let x_i be true if Person i is present and false otherwise. Let S be the proposition ‘a nuclear strike can be launched’. Considering each minimal authorized set in turn, and assuming that any person present is happy to play their role in an epoch-defining nuclear strike, we see that

\displaystyle S \longleftrightarrow \bigvee_{X \in \Delta} \bigl( \bigwedge_{i \in X} x_i \bigr).

This expresses S in disjunctive normal form: a disjunction in which each term is a conjunction in the variables x_i and \neg x_i. In this example, corresponding to the monotonicity of the access structure, negated variables never appear.

It is an interesting, and maybe not completely obvious fact, that any proposition is logically equivalent to one in disjunctive normal form. Roughly, one proof goes as follows: consider all true/false assignments to the variables x_1, \ldots, x_n that make a given proposition P true. Then P is logically equivalent to ‘at least one of these assignments holds’, which is easily written in DNF.

For instance, in the first example above, the assignments that make S true are TTTT, FTTT, TFTT, TTFT, TTTF, and S is logically equivalent to

S' = Q \vee P_1 \vee P_2 \vee P_3 \vee P_4

where Q is x_1 \wedge x_2 \wedge x_3 \wedge x_4, P_1 is \neg x_1 \wedge x_2 \wedge x_3 \wedge x_4, P_2 is x_1 \wedge \neg x_2 \wedge x_3 \wedge x_4, and so on. A disjunction, like S' above, in which every clause is of the form y_1 \wedge y_2 \wedge \cdots \wedge y_n, where each y_i is either x_i or \neg x_i, is said to be in full disjunctive normal form. Since there are 2^n such clauses, and taking the disjunction over different subsets of them gives logically inequivalent propositions, there are exactly 2^{2^n} propositional formulae in the variables x_i, up to logical equivalence. (The empty disjunction is false by convention.)

The Dedekind number D_n is the number of such formulae not involving any \neg x_i. Thus D_n is the number of monotone boolean functions in n variables, or the number of monotone access structures with n people, or (by the minimal authorized subsets interpretation), the number of antichains in the poset of subsets of \{1,2,\ldots, n\}, ordered by inclusion. The OEIS link gives several further combinatorial interpretations. The sequence of Dedekind numbers, for n \in \mathbb{N}, begins 2, 3, 6, 20, 168, 7581, 7828354. No exact formula is known (or likely ever to be known), but the antichain interpretation makes it plausible that

\log_2 \displaystyle D_n \sim \binom{n}{n/2} \sim \sqrt{\frac{2}{\pi}} \frac{2^{n}}{\sqrt{n}},

and this was first proved by Kleitman.

Negating formulae in DNF, it follows from de Morgan’s laws that any proposition is equivalent to a conjunction of clauses, each of the form y_1 \vee y_2 \vee \cdots \vee y_n, where each y_i is either x_i or \neg x_i. (The empty conjunction is true by convention.) For monotonic boolean functions this gives a normal form in which every x_i appears negated. The access structure interpretation makes an alternative conjunctive form apparent: for each non-authorized set Y \not\in \Gamma, we require that at least one person its complement is present. Clearly it suffices to consider the maximal such Y. Thus

\displaystyle S \longleftrightarrow \bigwedge_{X \in \Delta} \bigwedge_{i \in X} \bigl( \bigvee_{j \in X' \cup \{i\}} x_j \bigr).

Taking the special case where the minimal authorized sets are all r-subsets of the n people we get the appealing

\displaystyle \bigvee_{X \subseteq \{1,\ldots, n\} \atop |X| = r} \bigl( \bigwedge_{i \in X} x_i \bigr) \longleftrightarrow \bigwedge_{Y \subseteq \{1,\ldots, n\} \atop |Y| = n-r+1} \bigl( \bigvee_{j \in Y} x_j \bigr).

For a direct proof, just note that at least r people are present if and only if at most n-r people are absent, so if and only if at least one person from every subset of size n-r+1 is present.

LMS Education Day 2016

May 24, 2016

The day started with Jane White and Paola Iannone on Assessment practices and learning mathematics. There was some discussion of the rationale for different forms of assessment, and then Paola reported on an interesting survey that asked students to rate different forms of assessment by (a) their personal preference; (b) their perceived ability to discriminate. For mathematics students there was some correlation. The traditional closed book exam was the most preferred option, and also regarded as having the greatest discriminatory power.

From interviews with students several other themes emerged: students dislike coursework, knowing how easy it is plagiarise (it was quickly pointed out that plagiarising successfully is not so easy), and group assessment, perhaps because of the ‘free-rider’ problem. They value fairness highly, and were surprisingly open to unusual forms of assessment. Several suggested short oral exams would give them a fair chance to show what they could do. Paola reported that vivas are used instead of one problem sheet in a course at UEA. The workload is roughly equivalent, but shifts from Ph.D markers to the lecturer.

The next talk was Kevin Houston and Mike Robinson on The flipped classroom. Kevin talked about flipping a first term ‘Number Systems’ course at Leeds. I lectured a similar course at Royal Holloway for three years, with Kevin’s book as required additional reading, and most recently gave a ‘partially flipped’ follow on course on Matrix Algebra, so naturally my ears pricked up. Before flipping, Kevin’s course had 10 weeks of 1 hour lectures, with a 1 hour weekly tutorial in groups of 10 to 13. After flipping, it has two 1 hour lectures, on Monday and Fridays. On Monday he lectured in his usual style; on Friday students were expected to have worked through written notes in advance, and the hour was spent solving problems and reviewing the key ideas in proofs. (I’m not sure if Kevin fully adopted Eric Mazur’s peer instruction cycle.) He used socrative to run quizzes in lectures, relying on students to own a smartphone: they all did. I’d be worried about students getting distracted, but there are clearly advantages over clickers: for instance, students can submit short text responses, so questions do not have to be purely multiple choice.

Students were encouraged to do the preliminary work by an assessed online quiz, available before each Friday lecture. Full exam credit was available for any result above 50%, with multiple attempts permitted — this seems a neat way to provide some additional motivation, while keeping an essentially formative character to the assessment. Exam results were ‘comparable to previous years on a slightly harder course’. Student engagement was significantly improved. Kevin commented on the high workload in preparing the new course for the first time, and the need for very accurate timing to get students to the right point on Monday evening.

Mike then talked about his experience flipping a second year course ‘Dynamical systems and Fourier analysis’. Before flipping, this course had a 1 hour lecture, and 1 hour practical workshop in a PC Lab. Post flipping, it had a single 2 hour class, held in a room with plentiful laptop computers. The preliminary work was to watch short online videos prepared by Mike. The advantage of online videos over lectures, that the viewer can pause at any moment, was appreciated by the students. They also observed, however, that there was no immediate opportunity to ask questions to the lecturer or their peers. Mike reported on a survey, addressed only to serial non-attenders at this course: a typical non-attender missed classes for reasons unrelated to the courses, but a few, who admitted not doing the preliminary work, felt their lack of preparation would be exposed. Another survey showed the new course was very popular with students, with many seeing only advantages to the new format.

I would like to know whether strong or weak students benefit more from the flipped classroom. My guess is that flipping can help weak but hard-working students to get somewhere, rather than nowhere, but is maybe of less benefit to stronger students. And under any system there will be a core of students who’ll fail no matter what we do. Kevin made the interesting point that many students adopt a pragmatic attitude and aim to do just enough in any given course. So it is unrealistic to expect huge improvements in exam performance, even from successful interventions.

All speakers were enthusiastic about their experience; both Kevin and Mike felt they ‘hadn’t flipped enough’. A recurring comment was that, post flipping, they had more meaningful interaction with the students. Paola commented that it was often surprising how much students turned out to know. I agree. Going the other way, I’m now only rarely surprised by the many essential things (definitions, for example) that pass them by, causing endless stumbling blocks. The flipped classroom may create — at least in the short term — more work, and is probably not suited to everyone, but it offers a way to tackle these problems head on.

The final talk was by Peter Giblin and Alice Rogers on The Teaching Excellence Framework, HE White Paper, and Quality Assurance. Peter gave an excellent summary of the white paper (he said slides would be available from the LMS website), highlighting the somewhat strange way in which the TEF is expected to be introduced, at the institutional level very soon (but with everyone who agrees to play guaranteed at least the minimum ‘Meets Expectation’ rating), and at the discipline level only by 2019/2020. Peter quoted one particularly discouraging requirement, buried in the additional technical documentation, that submissions should ‘avoid focusing on successful but highly localised practices’. The whole game seems to be a bureaucrat’s dream and an academic’s nightmare.

There was much uncertainty about the way departments would be assessed, and considerable debate about what, if any, benchmarks could fairly be used. On this note, the day ended.

Variance of an estimate for the second moment

May 4, 2016

Suppose there are u urns, numbered from 1 up to u containing in total b balls, numbered from 1 up to b. Let p_i be the proportion of balls in i. Let s be the probability that if we pick two numbers in \{1,\ldots,b\} uniformly at random, then the balls with these numbers lie in the same urn. Clearly s = \sum_{i=1}^u p_i^2.

The honest way to estimate s follows the definition by repeatedly picking pairs (b,b') of balls, and counting the number of ‘hits’ where b and b' are in the same urn. For example, suppose that each urn contains equally many balls, so p_i = 1/u for each i. Then our estimate for s = 1/u after N samples is H/N, where the number of hits H has distribution \mathrm{Bin}(N,1/u). As expected, \mathbb{E}[H/N] = 1/u = s.

An alternative is to choose N ball numbers uniformly at random, and for each ball, note down its urn number. For each i \in \{1,\ldots, u\}, let B_i be the random variable counting the number of balls lying in urn i. We then estimate s by the statistic

S = \sum_{i=1}^u \bigl( \frac{B_i}{N} \bigr)^2.

By the well known formula

\mathrm{Var} X = \mathbb{E}[(X -\mathbb{E}X)^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2,

we get

\begin{aligned} \mathbb{E}[S] &= \frac{1}{N^2} \mathbb{E}\bigl[ \sum_{i=1}^u (B_i-\mathbb{E}[B_i])^2\bigr] + \sum_{i=1}^u \mathbb{E}[B_i]^2) \\ &= \frac{1}{N^2} \sum_{i=1}^u \mathbb{E}[(B_i - \mathbb{E}[B_i])^2]  + s\\ &= \frac{1}{N^2} \sum_{i=1}^u \mathrm{Var} B_i + s. \end{aligned}

Thus S is a biased estimator for s. Since B_i has distribution \mathrm{Bin}(N, p_i), we have \mathrm{Var}(B_i) = Np_i(1-p_i) and so

\mathbb{E}[S] = p + \frac{1}{N^2} \sum_{i=1}^U Np_i(1-p_i) = s + \frac{1}{N} - \frac{s}{N}.

Since s is small compared to 1, a good correction for the systematic error can be made by subtracting 1/N. All the same, it might well seem sampling by pairs is the simplest and best approach. But this ignores the variability in our estimates.

Again consider the case where p_i = 1/u for each i. Then, as already seen, H has distribution \mathrm{Bin}(N,1/u) and so \mathrm{Var}[H/N] = \mathrm{Var}[H]/N^2 = N(1/u)(1-1/u)/N^2 \approx 1/Nu. The standard deviation in H/N is therefore 1/\sqrt{Nu}. In particular, to improve the accuracy of our estimate by a factor f, we must increase the number of samples by f^2.

For S, we extend the ‘orthogonality trick’ used earlier to express \mathbb{E}[S] in terms of the variances of the B_i, to get

\begin{aligned} S &= \sum_{i=1}^u \frac{B_i^2}{N^2} \\ &= \sum_{i=1}^u \Bigl( \bigl( \frac{B_i}{N} - \frac{1}{u} \bigr)^2 + \frac{2B_i}{Nu} - \frac{1}{u^2} \Bigr) \\ &= \sum_{i=1}^u \bigl( \frac{B_i}{N} - \frac{1}{u} \bigr)^2 + \frac{2}{u} - \frac{1}{u} \\ &= \sum_{i=1}^u \bigl( \frac{B_i}{N} - \frac{1}{u} \bigr)^2 - \frac{1}{u} \end{aligned}
where the penultimate line uses \sum_{i=1}^u B_i = N.

By the Central Limit Theorem, \sqrt{uN}\bigl( \frac{B_i}{N} - \frac{1}{u} \bigr) is approximately normally distributed with mean 0 and variance uN \mathrm{Var}[B_i/N] \approx 1. While the B_i are not independent, it still seems reasonable to approximate

\hat{S} = uN \sum_{i=1}^u \bigl(  \frac{B_i}{N} - \frac{1}{u} \bigr)^2

as the sum of the squares of u standard normal variables, i.e. as the \chi^2 distribution with u degrees of freedom. This distribution has variance 2u, so we have \mathrm{Var}[\hat{S}] \approx 2u. Undoing the scaling we get

\mathrm{Var}[S] \approx \frac{2u}{u^2N^2} = \frac{2}{uN^2} .

The standard deviation in S is then \frac{1}{N} \sqrt{\frac{2}{u}}. Therefore increasing number of samples by a factor of f reduces the standard deviation by the same factor. Moreover, we use fewer samples: N rather than 2N.


Data from simulation suggest these approximations work well in practice. For instance, with u = 100 urns and N = 1000 samples, the mean of H/N over 1000 repetitions of the sampling experiment was 0.01014, with standard deviation 3.1354 \times 10^{-3}; the mean of S/N was 0.01099 \approx 1/u + 1/N, with standard deviation 1.4101 \times 10^{-4}. Increasing the number of samples to 10000 reduced the standard deviations for H/N and S/N to 1.0114 \times 10^{-3} and 1.4176 \times 10^{-5}, respectively, agreeing with the prediction above. For more skewed distributions of balls into urns the standard deviation of S/N is still far less than for H/N, but the improvement on increasing the number of samples is less dramatic.

Modular representations that look ordinary

March 23, 2016

Let G be a finite group and let (K, \mathcal{O}, k) be a p-modular system for G. Thus K is a field of characteristic zero sufficiently large that if V is any irreducible representation of G then \mathrm{End}_{KG}(V) \cong K and K contains all the eigenvalues of the matrices representing elements of G in their action on V, \mathcal{O} is an integrally closed subring of K containing all these eigenvalues, and k is a field of prime characteristic p obtained by quotienting \mathcal{O} by a maximal ideal \mathfrak{m}, necessarily lying above \langle p \rangle \unlhd \mathcal{O}. By localizing if necessary, we may assume that all elements of \mathcal{O} not in \mathfrak{m} are invertible. So \mathcal{O} is a local ring with unique maximal ideal \mathfrak{m}.

This is a standard setup for modular representation theory. The main purpose of this proof is to prove Theorem 2 below, which gives a precise description of a family of representations that look substantially the same, whether defined over K, \mathcal{O} or k.

As a warm-up we prove the following result on ordinary characters.

Proposition 1. Let \chi be an irreducible character of G such that \chi(1) is divisible by the highest power of p dividing |G|. If g \in G is a p-element then \chi(g) = 0.

We need the following fact: the central primitive idempotent in KG corresponding to \chi is

e_\chi = \frac{\chi(1)}{|G|} \sum_{g \in G} \chi(g^{-1}) g.

Note that the coefficients of e_\chi are contained in \mathcal{O}.

Proof of Proposition 1. Let P be a Sylow p-subgroup of G containing g. Since

(1) \qquad \mathcal{O}G =  e_\chi \mathcal{O}G \oplus (1-e_\chi) \mathcal{O}G,

considered as a right \mathcal{O}G-module, e_\chi \mathcal{O}G is projective. Hence the restriction of e_\chi \mathcal{O}G to \mathcal{O}P is also projective. Now \mathcal{O}P is indecomposable as a right \mathcal{O}P-module, since any direct sum decomposition induces a direct sum decomposition of the indecomposable kP-module kP by reduction modulo \mathfrak{m}. Hence the unique projective indecomposable \mathcal{O}P-module is \mathcal{O}P itself and e_\chi \mathcal{O}G\downarrow_P \cong \mathcal{O}P \oplus \cdots \oplus \mathcal{O}P. Restricting further we get

(2)\qquad e_\chi \mathcal{O}G \downarrow_{\langle g \rangle} \cong \mathcal{O} \langle g \rangle^{\oplus \chi(1)/\mathrm{ord}(g)}.

From the canonical basis for the right-hand we see that the character of g acting on e_\chi \mathcal{O}G is zero. But, by Wedderburn’s Theorem, e_\chi KG is isomorphic to the direct sum of \chi(1) copies of a simple KG-module with character \chi. Hence \chi(1)\chi(g) = 0.\qquad\Box

I find it quite striking that the modular theory gives a short proof of a result stated entirely in the language of ordinary (i.e. characteristic zero) representation theory. Moreover, from (2), we can easily get a stronger result: \chi(g) = 0 whenever g has order divisible by p.

Outline proof. let g have order p^a c where c is not divisible by p, let g_p = g^c, and let g_{p'} = g^{p^a}. Take a direct sum decomposition of e_\chi KG into eigenspaces for the action of g_{p'}. By (2), g_p has trace zero in its action on each eigenspace, hence so does g.

Proposition 1 is also a corollary of the following theorem. My first version of the proof did not make it entirely clear that e_\chi \mathcal{O}G is a direct sum of isomorphic \mathcal{O}G-modules, each simple as a KG-module. To repair the gap, I borrowed the proof of (1) \implies (2) on page 162 of this recent book by Peter Webb.

Theorem 2. Let \chi be an irreducible character of G such that \chi(1) is divisible by the highest power of p dividing |G|. There is a projective \mathcal{O}G-module M with ordinary character \chi such that M/\mathfrak{m}M is a simple projective kG-module.

Proof. Let M be a \mathcal{O}G-module such that M_K = M \otimes_\mathcal{O} K is a simple KG-module. By Wedderburn’s Theorem, e_\chi KG is isomorphic to \mathrm{End}_K(M_K), thought of as the complete matrix algebra of \chi(1) \times \chi(1) matrices with coefficients in K. Let \rho : KG \rightarrow \mathrm{End}_K(M_K) be the corresponding ring homomorphism, with kernel (1-e_\chi)KG. The restriction of \rho to \mathcal{O}G has kernel (1-e_\chi)\mathcal{O}G. So it is sufficient to prove that the image of \rho : \mathcal{O}G \rightarrow \mathrm{End}_K(M_K) is \mathrm{End}_O(M), thought of as the complete matrix ring of \chi(1) \times \chi(1) matrices with coefficients in \mathcal{O}. Since M is a \mathcal{O}G-module, the matrix coefficients lie in \mathcal{O}.

The surjectivity of the restriction of \rho follows from the following ‘averaging’ formula which expresses an arbitrary \phi \in \mathrm{End}_\mathcal{O}(M) as something in its image:

\phi = \frac{\chi(1)}{|G|} \sum_{g \in G} \mathrm{Tr}_M\bigl( \rho(g^{-1})  \phi\bigr) \rho(g).

Subproof. Since the ring homomorphism \rho: KG \rightarrow \mathrm{End}_K(M) is surjective, it is sufficient to prove this when \phi = \rho(x) with x \in G; the right-hand side is

\rho(x) \frac{\chi(1)}{|G|} \sum_{g\in G}  \chi(g^{-1} x) \rho(x^{-1} g) = \rho(x) \rho(e_\chi) = \rho(x),

as required.

We can now think of the right \mathcal{O}G-module M very concretely, as those d \times d matrices with coefficients in \mathcal{O} that are zero outside their first row. Reducing mod \mathfrak{m}, we obtain M/\mathfrak{m}M; this is the unique simple module (up to isomorphism) for \mathrm{Mat}_{d \times d}(k). Hence M/\mathfrak{m}M is a simple projective kG-module, as claimed. \qquad\Box