The purpose of this post is to give the simplest way I’ve found to get parallel execution to work in ghc 8.10.1.

#### The unhelpful cabal

I’ll begin by getting rid of the trivial difficulties. I’m obliged to use Stack (and the related cabal system), as it’s the only way I can find to get `ghc`

onto recent versions of Mac OS X. I am unable to make `stack`

work in any simple way. Here is the complicated way I’ve settled on. For each project:

- Make a directory with all
`.hs`

source files. - Create a
`.cabal`

file as the example below. - Run
`stack --init`

. - Edit
`stack.yaml`

so that the resolver field reads`resolver: lts-18.3`

. - Run
`stack -ghci`

- Build with
`stack build`

- Run executable with
`stack exec Main <arguments>`

For (1) use a separate `Main.hs`

module declaring `main`

. Do not expect the ghc option `-main-is`

to work. A suitable `.cabal`

file for (2) is

name: ParTest version: 0.0.0.0 cabal-version: >= 1.22 build-type: Simple executable Main main-is: Main.hs other-modules: ParTest build-depends: base >= 4.14.1.0, array, deepseq, parallel default-language: Haskell98 ghc-options: -threaded -O2 -with-rtsopts=-N

Note that the `build-depends`

field is separate from any import declarations in the `.hs`

files and uses a different naming convention. For instance the package `parallel`

above corresponds to `import Control.Parallel`

and `import Control.Parallel.Strategies`

in the module file. Fortunately if you omit a required package from `build-depends`

then there is a very helpful error message that includes the package name you need. The purpose of (4) is to prevent `stack`

from downloading a more recent version of `ghc`

than it supports: ‘Stack has not been tested with GHC versions above 8.10, and using 9.0.2, this may fail’ — indeed, fail it does. (Thanks to `stack`

‘s insistence on downloading an entirely fresh `ghc`

for each project, I suspect I have now downloaded over 30Gb worth of copies of `ghc`

trying to chase down this glitch.) At least once I’ve found that (6) fails unless one first does (5): I have no idea why.

More positively, while I’ve had no success at all in using `--ghc-options="-threaded"`

to pass options to `ghc`

, the `ghc-options`

field in the cabal file automates all of this, with the relevant run-time option included to boot. And I will concede that despite all the fuss of using `cabal`

, it does in my experience avoid the dreaded ‘dependency hell’.

Note that `ghc`

has to be explicitly told to compile threaded code. If one omits `-threaded`

then `par`

and the related constructs from `Control.Parallel`

appear to have no effect. (When testing this, I discovered that its often essential to run `stack clean`

after editing the `.cabal`

file to make the changes ‘stick’.) This behaviour of `ghc`

seems unnecessarily unhelpful to me: why can’t the compiler issue a warning when it sees `par`

without threading turned on?

#### Benchmarking with lazy evaluation

A recurring issue with Haskell benchmarks is that lazy evaluation means that it’s very easy to write code that executes quickly, but only becaue it doesn’t do what strict imperative languages such as C might lead one to expect. For a simple example, suppose `expensive :: Integer -> Integer`

is a function such that `expensive k`

takes time to compute. We can build a list of the first `n`

values of `expensive`

using

testExpensive n = length [expensive k | k <- [1..n]]

Irrespective of how `expensive`

is defined, evaluating `testExpensive`

at the Haskell prompt takes time . The reason is that under lazy evaluation, only the existence of each entry of the list has to be established; each `expensive k`

sits within the list as an unevaluated thunk. This can be seen interactively using the very useful `sprint`

command in `ghci`

. Try for instance

let xs = [expensive k | k <- [1..n]]

for the `expensive`

function of your choice (for instance `collatzSum`

below), followed by `:sprint xs`

, `length xs`

, `:sprint xs`

, `sum xs`

, `:sprint xs`

.

Warning: if you do this interactively, you must avoid giving `xs`

a polymorphic type, as will probably happen if you omit all type annotations. For instance if `xs`

has type `Num a => [a]`

then each evaluation will refer to a separate copy of `xs`

. Thus with `let xs = map (\x -> x + 1) [1..10]`

you’ll always get `_`

from `:sprint xs`

, whereas `let xs = map (\x -> x + 1) [1..10] :: [Int]`

will work as expected.

#### An expensive function the compiler can’t magic away

Open mathematical problems are a rich source of such functions. Here I’ll go for the Collatz conjecture, on which Paul Erdős said “Mathematics is not yet ripe enough for such questions”; despite an astonishing partial result due to Terence Tao, his assessment still stands today. Let us define:

collatzList :: Integer -> [Integer] collatzList 0 = error "collatzList: 0" collatzList 1 = [] collatzList n | n `mod` 2 == 0 = n : collatzList (n `div` 2) | otherwise = n : collatzList (3 * n + 1) collatzSum :: Integer -> Integer collatzSum n = sum $ collatzList n

While `ghc`

is a deeply impressive piece of software, I think we can safely assume that it does not have, deeply embedded in its internals, an effective proof of the Collatz conjecture that gives a short-circuited evaluation of `collatzSum`

. We can therefore take `collatzSum`

as the `expensive`

function we require.

#### Parallelising the Collatz sum of sums

Evaluating `sum [collatzSum k | k <- [1..1000000]`

, using the module `Main.hs`

at the end of this post, takes about 9.5 seconds on my computer. The code below parallelises this computation by splitting the list in two, and evaluating the sum of each half in parallel. Its running time is about 5 seconds.

collatzSumOfSumsP :: Integer -> Integer collatzSumOfSumsP r = sA `par` (sB `par` sA + sB) where sA = sum [collatzSum n | n <- [1..r], n `mod` 2 == 0] sB = sum [collatzSum n | n <- [1..r], n `mod` 2 == 1]

Here `par`

is a Haskell function with the type `a -> b -> b`

. The unusual type is a clue that it breaks the normal rules of evaluation. (Indeed, the reasoning used in Wadler’s remarkable paper Theorems for free! shows that the only *strict* function with this type must discard the first variable and return the second unmodified.) The effect of `par`

is to evaluate the first argument to weak head normal form (WHNF), and then, in parallel, but otherwise following the usual rules of Haskell evaluation, evaluate the second argument. (Compare `seq :: a -> b -> b`

which does exactly the same, but in series.) In particular, since `sA`

is defined in the `where`

clause, it is evaluated exactly once, in parallel with `sB`

.

In fact the simpler definition `collatzSumOfSumsP r = sA `par` sA + sB`

also works: the compiler is smart enough to avoid the ‘not-parallel’ code that in theory computes `sA`

and `sA+sB`

in parallel, but hangs on `sA+sB`

waiting for `sA`

, rather than getting on with `sB`

.

The variation below evaluates the lists in parallel and *then* takes the sum. Its running time is (up to random noise) the same as `collatzSumOfSumsP`

. For `rpar`

import `Control.Monad.Strategies`

and for `force`

import `Control.DeepSeq`

.

interleave :: [a] -> [a] -> [a] interleave xs [] = xs interleave [] ys = ys interleave (x : xs) (y : ys) = x : y : interleave xs ys parallelInterleave :: (NFData a) => [a] -> [a] -> [a] parallelInterleave xs ys = runEval $ do xsP <- rpar (force xs) ysP <- rpar (force ys) return $ interleave xsP ysP collatzSumOfSumsPL :: Integer -> Integer collatzSumOfSumsPL r = sum $ parallelInterleave ssA ssB where ssA = [collatzSum n | n <- [1..r], n `mod` 2 == 0] ssB = [collatzSum n | n <- [1..r], n `mod` 2 == 1]

This version is more typical of the kind of problem I really want to parallelise, in which a huge collection of candidate solutions is generated, and then some further computation is done on this collection to decide which solutions work.

#### Parallel interleave

The engine driving `collatzSumOfSumsPL`

is the function `parallelInterleave`

, which uses `rpar`

from `Control.Parallel.Strategies`

and `force`

from `Control.Deepseq`

to evaluate the lists in parallel.

The use of `force`

is essential because of a very well-known Haskell ‘gotcha’. Consider for instance `let xs = collatzList 10; seq xs "Now xs has been evaluated?!"`

. One can check using `:sprint`

that `xs`

is indeed evaluated, but only to WHNF: the output is `xs = 10 : _`

. The correct implementation of `force`

(for lists) is recursive:

forceList :: [a] -> [a] forceList [] = [] forceList (x : xs) = let xsF = forceList xs in xsF `seq` (x : xsF)

This can be used as a drop-in replacement for `force`

in `parallelInterleave`

. But it’s all fiddly enough that I think it’s best to leave it to the experts who wrote the `Control.DeepSeq`

package and use `force`

as above. (The first version of this post even had a version that accidentally erased the list, while still failing to force non-head terms.)

The problem with `collatzSumOfSumsPL`

is that since it evaluates both the lists `ssA = [collatzSum n | n <- [1..r], n `mod` 2 == 0]`

and `ssB = [collatzSum n | n <- [1..r], n `mod` 2 == 1]`

in full *before* taking the sum of each list (and then the sum of the sums), it uses memory proportional to the running time of the entire program, rather than memory bounded by the largest value of `collatzSum n`

as in `collatzSumOfSumsP`

. Our use of `par`

and `seq`

has forced us to confront the messy reality that semantically equivalent programs can have very different space requirements.

#### More on space

The code below computes `collatzSumOfSums`

by using an array, storing the value `collatzSum n `

for each relevant `n`

. Since Haskell arrays are strict, this means each call `collatzSumOfSums r`

has to fully allocate an array of size .

hugeCollatz :: Integer -> Array Integer Integer hugeCollatz r = array (1, r) [(i, collatzSum i) | i <- [1..r]] sumArray :: Array i Integer -> Integer sumArray arr = sum $ elems arr collatzSumOfSumsA :: Integer -> Integer collatzSumOfSumsA r = sumArray $ hugeCollatz r

Using `collatzSumOfSumsA`

as the `expensive`

function in our benchmarks, the extra space required to interleave the strictly evaluated lists before taking the sum can easily be seen using `top`

.

collatzSumOfSumOfSums :: Integer -> Integer collatzSumOfSumOfSums t = sum [collatzSumOfSums r | r <- [1..t]] collatzSumOfSumOfSumsAPHuge :: Integer -> Integer collatzSumOfSumOfSumsAPHuge t = sum [sumArray arr | arr <- arrays] where arraysA = [hugeCollatz r | r <- [1..t], r `mod` 2 == 0] arraysB = [hugeCollatz r | r <- [1..t], r `mod` 2 == 1] arrays = parallelInterleave arraysA arraysB collatzSumOfSumOfSumsAPSmall :: Integer -> Integer collatzSumOfSumOfSumsAPSmall t = sum ss where ssA = [sumArray $ hugeCollatz r | r <- [1..t], r `mod` 2 == 0] ssB = [sumArray $ hugeCollatz r | r <- [1..t], r `mod` 2 == 1] ss = parallelInterleave ssA ssB

On my machine, evaluating these three functions for `t = 2500`

take respectively:

- 14 seconds, memory 20Mb
- 8 seconds, memory growing to 320Mb
- 8 seconds, memory 20Mb

The saving in `collatzSumOfSumOfSumsAPSmall`

from summing the arrays as they are produced is shown in the diagram below.

#### Conclusion

Despite my initial struggles, I think these experiments show that it’s possible to get a substantial gain from parallelism in Haskell without any very expert knowledge or recondite tricks. But to get the full benefit one has to parallelise *all* the code, not just the first stage in the pipeline. This is of course the more complicated option: my hope for my intended applications is that functions such as `rdeepseq`

can automate the plumbing.

#### Further reading

I discovered the `Eval`

monad used by `rpar :: Strategy a`

(equivalently, `rpar :: a -> Eval a`

) from Chapter 2 of Parallel and concurrent programming in Haskell by Simon Marlow. Chapter 3 goes into much more detail. See also Chapter 24 of Real world Haskell by Bryan O’Sullivan, Don Stewart, and John Goerzen.

#### Appendix: Main module

For the record here is the `Main.hs`

module I used to get these timings.

module Main where import System.Environment import ParTest -- Tests with r = 1000000 testSumOfSums :: IO () testSumOfSums = do putStrLn "testSumOfSums" rStr : _ <- getArgs let r = read rStr :: Integer -- putStrLn $ show $ collatzSumOfSums r -- 316% cpu, 9.582 total -- putStrLn $ show $ collatzSumOfSumsP r -- 304% cpu, 4.962 total putStrLn $ show $ collatzSumOfSumsPL r -- 342% cpu, 4.998 total -- Tests with t = 2500 testSumOfSumOfSums :: IO () testSumOfSumOfSums = do tStr : _ <- getArgs let t = read tStr :: Integer --putStrLn $ show $ collatzSumOfSumOfSums t -- 312% cpu 13.837 total, memory 20Mb --putStrLn $ show $ collatzSumOfSumOfSumsAPHuge t -- 435% cpu 7.902 total, memory to 320Mb putStrLn $ show $ collatzSumOfSumOfSumsAPSmall t -- 377% cpu 7.077 total, memory 20Mb main :: IO () main = testSumOfSums -- main = testSumOfSumOfSums