======================[ Laziness: handle with care ]======================

The idea of lazy computation as the default mode of a execution for a
programming language was revolutionary. Lisp and Scheme started with
eager evaluation; it was not until functional programming was already
mature that the idea of lazy computation as its natural companion
emerged---and quickly took over in a progression of languages that led
to Haskell.

This story is told in the "A History of Haskell: Being Lazy With
Class" by the co-creators of Haskell and modern functional
programming. The place of laziness is apparent even from the title---and
you will find its beginnings described in the first few pages:
http://haskell.cs.yale.edu/wp-content/uploads/2011/02/history.pdf

You can track the origins of this idea from one the key papers that
argued for it, "CONS should not evaluate its arguments", 
http://www.cs.indiana.edu/pub/techreports/TR44.pdf . Observe how
the jargon of PL has changed since this paper---and what remains
the same to this day. 

For the application of laziness in Javascript, Scala, Swift, and 
Racket (a popular teaching dialect of Scheme) see 
http://matt.might.net/articles/implementing-laziness/

Laziness is in near-perfect correspondence with pattern matching: we
care to evaluate the matched data structure only so far as the pattern
requires. In a world where typed can only be built by constructors and
all functions that work on it must start with taking it apart by
patterns, this correspondence can save us the runtime effort of evaluating 
most of these constructors. 

--------------------[ An example: lazy lists ]--------------------

List are perhaps the simplest example. Recall that lists come 
with built-in constructors  :  and [] , and any other way to
create or handle them is "syntactic sugar" that reduces to 
using and/or pattern-matching them.
 
For example, "head 1:2:3:4:5:[]" need not construct a full list; the
head's definition, "head x:_ = x", can make do with just one : ,
ignoring the rest. Moreover, the rest of the list may not even be
defined:

Prelude> head (1:undefined)
1

This ability to evaluate an expression whose parts may be undefined 
is referred to as "non-strict semantics". Non-strictness is strongly
tied with laziness, but they are not quite synonymous, as 
https://wiki.haskell.org/Lazy_vs._non-strict explains.

Observe that the above does even less work than printing in the REPL:
an attempt to print 1:undefined fails with an exception, 

Prelude> 1:undefined
[1*** Exception: Prelude.undefined
 
where the call to head above succeeds. 

Of course, the same idea is behind Haskell's infinite lists like
[1..]. Printing it leads to an infinite loop in the REPL, 

Prelude> [1..]
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,^C
^C

whereas evaluating functions on it makes for all kinds of
neat tricks like in http://www.techrepublic.com/article/infinite-list-tricks-in-haskell/

----------------------[ Thunks ]----------------------

Unlike LISP or Scheme, the default "product" of a function or operator
such as "+" or ":" is not a value. Rather, it's a *thunk*, which could
be evaluated further if actually needed, a recipe of how to compute
that value. You can think of the thunk as an "environment" data
structure that is paired with the generic code that computes that kind
of a value, and will guide that generic code to compute that
particular value if called. 

This should remind you of continuations, a closely related idea where
a captured environment data structure guides postponed execution---except
in Haskell this postponed execution can also be pattern-matched on.

----------------------[ Surprises ]----------------------

You should remember that thunks will not just "do what you mean".
They will happily accumulate and consume memory even if you don't 
intend them to. 

For example(*)(**), the simplest kind of a recursive sum function 

(*) the following example and some further examples are from
     http://eax.me/lazy-evaluation/ (a Russian programmer blog)

(**) the code for this and further examples can be found in hs/lazy/ ,
      the commands to build and run them in hs/lazy/Makefile 
      (starting from the bottom of the file; I kept adding targets
       to the top of the file, to run the top target's commands with
       "make", which by default processes the first target) 

$ cat mysum.hs

mysum []     = 0 
mysum (x:xs) = x + mysum xs

main = putStrLn $ show $ mysum $ take 100000 $ [1..]

will happily accumulate thunks of "+" to consume megabytes of 
memory space, which will be acted upon only at print time, and
garbage-collected. This can be seen by compiling the above
program with memory profiling:

$ cat Makefile

mysum: mysum.hs
       ghc -rtsopts mysum.hs
       ./mysum +RTS -sstderr

$ make
ghc -rtsopts mysum.hs
[1 of 1] Compiling Main             ( mysum.hs, mysum.o )
Linking mysum ...
./mysum +RTS -sstderr
5000050000
      21,813,256 bytes allocated in the heap
      12,048,632 bytes copied during GC
       6,571,152 bytes maximum residency (5 sample(s))
          25,472 bytes maximum slop
              11 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0        29 colls,     0 par    0.01s    0.01s     0.0003s    0.0006s
  Gen  1         5 colls,     0 par    0.01s    0.01s     0.0021s    0.0039s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.01s  (  0.01s elapsed)
  GC      time    0.02s  (  0.02s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.02s  (  0.03s elapsed)

  %GC     time      71.5%  (71.5% elapsed)

  Alloc rate    3,989,256,766 bytes per MUT second

  Productivity  28.2% of total user, 22.7% of total elapsed
  
Note that this simple program uses up 11MB of RAM---certainly
not what you'd expect of a simple integer-summing loop! 

You might think that this is due to accumulation of stack frames,
since the mysum function above is not tail-recursive. However,
tail recursion doesn't in fact help:

$ cat mysum-tr.hs
mysum acc []     = acc
mysum acc (x:xs) = mysum (acc+x) xs

main = putStrLn $ show $ mysum 0 $ take 100000 $ [1..]

$ cat Makefile

mysum-tr: mysum-tr.hs
          ghc -rtsopts mysum-tr.hs
          ./mysum-tr +RTS -sstderr

$ make
ghc -rtsopts mysum-tr.hs
[1 of 1] Compiling Main             ( mysum-tr.hs, mysum-tr.o )
Linking mysum-tr ...
./mysum-tr +RTS -sstderr
5000050000
      20,994,040 bytes allocated in the heap
      18,621,440 bytes copied during GC
       5,915,616 bytes maximum residency (5 sample(s))
          91,232 bytes maximum slop
              11 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0        35 colls,     0 par    0.01s    0.01s     0.0003s    0.0006s
  Gen  1         5 colls,     0 par    0.01s    0.01s     0.0021s    0.0044s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.00s  (  0.01s elapsed)
  GC      time    0.02s  (  0.02s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.02s  (  0.03s elapsed)

  %GC     time      74.7%  (74.0% elapsed)

  Alloc rate    4,797,541,133 bytes per MUT second

  Productivity  25.0% of total user, 20.0% of total elapsed

If anything, it leads to _more_ garbage collection, not less, ~18M
rather than ~12M !

The culprit is still the thunks of "+" accumulating, despite the 
tail-recursion savings on frames (about ~1M of overall allocation).

----------------------[ Forcing strictness with seq ]----------------------

This problem is so common that Haskell includes the function seq
to force evaluation of the thunks. It works on an expression
until it encounters a first constructor such as that of a cons
cell or pair. 

$ cat Makefile

mysum-seq: mysum-seq.hs
           ghc -rtsopts mysum-seq.hs
           ./mysum-seq +RTS -sstderr

$ make
ghc -rtsopts mysum-seq.hs
[1 of 1] Compiling Main             ( mysum-seq.hs, mysum-seq.o )
Linking mysum-seq ...
./mysum-seq +RTS -sstderr
5000050000
      14,451,704 bytes allocated in the heap
          16,768 bytes copied during GC
          44,312 bytes maximum residency (2 sample(s))
          21,224 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0        26 colls,     0 par    0.00s    0.00s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.00s    0.00s     0.0005s    0.0008s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.00s  (  0.00s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.00s  (  0.01s elapsed)

  %GC     time       8.0%  (18.5% elapsed)

  Alloc rate    5,112,028,298 bytes per MUT second

  Productivity  90.3% of total user, 63.1% of total elapsed

Much better---this cut the garbage-collected thunks to mere
kilobytes!

Note that the list [1..100000] is still costing us a lot (see this by
reducing 100000 to 100 and rerunning), but in this example the
target is thunks, not the data structures that list creates.

----------------------[ -XBangPatterns ]----------------------

Besides seq, there are a number of ways to enforce strict evaluation.

The simplest---which is not in standard Haskell, but can be enabled
as a languages extension---is using ! with the function's pattern
matching to indicate that the argument expression should be 
strictly evaluated:

$ cat mysum-bang.hs

-- Uncomment the following to have the code compile without 
--   the -XBangPatterns option to ghc:
--{-# LANGUAGE BangPatterns #-}

mysum !acc []     = acc
mysum !acc (x:xs) = mysum (acc+x) xs

main = putStrLn $ show $ mysum 0 $ take 100000 $ [1..]

(run this and see the GC stats)

----------------------[ -O2 ]----------------------

Compiling with -O2 will catch some cases where strict evaluation
is called for. See the  mysum-tr-o2  target in the hs/lazy/Makefile .

However, -O2 strictness analysis is not magic. The very
next target, mysum-rest, fails to get optimized.

In mysum-rest the mysum function returns a tuple (pair) of the list's
sum up to its third argument and the object representing the rest of
the list (in our example, an infinite list!). The compiler fails
to deduce the desired strictness of the first argument, and brings
back megabytes of GC. 

This example can be fixed by explicitly specifying strict patterns, as
in mysum-rest-bang  or by using  seq  as in  mysum-rest-seq

More about Haskell's strictness analysis:
https://wiki.haskell.org/Performance/Strictness

----------------------[ seq vs deepseq ]----------------------

Seq is O(1) and stops at the first constructor. Thus seq works
for arithmetic operators, but stops, e.g., at a pair constructor.

Consider mysum-and-len , which returns the pair of the list's sum and
length, using two accumulator arguments. Its thunks cause a ton of
GC---in fact, around ~35M, twice as many as before, despite the tail
recursion, because of the additional accumulator argument creates it
own thunks. This is fixed by using strict patterns as in
mysum-and-len-bang --- but let's try something else.

Let's rewrite this function to take the two accumulators as a pair,
and let's use the seq trick above to try to cause the strict evaluation.
Unlike the previous examples of  mysum-rest-seq.hs  and  mysum-seq.hs,
this trick ( mysum-and-len-seq ) does not work!

The forced thunk evaluation caused by seq stops at the tuple
constructor---with the familiar result of causing a ton of GC. 

Just for such cases, the deepseq function imported from Control.DeepSeq
helps breach this obstacle. Note that deepseq is O(N) in the size N
of the thunk expression, as it fully traverses that expression (cf.
deep copy of LISP lists). The  mysum-and-len-deepseq  example shows how.

Note that deepseq in my GHC version has trouble inferring the types,
and gives "ambiguous type" errors. For this reason, in this one example
we need an explicit type definition for mysum.

This example can also be fixed by explicit -XBangPattern strictness (
mysum-and-len-bang ).
  
----------------------[ Strict data structures ]----------------------

Finally, another way of enforcing strictness is to declare it in
type definitions. Compare  mysum-and-len-spair.hs , which defines
its own pair type for the sum and length of the list, and 
mysum-and-len-spair-strict.hs , which stipulates that strict
evaluation must be applied whenever an SPair  constructor is called.