8c6794b6.github.io

Cause of 3 nano seconds difference

Unboxed Vector, in vector package is efficient data structure. Iterating over unboxed value is pretty fast.

Is there any faster way to iterate than unboxed vector?

Interest of comparision here is efficiency of iteration. Not the body contents of work in each iteration, but overhead we get from the loop.

> {-# LANGUAGE BangPatterns #-}
> module Main where
>
> import Criterion.Main (bench, defaultMain, nf)
> import qualified Data.Vector.Unboxed as U

Couple type synonyms, and body function to apply inside the loop.

> type Updater a = a -> a
> type Work a = Updater a -> a -> Int -> a
>
> add1 :: Updater Int
> add1 = (+1)

Using unboxed vector to loop for given time. Using foldl' and replicate.

> work01 :: Work Int
> work01 f n m = U.foldl' (\x _ -> f x) n (U.replicate m ())

Invoking work01 with add1. Below will apply add1 to 0 for 100 times.

ghci> work01 add1 0 100
100

Manual recursive loop function with wrapper strict data type.

> data P = P {-# UNPACK #-} !Int {-# UNPACK #-} !Int
>
> work02 :: Work Int
> work02 f n0 m = go (P 0 n0) where
>   {-# INLINE go #-}
>   go !(P i n) | i == m    = n
>               | otherwise = go (P (i+1) (f n))

Again, applying add1 to 0 for 100 times.

ghci> work02 add1 0 100
100

Main for taking benchmark looks like below.

> main_one :: IO ()
> main_one = do
>   let k = 2 ^ 10
>   defaultMain
>     [ bench "unboxed vector" (nf (work01 add1 0) k)
>     , bench "hand written recursion" (nf (work02 add1 0) k)
>     ]

Compiling with couple optimization options:

$ ghc -O3 -fllvm -optl-O3 -main-is main_one -o a.out loop.lhs

The benchmark result:

benchmarking unboxed vector
mean: 52.56010 ns, lb 52.49446 ns, ub 52.64269 ns, ci 0.950
std dev: 376.4229 ps, lb 310.1393 ps, ub 458.0804 ps, ci 0.950

benchmarking hand written recursion
mean: 55.53802 ns, lb 55.51631 ns, ub 55.56074 ns, ci 0.950
std dev: 113.1906 ps, lb 103.0162 ps, ub 125.4034 ps, ci 0.950

In this micro benchmark, looping with unboxed vector was about 3 nano seconds faster than manual strict recursion.

What is the cause of this 3 nano seconds difference? Seems like it's a goot time to have a look at GHC core. Dumped core of work01 looks like below:

Main.work01 :: Main.Work GHC.Types.Int
[GblId,
 Arity=3,
 Caf=NoCafRefs,
 Str=DmdType LS(A)U(L)m,
 Unf=Unf{Src=InlineStable, TopLvl=True, Arity=0, Value=True,
         ConLike=True, Cheap=True, Expandable=True,
         Guidance=ALWAYS_IF(unsat_ok=True,boring_ok=True)
         Tmpl= Main.work1
               `cast` (<Main.Updater GHC.Types.Int>
                       -> <GHC.Types.Int>
                       -> <GHC.Types.Int>
                       -> Data.Vector.Fusion.Util.NTCo:Id <GHC.Types.Int>
                       :: (Main.Updater GHC.Types.Int
                           -> GHC.Types.Int
                           -> GHC.Types.Int
                           -> Data.Vector.Fusion.Util.Id GHC.Types.Int)
                            ~#
                          (Main.Updater GHC.Types.Int
                           -> GHC.Types.Int -> GHC.Types.Int -> GHC.Types.Int))}]
Main.work01 =
  Main.work1
  `cast` (<Main.Updater GHC.Types.Int>
          -> <GHC.Types.Int>
          -> <GHC.Types.Int>
          -> Data.Vector.Fusion.Util.NTCo:Id <GHC.Types.Int>
          :: (Main.Updater GHC.Types.Int
              -> GHC.Types.Int
              -> GHC.Types.Int
              -> Data.Vector.Fusion.Util.Id GHC.Types.Int)
               ~#
             (Main.Updater GHC.Types.Int
              -> GHC.Types.Int -> GHC.Types.Int -> GHC.Types.Int))

We don't find the actual work in Main.work01 from dumped result. It's function body shows that work01 is calling cast with Main.work1, which should doing the actual work.

The dumped core of work02 looks like this

Main.work02 [InlPrag=INLINE[0]] :: Main.Work GHC.Types.Int
[GblId,
 Arity=3,
 Caf=NoCafRefs,
 Str=DmdType LU(L)U(L)m,
 Unf=Unf{Src=Worker=Main.$wwork02, TopLvl=True, Arity=3, Value=True,
         ConLike=True, Cheap=True, Expandable=True,
         Guidance=ALWAYS_IF(unsat_ok=True,boring_ok=False)
         Tmpl= \ (w_s2Uw [Occ=Once] :: Main.Updater GHC.Types.Int)
                 (w1_s2Ux [Occ=Once!] :: GHC.Types.Int)
                 (w2_s2UB [Occ=Once!] :: GHC.Types.Int) ->
                 case w1_s2Ux of _ { GHC.Types.I# ww_s2Uz [Occ=Once] ->
                 case w2_s2UB of _ { GHC.Types.I# ww1_s2UD [Occ=Once] ->
                 case Main.$wwork02 w_s2Uw ww_s2Uz ww1_s2UD
                 of ww2_s2UH { __DEFAULT ->
                 GHC.Types.I# ww2_s2UH
                 }
                 }
                 }}]
Main.work02 =
  \ (w_s2Uw :: Main.Updater GHC.Types.Int)
    (w1_s2Ux :: GHC.Types.Int)
    (w2_s2UB :: GHC.Types.Int) ->
    case w1_s2Ux of _ { GHC.Types.I# ww_s2Uz ->
    case w2_s2UB of _ { GHC.Types.I# ww1_s2UD ->
    case Main.$wwork02 w_s2Uw ww_s2Uz ww1_s2UD
    of ww2_s2UH { __DEFAULT ->
    GHC.Types.I# ww2_s2UH
    }
    }
    }

Main.$wwork02
  :: Main.Updater GHC.Types.Int
     -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId,
 Arity=3,
 Caf=NoCafRefs,
 Str=DmdType LLL,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Arity=3, Value=True,
         ConLike=True, Cheap=True, Expandable=True,
         Guidance=IF_ARGS [60 0 0] 152 0}]
Main.$wwork02 =
  \ (w_s2VX :: Main.Updater GHC.Types.Int)
    (ww_s2W0 :: GHC.Prim.Int#)
    (ww1_s2W4 :: GHC.Prim.Int#) ->
    letrec {
      $wgo1_s2Xg [Occ=LoopBreaker]
        :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
      [LclId, Arity=2, Str=DmdType LL]
      $wgo1_s2Xg =
        \ (ww2_s2VN :: GHC.Prim.Int#) (ww3_s2VO :: GHC.Prim.Int#) ->
          case GHC.Prim.==# ww2_s2VN ww1_s2W4 of _ {
            GHC.Types.False ->
              case w_s2VX (GHC.Types.I# ww3_s2VO) of _ { GHC.Types.I# tpl1_B6 ->
              $wgo1_s2Xg (GHC.Prim.+# ww2_s2VN 1) tpl1_B6
              };
            GHC.Types.True -> ww3_s2VO
          }; } in
    $wgo1_s2Xg 0 ww_s2W0

After taking a closer look, I realised that i == m in guard of work02 is carrying m, which is passed through the inner functions found in the core. This lead $wwork02 to compare ww2_s2VN and ww1_s2w4, which both of them passed from argument in work02.

Rewriting the guard to compare with constant 0 instead of m:

> work03 :: Work Int
> work03 f n0 m = go (P m n0) where
>   {-# INLINE go #-}
>   go !(P i n) | i == 0    = n
>               | otherwise = go (P (i-1) (f n))

New main for taking benchmark:

> main_two :: IO ()
> main_two = do
>   let k = 2 ^ 10
>   defaultMain
>     [ bench "unboxed vector" (nf (work01 add1 0) k)
>     , bench "hand written recursion, take 2" (nf (work03 add1 0) k)
>     ]

Compiling:

$ ghc -O3 -fllvm -optl-O3 -main-is main_two -o b.out loop.lhs

And running the benchmark:

benchmarking unboxed vector
mean: 49.86483 ns, lb 49.84522 ns, ub 49.88565 ns, ci 0.950
std dev: 103.5967 ps, lb 91.78539 ps, ub 117.8307 ps, ci 0.950

benchmarking hand written recursion, take 2
mean: 48.45895 ns, lb 48.37886 ns, ub 48.61787 ns, ci 0.950
std dev: 554.6886 ps, lb 337.5092 ps, ub 946.0307 ps, ci 0.950

Now the hand written recursion performs about 1.4 nano second faster than unboxed vector. Whenever possible, compare with constant in guards. And, if few nano seconds does not matter, just use unboxed vector.

TAGGED: haskell, performance