Browse Source

Vector package

master
Michael Snoyman 6 years ago
parent
commit
50b69ec8e5
  1. 615
      content/vector.md
  2. 2
      outline/intermediate-haskell.md

615
content/vector.md

@ -0,0 +1,615 @@
---
title: The vector package
author: Michael Snoyman <michael@snoyman.com>
description: Overview and typical usage of the vector package
first-written: 2015-10-25
last-updated: 2015-10-25
last-reviewed: 2015-10-25
---
The de facto standard package in the Haskell ecosystem for integer-indexed
array data is the [vector package](http://www.stackage.org/package/vector).
This corresponds at a high level to arrays in C, or the vector class in C++'s
STL. However, the vector package offers quite a bit of functionality not
familiar to those used to options in imperative and mutable languages.
While the interface for vector is relatively straightforward, the abundance of
different modules can be daunting. This article will start off with an overview
of terminology to guide you, and then step through a number of concrete
examples of using the package.
## Example
Since we're about to jump into a few section of descriptive text, let's kick
this off with a concrete example of whet your appetite. We're going to count
the frequency of different bytes that appear on standard output, and then
display this content.
Note that this example is purposely written in a very generic form. We'll build
up to handling this form throughout this article.
```haskell
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad.Primitive (PrimMonad, PrimState)
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector.Generic.Mutable as M
import qualified Data.Vector.Unboxed as U
import Data.Word (Word8)
main :: IO ()
main = do
-- Get all of the contents from stdin
lbs <- L.getContents
-- Create a new 256-size mutable vector
-- Fill the vector with zeros
mutable <- M.replicate 256 0
-- Add all of the bytes from stdin
addBytes mutable lbs
-- Freeze to get an immutable version
vector <- U.unsafeFreeze mutable
-- Print the frequency of each byte
-- In newer vectors: we can use imapM_
U.zipWithM_ printFreq (U.enumFromTo 0 255) vector
addBytes :: (PrimMonad m, M.MVector v Int)
=> v (PrimState m) Int
-> L.ByteString
-> m ()
addBytes v lbs = mapM_ (addByte v) (L.unpack lbs)
addByte :: (PrimMonad m, M.MVector v Int)
=> v (PrimState m) Int
-> Word8
-> m ()
addByte v w = do
-- Read out the old count value
oldCount <- M.read v index
-- Write back the updated count value
M.write v index (oldCount + 1)
where
-- Indices in vectors are always Ints. Our bytes come in as Word8, so we
-- need to convert them.
index :: Int
index = fromIntegral w
printFreq :: Int -> Int -> IO ()
printFreq index count = putStrLn $ concat
[ "Frequency of byte "
, show index
, ": "
, show count
]
```
## Terminology
There are two different varieties of vectors: immutable and mutable. Immutable
vectors (such as provided by the `Data.Vector` module) are essentially
swappable with normal lists in Haskell, though with drastically different
performance characteristics (discussed below). The high-level API is similar to
lists, it implements common typeclasses like `Functor` and `Foldable`, and
plays quite nicely with parallel code.
By contrast, mutable vectors are much closer to C-style arrays. Operations
working on these values must live in the `IO` or `ST` monads (see `PrimMonad`
below for more details). Concurrent access from multiple threads has all of the
normal concerns of shared mutable state. And perhaps most importantly for
usage: mutable vectors can be *much* more efficient for certain use cases.
However, that's not the only dimension of choice you get in the vector package.
vector itself defines three flavors: unboxed
(`Data.Vector`/`Data.Vector.Mutable`), storable (`Data.Vector.Storable` and
`Data.Vector.Storable.Mutable`), and unboxed (`Data.Vector.Unboxed` and
`Data.Vector.Unboxed.Mutable`). (There's also technically primitive vectors,
but in practice you should always prefer unboxed vectors; see the module
documentation for more information on the distinction here.)
And our final point: in addition to having these three flavors, the vector
package provides a typeclass-based interface which allows you to write code
that works in any of these three (plus other vector types that may be defined
in other packages, like
[hybrid-vectors](http://www.stackage.org/package/hybrid-vectors)). These
interfaces are in `Data.Vector.Generic` and `Data.Vector.Generic.Mutable`. When
using these interfaces, you must still eventually choose a concrete
representation, but your helper code can be agnostic to what it is.
What's nice is that - with small differences - all four mutable modules have
the same interface, and all four immutable modules have the same interface.
This means you can focus on learning one type of vector, and almost for free
have that knowledge apply to other types as well. It then just becomes a
question of choosing the representation that best fits your use case, which
we'll get to shortly.
## Efficiency
Standard lists in Haskell are immutable, singly-linked lists. Every time you
add another value to the front of the list, it has to allocate another heap
object for that cell, create a pointer to the head of the original list, and
create a pointer to the value in the current cell. This takes up a lot of
memory for holding pointers, and makes it inefficient to index or traverse the
list (indexing to position N requires N pointer dereferences).
By contract, vectors are stored in a packed format in memory, meaning indexing
is an O(1) operation, and the memory overhead per additional item in the vector
is much smaller (depending on the type of vector, which we'll cover in a
moment). However, compared to lists, appending an item to a vector is
relatively expensive: it requires creating a new buffer in memory, copying the
old values, and then adding the new value.
There are other data structures that can be considered for list-like data, such
as `Seq` from containers, or in some cases a `Set`, `IntMap`, or `Map`.
Figuring out the best choice for each use case can only be reliably determined
via profiling. But as a general rule: densely populated lists with integral
access to the values will be best served by vector.
Now let's talk about some of the other things that make vector so efficient.
### Boxed, storable and unboxed
Boxed vectors hold normal Haskell values. These can be _any_ values at all, and
are stored on the heap with pointers kept in the vector. The advantage is that
this works for all datatypes, but the extra memory overhead for the pointers
and the indirection of needing to dereference those pointers makes them
(relative to the next two types) inefficient.
Storable and unboxed vectors both store their data in a byte array, avoiding
pointer indirection. This is more memory efficient and allows better usage of
caches. The distinction between storable and unboxed vectors is subtle:
* Storable vectors require data which is an instance of the [`Storable` type
class](http://haddock.stackage.org/lts-3.11/base-4.8.1.0/Foreign-Storable.html#t:Storable).
This data is stored in `malloc`ed memory, which is *pinned* (the garbage
collector can't move it around). This can lead to memory fragmentation, but
allows the data to be shared over the C FFI.
* Unboxed vectors require data which is an instance of the [`Prim` type
class](http://haddock.stackage.org/lts-3.11/primitive-0.6.1.0/Data-Primitive-Types.html).
This data is stored in GC-managed *unpinned* memory, which helps avoid memory
fragmentation. However, this data cannot be shared over the C FFI.
Both the `Storable` and `Prim` typeclasses provide a way to store a value as
bytes, and to load bytes into a value. The distinction is what type of
bytearray is used.
As usual, the only true measure of performance will be benchmarking. However,
as a general guideline:
* If you don't need to pass values to a C FFI, and you have a `Prim` instance,
use unboxed vectors.
* If you have a `Storable` instance, use a storable vector.
* Otherwise, use a boxed vector.
There are also other issues to consider, such as the fact that boxed vectors
are instances of `Functor` while storable and unboxed vectors are not.
### Stream fusion
Take a guess how much memory the following program will take to run:
```haskell
import qualified Data.Vector.Unboxed as V
main :: IO ()
main = print $ V.sum $ V.enumFromTo 1 (10^9 :: Int)
```
A valid guess may be `10^9 * sizeof int` bytes. However, when compiled with
optimizations (`-O2`) on my system, it allocates a total of only 52kb! How it
is possible to create a one billion integer array without using up 4-8GB of
memory?
The vector package has a powerful technique: stream fusion. Using GHC rewrite
rules, it's able to find many cases where creating a vector is unnecessary, and
instead create a tight inner loop. In our case, GHC will end up generating code
that can avoid touching system memory, and instead work on just the registers,
yielding not only a tiny memory footprint, but performance close to a for-loop
in C. This is one of the beauties of this library: you get to write high-level
code, and optimizations can churn out something much more CPU-friendly.
### Slicing
Above we discussed the problem of appending values to the front of a vector.
However, one place where vector shines is with *slicing*, or taking a subset of
the vector. When dealing with immutable vectors, slicing is a safe operation,
with slices being sharable with multiple threads. Slicing also works with
mutable vectors, but as usual you need to be a bit more careful.
## Replacing lists
Enough talk! Let's start using vector. Assuming you're familiar with the list
API, this should looke rather boring.
```haskell
import qualified Data.Vector as V
main :: IO ()
main = do
let list = [1..10] :: [Int]
vector = V.fromList list :: V.Vector Int
vector2 = V.enumFromTo 1 10 :: V.Vector Int
print $ vector == vector2 -- True
print $ list == V.toList vector -- also True
print $ V.filter odd vector -- 1,3,5,7,9
print $ V.map (* 2) vector -- 2,4,6,...,20
print $ V.zip vector vector -- (1,1),(2,2),...(10,10)
print $ V.zipWith (*) vector vector -- (1,4,9,16,...,100)
print $ V.reverse vector -- 10,9,...,1
print $ V.takeWhile (< 6) vector -- 1,2,3,4,5
print $ V.takeWhile odd vector -- 1
print $ V.takeWhile even vector -- []
print $ V.dropWhile (< 6) vector -- 6,7,8,9,10
print $ V.head vector -- 1
print $ V.tail vector -- 2,3,4,...,10
print $ V.head $ V.takeWhile even vector -- exception!
```
Hopefully there's nothing too surprising about this. Most `Prelude` functions
that apply to lists have a corresponding vector function. If you know what a
function does in `Prelude`, you probably know what it does in `Data.Vector`.
This is the simplest usage of the vector package: import `Data.Vector`
qualified, convert to/from lists with `V.fromList` and `V.toList`, and then
prefix your function calls with `V.`.
* Exercise 1: Try out some other functions available in the [`Data.Vector`
module](http://haddock.stackage.org/lts-3.11/vector-0.10.12.3/Data-Vector.html).
In particular, try some of the fold functions, which we haven't covered here.
* Exercise 2: Try using the `Functor`, `Foldable`, and `Traversable` versions of
functions with a vector
* Exercise 3: Use an unboxed (or storable) vector instead of the boxed vectors
we were using above. What code did you have to change from the original
example? Do your examples from exercise 2 all work still?
There are also a number of functions in the `Data.Vector` module with no
corresponding function in `Prelude`. Many of these are related to mutable
vectors (which we'll cover shortly). Others are present to provide more
efficient means of manipulating a vector, based on their special in-memory
representation.
## Mutable vectors
I want to test how fair the `System.Random` number generator is at generating
numbers between 0 and 9, inclusive. I want to generate 1,000,000 random values,
count the frequency of each result, and then print how often each value
appeared. Let's first implement this using immutable vectors:
```haskell
import Data.Vector.Unboxed ((!), (//))
import qualified Data.Vector.Unboxed as V
import System.Random (randomRIO)
main :: IO ()
main = do
let v0 = V.replicate 10 (0 :: Int)
loop v 0 = return v
loop v rest = do
i <- randomRIO (0, 9)
let oldCount = v ! i
v' = v // [(i, oldCount + 1)]
loop v' (rest - 1)
vector <- loop v0 (10^6)
print vector
```
We've introduced the `!` operator for indexing, and the `//` operator for
updating. Other than that, this is fairly straightforward code. When I ran this
on my system, it had 48MB maximum memory residency, and took 1.968s to
complete. Surely we can do better.
This problem is inherently better as a mutable state one: instead of generating
a new immutable `Vector` for each random number generated, we'd like to simply
increment a piece of memory. Let's rewrite this to use a mutable, unboxed
vector:
```haskell
import Control.Monad (replicateM_)
import Data.Vector.Unboxed (freeze)
import qualified Data.Vector.Unboxed.Mutable as V
import System.Random (randomRIO)
main :: IO ()
main = do
vector <- V.replicate 10 (0 :: Int)
replicateM_ (10^6) $ do
i <- randomRIO (0, 9)
oldCount <- V.read vector i
V.write vector i (oldCount + 1)
ivector <- freeze vector
print ivector
```
Once again, we use `replicate` to create a size-10 vector filled with 0. But
now we've created a mutable vector (note the change in import). We then use
`replicateM_` to perform the inner action 1,000,000 times, namely: generate a
random index, read the old value at that index, increment it, and write it
back.
After we're finished, we _freeze_ the vector (more on that in the next section)
and print it. The results are the same (or close - we are dealing with random
numbers here) to the previous immutable one. But instead of 48MB and 1.968s,
this program has a maximum residency of 44KB and runs in 0.247s! That's a
significant improvement!
If we feel like being even more adventurous, we can replace our `read` and
`write` calls with `unsafeRead` and `unsafeWrite`. That will disable some
bounds checks before reading and writing. This can be a nice performance boost
in very tight loops, but has the potential to segfault your program, so caveat
emptor! For example, try replacing `replicate 10` with `replicate 9`, change
the `read` for an `unsafeRead`, and run your program. You'll see something
like:
```
internal error: evacuate: strange closure type -1944718914
(GHC version 7.10.2 for x86_64_unknown_linux)
Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug
Aborted (core dumped)
```
The same logic applies to the other `unsafe` functions in vector. The
nomenclature means: `unsafe` may segfault your whole process, while
not-marked-unsafe may just throw an impure exception (also not great, but
certainly better than a segfault).
And if you were curious: on my system using `unsafeRead` and `unsafeWrite`
speeds the program up marginally, from 0.247s to 0.233s. In our example, most
of our time is spent on generating the random numbers, so taking off the safety
checks does not have a significant impact.
## Freezing and thawing
We used the `freeze` function above. The behavior of this may not be
immediately obvious. When you freeze a mutable vector, what happens is:
1. A new mutable vector of the same size is created
2. Each value in the original mutable vector is copied to the new mutable vector
3. A new immutable vector is created out of the memory space used by the new mutable vector
Why not just freeze it in place? Two reasons, actually:
1. It has the potential to break referential transparency. Consider this code:
```haskell
import Data.Vector.Unboxed (freeze)
import qualified Data.Vector.Unboxed.Mutable as V
main :: IO ()
main = do
vector <- V.replicate 1 (0 :: Int)
V.write vector 0 1
ivector <- freeze vector
print ivector
V.write vector 0 2
print ivector
```
If we froze the vector in-place in the call to `freeze`, then the second
`write` call would modify our `ivector` value, meaning that the first and
second call to `print ivector` would have different results!
2. When you freeze a mutable vector, its memory is marked different for
garbage collection purposes. Later trying to write to that same memory can
lead to a segfault.
However, if you really want to avoid that extra buffer copy, and are certain
it's safe, you can use `unsafeFreeze`. And in fact, our random number example
above is a case where `freeze` can be safely replaced by `unsafeFreeze`, since
after the freeze, the original mutable vector is never used again.
* Exercise 1: Go ahead and make that swap and confirm that your program works
as expected.
* Exercise 2: In the program just above (with `V.replicate 1 (0 :: Int)`),
replace `freeze` with `unsafeFreeze`. What result do you see?
The opposite of `freeze` is `thaw`. Similar to `freeze`, `thaw` will copy to a
new mutable vector instead of exposing the current memory buffer. And also,
like `freeze`, there's an `unsafeThaw` that turns off the safety measures. Like
everything `unsafe`: caveat emptor!
(We'll cover some functions like `create` that provide safe wrappers around
`unsafeFreeze` and `unsafeThaw` later.)
## PrimMonad
If you look at the mutation functions we used above like `read` and `write`,
you can tell that they were looking in the `IO` monad. However, vector is more
generic than that, and will allow your mutations to live in any *primitive
monad*, meaning: `IO`, strict `ST s`, and transformers sitting on top of those
two. The type class controlling this is `PrimMonad`.
You can get more information on `PrimMonad` in the [Primitive
Haskell](primitive-haskell.md) article. Without diving into details: every
primitive monad also has an associated primitive state token type, which is
captured with `PrimState`. As a result, the type signatures for `read` and
`write` (for boxed vectors) look like:
```haskell
read :: PrimMonad m => MVector (PrimState m) a -> Int -> m a
write :: PrimMonad m => MVector (PrimState m) a -> Int -> a -> m ()
```
Every mutable vector takes two type parameters: the state token of the monad it
lives in, and the type of value it holds. These gymnastics may seem overkill
now, but are necessary for making mutable vectors both versatile in multiple
monads, and type safe.
## modify and the ST monad
Let's check out a particularly complicated type signature (for unboxed vectors):
```haskell
modify :: Unbox a => (forall s. MVector s a -> ST s ()) -> Vector a -> Vector a
```
What this function does is:
1. Creates a new mutable buffer the same length as the original vector
2. Copies the values from the original vector into the new mutable vector
3. Runs the provided `ST` action on the provided mutable vector
4. Unsafely freezes the mutable vector and returns it.
What's great about this function is that it does the minimal amount of buffer
copying to be safe, and that it can be used from pure code (since all
side-effects are captured inside the `ST` action you provide).
* Exercise 1: Steps 1 and 2 should look pretty similar to a function we
discussed above. Can you figure out which one it is?
* Exercise 2: Implement `modify` yourself using functions we've discussed and
`runST` from `Control.Monad.ST`.
Let's use our new function to implement a Fisher-Yates shuffle. If we start
with a vector of size 20, we'll generate a random number between 0 and 19. Then
we'll swap position 19 with that generated random number. Then we'll loop, but
this time with a random number between 0 and 18 and swapping with position 18.
We continue until we get down to 0.
```haskell
import Control.Monad.Primitive (PrimMonad, PrimState)
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Unboxed.Mutable as M
import System.Random (StdGen, getStdGen, randomR)
shuffleM :: (PrimMonad m, V.Unbox a)
=> StdGen
-> Int -- ^ count to shuffle
-> M.MVector (PrimState m) a
-> m ()
shuffleM _ i _ | i <= 1 = return ()
shuffleM gen i v = do
M.swap v i' index
shuffleM gen' i' v
where
(index, gen') = randomR (0, i') gen
i' = i - 1
shuffle :: V.Unbox a
=> StdGen
-> V.Vector a
-> V.Vector a
shuffle gen vector = V.modify (shuffleM gen (V.length vector)) vector
main :: IO ()
main = do
gen <- getStdGen
print $ shuffle gen $ V.enumFromTo 1 (20 :: Int)
```
Notice how `shuffleM` is a mutable, side-effecting function. However, `shuffle`
itself is pure.
## Generic
After everything else we've dealt with, `Generic` is a relatively easy
addition. We introduce two new typeclasses:
```haskell
class MVector v a
class MVector (Mutable v) a => Vector v a
```
Said in English: an instance `MVector v a` is a mutable vector of type `v` that
can hold values of type `a`. The `Vector v a` is the immutable counterpart to
some mutable vector. You can find the mutable version with `Mutable v`.
One important thing to keep in mind is *kinds*. The kind of the `v` is `MVector
v a` is `* -> * -> *`, since it takes parameters for both the state token and
the value it holds. With the immutable `Vector v a`, the `v` is of kind `* ->
*`. Was that a little abstract? No problem, some type signatures should help:
```haskell
length :: MVector v a => v s a -> Int
length :: Vector v a => v a -> Int
read :: (PrimMonad m, MVector v a) => v (PrimState m) a -> Int -> m a
```
It takes a bit of time to get used to these generic classes, but once you do
it's fairly easy to use them. The best advice is to practice! And as such:
* Exercise: modify the `shuffle` program above to work on a generic vector
instead of specifically on an unboxed vector.
The final trick when working with generic vectors is that, ultimately, you will
need to provide a concrete type. If you forget to do so, you'll end up with
error messages that look like the following:
```haskell
stream.hs:28:13:
No instance for (V.Vector v0 Int) arising from a use of ‘shuffle’
In the expression: shuffle gen
In the second argument of ‘($)’, namely
‘shuffle gen $ V.enumFromTo 1 (20 :: Int)’
In a stmt of a 'do' block:
print $ shuffle gen $ V.enumFromTo 1 (20 :: Int)
```
## vector-algorithms
A package of note is
[vector-algorithms](http://www.stackage.org/package/vector-algorithms), which
provides some algorithms (mostly sort) on mutable vectors. For example, let's
generate 100 random numbers and then sort them.
```haskell
import Data.Vector.Algorithms.Merge (sort)
import qualified Data.Vector.Generic.Mutable as M
import qualified Data.Vector.Unboxed as V
import System.Random (randomRIO)
main :: IO ()
main = do
vector <- M.replicateM 100 $ randomRIO (0, 999 :: Int)
sort vector
V.unsafeFreeze vector >>= print
```
* Exercise 1: write a helper function `sortImmutable` that uses `modify` and `sort` from vector-algorithms to sort an immutable vector safely
* Exercise 2: rewrite the main function above to use `sortImmutable` and only the immutable vector API
* Exercise 3: is your new version more efficient, less efficient, or the same? Explain.
## mwc-random
One final library to mention now is mwc-random, a random number generation
library built on top of vector and primitive. Its API can be a bit daunting
initially, but given your newfound understanding of the vector package, the API
might make a lot more sense now. It provides a `Gen s` type, where `s` is some
state token. You can then use `uniform` and `uniformR` to get random numbers
out of that generator.
As a final example, here's how we can shuffle the numbers 1-20 using
mwc-random.
```haskell
import Control.Monad.ST (ST)
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Unboxed.Mutable as M
import System.Random.MWC (Gen, uniformR, withSystemRandom)
shuffleM :: V.Unbox a
=> Gen s
-> Int -- ^ count to shuffle
-> M.MVector s a
-> ST s ()
shuffleM _ i _ | i <= 1 = return ()
shuffleM gen i v = do
index <- uniformR (0, i') gen
M.swap v i' index
shuffleM gen i' v
where
i' = i - 1
main :: IO ()
main = do
vector <- withSystemRandom $ \gen -> do
vector <- V.unsafeThaw $ V.enumFromTo 1 (20 :: Int)
shuffleM gen (M.length vector) vector
V.unsafeFreeze vector
print vector
```

2
outline/intermediate-haskell.md

@ -37,7 +37,7 @@ follow the rest of this outline in particular.
Covers some of the most commonly used data structures in Haskell, and the
libraries providing them.
* vector (cover vector-algorithms)
* [vector](../content/vector.md)
* containers
* unordered-containers
* text (cover text-icu)

Loading…
Cancel
Save