Bytestring and unicode chars from U+0080 to U+00FF

While I was viewing html files, got an error:

| A web handler threw an exception. Details:
| Cannot decode byte '\xa0': Data.Text.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream

The html file was served by snap. Somewhere inside in it's data transfer, something wrong is happening with utf8 decoding with function in Data.Text. The byte failed to decode: '\xa0', was no-break space.

No-break space in UTF-8 is 0xc2 0xa0, expressed with 2 bytes. When decodeUtf8 convert ByteString, it is expecting 0xc2 prefix to be passed together, but when above error occured, only the latter 0xa0 has passed. Later I realised that all characters in between unicode code point from U+0080 to U+00FF will raise similar exception, since those non-ascii characters need explicit 0xc2 or 0xc3 prefix.

How can we get rid of this exception?

> module Main where
> import Data.Word (Word8)
> import qualified Data.ByteString as B
> import qualified Data.ByteString.Char8 as C8
> import qualified Data.ByteString.UTF8 as U8
> import qualified Data.Text as T
> import qualified Data.Text.IO as T
> import qualified Data.Text.Encoding as E

Suppose that we have latin1 string:

> latin1s :: String
> latin1s = "français"

Applying decodeUtf8 to latin1 chars will raise exception, as shown in top of this writing.

> printNG :: String -> IO ()
> printNG ss = T.putStrLn $ E.decodeUtf8 $ C8.pack ss

Results in:

| *Main> ptintNG latin1s
| "*** Exception: Cannot decode byte '\x61': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream

The problem in this case was that ByteString representation of ç is expressed with single byte, instead of 2 bytes.

When the characters we matter are ascii and latin1 only, simple solution is to pad the prefix.

> padLatin1s :: C8.ByteString -> T.Text
> padLatin1s = E.decodeUtf8 . C8.foldr f C8.empty where
>   f c a | '\128' <= c && c < '\192' = C8.cons '\194' . C8.cons c $ a
>         | '\192' <= c && c < '\256' =
>           C8.cons '\195' . C8.cons (toEnum (fromEnum c - 0x40)) $ a
>         | otherwise = C8.cons c a

Now we can decode bytestring data to Text data without exception.

> printPadded :: String -> IO ()
> printPadded ss = T.putStrLn $ padLatin1s $ C8.pack ss

Results in:

| *Main> printPadded latin1s
| français

Though above padding approach does not work when documents has mixed use of characters between U+0080 and U+00FF with non-ascii, non-latin1 characters, since its always padded. Though it may useful when converting a limited set of character, which could expressed with sequence of hex numbers from 0x00 to 0xff. Those inputs using hex numbers above 0xff are not representable with Data.ByteString.

> zhongwen :: String
> zhongwen = "中文"

Padding does not make sense:

| *Main> printPadded zhongwen
| -‡

Which function made the change? When we use pack from Data.ByteString.Char8 to convert this String to ByteString, it results in:

| *Main> C8.pack zhongwen
| "-\135"

Which is not what we want in most cases.

Using pack from Data.ByteString.Char8 will truncate hex number used for the characters, we get a value which could be expressed in Word8 only.

Representing with ByteString is possible, however. It needs a longer, proper sequence of hex numbers.

> zhongwen' :: [Word8]
> zhongwen' = [0xe4, 0xb8, 0xad, 0xe6, 0x96, 0x87]

We use Data.ByteString.pack to convert list of hex numbers:

> printZhongwen :: IO ()
> printZhongwen = T.putStrLn $ E.decodeUtf8 $ B.pack zhongwen'


| *Main> printZhongwen
| 中文

So what we want here is a utf-8 aware function that converts list of Char to ByteString. A package utf8-string has a converting function from String to ByteString.

> printUTF8 :: String -> IO ()
> printUTF8 ss = T.putStrLn $ E.decodeUtf8 $ U8.fromString ss


| *Main> printUTF8 latin1s
| français
| *Main> printUTF8 zhongwen
| 中文

There should be a situation with more complicated character encoding issues. To get more support of unicode, using text-icu should be better, as suggested in haddock comment of text.

