While I was viewing html files, got an error:
| A web handler threw an exception. Details: | Cannot decode byte '\xa0': Data.Text.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream
The html file was served by snap.
Somewhere inside in it's data transfer, something wrong is happening
with utf8 decoding with function in
The byte failed to decode:
'\xa0', was no-break
No-break space in UTF-8 is
0xc2 0xa0, expressed with 2 bytes. When
ByteString, it is expecting
0xc2 prefix to be
passed together, but when above error occured, only the latter
has passed. Later I realised that all characters in between unicode code
point from U+0080 to U+00FF will raise similar exception, since those
non-ascii characters need explicit
How can we get rid of this exception?
> module Main where > > import Data.Word (Word8) > import qualified Data.ByteString as B > import qualified Data.ByteString.Char8 as C8 > import qualified Data.ByteString.UTF8 as U8 > import qualified Data.Text as T > import qualified Data.Text.IO as T > import qualified Data.Text.Encoding as E
Suppose that we have latin1 string:
> latin1s :: String > latin1s = "français"
decodeUtf8 to latin1 chars will raise exception, as shown in
top of this writing.
> printNG :: String -> IO () > printNG ss = T.putStrLn $ E.decodeUtf8 $ C8.pack ss
| *Main> ptintNG latin1s | "*** Exception: Cannot decode byte '\x61': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream
The problem in this case was that
ByteString representation of
expressed with single byte, instead of 2 bytes.
When the characters we matter are ascii and latin1 only, simple solution is to pad the prefix.
> padLatin1s :: C8.ByteString -> T.Text > padLatin1s = E.decodeUtf8 . C8.foldr f C8.empty where > f c a | '\128' <= c && c < '\192' = C8.cons '\194' . C8.cons c $ a > | '\192' <= c && c < '\256' = > C8.cons '\195' . C8.cons (toEnum (fromEnum c - 0x40)) $ a > | otherwise = C8.cons c a
Now we can decode bytestring data to Text data without exception.
> printPadded :: String -> IO () > printPadded ss = T.putStrLn $ padLatin1s $ C8.pack ss
| *Main> printPadded latin1s | français
Though above padding approach does not work when documents has mixed use
of characters between U+0080 and U+00FF with non-ascii, non-latin1
characters, since its always padded. Though it may useful when
converting a limited set of character, which could expressed with
sequence of hex numbers from
0xff. Those inputs using hex
0xff are not representable with
> zhongwen :: String > zhongwen = "中文"
Padding does not make sense:
| *Main> printPadded zhongwen | -
Which function made the change? When we use
Data.ByteString.Char8 to convert this String to ByteString, it results
| *Main> C8.pack zhongwen | "-\135"
Which is not what we want in most cases.
Using pack from
Data.ByteString.Char8 will truncate hex number used
for the characters, we get a value which could be expressed in
Representing with ByteString is possible, however. It needs a longer, proper sequence of hex numbers.
> zhongwen' :: [Word8] > zhongwen' = [0xe4, 0xb8, 0xad, 0xe6, 0x96, 0x87]
We use Data.ByteString.pack to convert list of hex numbers:
> printZhongwen :: IO () > printZhongwen = T.putStrLn $ E.decodeUtf8 $ B.pack zhongwen'
| *Main> printZhongwen | 中文
So what we want here is a utf-8 aware function that converts list of
ByteString. A package
utf8-string has a
converting function from
> printUTF8 :: String -> IO () > printUTF8 ss = T.putStrLn $ E.decodeUtf8 $ U8.fromString ss
| *Main> printUTF8 latin1s | français | *Main> printUTF8 zhongwen | 中文