Unicode and byte strings¶
The pypfilt
package simultaneously supports Python 2.7 and Python 3.x,
and is intended to behave identically regardless of the Python version.
It is assumed that the following Python 3 features are enabled in Python 2.7:
from __future__ import absolute_import, division, print_function from __future__ import unicode_literals
Importantly, among the
differences
between Python 2.7 and
Python 3.x, the native str
type is a byte string in Python 2 and a Unicode
string in Python 3.
This means that, e.g., the str()
built-in function returns byte strings
in Python 2 and Unicode strings in Python 3.
Guidelines for working with text¶
As per the Unicode HOWTO for Python 2 and Python 3:
Tip
Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end (the “Unicode sandwich”).
To that end, adhere to the following guidelines:
Use Unicode strings and Unicode literals everywhere. In Python 2, this means placing the following at the top of every file:
from __future__ import unicode_literals
If you have non-ASCII characters in a Python source file (e.g., in Unicode literals such as
'α'
), you need to declare the file encoding at the top of the file:# -*- coding: utf-8 -*-
Encode Unicode text into UTF-8 when writing to disk:
# Note: in Python 3, the open() built-in accepts an encoding argument with codecs.open(filename, 'wb', encoding='utf-8') as f: f.write(unicode_string)
Decode UTF-8 bytes into Unicode text when reading from disk:
# Note: in Python 3, the open() built-in accepts an encoding argument with codecs.open(filename, 'rb', encoding='utf-8') as f: unicode_lines = f.read().splitlines()
Note that NumPy functions such as loadtxt and genfromtxt cannot reliably handle non-ASCII text (e.g., see NumPy issues #3184, #4543, #4600, #4939), and should only be used with ASCII files:
import numpy as np with codecs.open(filename, encoding='ascii') as f: return np.loadtxt(f, ...)
Use the
'S'
(byte string) data type when storing text in NumPy arrays. Encode Unicode text into UTF-8 when storing text, and decode UTF-8 bytes when reading text:>>> from __future__ import unicode_literals >>> import numpy as np >>> xs = np.empty(3, dtype='S20') >>> xs[0] = 'abc'.encode('utf-8') >>> xs[1] = '« äëïöü »'.encode('utf-8') >>> xs[2] = 'ç'.encode('utf-8') >>> print(max(len(x) for x in xs)) 16 >>> for x in xs: >>> print(x.decode('utf-8')) abc « äëïöü » ç
NumPy has a Unicode data type (
'U'
), but it is not supported by h5py (and is platform-specific).Note that h5py object names (i.e., groups and datasets) are exclusively Unicode and are stored as bytes, so byte strings will be used as-is and Unicode strings will be encoded using UTF-8.
Use Unicode strings and literals when encoding to and decoding from JSON:
# Write UTF-8 bytes rather than '\uXXXX' escape sequences. with codecs.open(filename, 'wb', encoding='utf-8') as f: json.dump(json_data, f, ensure_ascii=False)
Functions for working with text¶
The pypfilt.text
module provides functions for converting between Unicode
strings and byte strings, which behave identically in Python 2 and Python 3.
-
pypfilt.text.
to_unicode
(value, encoding=u'utf-8')¶ Convert a value into a Unicode string.
- If the value is a Unicode string, no conversion is performed.
- If the value is a byte string, it is decoded according to the provided encoding.
- If the value is neither a Unicode string nor a byte string, it is
first converted into a string (by the
str()
built-in function) and then decoded if necessary.
-
pypfilt.text.
to_bytes
(value, encoding=u'utf-8')¶ Convert a value into a byte string.
- If the value is a Unicode string, it is encoded according to the provided encoding.
- If the value is a byte string, no conversion is performed.
- If the value is neither a Unicode string nor a byte string, it is
first converted into a string (by the
str()
built-in function) and then encoded if necessary.
It also provides functions for determining whether a value is a Unicode string or a byte string, although this should generally be known in advance.
-
pypfilt.text.
is_unicode
(value)¶ Return
True
if the value is a Unicode string.
-
pypfilt.text.
is_bytes
(value)¶ Return
True
if the value is a byte string.