Unicode and byte strings

The pypfilt package simultaneously supports Python 2.7 and Python 3.x, and is intended to behave identically regardless of the Python version. It is assumed that the following Python 3 features are enabled in Python 2.7:

from __future__ import absolute_import, division, print_function
from __future__ import unicode_literals

Importantly, among the differences between Python 2.7 and Python 3.x, the native str type is a byte string in Python 2 and a Unicode string in Python 3. This means that, e.g., the str() built-in function returns byte strings in Python 2 and Unicode strings in Python 3.

Guidelines for working with text

As per the Unicode HOWTO for Python 2 and Python 3:

Tip

Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end (the “Unicode sandwich”).

To that end, adhere to the following guidelines:

  • Use Unicode strings and Unicode literals everywhere. In Python 2, this means placing the following at the top of every file:

    from __future__ import unicode_literals
    
  • If you have non-ASCII characters in a Python source file (e.g., in Unicode literals such as 'α'), you need to declare the file encoding at the top of the file:

    # -*- coding: utf-8 -*-
    
  • Encode Unicode text into UTF-8 when writing to disk:

    # Note: in Python 3, the open() built-in accepts an encoding argument
    with codecs.open(filename, 'wb', encoding='utf-8') as f:
        f.write(unicode_string)
    
  • Decode UTF-8 bytes into Unicode text when reading from disk:

    # Note: in Python 3, the open() built-in accepts an encoding argument
    with codecs.open(filename, 'rb', encoding='utf-8') as f:
        unicode_lines = f.read().splitlines()
    
  • Note that NumPy functions such as loadtxt and genfromtxt cannot reliably handle non-ASCII text (e.g., see NumPy issues #3184, #4543, #4600, #4939), and should only be used with ASCII files:

    import numpy as np
    with codecs.open(filename, encoding='ascii') as f:
        return np.loadtxt(f, ...)
    
  • Use the 'S' (byte string) data type when storing text in NumPy arrays. Encode Unicode text into UTF-8 when storing text, and decode UTF-8 bytes when reading text:

    >>> from __future__ import unicode_literals
    >>> import numpy as np
    >>> xs = np.empty(3, dtype='S20')
    >>> xs[0] = 'abc'.encode('utf-8')
    >>> xs[1] = '« äëïöü »'.encode('utf-8')
    >>> xs[2] = 'ç'.encode('utf-8')
    >>> print(max(len(x) for x in xs))
    16
    >>> for x in xs:
    >>>     print(x.decode('utf-8'))
    abc
    « äëïöü »
    ç
    
  • NumPy has a Unicode data type ('U'), but it is not supported by h5py (and is platform-specific).

  • Note that h5py object names (i.e., groups and datasets) are exclusively Unicode and are stored as bytes, so byte strings will be used as-is and Unicode strings will be encoded using UTF-8.

  • Use Unicode strings and literals when encoding to and decoding from JSON:

    # Write UTF-8 bytes rather than '\uXXXX' escape sequences.
    with codecs.open(filename, 'wb', encoding='utf-8') as f:
        json.dump(json_data, f, ensure_ascii=False)
    

Functions for working with text

The pypfilt.text module provides functions for converting between Unicode strings and byte strings, which behave identically in Python 2 and Python 3.

pypfilt.text.to_unicode(value, encoding=u'utf-8')

Convert a value into a Unicode string.

  • If the value is a Unicode string, no conversion is performed.
  • If the value is a byte string, it is decoded according to the provided encoding.
  • If the value is neither a Unicode string nor a byte string, it is first converted into a string (by the str() built-in function) and then decoded if necessary.
pypfilt.text.to_bytes(value, encoding=u'utf-8')

Convert a value into a byte string.

  • If the value is a Unicode string, it is encoded according to the provided encoding.
  • If the value is a byte string, no conversion is performed.
  • If the value is neither a Unicode string nor a byte string, it is first converted into a string (by the str() built-in function) and then encoded if necessary.

It also provides functions for determining whether a value is a Unicode string or a byte string, although this should generally be known in advance.

pypfilt.text.is_unicode(value)

Return True if the value is a Unicode string.

pypfilt.text.is_bytes(value)

Return True if the value is a byte string.