Unicode and byte strings

The native str type is a Unicode string in Python 3.

Tip

Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end (the “Unicode sandwich”).

To that end, adhere to the following guidelines:

  • Use Unicode strings and Unicode literals everywhere.

  • If you have non-ASCII characters in a Python source file (e.g., in Unicode literals such as 'α'), you need to declare the file encoding at the top of the file:

    # -*- coding: utf-8 -*-
    
  • Encode Unicode text into UTF-8 when writing to disk:

    with open(filename, encoding='utf-8', mode='w') as f:
        f.write(unicode_string)
    
  • Decode UTF-8 bytes into Unicode text when reading from disk:

    with open(filename, encoding='utf-8') as f:
        unicode_lines = f.read().splitlines()
    
  • From NumPy 1.14 onward, functions such as loadtxt and genfromtxt can handle files with arbitrary (Python-supported) text encoding:

    import numpy as np
    
    data = np.loadtxt(filename, encoding='utf-8', ...)
    
  • Use the 'S' (byte string) data type when storing text in NumPy arrays. Encode Unicode text into UTF-8 when storing text, and decode UTF-8 bytes when reading text:

    >>> import numpy as np
    >>> xs = np.empty(3, dtype='S20')
    >>> xs[0] = 'abc'.encode('utf-8')
    >>> xs[1] = '« äëïöü »'.encode('utf-8')
    >>> xs[2] = 'ç'.encode('utf-8')
    >>> print(list(len(x) for x in xs))
    [3, 16, 2]
    >>> for x in xs:
    ...     print(x.decode('utf-8'))
    abc
    « äëïöü »
    ç
    

    Note

    There is also the option of using h5py’s variable-length string type instead of 'S'.

  • NumPy has a Unicode data type ('U'), but it is not supported by h5py (and is platform-specific).

  • Note that h5py object names (i.e., groups and datasets) are exclusively Unicode and are stored as bytes, so byte strings will be used as-is and Unicode strings will be encoded using UTF-8.