Friday 18 December 2009

Separation of concerns in Java and Python

I've blogged previously about the idea that separation of concerns is important in making a programming language easy to learn. Dealing with very different concerns using the same language feature is a bad idea because the meaning of symbols then becomes highly context-dependent, making code harder to read. As an example, let's consider the ways in which text and binary data are handled by Java and Python.

Java got this right from Day One, more or less. In Java, the byte and char types are distinctly different, the latter being a 16-bit type supporting Unicode. A slight wrinkle in JDK 1.0 was the fact that I/O was based solely around the byte-oriented InputStream and OutputStream class hierarchies, but this was fixed in JDK 1.1 via the introduction of parallel, text-oriented class hierarchies based on Reader and Writer.

With Python, things have been rather different. Python predates both Java and the finalised Unicode 1.0 standard, so for a long time it represented text solely as strings of 8-bit ASCII characters. It must have seemed a convenient economy at that time to also use strings as a means of representing sequences of bytes read from a binary file, but this in effect tied the str type to an 8-bit character representation. When it eventually became necessary (in 2000) for Python to support Unicode, the only way that this could be achieved without breaking existing code was to introduce a new unicode type for Unicode-based strings.

The upshot of all this was that Python now had two overlapping types: str, to represent ASCII text and binary data; and unicode, to represent Unicode text.

Aside from the problem of not keeping the different concerns of text and binary data separate, this solution also demonstrates undesirable redundancy; ideally, there should be just one way of representing text.

Perhaps the most significant improvement made in Python 3 is the reimplementation of str as a Unicode-based type and the inclusion of an entirely separate bytes type for representing byte sequences. Nothing is lost in this more orthogonal scheme, thanks to the provision of an encode method in str that converts a string to the equivalent bytes object and a decode method in bytes that converts a byte sequence to the equivalent string - given an appropriate codec in each case, of course.

1 comment:

  1. "Perhaps the most significant improvement made in Python 3 is the reimplementation of str as a Unicode-based type and the inclusion of an entirely separate bytes type for representing byte sequences."

    Just FYI, the situation is much more similar to a rename of the builin classes: (bytes, str) = (str, unicode) # in Python style #, than any kind of reimplementation! Some implementation details of unicode(old name) aka str (new name) might have change, but not anything the user should notice.

    ReplyDelete