
Jeremy Maitin-Shepard wrote:
Sebastian Redl <sebastian.redl@getdesigned.at> writes:
Jeremy Maitin-Shepard wrote:
- Binary transport layer issue:
Platforms with unusual features, like 9-bit bytes or inability to handle types less than 32-bits in size can possibly still implement the interface for a text/character transport layer, possibly on top of some other lower-level transport that need not be part of the boost library. Clearly, the text encoding and decoding would have to be done differently anyway.
A good point, but it does mean that the text layer dictates how the binary layer has to work. Not really desirable when pure binary I/O has nothing to do with text I/O.
I'm not sure what you mean by this exactly.
Platforms using 9-bit bytes have need for binary I/O, too. They might have need for doing it in their native 9-bit units. It would be a shame to deprive them of this possibility just because the text streams require octets. Especially if we already have a layer in place whose purpose is to convert between low-level data representations.
One approach that occurs to me would be to make the binary transport layer use a platform-specific byte type (octets, nonets, whatever) and have the binary formatting layer convert this into data suitable for character coding.
It seems like trying to support unusual architectures at all may be extremely difficult. See my other post.
Which other post is this?
I suppose if you can find a clean way to support these unusual architectures, then all the better.
It seems that it would be very hard to support e.g. utf-8 on a platform with 9-bit bytes or which cannot handle types smaller than 32-bits.
I think the binary conversion can do it. The system would work approximately like this: 1) Every platform defines its basic I/O byte. This would be 8 bits for most computers (including those where char is 32 bits large), 9 or some other number of bits for others. The I/O byte is the smallest unit that can be read from a stream. 2) Most platforms will additionally designate an octet type. Probably I will just use uint8_t for this. They will supply a Representation for the formatting layer that can convert a stream of I/O bytes to a stream of octets. (E.g. by truncating each byte.) If an octet stream is then needed (e.g. for creating a UTF-8 stream) this representation will be inserted. 3) Platforms that do not support octets at all (or simply do not have a primitive to spare for unambiguous overloads - they could use another 9-bit type and just ignore the additional byte; character streams, at least, do not perform arithmetic on their units so overflow is not an issue) do not have support for this. They're off bad. I think this case is rare enough to be ignored.
The return type of mark(), on the other hand, can and should be opaque. This allows for many interesting things to be done. For example: Consider a socket. It has no mark/reset, let alone seeking support. You have a recursive descent parser that requires multiple mark/reset support.
I see. It still seems that using different names means that something that requires only mark/reset support cannot use a stream providing seek/tell support, without an additional intermediate layer.
Well, depends. Let's assume, for example, that the system will be implemented as C++09 templates with heavy use of concepts. The concepts for multimark/reset and tell/seek could look like this: typedef implementation-defined streamsize; enum start_position { begin, end, current }; template <typename T> concept Seekable { streamsize tell(T); void seek(T, start_position, streamsize); } template <typename T> concept MultiMarkReset { typename mark_type; mark_type mark(T); void reset(T, mark_type); } Now it's trivial to make every Seekable stream also support mark/reset by means of this simple concept map: template <Seekable T> concept_map MultiMarkReset<T> { typedef streamsize mark_type; mark_type mark(const T &t) { return tell(t); } void reset(T &t, mark_type m) { seek(t, begin, m); } }
The reason would be for a protocol in which little/big endian is specified as part of the message/data, and a typical implementation would always write in native format (and so it would need to determine which is the native format), but support both formats for reading.
Hmm ... makes sense. I'm not really happy, but it makes sense.
Ideally, the cost of the virtual function calls would normally be mitigated by calling e.g. read/write with a large number of elements at once, rather than with only a single element.
Yes, but that's the ideal case. In practice, this means that the application would have to do its own buffering even if it really wants the data unit by unit. The programmer will not want to construct the complicated full type for this. newline_filter<encoding_device<utf_8, native_converter<gzip_filter<buffering_filter<file_device> > > > > chain & = open_file(filename, read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines()); The programmer will want to simply write text_stream<utf_8> chain = open_file(filename, read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines()); But text_stream does type erasure and thus has a virtual call for everything. If the user now proceeds to read single characters from the stream, that's one virtual call per character. And I don't think this can really be changed. It's better than a fully object-oriented design, where every read here would actually mean 3 or more virtual calls down the chain. (That's the case in Java's I/O system, for example.)
Is it in fact the case that all character encodings that are useful to support encode only a subset of Unicode? (i.e. there does not exist a useful encoding that can represent a character that cannot be represented by Unicode?)
I think it is. If it isn't, that's either a defect the Unicode consortium will want to correct by adding the characters to Unicode, or the encoding is for really unusual stuff, such as Klingon text or Elven Tengwar runes. They can be seen as mappings to the private regions of Unicode and are by nature not convertible to other encodings. One possible exception is characters that only exist in Unicode as grapheme clusters but may be directly represented in other encodings.
In any case, though, it is not clear exactly why there is a need to think of an arbitrary character encoding in terms of Unicode, except when explicitly converting between that encoding and a Unicode encoding.
It is convenient to have a unified concept of a character, independent of its encoding. The Unicode charset provides such a concept. Unicode is also convenient in that it adds classification rules and similar stuff. This decision is not really visible to user code anyway, only to encoding converters: it should be sufficient to provide a conversion from and to Unicode code points to enable a new encoding to be used in the framework. Sebastian Redl