
Hi, A few weeks ago, a discussion that followed the demonstration of the binary_iostream library made me think about the standard C++ I/O and what I would expect from an I/O model. Now I have a preliminary design document ready and would like to have some feedback from the Boost community on it. The document can be found here: http://windmuehlgasse.getdesigned.at/newio/ I'd especially like input on the unresolved issues, but all comments are welcome, even if you tell me that what I'm doing is completely pointless and misguided. (At least I'd know not to waste my time with refining and implementing the design. :-)) Sebastian Redl

A few weeks ago, a discussion that followed the demonstration of the binary_iostream library made me think about the standard C++ I/O and what I would expect from an I/O model.
Now I have a preliminary design document ready and would like to have some feedback from the Boost community on it.
The document can be found here: http://windmuehlgasse.getdesigned.at/newio/
Sebastian Redl
Seems iteresting. There is a small problem on this line (main page): Noteworthy readers/writers are StringReader/Writer, *for in-memory I/O*, and InputStreamReader and OutputStreamWriter, which wrap binary streams and do on-the-fly character conversion. The for in-memory I/O seems to be a copy paste error. Others than this, I don't have much to says. Just a word on formatting: we need a better way than iostream for specifying formatting option. When I need to print number, I often go back to printf. The actual system is to clumsy. Regards, // I am supposed to put this on the bottom of a mail? // seems like some do other put Best regards (and others nothings) // I suppose it is the equivalent of 'Cordialement' in French -- Cédric Venet

Sebastian Redl wrote:
Now I have a preliminary design document ready and would like to have some feedback from the Boost community on it.
The document can be found here: http://windmuehlgasse.getdesigned.at/newio/
I'd especially like input on the unresolved issues, but all comments are welcome, even if you tell me that what I'm doing is completely pointless and misguided. (At least I'd know not to waste my time with refining and implementing the design. :-))
I personally think that there is no use in distinguishing between text and binary I/O. For I/O, you need a binary representation. To get it, you can serialize your object any way you like. This shouldn't be the job of a "binary formatting layer", even though having separate utilities to serialize/unserialize basic types to common representations would be useful (double to Little Endian IEEE 754 Double-precision for example). Also, I believe the narrow/wide characters and locales stuff is broken beyond all repair, so I wouldn't recommend to do anything related to that. I also think text formatting is a different need than I/O. Indeed, it is often needed to generate a formatted string which is then given to a GUI Toolkit or whatever.

Mathias Gaunard <mathias.gaunard@etu.u-bordeaux1.fr> writes:
Sebastian Redl wrote:
Now I have a preliminary design document ready and would like to have some feedback from the Boost community on it.
The document can be found here: http://windmuehlgasse.getdesigned.at/newio/
I'd especially like input on the unresolved issues, but all comments are welcome, even if you tell me that what I'm doing is completely pointless and misguided. (At least I'd know not to waste my time with refining and implementing the design. :-))
I personally think that there is no use in distinguishing between text and binary I/O.
I think part of the issue may be the name "binary". A better name may be "byte" I/O or "byte" stream. Conceptually it seems important to distinguish between a byte stream and a character stream. At the C++ type level, however, it may indeed not be useful to distinguish between a "character" stream and a "byte stream", and instead merely allow streams of any arbitrary POD type. Then a "byte" stream is defined as a stream of uint8_t, and a character stream may be a stream of uint8_t or uint16_t or uint32_t, depending on the character encoding used.
For I/O, you need a binary representation. To get it, you can serialize your object any way you like. This shouldn't be the job of a "binary formatting layer", even though having separate utilities to serialize/unserialize basic types to common representations would be useful (double to Little Endian IEEE 754 Double-precision for example). Also, I believe the narrow/wide characters and locales stuff is broken beyond all repair, so I wouldn't recommend to do anything related to that.
I believe that this library will attempt to address and properly handle those issues.
I also think text formatting is a different need than I/O. Indeed, it is often needed to generate a formatted string which is then given to a GUI Toolkit or whatever.
Presumably this would be supported by the using the text formatting layer on top of an output text sink backed by an in-memory buffer. -- Jeremy Maitin-Shepard

I think part of the issue may be the name "binary". A better name may be "byte" I/O or "byte" stream. Originally, the binary transport layer was called byte transport layer. I decided against this name for the simple reason that, as far as the C++ standard is concerned, a byte is pretty much the same as an unsigned char. Because the exact unit of transport is still in the open (and the current tendency I see is toward using octets, and leaving native bytes to some other mechanism), I didn't want any such implication in the name. The name binary isn't a very good choice either, I admit. In the end, all data is binary. But the distinction between "binary" and "textual" data is important, and not only at the concept level. What I have in my mind works something like this: Binary data is in terms of octets, bytes, primitives, or PODs, whatever. The distinguishing feature of binary data is that each "unit" is meaningful in isolation. It makes sense to fetch a single unit and work on it. It makes sense to jump to an arbitrary position in the stream and interpret the unit there. Textual data is far more complex. It's a stream of abstract characters, and they don't map cleanly to the underlying representative primitive. A UTF-8 character maps to one, two, three or four octets, leaving aside
Jeremy Maitin-Shepard wrote: the dilemma of combining accents. A UTF-16 character maps to one or two double-octets. It doesn't make sense to fetch a single primitive and work on it, because it may not be a complete character. It doesn't make sense to jump to an arbitrary position, because you might jump into the middle of a character. The internal character encoding is part of the text stream's type in my model.
Also, I believe the narrow/wide characters and locales stuff is broken beyond all repair, so I wouldn't recommend to do anything related to that.
I believe that this library will attempt to address and properly handle those issues.
I certainly will, especially as this view seems to be generally agreed on by the posters.
I also think text formatting is a different need than I/O. Indeed, it is often needed to generate a formatted string which is then given to a GUI Toolkit or whatever.
Presumably this would be supported by the using the text formatting layer on top of an output text sink backed by an in-memory buffer.
That's the idea. Separating formatting and I/O is necessary to avoid an ugly mess of responsibilities, which is why the text formatting layer is a distinct layer. However, having the the formatting build on the I/O interfaces instead of string interfaces allows for greater optimization opportunities. There is not much you can do when you create a string from format instructions. You have two choices, basically. One is to first find out how much memory is needed, allocate a buffer, and then format the data. This is the approach MFC's CString::Format takes. The obvious problem is that it does all work twice. The less obvious problem is that it makes the code considerably more complex: either you find a way to turn sub-formatting (that is, evaluating the format instructions for a single parameter, e.g. turning an int into a string) into a dummy operation that just returns the space needed. This is very complex and, depending on the exact way formatting works, may even be impossible (or at least will create many little strings that have their length read, only to be discarded and later re-created), or (to continue a sentence started a long time ago) you require the formatting methods to provide an explicit "measure-only" operation, which hurts extensibility. The other way is to just go ahead and format, re-allocating whenever space runs out. And that's just what the I/O-based method does anyway. However, if your formatting is bound to string objects, it means that every formatting operation has to create a string containing all formatted parameters. This may be a considerable memory/time overhead when compared to a formatting that works on the I/O interfaces, where formatted data is sent directly to the underlying device. (For efficiency, of course, there may be a buffer in the device chain. That's fine. The buffer is re-used, not allocated once per formatting operation.) So yes, the formatting layer will be very distinct (and developed after the other parts are complete), but I really believe that basing it on the I/O interfaces is the best solution. There can be, of course, a convenience interface that simply creates a string from formatting instructions and parameters. Sebastian Redl

Sebastian Redl <sebastian.redl@getdesigned.at> writes:
Jeremy Maitin-Shepard wrote:
I think part of the issue may be the name "binary". A better name may be "byte" I/O or "byte" stream. Originally, the binary transport layer was called byte transport layer. I decided against this name for the simple reason that, as far as the C++ standard is concerned, a byte is pretty much the same as an unsigned char.
By byte, I really meant octet. You could then call it an octet stream, but it may be useful for at least the basic stream concept to support streams of arbitrary POD types. It seems like it would introduce significant complications to try to support any architecture with a non-octet byte (i.e. an architecture that does not have an 8-bit type). Perhaps it is best for this library to just completely ignore such architectures, since they are extremely rare anyway.
Because the exact unit of transport is still in the open (and the current tendency I see is toward using octets, and leaving native bytes to some other mechanism), I didn't want any such implication in the name. The name binary isn't a very good choice either, I admit. In the end, all data is binary. But the distinction between "binary" and "textual" data is important, and not only at the concept level. What I have in my mind works something like this: Binary data is in terms of octets, bytes, primitives, or PODs, whatever.
Perhaps the name "data stream" would be appropriate, or better yet, perhaps just "stream", and use the qualified name "text stream" or "character stream" to refer to streams of characters that are somehow marked (either at compile-time or run-time) with the encoding. This brings up the following issue, what didn't seem to be addressed very much in your responses: Should text streams of arbitrary (non-Unicode encodings) be supported? Also, should text streams support arbitrary base types (i.e. uint8_t or uint16_t, or some other type), or should be restricted to a single type (like uint16_t)? The reason for substantial unification between both "data streams" and "text streams" is that despite differences in how the data they transport is used, the interface should essentially be the same (both basic read/write, as well as things like mark/reset and seeking), and a buffering facility should be exactly the same for both types of streams. Similarly, facilities for providing mark/reset support on top of a stream that does not support it by using buffering would be exactly the same for both "binary" and "text" streams. Even if seek may not be as useful for text streams, it still might be useful to some people, and there is no reason to exclude it. In the document you posted, for instance, you essentially just duplicated much of the description of binary streams for the text streams, which suggests a problem. As a suggest below, a "text stream" might always be a very thin layer on top of a binary stream, that simply specifies an encoding. The issue, though, is how it would work to layer something like a regular buffer or a mark/reset-providing buffer on top of a text stream. There shouldn't have to be two mark/reset providers, one for data streams, and one for text stream, but also it should be possible to layer such a thing on top of a text stream directly, and still maintain the encoding annotation.
The distinguishing feature of binary data is that each "unit" is meaningful in isolation. It makes sense to fetch a single unit and work on it. It makes sense to jump to an arbitrary position in the stream and interpret the unit there.
I suppose the question is at what level will the library provide character encoding/decoding/conversion. It seems like the most basic way to create a text stream would be to simply take a data stream and mark it with a particular encoding to make a text stream. This suggests that encoding/decoding/conversion should exist as a "data stream" operation. One thing I haven't figured out, though, it how the underlying unit type of the stream, i.e. uint8_t or uint16_t, would correspond to the encoding. In particular, the issue is what underlying unit type corresponds to each of the following encodings: - UTF-8, iso-8859-* (it seems obvious that uint8_t would be the choice here) - UTF-16 (uint16_t looks promising, but you need to be able to read this from a file, which might be a uint8_t stream) - UTF-16-LE/UTF-16-BE (uint16_t looks promising, but also uint16_le_t/uint16_be_t (a special type that might be defined in the endian library) might be better, and furthermore you need to be able to read this from a file, which might be a uint8_t stream) Perhaps you have some ideas about the conceptual model that resolves these issues. One solution may be to decide that the unit type of the stream need not be too closely tied to the encoding, and in particular the encoding conversion might not care what types the input and output streams are, within some limits (maybe the size of unit type of the encoding must be a multiple of the size of the unit type of the stream).
Textual data is far more complex. It's a stream of abstract characters, and they don't map cleanly to the underlying representative primitive. A UTF-8 character maps to one, two, three or four octets, leaving aside the dilemma of combining accents. A UTF-16 character maps to one or two double-octets. It doesn't make sense to fetch a single primitive and work on it, because it may not be a complete character. It doesn't make sense to jump to an arbitrary position, because you might jump into the middle of a character. The internal character encoding is part of the text stream's type in my model.
It seems that it may be useful to allow the encoding to be specified at run-time. The size of each encoded unit would still have to be known at compile-time, though, so it is not clear exactly how important this is. [snip] -- Jeremy Maitin-Shepard

Jeremy Maitin-Shepard wrote:
It seems like it would introduce significant complications to try to support any architecture with a non-octet byte (i.e. an architecture that does not have an 8-bit type). Perhaps it is best for this library to just completely ignore such architectures, since they are extremely rare anyway.
Current architectures without 8-bit bytes don't do any kind of file I/O or similar, AFAIK.

Jeremy Maitin-Shepard wrote:
Sebastian Redl <sebastian.redl@getdesigned.at> writes:
Because the exact unit of transport is still in the open (and the current tendency I see is toward using octets, and leaving native bytes to some other mechanism), I didn't want any such implication in the name. The name binary isn't a very good choice either, I admit. In the end, all data is binary. But the distinction between "binary" and "textual" data is important, and not only at the concept level. What I have in my mind works something like this: Binary data is in terms of octets, bytes, primitives, or PODs, whatever.
Perhaps the name "data stream" would be appropriate, or better yet, perhaps just "stream", and use the qualified name "text stream" or "character stream" to refer to streams of characters that are somehow marked (either at compile-time or run-time) with the encoding.
This might lead to ambiguity when addressing streams as a whole.
Should text streams of arbitrary (non-Unicode encodings) be supported? Also, should text streams support arbitrary base types (i.e. uint8_t or uint16_t, or some other type), or should be restricted to a single type (like uint16_t)?
Each encoding requires a specific base type. For example, UTF-8 requires uint8_t, UTF-16 requires uint16_t, UTF-16LE and BE require uint8_t (they do their own combining). The current binary formatting layer would be used by the converter to get units of the desired format.
The reason for substantial unification between both "data streams" and "text streams" is that despite differences in how the data they transport is used, the interface should essentially be the same (both basic read/write, as well as things like mark/reset and seeking), and a buffering facility should be exactly the same for both types of streams.
Similarly, facilities for providing mark/reset support on top of a stream that does not support it by using buffering would be exactly the same for both "binary" and "text" streams.
Even if seek may not be as useful for text streams, it still might be useful to some people, and there is no reason to exclude it.
In the document you posted, for instance, you essentially just duplicated much of the description of binary streams for the text streams, which suggests a problem.
Yes, a text stream is essentially a binary stream with an encoding instead of a data type. So the interface is the same in description, but the types involved are different. I think this is mostly a documentation issue.
As a suggest below, a "text stream" might always be a very thin layer on top of a binary stream, that simply specifies an encoding. The issue, though, is how it would work to layer something like a regular buffer or a mark/reset-providing buffer on top of a text stream. There shouldn't have to be two mark/reset providers, one for data streams, and one for text stream, but also it should be possible to layer such a thing on top of a text stream directly, and still maintain the encoding annotation.
While this would be nice, I'm not sure if the C++ type system supports such a thing. The library might provide templates that can do such a thing.
This suggests that encoding/decoding/conversion should exist as a "data stream" operation. It does. That's what the character converter device does. One thing I haven't figured out, though, it how the underlying unit type of the stream, i.e. uint8_t or uint16_t, would correspond to the encoding. In particular, the issue is what underlying unit type corresponds to each of the following encodings:
- UTF-8, iso-8859-* (it seems obvious that uint8_t would be the choice here)
- UTF-16 (uint16_t looks promising, but you need to be able to read this from a file, which might be a uint8_t stream)
- UTF-16-LE/UTF-16-BE (uint16_t looks promising, but also uint16_le_t/uint16_be_t (a special type that might be defined in the endian library) might be better, and furthermore you need to be able to read this from a file, which might be a uint8_t stream)
Perhaps you have some ideas about the conceptual model that resolves these issues.
I have. There is a stream operation that allows conversion of the raw stream (of type octet or iobyte or something) into a stream of another primitive. That's the binary formatting layer.
It seems that it may be useful to allow the encoding to be specified at run-time. Only for the external encoding. The internal encoding should be fixed at compile time. Everything else is just too confusing. That's one important lesson I've learned trying to internationalize PHP web applications.
Sebastian Redl

Sebastian Redl <sebastian.redl@getdesigned.at> writes: [snip]
Perhaps the name "data stream" would be appropriate, or better yet, perhaps just "stream", and use the qualified name "text stream" or "character stream" to refer to streams of characters that are somehow marked (either at compile-time or run-time) with the encoding.
This might lead to ambiguity when addressing streams as a whole.
Okay, that is a good point. "Data stream" would probably be best then. I am quite averse to "binary stream", since it would really be a misuse, albeit a common misuse, of "binary".
Should text streams of arbitrary (non-Unicode encodings) be supported? Also, should text streams support arbitrary base types (i.e. uint8_t or uint16_t, or some other type), or should be restricted to a single type (like uint16_t)?
Each encoding requires a specific base type. For example, UTF-8 requires uint8_t, UTF-16 requires uint16_t, UTF-16LE and BE require uint8_t (they do their own combining). The current binary formatting layer would be used by the converter to get units of the desired format.
I see. You are suggesting, I suppose, that in addition to providing formatting of individual values, the binary formatting layer also provides stream filters for converting a stream of one type into a stream of another type with a particular formatting. I like this idea. This seems to suggest, then, that if you want to convert UTF-32 (native endian) in a file to say, a UTF-16 (native endian) text stream, you have to first convert the file uint8_t stream to a uint32_t stream (native endian formatting), and then mark this uint32_t stream as UTF-32, and then use a text encoding conversion filter to convert this UTF-32 stream to a UTF-16 (native endian) uint16_t stream. The trouble is, though, that if you then have a file with UTF-8 encoded text, you have to use different types to obtain the same UTF-16 uint16_t text stream. Furthermore, the encoding of the file might be supplied (by name) by the user at run-time as one of a large number of supported encodings; the base type of this encoding should not be particularly important. [snip]
Yes, a text stream is essentially a binary stream with an encoding instead of a data type. So the interface is the same in description, but the types involved are different. I think this is mostly a documentation issue.
Instead of a data type? But presumably both the data type and the encoding must be specified. Also, it seems like it may be useful to be able to specify the encoding at run-time, rather than just compile-time. [snip]
This suggests that encoding/decoding/conversion should exist as a "data stream" operation. It does. That's what the character converter device does.
Well, the question is under what interface character encoding conversion should be done. It could be a text stream to text stream interface. [snip]
It seems that it may be useful to allow the encoding to be specified at run-time. Only for the external encoding. The internal encoding should be fixed at compile time. Everything else is just too confusing. That's one important lesson I've learned trying to internationalize PHP web applications.
What does internal or external really mean though? That is somewhat of an artificial distinction in itself. It may be a reasonable one, but you'll have to define those terms. Which encodings will be supported at compile-time, then? Just UTF-8, UTF-16, and UTF-32? -- Jeremy Maitin-Shepard

Jeremy Maitin-Shepard wrote:
Okay, that is a good point. "Data stream" would probably be best then. I am quite averse to "binary stream", since it would really be a misuse, albeit a common misuse, of "binary".
I'm using "unstructured stream" in the next iteration of the design document. Does that seem appropriate to you?
I see. You are suggesting, I suppose, that in addition to providing formatting of individual values, the binary formatting layer also provides stream filters for converting a stream of one type into a stream of another type with a particular formatting. I like this idea.
Yes, exactly. This can be very useful for reading of data. However, it's not quite sufficient for runtime selection of a character encoding. For this, an interface that is not a stream of a single data type, but rather provides extraction of any data type at any time, is required.
This seems to suggest, then, that if you want to convert UTF-32 (native endian) in a file to say, a UTF-16 (native endian) text stream, you have to first convert the file uint8_t stream to a uint32_t stream (native endian formatting), and then mark this uint32_t stream as UTF-32, and then use a text encoding conversion filter to convert this UTF-32 stream to a UTF-16 (native endian) uint16_t stream.
Again, not quite. I suppose I should first define external and internal encoding, as you suggest below, because it is fundamental to this issue. I am of the opinion that an application does not gain anything by having string-like data for processing that is not of an encoding known at compile time. For any given string the application actually processes, the encoding of the data must be known at compile time. Everything else is a mess. String types should be tagged with the encoding. Text streams should be tagged. Buffers for text should be tagged. This compile-time-known encoding is the internal encoding. (An application can have different internal encodings for different data, but not for the same piece of data.) The external encodings are what external data is in. Files, networks streams, user input, etc. When reading a file as text, the application must specify the external encoding of this file. (Or it can fall back to a default, but this is often unacceptable.) The external encoding must be specifiable at run time, obviously, because different files, different network connections, can be in different encodings. Suppose, then, that my application uses UTF-16 internally for everything. Endian does not matter - does not exist, even, because UTF-16 uses uint16_t as the underlying type, and to the view of C++, endianness doesn't matter as long as the type isn't viewed as its components. To read in a text file in UTF-32, native endian, I could do this (tentative interface), if I know at compile time that the file is UTF-32: text_input_stream<utf_16> stream = open_file_input("filename.txt") // a file_input_stream .filter(buffer()) // a buffered_input_stream<iobyte, file_input_stream> .filter(assembler<uint32_t, native>()) // a assembler_input_stream<uint32_t, native, buffered_input_stream<iobyte, file_input_stream>> .filter(text_decode<utf_16, utf_32>()) // a text_decode_stream<utf_32, assembler_input_stream<uint32_t, native, buffered_input_stream<iobyte, file_input_stream>>> ; More likely, however, I would do this: auto assembler = generic_assembler<native_rules>(open_file_input("filename.txt").filter(buffer()); text_input_stream<utf_16> stream = text_decoder<utf_16>(assembler, "UTF-32"); assembler would be of type generic_assembler_t<native_rules, buffered_input_stream<iobyte, file_input_stream>> and would provide a single template member, read<T>(buffer<T> &target), that allows extracting any (primitive) type. The assembler follows the rules to provide the type. The text_decoder would then call this function using a type determined by the encoding specified. Yes, the read function would be instantiated for all types regardless of whether it's used, because the encoding is a run time decision. But at one point, you need to bridge the gap between compile time and run time decisions. Yet another alternative would be a direct_text_decoder<utf_16> that reads from a uint8_t stream and expects either that the encoding specifies the endianness (like "UTF-16BE") or that a byte order mark is present. Such a decoder would not be able to decode encodings where neither is the case.
Yes, a text stream is essentially a binary stream with an encoding instead of a data type. So the interface is the same in description, but the types involved are different. I think this is mostly a documentation issue.
Instead of a data type? But presumably both the data type and the encoding must be specified. Also, it seems like it may be useful to be able to specify the encoding at run-time, rather than just compile-time.
Instead of a data type. The data type of a text_stream<Encoding> is base_type<Encoding>::type. This is for internal use - the external use is different anyway.
Which encodings will be supported at compile-time, then? Just UTF-8, UTF-16, and UTF-32?
Whichever the library supplies. I think these three plus ASCII and Latin-1 would make a reasonable minimum requirement. Sebastian Redl

Sebastian Redl <sebastian.redl@getdesigned.at> writes:
Jeremy Maitin-Shepard wrote:
Okay, that is a good point. "Data stream" would probably be best then. I am quite averse to "binary stream", since it would really be a misuse, albeit a common misuse, of "binary".
I'm using "unstructured stream" in the next iteration of the design document. Does that seem appropriate to you?
I suppose it depends on how text streams will differ from these "unstructured" streams. It seems that it may be the case that a single "unstructured" stream concept/interface will be defined, and text streams will be instances/implementations of this concept, but in addition provide other functionality. In that case, maybe just the name "stream" would indeed be appropriate.
I see. You are suggesting, I suppose, that in addition to providing formatting of individual values, the binary formatting layer also provides stream filters for converting a stream of one type into a stream of another type with a particular formatting. I like this idea.
Yes, exactly. This can be very useful for reading of data. However, it's not quite sufficient for runtime selection of a character encoding. For this, an interface that is not a stream of a single data type, but rather provides extraction of any data type at any time, is required.
Not necessarily, since the data formatting stream could still be created as needed by the encoding conversion code. Nonetheless, for direct use (e.g. for reading a complicated data format like an image format with fields of various sizes), a single filter that supports reading/writing any supported type in any supported format would likely be useful. It is not clear whether it would also be useful to provide additional filters for which certain information about the format (i.e. endianness) is specified in the type of the filter, rather than in the names of/template arguments to individual methods. [snip: example of code for encoding conversion] Perhaps a better default would be to assume the native byte order. It seems a bit unfortunate to have 3 different interfaces for encoding conversion from a file. I think that it would be likely be ideal to unify these into a single interface somehow.
Yes, a text stream is essentially a binary stream with an encoding instead of a data type. So the interface is the same in description, but the types involved are different. I think this is mostly a documentation issue.
Instead of a data type? But presumably both the data type and the encoding must be specified. Also, it seems like it may be useful to be able to specify the encoding at run-time, rather than just compile-time.
Instead of a data type. The data type of a text_stream<Encoding> is base_type<Encoding>::type. This is for internal use - the external use is different anyway.
Conceptually, it seemed that there might be advantages to considering streams of characters encoded in a non-Unicode encoding as text streams as well. Then there could be a somewhat uniform interface for all encoding conversions, since they would always convert a text stream to a text stream. There also seem to be advantages to this interface, though.
Which encodings will be supported at compile-time, then? Just UTF-8, UTF-16, and UTF-32?
Whichever the library supplies. I think these three plus ASCII and Latin-1 would make a reasonable minimum requirement.
Would ASCII really be useful since UTF-8 would be supported? Why is Latin-1 included? What is the argument for supporting iso-8859-1 but not the other iso-8859 encodings? Furthermore, what is the argument for not supporting any other arbitrary encoding? I don't think it is reasonable to give Latin-1 special status. -- Jeremy Maitin-Shepard

Mathias Gaunard wrote:
I personally think that there is no use in distinguishing between text and binary I/O. For I/O, you need a binary representation. To get it, you can serialize your object any way you like. This shouldn't be the job of a "binary formatting layer", even though having separate utilities to serialize/unserialize basic types to common representations would be useful (double to Little Endian IEEE 754 Double-precision for example). Also, I believe the narrow/wide characters and locales stuff is broken beyond all repair, so I wouldn't recommend to do anything related to that.
I agree on this. When I heard about iostreams for the very first time I thought: "Wow! C++ even has build-in support for streams, filters, sources and sinks!" I did not want to use std::iostreams for basic io. I thought it'd be a nice thing to build - say - an audio synthesizer on top of. Unfortunately I had to recognize that std::iostreams are about *character* streams and not about streams of arbitrary (user defined) data. Furthermore different concepts (transportation, storage, formatting, ...) are coupled tightly and bound to the domain of character processing in such a way that it was impossible to specialize std::iostreams for my purposes. So, when thinking about the design of an universal IO library I suggest to define a list of operational aspects first and thereby make clear distinction between general stream functionality (transportation, buffering, filtering, ...) and functionality that is specific to character streams (formatting, locals, cin/cout, ...). The latter ('character streams') could then be build on-top of the low level stream functionality ('data streams' or 'object streams'). Think of it as a two-layer OSI model for io. ;) just my 0.02€ cheers sascha

On 6/18/07, Sascha Seewald <vudu@gmx.net> wrote:
When I heard about iostreams for the very first time I thought: "Wow! C++ even has build-in support for streams, filters, sources and sinks!" I did not want to use std::iostreams for basic io. I thought it'd be a nice thing to build - say - an audio synthesizer on top of.
Hi Sascha, If that's the sort of thing you interested in, I was wondering if you could take a look at the signal network summer of code project.. I'd be very keen to get some feedback on it. The docs are at: http://dancinghacker.com/code/signet/ When you say building an audio synthesizer, that's the sort of thing I had in mind when designing it, so I'd be interested to hear how the library aligns or does not align with what you had in mind. Thanks! Stjepan

Sascha Seewald wrote:
When I heard about iostreams for the very first time I thought: "Wow! C++ even has build-in support for streams, filters, sources and sinks!" I did not want to use std::iostreams for basic io. I thought it'd be a nice thing to build - say - an audio synthesizer on top of.
Unfortunately I had to recognize that std::iostreams are about *character* streams and not about streams of arbitrary (user defined) data.[...]
So, when thinking about the design of an universal IO library I suggest to define a list of operational aspects first and thereby make clear distinction between general stream functionality (transportation, buffering, filtering, ...) and functionality that is specific to character streams (formatting, locals, cin/cout, ...).
Streams of arbitrary, used-defined data. Hmm ... I'm somewhat afraid of the semantical complexity that such freedom introduces. It's very important to me to keep the system simple, despite everything. I'm also afraid of a "one size fits none" system that tries to do everything and accomplishes nothing. What would be your requirements on such a stream system? How much flexibility to you need to accomplish your goals? This information would help me in evaluating how far my fears are justified.
The latter ('character streams') could then be build on-top of the low level stream functionality ('data streams' or 'object streams'). Think of it as a two-layer OSI model for io. ;)
That's exactly what I'm doing. Sebastian Redl

Mathias Gaunard wrote:
I personally think that there is no use in distinguishing between text and binary I/O. For I/O, you need a binary representation. To get it, you can serialize your object any way you like. This shouldn't be the job of a "binary formatting layer", even though having separate utilities to serialize/unserialize basic types to common representations would be useful (double to Little Endian IEEE 754 Double-precision for example).
I think there is definitely a need for binary I/O. It is not just a matter of serializing your objects in memory: - Is your serialized object model portable between different compilers and operating systems? If not, you need to fix a binary external data representation, and read/write it using a binary formatting layer. - Is your serialized object model compatible with serialized object models of different languages? (Suppose you need to get serialized data from C++ to Java.) If not, you need to fix a binary external data representation, and read/write it with a binary formatting layer. - Does you serialized object model map to existing binary formats (e.g. binary files like JPEG, ...). If not, you need to read/write these formats according to their specifications using a binary formatting layer. In fact, a serialization library should be built on a binary formatting layer. Best regards, -- Ares Lagae Computer Graphics Research Group, Katholieke Universiteit Leuven http://www.cs.kuleuven.be/~ares/

Hello, I have been thinking about changing IOStreams for a while but never came up with a design that actually got the stuff just right. I am tending toward a system that has a three layer design. The first layer encompasses reading and writing the device, the second layer takes a first or second layer object and allows you to decorate or encapsulate it (decode UTF, ISO8859 or so on) and the third layer object that allows a programmers interface (in case of legacy iostreams support for operator<< and >>). In essence, a bottom layer, top layer and "middle layer" that can be recursively applied. I recall the Java model being similar. Regards, Peter

Thank you to all who have commented so far. The discussion seems to be very interesting. I had a very busy week with some exams, but I will soon be able to go through all your replies. Sebastian Redl

Sebastian Redl <sebastian.redl@getdesigned.at> writes:
A few weeks ago, a discussion that followed the demonstration of the binary_iostream library made me think about the standard C++ I/O and what I would expect from an I/O model.
Now I have a preliminary design document ready and would like to have some feedback from the Boost community on it.
The document can be found here: http://windmuehlgasse.getdesigned.at/newio/
I'd especially like input on the unresolved issues, but all comments are welcome, even if you tell me that what I'm doing is completely pointless and misguided. (At least I'd know not to waste my time with refining and implementing the design. :-))
I am pleased to you taking an interest in a new I/O library for C++. The existing C++ I/O facilities have always bothered me, but I've never gotten around to trying to write something better. I have a number of comments. They aren't particularly well structured, because I didn't bother to try to reorganize them after initially just writing down thoughts as they occurred to me. - I think it is important to look at the boost iostreams architecture, and make sure to include or reuse any of the ideas or even actual code if possible. One idea from that library to consider is the direct/indirect device distinction. - Binary transport layer issue: Make the "binary transport layer" the "byte transport layer" to make it clear that it is for bytes. Platforms with unusual features, like 9-bit bytes or inability to handle types less than 32-bits in size can possibly still implement the interface for a text/character transport layer, possibly on top of some other lower-level transport that need not be part of the boost library. Clearly, the text encoding and decoding would have to be done differently anyway. - Asynchronous issue: Asynchronous I/O is extremely useful, but it also requires a very different architecture --- something like asio io_service is needed to manage requests, a function to call on completion or error must be provided. One issue is that there are very large differences between platforms (Windows and Linux). On Linux, asynchronous I/O via efficient polling for readiness is possible for sockets and pipes using epoll (and somewhat less efficiently using select and poll), but these mechanisms cannot be used for regular files. I think there may be other asynchronous I/O mechanisms on Linux that do support regular files, at least on some filesystems, but which are not very easily compatible with epoll and other methods suitable for sockets. Furthermore, even if read and write are asynchronous, open will always be synchronous on Linux. It may not be feasible, therefore, to implement a proper asynchronous I/O interface on Linux. Even on Windows, I belive it may not be possible to get asynchronous open. Thus, I think I agree that it would be better to avoid including an asychronous I/O interface in this library, although probably a bit more thought should go into the decision before it is made. - Seeking: Maybe make multiple mark/reset use the same interface as seeking, for simplicity. Just define that a seeking device has the additional restriction that the mark type is an offset, and the argument to seek need not be the result of a call to tell. Another issue is whether to standardize the return type from tell, like std::ios_base::streampos in the C++ iostreams library. - Binary formatting (perhaps the name data format would be better?): I think it is important to provide a way to format {uint,int}{8,16,32,64}_t as either little or big endian two's complement (and possibly also one's complement). It might be useful to look at the not-yet-official boost endian library in the vault. A similar variety of output formats for floating point types should also be supported. It is also important to provide the most efficient output format as an option as well (i.e. writing the in-memory represention of the type directly, via e.g. reinterpret_cast). It should probably also be possible to determine using the library at compile time what the native format is. It is not clear what to do about the issue of some platforms not using any standard format as its native format. - Header vs Precompiled: I think as much should be separately compiled as possible, but I also think that type erasure should not be used in any case where it will significantly compromise performance. - The "byte" stream and the character stream, while conceptually different, should probably both be considered just "streams" of particular POD types. The interfaces will in general be exactly the same as far as reading, writing, seeking, filtering. - Text transport: I don't think this layer should be restricted to Unicode encodings. Rather, a text transport should just be a "stream" of type T, where T might be uint8_t, uint16_t, uint32_t depending on the character encoding. For full generality, the library should provide facilities for converting between any two of a large list of encodings. (For simplicity, some of these conversions might internally be implemented by converting first to one encoding, like UTF-16, and then converting to the other encoding, if a direct conversion is not coded specially.) I think it is important to require that all of a minimal set of encodings are supported, where this minimal set should include at least all of the common unicode encodings, and perhaps all of the iso-8559-* encodings as well, in addition to ASCII. - Text formatting: For text formatting, I think it would be very useful to look at the IBM ICU library. It may in fact make sense to leave text formatting as a separate library (for example, as a unicode library), since it is somewhat encoding specific, and a huge task by itself and not very related to this I/O library. As long as the I/O library provides a suitable character stream interface, an arbitrary formatting facility can be used on top of it. -- Jeremy Maitin-Shepard

Jeremy Maitin-Shepard <jbms@cmu.edu> writes: [snip]
- Text transport:
I don't think this layer should be restricted to Unicode encodings.
It occurs to me that perhaps it is not unreasonable after all to restrict the library to supporting Unicode encodings for in-memory character representation. It would, I think, still be necessary to support all of the Unicode encodings, though, since it would be important that these facilities can be used efficiently with existing libraries that depend on particular encodings, like UTF-8 and UTF-16. [snip] -- Jeremy Maitin-Shepard

Jeremy Maitin-Shepard wrote:
It occurs to me that perhaps it is not unreasonable after all to restrict the library to supporting Unicode encodings for in-memory character representation.
I personally believe Unicode (not only the character set, but also its collations and algorithms) is the only viable way to represent characters, and thus should be the way strings work with. (get out evil locales and other stuff!) Of course, various encodings can still be used for serialization. Unfortunately, C++ is quite far from having good Unicode tools (not that other programming languages are really better -- Unicode is simply quite complicated, because human languages just are) ICU has most of the stuff, but not with the right interfaces.

Mathias Gaunard <mathias.gaunard@etu.u-bordeaux1.fr> writes:
Jeremy Maitin-Shepard wrote:
It occurs to me that perhaps it is not unreasonable after all to restrict the library to supporting Unicode encodings for in-memory character representation.
I personally believe Unicode (not only the character set, but also its collations and algorithms) is the only viable way to represent characters, and thus should be the way strings work with. (get out evil locales and other stuff!) Of course, various encodings can still be used for serialization.
I agree that I personally would always want to use a Unicode encoding for handling text in my software. The question, though, is whether the new I/O library should actually force users to use a Unicode encoding for internal text representation. Even if other internal encodings are supported, Boost might still only provide actual text formatting facilities and other high-level text facilities for all Unicode encodings (UTF-8, UTF-16, and UTF-32) or even only a single Unicode encoding.
Unfortunately, C++ is quite far from having good Unicode tools (not that other programming languages are really better -- Unicode is simply quite complicated, because human languages just are)
ICU has most of the stuff, but not with the right interfaces.
A better I/O system might provide a very solid base on top of which proper higher level text facilities can be provided, quite possibly by incorporating pieces of ICU. -- Jeremy Maitin-Shepard

Mathias Gaunard wrote:
Jeremy Maitin-Shepard wrote:
It occurs to me that perhaps it is not unreasonable after all to restrict the library to supporting Unicode encodings for in-memory character representation.
I personally believe Unicode (not only the character set, but also its collations and algorithms) is the only viable way to represent characters, and thus should be the way strings work with. (get out evil locales and other stuff!) Of course, various encodings can still be used for serialization.
I'd like to note that Unicode consumes more memory than narrow encodings. This may not be desirable in all cases, especially when the application is not intended to support multiple languages in its majority of strings (which, in fact, is a quite common case).

Andrey Semashev wrote:
I'd like to note that Unicode consumes more memory than narrow encodings.
That's quite dependent on the encoding used. The most popular Unicode memory-saving encoding is UTF-8 though, which doubles the size needed for non ASCII characters compared to ISO-8859-* for example. It's not that problematic though. Alternatives which use even less memory exist, but they have other disadvantages.
This may not be desirable in all cases, especially when the application is not intended to support multiple languages in its majority of strings (which, in fact, is a quite common case).
Algorithms to handle text boundaries, tailored grapheme clusters, collations (some of which are context-sensitive) etc. are needed to process correctly any one language. So you need Unicode anyway, and better reuse the Unicode stuff than work on top of a legacy encoding.

Mathias Gaunard wrote:
Andrey Semashev wrote:
I'd like to note that Unicode consumes more memory than narrow encodings.
That's quite dependent on the encoding used. The most popular Unicode memory-saving encoding is UTF-8 though, which doubles the size needed for non ASCII characters compared to ISO-8859-* for example. It's not that problematic though.
UTF-8 is a variable character length encoding which complicates processing considerably. I'd rather stick to UTF-16 if I had to use Unicode. And it's already twice bigger than ASCII.
Alternatives which use even less memory exist, but they have other disadvantages.
This may not be desirable in all cases, especially when the application is not intended to support multiple languages in its majority of strings (which, in fact, is a quite common case).
Algorithms to handle text boundaries, tailored grapheme clusters, collations (some of which are context-sensitive) etc. are needed to process correctly any one language. So you need Unicode anyway, and better reuse the Unicode stuff than work on top of a legacy encoding.
I'm not saying that we don't need Unicode support. We do! I'm only saying that in many cases plain ASCII does its job perfectly well: logging, system messages, simple text formatting, texts in restricted character sets, like numbers, phone numbers, identifiers of all kinds, etc. There are cases where i18n is not needed at all - mostly server-side apps with minimal UI. Being forced to use Unicode internally in these cases means increased memory footprint and degraded performance due to encoding translation overhead.

Andrey Semashev wrote:
I'm not saying that we don't need Unicode support. We do! I'm only saying that in many cases plain ASCII does its job perfectly well: logging, system messages, simple text formatting, texts in restricted character sets, like numbers, phone numbers, identifiers of all kinds, etc. There are cases where i18n is not needed at all - mostly server-side apps with minimal UI.
I agree. There are other cases where you need single byte encodings. In many kinds of scientific work you rarely need anything else than ASCII (or Latin-1). In some kinds of scientific work it essential to use single byte encodings. For instance when you need to index and search an annotated genome. This involves a huge amount of text, and anything else than a single byte encoding might kill performance. The current C++ iostreams library needs to be replaced. I agree that the locale and narrow/wide stuff is broken beyond repair. But whatever is going to replace it, needs to be flexible enough to handle many different encodings. Unicode should of course be the base. But we need at least ASCII, Latin-1, UTF-8, UTF-16, UCS2 (for Windows compatibility) and UCS4. And other users will undoubtedly have other needs. It would be very interesting if the Boost community tackles this problem. I will be excited to see the outcome of such an effort. -- Johan Råde

Andrey Semashev wrote:
UTF-8 is a variable character length encoding which complicates processing considerably.
It's trivial compared to the real Unicode work.
I'd rather stick to UTF-16 if I had to use Unicode.
UTF-16 is a variable-length encoding too. But anyway, Unicode itself is a variable-length format, even with the UTF-32 encoding, simply because of grapheme clusters.
I'm not saying that we don't need Unicode support. We do! I'm only saying that in many cases plain ASCII does its job perfectly well: logging, system messages, simple text formatting, texts in restricted character sets, like numbers, phone numbers, identifiers of all kinds, etc.
Identifiers of all kinds aren't text, they're just bytes. As for logging, I'm not too sure whether it should be localized or not. And I don't understand what you mean by system messages. I still don't understand why you want to work with other character sets. That will just require duplicating the tables and algorithms required to process the text correctly. See http://www.unicode.org/reports/tr10/ for an idea of the complexity of collations, which allow comparison of strings. As you can see, it has little to do with encoding, yet the tables etc. require the usage of the Unicode character set, preferably in a canonical form so that it can be quite efficient.
There are cases where i18n is not needed at all - mostly server-side apps with minimal UI.
Any application that process or display non-trivial text (meaning something else than options) should have internationalization.
Being forced to use Unicode internally in these cases means increased memory footprint and degraded performance due to encoding translation overhead.
What encoding translation are you talking about?

Mathias Gaunard wrote:
Any application that process or display non-trivial text (meaning something else than options) should have internationalization.
Is there any performance penalty when using UTF-8 instead of ASCII, for instance when searching text? If there is not, then I'd be happy with an UTF-8 / UTF-16 / UTF-32 solution. --Johan Råde

On 21/06/07, Johan Råde <rade@maths.lth.se> wrote:
Mathias Gaunard wrote:
Any application that process or display non-trivial text (meaning something else than options) should have internationalization.
Is there any performance penalty when using UTF-8 instead of ASCII, for instance when searching text? If there is not, then I'd be happy with an UTF-8 / UTF-16 / UTF-32 solution.
Within the bounds of the ASCII-compatible characters it's exactly the same (up to the byte content). For the other characters it uses an extended format that /should/ be character convertible, if all parties follow the actual unicode standard. When searching ASCII text, it's equal; when searching non-ASCII text all characters should have a unique encoding and should therefore match. Regards, Peter

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Peter Bindels Sent: Thursday, June 21, 2007 2:54 AM To: boost@lists.boost.org Subject: Re: [boost] [rfc] I/O Library Design
On 21/06/07, Johan Råde <rade@maths.lth.se> wrote:
Mathias Gaunard wrote:
Any application that process or display non-trivial text (meaning something else than options) should have internationalization.
Is there any performance penalty when using UTF-8 instead of ASCII, for instance when searching text? If there is not, then I'd be happy with an UTF-8 / UTF-16 / UTF-32 solution.
[Joe] I think the data formatting needs to be cleanly separated from the actual I/O layer and instead performed in some sort of string class. Much of the time, I want to format data, but don't want to send it to an I/O layer. Instead I want to display it with a GUI; put it in a packet to be sent later; or some other operation. Performing the data formatting in the string class means the I/O library wouldn't have to care about Unicode issues. Any formatting issues would be taken care of by an appropriate string class. I would prefer that something like boost format be part of the string class, though I prefer the specifiers from C# to the specifiers actually used by boost format. C# also allows custom formatters to be created and used and I like that idea. Now, I am obviously not talking about using boost format as is because it uses streams and I think that is too slow in general. Having said all that, it would be great if the I/O system could automatically delegate the formatting so that the user doesn't explicitly have to create a string to do I/O. That is, I would want to do some equivalent of Write("X = {0:d}", 53) and have that come out, but I think I would like it to automatically create an appropriate string, do the formatting, and then dump the results out (or at least use the same mechanisms) rather than have the I/O system do the formatting. As far as everything else, using some sort of interface on a string class would allow various Unicode encodings to be included as they are written. I hesitate to base much on Unicode because there have been a plethora of Unicode classes proposed and none of them have seen the light of day as far as I can tell. This tells me that there are too many issues that crop up to be basing the whole I/O system upon. Anyway, these are just some random thoughts I have had about the issue. joe

Greer, Joe wrote:
Performing the data formatting in the string class means the I/O library wouldn't have to care about Unicode issues.
though I prefer the specifiers from C# to the specifiers actually used by boost format. C# also allows custom formatters to be created and used and I like that idea. I like the idea, too, but I studied the C# a bit and don't like the way
I'm afraid that's not true. The I/O library has to care about Unicode (and general character encoding) if it wants to handle text without placing the entire burden of making the text suitable for I/O (i.e. writing it to a file) on the programmer. On the other hand, the I/O library doesn't need to care about any Unicode issues beyond simple encoding/decoding for simple text transport. Formatting doesn't have to care too much, either (for example, collating is not interesting to generic formatting), and anyway, it'll be very separate anyway. I have written at length in another post on why I think formatting should be based on the I/O interfaces. they're doing it. I cannot make head nor tails from the three involved interfaces (ITextFormatter, IFormattable, IFormatProvider), but it doesn't sound that flexible to me. However, it gave me some ideas. I hope to be able to present them to the community soon. Sebastian Redl

Peter Bindels wrote:
When searching ASCII text, it's equal;
Not if you handle grapheme clusters. If your text is "abcfoôdef", with ô coded as o + combining accent, then searching for "foo" shouldn't work, since you would only find part of the grapheme cluster and possibly do weird things if for example the substring is removed.

On 23/06/07, Mathias Gaunard <mathias.gaunard@etu.u-bordeaux1.fr> wrote:
Peter Bindels wrote:
When searching ASCII text, it's equal;
Not if you handle grapheme clusters.
If your text is "abcfoôdef", with ô coded as o + combining accent, then searching for "foo" shouldn't work, since you would only find part of the grapheme cluster and possibly do weird things if for example the substring is removed.
Combining accents, nor in fact any character with accent, were in ASCII last time I checked.

Peter Bindels wrote:
On 23/06/07, Mathias Gaunard <mathias.gaunard@etu.u-bordeaux1.fr> wrote:
Peter Bindels wrote:
When searching ASCII text, it's equal; Not if you handle grapheme clusters.
If your text is "abcfoôdef", with ô coded as o + combining accent, then searching for "foo" shouldn't work, since you would only find part of the grapheme cluster and possibly do weird things if for example the substring is removed.
Combining accents, nor in fact any character with accent, were in ASCII last time I checked.
Exactly what question is being discussed here? I thought the question was, how fast is text search with UTF-8 strings that happen to contain ASCII only, compared with text search with ASCII strings. Even if the UTF-8 strings happen to contain ASCII, the search algorithm may still have to check for combining characters. The wider question is, should people who currently use ASCII and care a lot about performance and don't care about i18n switch to UTF-8? It would certainly simplify life if all strings were UTF-n. I have had to deal with BSTR, CString, QString and others. Just having to deal with a single string type would be a good thing. --Johan Råde

The wider question is, should people who currently use ASCII and care a lot about performance and don't care about i18n switch to UTF-8?
It would certainly simplify life if all strings were UTF-n. I have had to deal with BSTR, CString, QString and others. Just having to deal with a single string type would be a good thing.
You will still have to deal with all of them because they are either different interfaces to the string (CString, QString and others) or imply additional requirements to the storage properties (BSTR). Moving to Unicode won't make them all a same thing.

On 23/06/07, Johan Råde <rade@maths.lth.se> wrote:
Peter Bindels wrote:
Combining accents, nor in fact any character with accent, were in ASCII last time I checked.
Exactly what question is being discussed here?
As far as I was concerned, in how far switching from ASCII to UTF-8 would impact performance, if only 7-bit characters were being used.
I thought the question was, how fast is text search with UTF-8 strings that happen to contain ASCII only, compared with text search with ASCII strings.
Exactly.
Even if the UTF-8 strings happen to contain ASCII, the search algorithm may still have to check for combining characters.
The wider question is, should people who currently use ASCII and care a lot about performance and don't care about i18n switch to UTF-8?
I think it's best to switch to UTF-8, for the simple reason that it's equally fast, or a bit slower but more correct. At the very least, from a corporate perspective, UTF-8 would strongly reduce development time by automatically (up to a certain limit) coping with extended behaviour when different from plain ASCII, whilst barely impacting performance when using only ASCII features. Sorting in specific is an odd case. Having read only the introduction of the collation algorithm, I'm not entirely certain about the complexity, but I think it should come out to about O(n), similar to comparing two ASCII strings. It'll have a higher base factor, but won't be quite so determining for the performance of anything but the special case of high-performance computing that's centered around string usage.
From a purely technical perspective, ASCII is essentially a base case for UTF-8. If you properly wrap UTF-8, you can keep the entire technical complexity of it below covers (putting the collation under operator< and operator==, putting multibyte character handling under operator<< and operator>>). That makes it as attractive as any string class is now over a (const) char *.
Regards, Peter

Peter Bindels <dascandy <at> gmail.com> writes:
On 23/06/07, Mathias Gaunard <mathias.gaunard <at> etu.u-bordeaux1.fr> wrote:
If your text is "abcfoôdef", with ô coded as o + combining accent, then searching for "foo" shouldn't work, since you would only find part of the grapheme cluster and possibly do weird things if for example the substring is removed.
Combining accents, nor in fact any character with accent, were in ASCII last time I checked.
You were talking of searching for an ASCII string in an UTF-8 one.

Mathias Gaunard wrote:
Andrey Semashev wrote:
I'd rather stick to UTF-16 if I had to use Unicode.
UTF-16 is a variable-length encoding too.
But anyway, Unicode itself is a variable-length format, even with the UTF-32 encoding, simply because of grapheme clusters.
Technically, yes. But most of the widely used character sets fit into UTF-16. That means that I, having said that my app is localized to languages A B and C, may treat UTF-16 as a fixed-length encoding if these languages fit in it. If they don't, I'd consider moving to UTF-32.
I'm not saying that we don't need Unicode support. We do! I'm only saying that in many cases plain ASCII does its job perfectly well: logging, system messages, simple text formatting, texts in restricted character sets, like numbers, phone numbers, identifiers of all kinds, etc.
Identifiers of all kinds aren't text, they're just bytes.
Not always. I may get such an identifier from a text-based protocol primitive, thus I can handle it as a text. This assumption may allow more opportunities to various optimizations.
As for logging, I'm not too sure whether it should be localized or not.
I can think only of a single case where logging should i18n. It's when you have to log external data, such as client app queries or DB responses. This need questionable in the first place, because it may introduce serious security holes. As for regular logging, I feel quite fine with narrow logs and don't see why would I want to make them wide.
And I don't understand what you mean by system messages.
Error and warning descriptions that may come either from your application or from the OS, some third-party API or language runtime. Although, I may agree that these messages could be localized too, but to my mind it's an overkill. Generally, I don't need std::bad_alloc::what() returning Russian or Chinese description.
I still don't understand why you want to work with other character sets.
Because I have an impression that it may be done more efficiently and with less expenses. I don't want to pay for what I don't need - IMHO, the ground principle of C++.
That will just require duplicating the tables and algorithms required to process the text correctly.
What algorithms do you mean and why would they need duplication?
See http://www.unicode.org/reports/tr10/ for an idea of the complexity of collations, which allow comparison of strings. As you can see, it has little to do with encoding, yet the tables etc. require the usage of the Unicode character set, preferably in a canonical form so that it can be quite efficient.
The collation is just an approach to perform string comparison and ordering. I don't see how it is related to efficiency questions I mentioned. Besides, comparison is not the only operation on strings. I expect iterating over a string or operator[] complexity to rise significantly once we assume that the underlying string has variable-length chars.
There are cases where i18n is not needed at all - mostly server-side apps with minimal UI.
Any application that process or display non-trivial text (meaning something else than options) should have internationalization.
I have to disagree. I18n is good when it's needed, i.e. when there are users that will appreciate it or when it's required by application domain and functionality. Otherwise, IMO, it's waste of efforts on the development stage and system resources on the evaluation stage.
What encoding translation are you talking about?
Let's assume my app works with a narrow text file stream. If the stream is using Unicode internally, it has to translate between the file encoding and its internal encoding every time I output or input something. I don't think that's the way it should be. I'd rather have an opportunity to chose the encoding I want to work with and have it through the whole formatting/streaming/IO tool chain with no extra overhead. That doesn't mean, though, that I wouldn't want some day to perform encoding translations with the same tools. PS: I have a slight feeling that we have a misunderstanding at this point...

Andrey Semashev <andysem@mail.ru> writes:
Mathias Gaunard wrote:
Andrey Semashev wrote:
I'd rather stick to UTF-16 if I had to use Unicode.
UTF-16 is a variable-length encoding too.
But anyway, Unicode itself is a variable-length format, even with the UTF-32 encoding, simply because of grapheme clusters.
Technically, yes. But most of the widely used character sets fit into UTF-16. That means that I, having said that my app is localized to languages A B and C, may treat UTF-16 as a fixed-length encoding if these languages fit in it. If they don't, I'd consider moving to UTF-32.
Note that even if you can represent a single Unicode code point in your underlying type for storing a single unit of encoded text, you still have the issue of combining characters and such. Thus, it is not clear how a fixed-width encoding makes text processing significantly easier; I'd be interested if you have some examples where it does make processing significantly easier. [snip]
As for logging, I'm not too sure whether it should be localized or not.
I can think only of a single case where logging should i18n. It's when you have to log external data, such as client app queries or DB responses. This need questionable in the first place, because it may introduce serious security holes. As for regular logging, I feel quite fine with narrow logs and don't see why would I want to make them wide.
I don't think the narrow/wide terminology is very helpful for discussing the issues here. I think of internationalization as supporting multiple languages/locales at run-time (or conceivably just at compile-time), which would mean supporting multiple languages and formatting conventions for the log messages. Whether this is useful obviously depends on the intended use for the logs. The issue of external text data in log messages is really just a particular instance of outputting some representation of program data to the log message, and need not really be considered here. (It may, for instance, involve some sort of encoding conversion, or using some sort of escape syntax.) Note that even if log messages are only in a single language, there is still the issue of how the text of the messages is to be represented (i.e. what encoding to use). [snip]
I still don't understand why you want to work with other character sets.
Because I have an impression that it may be done more efficiently and with less expenses. I don't want to pay for what I don't need - IMHO, the ground principle of C++.
It is important to support this principle. It is useful to consider exactly what the costs and benefits are of standardizing on a single encoding (likely UTF-16) for high-level text processing. In some cases trying to use templates to avoid some people paying for what they don't need results in everyone paying, in compile-time, possibly in developer time, and sometimes in run-time due to compilers not being perfect.
That will just require duplicating the tables and algorithms required to process the text correctly.
What algorithms do you mean and why would they need duplication?
Examples of such algorithms are string collation, comparison, line breaking, word wrapping, and hyphenation.
See http://www.unicode.org/reports/tr10/ for an idea of the complexity of collations, which allow comparison of strings. As you can see, it has little to do with encoding, yet the tables etc. require the usage of the Unicode character set, preferably in a canonical form so that it can be quite efficient.
The collation is just an approach to perform string comparison and ordering. I don't see how it is related to efficiency questions I mentioned. Besides, comparison is not the only operation on strings. I expect iterating over a string or operator[] complexity to rise significantly once we assume that the underlying string has variable-length chars.
The complexity remains the same if operator[] indexes over encoded units, or you are iterating over the encoded units. Clearly, if you want an iterator that converts from the existing encoding, which might be UTF-8 or UTF-16, to UTF-32, then there will be greater complexity. As stated previously, however, it is not clear why this is likely to be a frequently useful operation.
There are cases where i18n is not needed at all - mostly server-side apps with minimal UI.
Any application that process or display non-trivial text (meaning something else than options) should have internationalization.
I have to disagree. I18n is good when it's needed, i.e. when there are users that will appreciate it or when it's required by application domain and functionality. Otherwise, IMO, it's waste of efforts on the development stage and system resources on the evaluation stage.
It is true that forcing certain text operations to be done in UTF-16 (as opposed to a fixed-width 1-byte encoding) would slow down certain text processing. In some cases, using UTF-8 would not hurt performance. It is not clear
What encoding translation are you talking about?
Let's assume my app works with a narrow text file stream.
For simplicity, we can avoid using the "narrow"/"wide" terminology and say you have a text file encoded using a 1-byte fixed width encoding, like ASCII or iso-8859-1.
If the stream is using Unicode internally, it has to translate between the file encoding and its internal encoding every time I output or input something. I don't think that's the way it should be. I'd rather have an opportunity to chose the encoding I want to work with and have it through the whole formatting/streaming/IO tool chain with no extra overhead. That doesn't mean, though, that I wouldn't want some day to perform encoding translations with the same tools.
I can see the use for this. I think it may well be important for the new I/O framework to support streams of arbitrary types and encodings. There is an issue though in attempting to have the text formatting system support arbitrary encodings: I think we all agree that it needs to support a large number of locales, and by locale I mean a set of formatting conventions, for formatting e.g. numbers and dates, and also some other things that you may not care as much about, like how to order strings. If multiple encodings are also supported, then either encoding conversions would have to be done, which is what you want to avoid, or the formatting and other information needs to duplicated for each supported encoding for each locale which would mean the amount of data that needs to be stored would be doubled or tripled. Since all but the needed data can likely remain on disk, however, this may be reasonable. One strategy would be to necessarily store the data for all locales in at least one of the Unicode encodings (or maybe UTF-16 would be required), and then implementations can provide the data in other encodings for locales as well (this data can likely be generated automatically from the UTF-16 data); some data like collation data would likely be provided only for Unicode encodings, so collation might only be provided for Unicode encodings. Then at run time, to format text given a particular locale and encoding, if there is already data for that locale in the desired encoding, it can be used without conversion; otherwise, the Unicode data is converted to the desired locale. -- Jeremy Maitin-Shepard

Jeremy Maitin-Shepard wrote:
Andrey Semashev <andysem@mail.ru> writes:
Mathias Gaunard wrote:
Andrey Semashev wrote:
I'd rather stick to UTF-16 if I had to use Unicode. UTF-16 is a variable-length encoding too.
But anyway, Unicode itself is a variable-length format, even with the UTF-32 encoding, simply because of grapheme clusters.
Technically, yes. But most of the widely used character sets fit into UTF-16. That means that I, having said that my app is localized to languages A B and C, may treat UTF-16 as a fixed-length encoding if these languages fit in it. If they don't, I'd consider moving to UTF-32.
Note that even if you can represent a single Unicode code point in your underlying type for storing a single unit of encoded text, you still have the issue of combining characters and such. Thus, it is not clear how a fixed-width encoding makes text processing significantly easier; I'd be interested if you have some examples where it does make processing significantly easier.
I may not support character combining from several code points if it is not used or uncommon in languages A, B and C. Moreover, many precombined characters exist in Unicode as a single code point. [snip]
That will just require duplicating the tables and algorithms required to process the text correctly.
What algorithms do you mean and why would they need duplication?
Examples of such algorithms are string collation, comparison, line breaking, word wrapping, and hyphenation.
Why would these algorithms need duplication? If we have all locale-specific traits and tools, such as collation tables, character checking functions like isspace, isalnum, etc. along with new ones that might be needed for Unicode, encapsulated into locale classes, the essence of algorithms should be independent form the text encoding.
Besides, comparison is not the only operation on strings. I expect iterating over a string or operator[] complexity to rise significantly once we assume that the underlying string has variable-length chars.
The complexity remains the same if operator[] indexes over encoded units, or you are iterating over the encoded units. Clearly, if you want an iterator that converts from the existing encoding, which might be UTF-8 or UTF-16, to UTF-32, then there will be greater complexity. As stated previously, however, it is not clear why this is likely to be a frequently useful operation.
What do you mean by encoded units? What I was saying, if we have a UTF-8 encoded string that contains both latin and national characters that encode to several octets, it becomes a non-trivial task to extract i-th character (not octet) from the string. Same problem with iteration - iterator has to analyze the character it points to to adjust its internal pointer to the beginning of the next character. The same thing will happen with true UTF-16 and UTF-32 support. As an example of the need in such functionality, it is widely used in various text parsers. [snip]
What encoding translation are you talking about?
Let's assume my app works with a narrow text file stream.
For simplicity, we can avoid using the "narrow"/"wide" terminology and say you have a text file encoded using a 1-byte fixed width encoding, like ASCII or iso-8859-1.
That was exactly what I meant by term "narrow". :) But I'm happy to say "1-byte fixed width encoding" instead of it to reduce misunderstanding.

Andrey Semashev <andysem@mail.ru> writes: [snip]
I may not support character combining from several code points if it is not used or uncommon in languages A, B and C. Moreover, many precombined characters exist in Unicode as a single code point.
They do; it may be largely for compatibility reasons, I think. I don't think it is a very good idea to attempt to provide partial support for Unicode by supporting only single-code-point grapheme clusters. Furthermore, I don't see that it would be a huge gain.
[snip]
That will just require duplicating the tables and algorithms required to process the text correctly.
What algorithms do you mean and why would they need duplication?
Examples of such algorithms are string collation, comparison, line breaking, word wrapping, and hyphenation.
Why would these algorithms need duplication? If we have all locale-specific traits and tools, such as collation tables, character checking functions like isspace, isalnum, etc. along with new ones that might be needed for Unicode, encapsulated into locale classes, the essence of algorithms should be independent form the text encoding.
Using standard data tables, and a single algorithm that merely accesses the locale-specific data tables, you can provide these algorithms for UTF-16 (and other Unicode encodings) for essentially all locales. This is done by libraries like IBM ICU. Providing them in addition for other encodings, however, would require separate data tables and separate implementations.
Besides, comparison is not the only operation on strings. I expect iterating over a string or operator[] complexity to rise significantly once we assume that the underlying string has variable-length chars.
The complexity remains the same if operator[] indexes over encoded units, or you are iterating over the encoded units. Clearly, if you want an iterator that converts from the existing encoding, which might be UTF-8 or UTF-16, to UTF-32, then there will be greater complexity. As stated previously, however, it is not clear why this is likely to be a frequently useful operation.
What do you mean by encoded units?
By encoded units I mean e.g. a single byte with utf-8, or a 16-bit quantity with utf-16, as opposed to a code point.
What I was saying, if we have a UTF-8 encoded string that contains both latin and national characters that encode to several octets, it becomes a non-trivial task to extract i-th character (not octet) from the string. Same problem with iteration - iterator has to analyze the character it points to to adjust its internal pointer to the beginning of the next character. The same thing will happen with true UTF-16 and UTF-32 support. As an example of the need in such functionality, it is widely used in various text parsers.
I'm still not sure I quite see it. I would think that the most common case in parsing text is to read it in order from the beginning. I suppose in some cases you might be parsing something where you know a field is aligned to e.g. the 20th character, but such formats tend to assume very simple encodings anyway, because they don't make much sense if you are to support complicated accents and such. -- Jeremy Maitin-Shepard

Jeremy Maitin-Shepard wrote:
Andrey Semashev <andysem@mail.ru> writes:
That will just require duplicating the tables and algorithms required to process the text correctly. What algorithms do you mean and why would they need duplication? Examples of such algorithms are string collation, comparison, line breaking, word wrapping, and hyphenation.
Why would these algorithms need duplication? If we have all locale-specific traits and tools, such as collation tables, character checking functions like isspace, isalnum, etc. along with new ones that might be needed for Unicode, encapsulated into locale classes, the essence of algorithms should be independent form the text encoding.
Using standard data tables, and a single algorithm that merely accesses the locale-specific data tables, you can provide these algorithms for UTF-16 (and other Unicode encodings) for essentially all locales. This is done by libraries like IBM ICU. Providing them in addition for other encodings, however, would require separate data tables and separate implementations.
I still can't see why one would need to reimplement algorithms. Their logic is the same regardless of the encoding.
What I was saying, if we have a UTF-8 encoded string that contains both latin and national characters that encode to several octets, it becomes a non-trivial task to extract i-th character (not octet) from the string. Same problem with iteration - iterator has to analyze the character it points to to adjust its internal pointer to the beginning of the next character. The same thing will happen with true UTF-16 and UTF-32 support. As an example of the need in such functionality, it is widely used in various text parsers.
I'm still not sure I quite see it. I would think that the most common case in parsing text is to read it in order from the beginning. I suppose in some cases you might be parsing something where you know a field is aligned to e.g. the 20th character,
There may be different parsing techniques, depending on the text format. Sometimes only character iteration is sufficient, in case of forward sequential parsing. There is no restriction, though, to perform non-sequential parsing (in case if there is some table of contents with offsets or each field to be parsed is prepended with its length).
but such formats tend to assume very simple encodings anyway, because they don't make much sense if you are to support complicated accents and such.
If all standard algorithms and classes assume that the text being parsed is in Unicode, it cannot perform optimizations in a more efficient manner. The std::string or regex or stream classes will always have to treat the text as Unicode.

Andrey Semashev <andysem@mail.ru> writes:
Jeremy Maitin-Shepard wrote:
Andrey Semashev <andysem@mail.ru> writes:
That will just require duplicating the tables and algorithms required to process the text correctly. What algorithms do you mean and why would they need duplication? Examples of such algorithms are string collation, comparison, line breaking, word wrapping, and hyphenation.
Why would these algorithms need duplication? If we have all locale-specific traits and tools, such as collation tables, character checking functions like isspace, isalnum, etc. along with new ones that might be needed for Unicode, encapsulated into locale classes, the essence of algorithms should be independent form the text encoding.
Using standard data tables, and a single algorithm that merely accesses the locale-specific data tables, you can provide these algorithms for UTF-16 (and other Unicode encodings) for essentially all locales. This is done by libraries like IBM ICU. Providing them in addition for other encodings, however, would require separate data tables and separate implementations.
I still can't see why one would need to reimplement algorithms. Their logic is the same regardless of the encoding.
I'll admit I haven't looked closely at the collation algorithm given by the Unicode specifications recently, so it is hard for me to give details. String collation is in general lexicographical on the grapheme clusters, but for some languages there may be certain exceptions (someone please correct me if I am mistaken). Perhaps someone with more knowledge can elaborate, but I believe the Unicode collation algorithms are indeed highly specific to Unicode.
What I was saying, if we have a UTF-8 encoded string that contains both latin and national characters that encode to several octets, it becomes a non-trivial task to extract i-th character (not octet) from the string. Same problem with iteration - iterator has to analyze the character it points to to adjust its internal pointer to the beginning of the next character. The same thing will happen with true UTF-16 and UTF-32 support. As an example of the need in such functionality, it is widely used in various text parsers.
I'm still not sure I quite see it. I would think that the most common case in parsing text is to read it in order from the beginning. I suppose in some cases you might be parsing something where you know a field is aligned to e.g. the 20th character,
There may be different parsing techniques, depending on the text format. Sometimes only character iteration is sufficient, in case of forward sequential parsing. There is no restriction, though, to perform non-sequential parsing (in case if there is some table of contents with offsets or each field to be parsed is prepended with its length).
Such a format would likely then not really be text, since it would contain embedded offsets (which might likely not be text). But in any case, the offsets could simply be provided as byte offsets (or encoded unit offsets), rather than character or grapheme cluster offsets, and then there is no problem. Note: I'm using the term "encoded unit" because I can't recall the proper term.
but such formats tend to assume very simple encodings anyway, because they don't make much sense if you are to support complicated accents and such.
If all standard algorithms and classes assume that the text being parsed is in Unicode, it cannot perform optimizations in a more efficient manner. The std::string or regex or stream classes will always have to treat the text as Unicode.
Well, since std::string and boost::regex already exist and do not assume Unicode (or even necessarily support it very well; I've seen some references to boost::regex providing Unicode support, but I haven't looked into it), that is not likely to occur. I think it is certainly important to provide some support for non-Unicode encodings. In particular, converting between arbitrary encodings should certainly be supported. Depending on to what extent your parsing/processing relies on library text processing facilities above basic encoding conversion, it may or may not be feasible to directly process non-Unicode text if only this very basic level of support is provided. It would be useful to explore how much trouble it is to support arbitrary non-Unicode encodings, and also to explore how useful it is to be able to format/parse numbers, dates (and perhaps currencies), for instance, in non-Unicode encodings. -- Jeremy Maitin-Shepard

Jeremy Maitin-Shepard wrote:
Andrey Semashev <andysem@mail.ru> writes:
Jeremy Maitin-Shepard wrote:
Andrey Semashev <andysem@mail.ru> writes:
There may be different parsing techniques, depending on the text format. Sometimes only character iteration is sufficient, in case of forward sequential parsing. There is no restriction, though, to perform non-sequential parsing (in case if there is some table of contents with offsets or each field to be parsed is prepended with its length).
Such a format would likely then not really be text, since it would contain embedded offsets (which might likely not be text).
Why not? See GCC symbols mangling for example.
If all standard algorithms and classes assume that the text being parsed is in Unicode, it cannot perform optimizations in a more efficient manner. The std::string or regex or stream classes will always have to treat the text as Unicode.
Well, since std::string and boost::regex already exist and do not assume Unicode (or even necessarily support it very well; I've seen some references to boost::regex providing Unicode support, but I haven't looked into it), that is not likely to occur.
Actually, std::string (or basic_string) does not support Unicode since it operates on per-value_type basis. IOW, it won't recognize code sequences. Same thing with streams. As for Boost.Regex, it has such support, but it is optional (i.e. it allows 1-octet fixed width strings for processing). And I believe, it is the way to do in other components we're discussing.

Andrey Semashev wrote:
Technically, yes. But most of the widely used character sets fit into UTF-16. That means that I, having said that my app is localized to languages A B and C, may treat UTF-16 as a fixed-length encoding if these languages fit in it. If they don't, I'd consider moving to UTF-32.
That can be used as an optimization. The container still should only support bidirectional traversal, for characters, that is. An Unicode string should also support a more efficient byte-like traversal. And anyway, you forgot the part about grapheme clusters. Since you don't seem to know what they are, even though I mentioned them several times, I will shortly explain that to you. In Unicode, it is possible to use combining characters to create new characters, and those may be equivalent to other existing ready-made characters (which may or may not exist). For example, "dé" might be represented by the two code points 'd' (100) and 'é' (233) -- that's e with an acute accent, in case you can't see it. (in utf-8, 'é' would be the two bytes [195, 169]). It might also be represented by the three code points 'd' (100), 'e' (101) and combining acute accent (769) (The combining acute accent being, in utf-8, the two bytes [204, 128]). The character 'é', described as a combining sequence of the 'e' code point and the combining acute accent is equivalent to the ready-made 'é'. That's one unique character and it shouldn't be split in the middle, obviously, since that would alter the meanings of other characters or potentially invalidate the string. Some characters may actually use even more other combining points, it's not limited to one. Of course, there is a canonical ordering. There are of course other uses than accents for such things. In Hangul (korean) the characters can be written by combining different parts of their ideograph (from what I understood). As you can see, characters may lie on top of a variable number of code points. Of course processing of such text can be simplified by maintaining the strings in a canonical state, like Normalization Form C.
Not always. I may get such an identifier from a text-based protocol primitive, thus I can handle it as a text. This assumption may allow more opportunities to various optimizations.
Text in a given human language is not exactly the same as textual data which may not use any word.
Error and warning descriptions that may come either from your application or from the OS, some third-party API or language runtime. Although, I may agree that these messages could be localized too, but to my mind it's an overkill. Generally, I don't need std::bad_alloc::what() returning Russian or Chinese description.
Not localizing your error messages is probably the worst thing you can do. I'm pretty sure the user would be frustrated if he gets an error in a language he doesn't understand well.
Because I have an impression that it may be done more efficiently and with less expenses. I don't want to pay for what I don't need - IMHO, the ground principle of C++.
Well then you can't have an unified text processing facility for all languages, which is the point of Unicode.
What algorithms do you mean and why would they need duplication?
The algorithms defined by the Unicode Standard, like the collation one, along with the many tables it requires to do its job. Those algorithms and tables are defined for Unicode, and it can be more or less difficult to adapt them to another character set.
The collation is just an approach to perform string comparison and ordering.
It's not "an approach". It is "the approach". This is what you need if you want to order strings (in a human way) or match loosely. (case insensitive search or stuff like that) I don't see how it is related to efficiency questions I mentioned.
Besides, comparison is not the only operation on strings.
String searching and comparison are probably the most used things in a string.
I expect iterating over a string or operator[] complexity to rise significantly once we assume that the underlying string has variable-length chars.
Iterating over the "true" characters would be a ridiculously inefficient operation -- especially if wanting to keep the guarantee that modifying the value pointed by an iterator doesn't invalidate the others --, and should be clearly avoided. I don't think there is much code in high-level programming languages that iterate over the strings.
What encoding translation are you talking about?
Let's assume my app works with a narrow text file stream. If the stream is using Unicode internally, it has to translate between the file encoding and its internal encoding every time I output or input something. I don't think that's the way it should be. I'd rather have an opportunity to chose the encoding I want to work with and have it through the whole formatting/streaming/IO tool chain with no extra overhead. That doesn't mean, though, that I wouldn't want some day to perform encoding translations with the same tools.
The stream shouldn't be using any text representation or facility, but only be a convenience to write stuff in an agnostic way. Of course, the text processing layer which IMO should be quite separate will probably work with Unicode, but you don't have to use it. You should be working with Unicode internally in your app anyway if you want to avoid translations, since most systems or toolkits require Unicode in some form in their interfaces.

I've already answered to Jeremy Maitin-Shepard on some for his arguments, which are similar to yours. So I might be repeating myself. Mathias Gaunard wrote:
Andrey Semashev wrote:
Technically, yes. But most of the widely used character sets fit into UTF-16. That means that I, having said that my app is localized to languages A B and C, may treat UTF-16 as a fixed-length encoding if these languages fit in it. If they don't, I'd consider moving to UTF-32.
That can be used as an optimization. The container still should only support bidirectional traversal, for characters, that is. An Unicode string should also support a more efficient byte-like traversal.
And anyway, you forgot the part about grapheme clusters. Since you don't seem to know what they are, even though I mentioned them several times, I will shortly explain that to you.
[snip] Thank you for explaining this to me. I've heard and read of such character combining but I never had to support it in practice. But as you noted yourself, there are many precombined single code point characters in Unicode. I'm not aware of the amount of such characters but I tend to think that they cover majority of the commonly used combine code sequences. This, in conjunction with the fact that I support a limited set of languages in my example application, allows me to perform the aforementioned optimizations. The CJK languages, of course, is a whole different story, and in order to support them we need a true Unicode processing. There's no argument on my side on this.
Not always. I may get such an identifier from a text-based protocol primitive, thus I can handle it as a text. This assumption may allow more opportunities to various optimizations.
Text in a given human language is not exactly the same as textual data which may not use any word.
Yes, but it doesn't prevent me from processing it as a text, does it?
Error and warning descriptions that may come either from your application or from the OS, some third-party API or language runtime. Although, I may agree that these messages could be localized too, but to my mind it's an overkill. Generally, I don't need std::bad_alloc::what() returning Russian or Chinese description.
Not localizing your error messages is probably the worst thing you can do. I'm pretty sure the user would be frustrated if he gets an error in a language he doesn't understand well.
Maybe. And maybe not, if the only one who sees these messages is a mature system administrator, and the messages are in English. Once again, I was speaking of a server-side applications. I understand, though, that such cases may not be the common ones.
Because I have an impression that it may be done more efficiently and with less expenses. I don't want to pay for what I don't need - IMHO, the ground principle of C++.
Well then you can't have an unified text processing facility for all languages, which is the point of Unicode.
What algorithms do you mean and why would they need duplication?
The algorithms defined by the Unicode Standard, like the collation one, along with the many tables it requires to do its job.
Those algorithms and tables are defined for Unicode, and it can be more or less difficult to adapt them to another character set.
As I noted to Jeremy, I think all locale-specific stuff should be encapsulated in locales. Therefore the processing algorithms are left independent from encoding specifics.
I expect iterating over a string or operator[] complexity to rise significantly once we assume that the underlying string has variable-length chars.
Iterating over the "true" characters would be a ridiculously inefficient operation -- especially if wanting to keep the guarantee that modifying the value pointed by an iterator doesn't invalidate the others --, and should be clearly avoided. I don't think there is much code in high-level programming languages that iterate over the strings.
Text parsing is one of such examples. And it may be extremely performance critical.
What encoding translation are you talking about? Let's assume my app works with a narrow text file stream. If the stream is using Unicode internally, it has to translate between the file encoding and its internal encoding every time I output or input something. I don't think that's the way it should be. I'd rather have an opportunity to chose the encoding I want to work with and have it through the whole formatting/streaming/IO tool chain with no extra overhead. That doesn't mean, though, that I wouldn't want some day to perform encoding translations with the same tools.
The stream shouldn't be using any text representation or facility, but only be a convenience to write stuff in an agnostic way. Of course, the text processing layer which IMO should be quite separate will probably work with Unicode, but you don't have to use it.
You should be working with Unicode internally in your app anyway if you want to avoid translations, since most systems or toolkits require Unicode in some form in their interfaces.
I'm not sure about the "most" word in context of "require". I'd rather say "most allow Unicode". But that does not mean that all strings in C++ should be in Unicode and I should always work in it. I just want to have a choice, after all. Additionally, there is plenty of already written code that does not use Unicode. We can't just throw it away.

Andrey Semashev wrote:
Text parsing is one of such examples. And it may be extremely performance critical.
Text parsing being quite low-level, they should probably use lower-level accesses (iterating over code points or code units for example). Extensive parsing should probably access lower-level views of the string, like code points or code units, and eventually be careful depending on what they do. Various Unicode related tools (text boundaries searching etc.) would be needed to assist the parser in this task. Building a fully Unicode-aware regex engine is probably difficult. See the guidelines here: http://unicode.org/unicode/reports/tr18/ Boost.Regex -- which makes use of ICU for Unicode support -- for example, does not even fully comply to level 1.
You should be working with Unicode internally in your app anyway if you want to avoid translations, since most systems or toolkits require Unicode in some form in their interfaces.
I'm not sure about the "most" word in context of "require". I'd rather say "most allow Unicode".
I know of several libraries or APIs that only work with Unicode. It's simply easier for them if there is only one format that represent all text. GTK+ is one example.
But that does not mean that all strings in C++ should be in Unicode and I should always work in it. I just want to have a choice, after all.
Additionally, there is plenty of already written code that does not use Unicode. We can't just throw it away.
Compatibility with legacy code will always be an issue. Isn't a runtime conversion simply acceptable?

Mathias Gaunard wrote:
Andrey Semashev wrote:
Text parsing is one of such examples. And it may be extremely performance critical.
Text parsing being quite low-level, they should probably use lower-level accesses (iterating over code points or code units for example).
Extensive parsing should probably access lower-level views of the string, like code points or code units, and eventually be careful depending on what they do.
I agree that parsing is rather a low-level task. But I see no benefit from being forced to parse Unicode code points instead of fixed-length chars in a given encoding.
Various Unicode related tools (text boundaries searching etc.) would be needed to assist the parser in this task.
That would be nice.
Building a fully Unicode-aware regex engine is probably difficult. See the guidelines here: http://unicode.org/unicode/reports/tr18/ Boost.Regex -- which makes use of ICU for Unicode support -- for example, does not even fully comply to level 1.
Interesting. I wonder what level of support will be proposed to the Standartization Comitee.
You should be working with Unicode internally in your app anyway if you want to avoid translations, since most systems or toolkits require Unicode in some form in their interfaces. I'm not sure about the "most" word in context of "require". I'd rather say "most allow Unicode".
I know of several libraries or APIs that only work with Unicode. It's simply easier for them if there is only one format that represent all text. GTK+ is one example.
Well, that doesn't mean I was wrong in my statement. :)
But that does not mean that all strings in C++ should be in Unicode and I should always work in it. I just want to have a choice, after all.
Additionally, there is plenty of already written code that does not use Unicode. We can't just throw it away.
Compatibility with legacy code will always be an issue. Isn't a runtime conversion simply acceptable?
I don't think so - we're recurring to the performance issue. I just don't understand why there's so strong will to cut down fixed char encodings in favor of exclusive Unicode support. Why can't we have both? Is it for that the text processing algorithms should be duplicated? I think not, if the implementation is well designed. Is it for CRT size growth because of some encoding-specific data? Possible, but not necessarily. In fact, if the application size is of primary concern, the whole Unicode support is a good candidate to cut away.

While working on ordinary web software, there are actually a lot more variations on data encodings than just text and binary: A binary format may itself be encoded as bytes (of varying endianess), or in Base64 for email attachments (RFC 2045) or Base32 for URLs or form post data (RFC 3548). When encoding in a plain-text format (after encoding into a narrow character set), there might still be escaping depending on the container. C, JS, XML attributes, elements and CDATAs, SQL (by database) all have different escaping rules. This fails to mention sillier issues like newline representation. Buffering is also an interesting problem because in some formats, buffering events (like flush overflow or EOF) have streaming output to indicate an explicit end of stream, minimum remaining distance or differences in distance (like how many bytes to the next chunk in a stream). None of these transformations are hard to write, but they are written over and over because standard streaming operators (be they Java, C++, Perl or printf) provide no straightforward way to inject the transformations. The cost tends to be that serializing an object is written several times over, or worse, gets tied up in a grander object persistence framework.
From my limited reasearch, the most complete description of a stream encoding is hidden in the description of HTTP 1.1 entities - this defines a 3-layer model for streaming:
Buffering events: How to determine how large the stream is (TE, Content-Length, Trailer headers) Transformations: Preprocessing required before the stream can be interpretted (Content-Encoding: gzip, deflate, could include byte encodings) Type: What class should further interpret the content, and for text entities, the character set encoding (Content-Type). This is not a complete model, largely because it ignores the issue of interpretting the content, but it seems like a good place to start since it's an intro to the problems of portably streaming data. John On 6/17/07, Jeremy Maitin-Shepard <jbms@cmu.edu> wrote:
Sebastian Redl <sebastian.redl@getdesigned.at> writes:
A few weeks ago, a discussion that followed the demonstration of the binary_iostream library made me think about the standard C++ I/O and what I would expect from an I/O model.
The document can be found here: http://windmuehlgasse.getdesigned.at/newio/
- Binary transport layer issue:
Make the "binary transport layer" the "byte transport layer" to make it clear that it is for bytes.
Platforms with unusual features, like 9-bit bytes or inability to handle types less than 32-bits in size can possibly still implement the interface for a text/character transport layer, possibly on top of some other lower-level transport that need not be part of the boost library. Clearly, the text encoding and decoding would have to be done differently anyway.

"John Hayes" <john.martin.hayes@gmail.com> writes:
While working on ordinary web software, there are actually a lot more variations on data encodings than just text and binary:
It seems fairly logical to me to have the following organization: - Streams of arbitrary POD types For instance, you might have uint8_t streams, uint16_t streams, etc. - A byte stream would be a uint8_t stream. - A text stream holding utf-16 encoded text would be a uint16_t stream, while a text stream holding utf-8 encoded text would be a uint8_t stream. A text stream holding iso-8859-1 encoded text would also be a uint8_t stream. There is the issue of whether it is useful to have a special text stream type that is tagged (either at compile-time or at run-time) with the encoding in which the data either going in or out of it are supposed to be. How exactly this tagging should be done, and to what extent it would be useful, remains to be explored. It seems that your various examples of filters/encoding, like BASE-64, URL encoding, CDATA escaping, and C++ string escaping, might well fit into the framework I described in the previous paragraphs. Many of these filters can be viewed as encoding a byte stream as text. Let me know your thoughts, though. -- Jeremy Maitin-Shepard

I don't think any of the transformation are accurately represented as encoding a byte stream as text. I'll quickly address base-64 because it's different from the others; this is a bitstream representation that happens to tolerate being interpretted as character data in most scenarios (base-32 also tolerates case conversion - so it's suitable for HTTP headers). For the other escaping, these represent a text-representation of text data. Which is dumb sounding, but look at it from the perspective of encoding a 32-bit int - when that gets streamed, there's two choices: 1. Submit 32-bits for encoding - the stream requires enough parameters to figure out the endianness, and then how the bits are represented. 2. Convert it into a text and submit the text - the string requires the parameters for this conversion (base, leading 0s - printf parameters), then downstream the text encoding has it's own parameters. Escaping is just another way of saying "what is the text representation of this text". There's more than could be included, however, some really basic operators like escape operators and stream terminations operators would go a long way towards making it easy to interpret a lot of file formats. These operators are not on binary data, but on text data (something that can interpret the grapheme clusters directly). Otherwise, escaping will be buggy the first time it's applied to characters outside of the bottom 48. At some point, the line between streaming, serialization and text parsing gets blurred (the smarter the text parsing the deeper we go into unicode issues) - but the interesting question to ask is what support would make these operations implementable without rebuffering (to perform translations that aren't immediately supported by the stream library). The complete stack needed to support all of these requirements has a bunch of layers, but depending on the application most of them are optional: 1. Text encoding - how are numbers formatted (are numbers going direct to primitive encoding), how are strings escaped and delimited in a text stream. If writing to a string buffer, then the stream may terminate here. Text encoding may alter the character set - for example, punycode changes unicode into ascii (which simplifies the string encode process). 2. String encoding - how do strings get reduced to a stream of primitives (if the text format matching the encoding format then there's nothing to do - true for SBCS, MBCS). How is a variable length string delimited in binary (length prefixes, null termination, maximum size, padding). 3. Primitive encoding - Endianness, did we really mean IEEE 754 floats, are we sending whole bytes or only a subset of bits (and int is expedient for a memory image, but there are only 8 significant bits), are there alignment issues (a file format that was originally a memory image may word-align records or fields). 4. Bitstream encoding - if the output is octets then this layer is optional, otherwise chop up bits into Base64 or less. Tagging the format can be most likely be ignored at the stream level. Most file formats will either externally or internally specify their encoding formats. The most helpful thing to do is provide factory functions that convert from existing character set descriptors ( http://www.iana.org/assignments/character-sets) into an actual operator and allow changing the operators at a specific stream position. This will help most situations where character encoding is specified in a header. John On 6/22/07, Jeremy Maitin-Shepard <jbms@cmu.edu> wrote:
"John Hayes" <john.martin.hayes@gmail.com> writes:
While working on ordinary web software, there are actually a lot more variations on data encodings than just text and binary:
It seems fairly logical to me to have the following organization:
- Streams of arbitrary POD types
For instance, you might have uint8_t streams, uint16_t streams, etc.
- A byte stream would be a uint8_t stream.
- A text stream holding utf-16 encoded text would be a uint16_t stream, while a text stream holding utf-8 encoded text would be a uint8_t stream. A text stream holding iso-8859-1 encoded text would also be a uint8_t stream.
There is the issue of whether it is useful to have a special text stream type that is tagged (either at compile-time or at run-time) with the encoding in which the data either going in or out of it are supposed to be. How exactly this tagging should be done, and to what extent it would be useful, remains to be explored.
It seems that your various examples of filters/encoding, like BASE-64, URL encoding, CDATA escaping, and C++ string escaping, might well fit into the framework I described in the previous paragraphs. Many of these filters can be viewed as encoding a byte stream as text.
Let me know your thoughts, though.

----- Original Message ----- From: "John Hayes" <john.martin.hayes@gmail.com> To: <boost@lists.boost.org> Sent: Tuesday, June 26, 2007 9:38 AM Subject: Re: [boost] [rfc] I/O Library Design
I don't think any of the transformation are accurately represented as encoding a byte stream as text. I'll quickly address base-64 because it's different from the others; this is a bitstream representation that happens to tolerate being interpretted as character data in most scenarios (base-32 also tolerates case conversion - so it's suitable for HTTP headers).
Sorry but not responding to your mail specifically but all of the recent postings on this topic. This material has been excellent for me. It has expanded the problem but strangely brought some clarity. Maybe ;-) There seems to be, as always, a suite of requirements hiding under a single title. This started out as "I/O library" and is now addressing the proper handling of mail attachments. Wonderful stuff but if these different requirements arent acknowledged it will be easy to bounce from one to the other. What about this; 1. There is an interaction with a device. This involves opening and closing with zero or more block I/O operations in between. 2. A byte-stream facillity may be associated with one of these device sessions. 3. The byte-streams may be interpreted an infinite number of ways. Some example interpretations (that happen to match common uses) appear below. This set of uses in no way precludes other useful interpretations from emerging. a) A series of variable length, free-format messages (i.e. lines) b) A single item of structured data (XML) c) A series of variable length messages with embedded data (i.e. lines carrying instances of formal data notations) d) .... (representing the infinity of other uses) 4. Very generally these use cases derive different structure from the stream. Whether this is as simple as a series of messages separated by newlines or the extensive structure and info in a SOAP message, the activity is abstractly equivalent. 5. At each level of structure the interpreter is free to enforce further rules, e.g. the two level stream of lines might require that the content of each line is in UTF-8 while the arbitrary-level stream (XML) requires that all document content is in UCS-2 and everything else is ASCII (possibly confusing example). A need for the latter example might arise where the interpretation input is only ever viewed by developers and support staff (all speaking English) while the stored content is being viewed and modified at many sites around Europe. There is a lot more that I can inject into this thread but before I make a tit of myself, is this making sense to anyone else? Scott.

----- Original Message ----- From: "Scott Woods" <scottw@qbik.com> To: <boost@lists.boost.org> Sent: Tuesday, June 26, 2007 2:57 PM Subject: Re: [boost] [rfc] I/O Library Design [snip]
I don't think any of the transformation are accurately represented as encoding a byte stream as text. I'll quickly address base-64 because it's different from the others; this is a bitstream representation that happens to tolerate being interpretted as character data in most scenarios (base-32 also tolerates case conversion - so it's suitable for HTTP headers).
Sorry but not responding to your mail specifically but all of the recent postings on this topic.
[snip]
A need for the latter example might arise where the interpretation input is only ever viewed by developers and support staff (all speaking English) while the stored content is being viewed and modified at many sites around Europe.
There is a lot more that I can inject into this thread but before I make a tit of myself, is this making sense to anyone else?
I'll take that as a no :-) Maybe my concepts could do with a bit more work. The detail on unicoding has been great. Thanks. Scott.

"Scott Woods" <scottw@qbik.com> writes: [snip]
I'll take that as a no :-) Maybe my concepts could do with a bit more work.
The detail on unicoding has been great. Thanks.
Perhaps you can elaborate on how your ideas about a conceptual framework for interpreting a byte stream as more structured data should affect the interface/design of the I/O library. -- Jeremy Maitin-Shepard

----- Original Message ----- From: "Jeremy Maitin-Shepard" <jbms@cmu.edu> To: <boost@lists.boost.org> Sent: Thursday, June 28, 2007 8:21 AM Subject: Re: [boost] [rfc] I/O Library Design
"Scott Woods" <scottw@qbik.com> writes:
[snip]
I'll take that as a no :-) Maybe my concepts could do with a bit more work.
The detail on unicoding has been great. Thanks.
Perhaps you can elaborate on how your ideas about a conceptual framework for interpreting a byte stream as more structured data should affect the interface/design of the I/O library.
Yes. Apologies for loss of context :-) The short version; 1. Drop "Compression Filter and Misc. Filter" from "Binary Transport Layer" 2. Rename "Buffer Filter" as just "Buffering" 3. Bundle "Endianness" and "Representation" and call it "Network/Host Representation" 4. Pull the resulting "Network/Host Representation" out of the presented layering 5. Define other representations such as "ASCII Line", "UTF-8 XML" and "Command Line User" 6. Allow for representations to be composable, e.g, Command Line User<input = keys to basic C++ types,output = basic types to UTF 8> Usage might look like; file_device d; d.open( "file name" ); command_line_user<keys_to_basic,basic_to_UTF_8> cli; cli.attach( d ); while( cli.parse() ) { process( cli.interpreted_item ); } and; TCP_device d; d.open( socket_descriptor ); network_host nh; nh.attach( d ); while( nh.parse() ) { process( nh.interpreted_item ); } 7. Issue of sync and async is ancillary, i.e. dont believe anything in the above implies exlusive use in a sync or async environment. Justification for this claim is based on the "composable representation" objects holding all the parsing/formatting state internally. Regards, Scott

"Scott Woods" <scottw@qbik.com> writes:
----- Original Message ----- From: "Jeremy Maitin-Shepard" <jbms@cmu.edu> To: <boost@lists.boost.org> Sent: Thursday, June 28, 2007 8:21 AM Subject: Re: [boost] [rfc] I/O Library Design
"Scott Woods" <scottw@qbik.com> writes:
[snip]
I'll take that as a no :-) Maybe my concepts could do with a bit more work.
The detail on unicoding has been great. Thanks.
Perhaps you can elaborate on how your ideas about a conceptual framework for interpreting a byte stream as more structured data should affect the interface/design of the I/O library.
Yes. Apologies for loss of context :-)
The short version; 1. Drop "Compression Filter and Misc. Filter" from "Binary Transport Layer"
I'm not sure what you mean by this. At what level would compression filters operate?
2. Rename "Buffer Filter" as just "Buffering" 3. Bundle "Endianness" and "Representation" and call it "Network/Host Representation"
I realize that some existing interfaces/specifications may refer to endian conversion as network to/from host byte-order conversion, but endian conversion can of course be used for things unrelated to network protocols.
4. Pull the resulting "Network/Host Representation" out of the presented layering
What does it mean to do this? From your comments, it seems like you might be saying that the stream abstraction is not particularly important, and that some other abstractions should be of primary interest. Perhaps you can explain the alternate abstractions, though, if this is the case.
5. Define other representations such as "ASCII Line", "UTF-8 XML" and "Command Line User"
I don't quite understand what is meant by "command line user". As far as XML, it seems that aside from providing basic encoding conversion facilities, the I/O library need not know anything about XML.
6. Allow for representations to be composable, e.g, Command Line User<input = keys to basic C++ types,output = basic types to UTF 8>
Usage might look like;
file_device d; d.open( "file name" ); command_line_user<keys_to_basic,basic_to_UTF_8> cli; cli.attach( d ); while( cli.parse() ) { process( cli.interpreted_item ); }
and;
TCP_device d; d.open( socket_descriptor ); network_host nh; nh.attach( d ); while( nh.parse() ) { process( nh.interpreted_item ); }
It isn't clear to me exactly what these code-snippets might be intended to do. Perhaps you can explain.
7. Issue of sync and async is ancillary, i.e. dont believe anything in the above implies exlusive use in a sync or async environment. Justification for this claim is based on the "composable representation" objects holding all the parsing/formatting state internally.
It still seems that to support asynchronous operations, support would be needed at every level of the library, but I am thinking primarily about a stream abstraction. Perhaps you can elaborate on this "composable representation" abstraction idea, and how it might make supporting both synchronous and asynchronous operations simpler. -- Jeremy Maitin-Shepard

Scott Woods wrote:
----- Original Message ----- From: "Jeremy Maitin-Shepard" <jbms@cmu.edu>
Perhaps you can elaborate on how your ideas about a conceptual framework for interpreting a byte stream as more structured data should affect the interface/design of the I/O library.
Yes. Apologies for loss of context :-)
The short version; 1. Drop "Compression Filter and Misc. Filter" from "Binary Transport Layer" 2. Rename "Buffer Filter" as just "Buffering" 3. Bundle "Endianness" and "Representation" and call it "Network/Host Representation" 4. Pull the resulting "Network/Host Representation" out of the presented layering 5. Define other representations such as "ASCII Line", "UTF-8 XML" and "Command Line User" 6. Allow for representations to be composable, e.g, Command Line User<input = keys to basic C++ types,output = basic types to UTF 8>
Although I think your ideas for an interpretation framework are interesting, I think you're applying them at the wrong level. For all its layering, my library concept is still intended (except for the formatting) as a low-level stream interface. Your framework might build on top of it, perhaps even modifying the chains as it goes along. However, I don't think mutilating the structure and generality of the interface for the sake of such an interpretation scheme is justified. Sebastian Redl

On 7/1/07, Sebastian Redl <sebastian.redl@getdesigned.at> wrote:
Scott Woods wrote:
----- Original Message ----- From: "Jeremy Maitin-Shepard" <jbms@cmu.edu>
[snip]
The short version; 1. Drop "Compression Filter and Misc. Filter" from "Binary Transport Layer" 2. Rename "Buffer Filter" as just "Buffering" 3. Bundle "Endianness" and "Representation" and call it "Network/Host Representation" 4. Pull the resulting "Network/Host Representation" out of the presented layering 5. Define other representations such as "ASCII Line", "UTF-8 XML" and "Command Line User" 6. Allow for representations to be composable, e.g, Command Line User<input = keys to basic C++ types,output = basic types to UTF 8>
Although I think your ideas for an interpretation framework are interesting, I think you're applying them at the wrong level. For all its layering, my library concept is still intended (except for the formatting) as a low-level stream interface. Your framework might build on top of it, perhaps even modifying the chains as it goes along. However, I don't think mutilating the structure and generality of the interface for the sake of such an interpretation scheme is justified.
[snip] Hi Sebastien, I mostly understand your comments and can also understand your direction. While there is major overlap between my directions (i.e. clean and efficient transfer of application data to system files and through networks) and yours I am not sure about what I can contribute to this thread. I will continue to listen in. I hope to have something to try out on boosters in the next few weeks. It will be good to get your feedback :-) On one thing I do disagree. Your are committed to a "low-level stream interface" and strongly distinguish this from a "interpretation scheme". Until proven otherwise I will continue to champion the latter. While there must be some equivalent of getc+putc (or cin >> c), this by itself is almost useless, i.e. there is always "interpretation". Whether that is reconstruction of unsigned long integers from a network-ordering or a wstring from a UTF8 stream. I'm pretty confident that you would say "of course" but what I do get confused about is the notion that the scope of your work somehow does not include "interpretation". Cheers, Scott

While working on ordinary web software, there are actually a lot more variations on data encodings than just text and binary: And not only in web software. This is exactly what the filters and devices are supposed to support. However, with some encodings, the line between filters and devices is a bit blurred. A binary format may itself be encoded as bytes (of varying endianess), or in Base64 for email attachments (RFC 2045) or Base32 for URLs or form post data (RFC 3548). I don't think any of the transformation are accurately represented as encoding a byte stream as text. I'll quickly address base-64 because it's different from the others; this is a bitstream representation that happens to tolerate being interpretted as character data in most scenarios (base-32 also tolerates case conversion - so it's suitable for HTTP headers).
From my understanding of Base-64, I'd say I disagree. Base-64 is not a bitstream representation that tolerates being interpreted as characters. This would mean that the bit pattern for the Base-64 version of a given blob is defined. That's not the case, though. The Base-64 transformation is defined in terms of abstract characters: the bit-hextet 000000 corresponds to A, 000001 to B, and so on. The actual representation of
Hi John, I'm responding to both your mails in a single reply (and mixing your quotes), because they are closely interrelated. John Hayes wrote: these characters does not matter - cannot matter! The encoding was designed to survive re-encoding of the resulting text. Therefore, writing a Base64Device that wraps a character stream and provides a binary stream seems to be a very appropriate way of implementing Base-64 to me.
When encoding in a plain-text format (after encoding into a narrow character set), there might still be escaping depending on the container. C, JS, XML attributes, elements and CDATAs, SQL (by database) all have different escaping rules. This fails to mention sillier issues like newline representation.
For the other escaping, these represent a text-representation of text data.
This is what text filters are for. While non-trivial, it would certainly be possible to implement stateful filters that can escape string literals. Or you can implement simpler filters that do the encoding but are context-insensitive. (Then you're responsible for inserting and removing the filters from your chain as context requires.)
but the interesting question to ask is what support would make these operations implementable without rebuffering (to perform translations that aren't immediately supported by the stream library).
Buffering is also an interesting problem because in some formats, buffering events (like flush overflow or EOF) have streaming output to indicate an explicit end of stream, minimum remaining distance or differences in distance (like how many bytes to the next chunk in a stream). That's an excellent and important observation. But I think my current design supports this. From my limited reasearch, the most complete description of a stream encoding is hidden in the description of HTTP 1.1 entities - this defines a 3-layer model for streaming:
Buffering events: How to determine how large the stream is (TE, Content-Length, Trailer headers) I don't think this should be part of the stream stack. Determining the size looks like an application concern to me. The stream simply supplies
Transformations: Preprocessing required before the stream can be interpretted (Content-Encoding: gzip, deflate, could include byte encodings) This would be the domain of filters. However, determining the required
I think in a system that works by combining components (and I think everything else would be too inflexible) you cannot implement functionality that changes the size of the data without rebuffering, at least with a small buffer. Any quoting means that for an incoming character, two characters might get forwarded. Or one for two, going in the other direction. A Base-64 translator must always buffer some data, because it needs groups of 3 bytes before it can do encoding, groups of 4 characters before it can do decoding. This means some buffering. the data the application requests. (Or tries to.) transformations is still an application issue. Like above, I don't think the stream stack should build itself based on data it parses. (However, it would be an interesting domain for a support library.)
Type: What class should further interpret the content, and for text entities, the character set encoding (Content-Type). Same thing. Let the application find out the encoding and what to do with the data. 1. Text encoding - how are numbers formatted (are numbers going direct to primitive encoding), how are strings escaped and delimited in a text stream. If writing to a string buffer, then the stream may terminate here. Text encoding may alter the character set - for example, punycode changes unicode into ascii (which simplifies the string encode process).
All except the first could be accomplished using text filters. The first seems to be a very domain-specific question that is better handled by the decision of which interface - the binary or the text - to use in the first place.
2. String encoding - how do strings get reduced to a stream of primitives (if the text format matching the encoding format then there's nothing to do - true for SBCS, MBCS). This would be the character conversion device. How is a variable length string delimited in binary (length prefixes, null termination, maximum size, padding).
3. Primitive encoding - Endianness, did we really mean IEEE 754 floats, That's the binary formatting. are we sending whole bytes or only a subset of bits (and int is expedient for a memory image, but there are only 8 significant bits), Interesting idea here. May be binary formatting, may be serialization, may be a simple matter of casting the data before feeding it to the stream. The main problem I see in integrating this into the stream is
This looks like a question of serialization to me, and thus outside the domain of the library. that it is highly context-dependent. Which int is there because the range is needed, and which is only there because the hardware processes it faster? There could be both kinds within a single structure, which is why I'm inclined to leave this to the application or the serialization.
are there alignment issues (a file format that was originally a memory image may word-align records or fields).
Another serialization issue.
4. Bitstream encoding - if the output is octets then this layer is optional, otherwise chop up bits into Base64 or less.
Tagging the format can be most likely be ignored at the stream level. Most file formats will either externally or internally specify their encoding formats. I don't think it's even possible, with reasonable effort, to support
Binary filters can do this, although as I argued above, I don't think Base-64 is a good example of such a use. this at the stream level. Tagging is very dependent on the data format.
The most helpful thing to do is provide factory functions that convert from existing character set descriptors ( http://www.iana.org/assignments/character-sets) into an actual operator and allow changing the operators at a specific stream position. This will help most situations where character encoding is specified in a header.
Yes, I agree. The semantics of changing the stack in the middle of the stream must be defined. Sebastian Redl

Jeremy Maitin-Shepard wrote:
- One idea from [Boost.IOStreams] to consider is the direct/indirect device distinction.
I never noticed this distinction before. It seems useful, but there are issues not unlike the AsyncIO issues. Direct devices provide a different interface. A programmer can take advantage of this interface for some purposes, but for most, I fear, the advantages would be lost. Consider: - A direct device cannot be wrapped by filters that do dynamic data rewriting (such as (de)compression). The random access aspect would be lost. - A direct device cannot participate in the larger stack without propagating the direct access model throughout the stack. (And this stops at the text level anyway, because the character recoder does dynamic data rewriting.) Propagating another interface means a lot of additional implementation effort and complexity.
- Binary transport layer issue:
Platforms with unusual features, like 9-bit bytes or inability to handle types less than 32-bits in size can possibly still implement the interface for a text/character transport layer, possibly on top of some other lower-level transport that need not be part of the boost library. Clearly, the text encoding and decoding would have to be done differently anyway.
A good point, but it does mean that the text layer dictates how the binary layer has to work. Not really desirable when pure binary I/O has nothing to do with text I/O. One approach that occurs to me would be to make the binary transport layer use a platform-specific byte type (octets, nonets, whatever) and have the binary formatting layer convert this into data suitable for character coding.
- Seeking:
Maybe make multiple mark/reset use the same interface as seeking, for simplicity. Just define that a seeking device has the additional restriction that the mark type is an offset, and the argument to seek need not be the result of a call to tell.
Another issue is whether to standardize the return type from tell, like std::ios_base::streampos in the C++ iostreams library.
These are incompatible requirements, and the reason I want to keep the interfaces separate. Standardizing the tell return type is a good idea and necessary for efficient work of type erasure and simple use of arbitrary values in seek(). The type must be transparent. The return type of mark(), on the other hand, can and should be opaque. This allows for many interesting things to be done. For example: Consider a socket. It has no mark/reset, let alone seeking support. You have a recursive descent parser that requires multiple mark/reset support. You want to parse data coming from the socket using this parser. Typically, this will meant that you have to receive all data (which may be tricky, if parsing is required to tell you how much data to expect) and then parse the buffer. With the I/O stack, you could instead write a filter that implements mark/reset support on top of an arbitrary device. Let's consider the simplest case: a filter that implements single mark/reset. The filter contains a simple extensible buffer (such as a std::vector, or perhaps a linked list of fixed-size buffers). When mark() is first called, it starts buffering all data that is read through it, in addition to returning it. When reset() is called, it simply starts feeding the buffered data to read requests until it runs out, at which point it goes back to calling the underlying device. When mark() is called again, it can discard all data buffered so far. (E.g. it might drop from the linked list all completely filled buffers and free their memory.) A more complex case is multiple mark/reset. One variant is a filter that just starts buffering on the first mark() and returns the offset into the buffer where the data starts (i.e. 0). On every subsequence mark(), it returns a higher index. A reset() is passed the offset, so it starts reading from this offset. The obvious problem is that, from the first mark() on, all data has to be buffered. In situations where memory is scarce, this may not be desirable. If the filter knew how many marks still exist, it could discard buffered data that is no longer needed. If the mark type is opaque, it can be a smart-pointer-like object with a reference count. For every buffer chunk, there could be one reference count. If the count for a chunk drops to zero, the filter knows the data in that chunk is no longer needed and can free the memory. I want to keep this flexibility. I want mark/reset to stay separate.
- Binary formatting (perhaps the name data format would be better?):
Sounds like a good name.
I think it is important to provide a way to format {uint,int}{8,16,32,64}_t as either little or big endian two's complement (and possibly also one's complement). Yes, that's pretty much the idea behind this layer. It might be useful to look at the not-yet-official boost endian library in the vault.
I will.
It should probably also be possible to determine using the library at compile time what the native format is. To what end? If the native format is one of the special predefined ones, it will hopefully be optimized in the platform-aware special implementation (well, I can dream) anyway. - Header vs Precompiled:
I think as much should be separately compiled as possible, but I also think that type erasure should not be used in any case where it will significantly compromise performance.
I'm thinking of a system where components are templates on the component they wrap, so as to allow direct calls upwards. I'm thinking of using the common separately compiled template specialization extension of compilers to provide pre-compiled versions of the standard components instantiated with the erasure components. This is very similar to how Spirit works, except that it doesn't have pre-compiled stuff. In Spirit, rule is the erasure type, but the various parsers can be directly linked, too. Then, if the performance is needed, the programmer can hand-craft his chain so that no virtual calls are made, at the cost of compiling his own copy of the components. I'm afraid I don't see a better way of doing this. I'm wide open to suggestions.
- The "byte" stream and the character stream, while conceptually different, should probably both be considered just "streams" of particular POD types. I have explained in a different post why I don't think this is a good idea. - Text transport:
I don't think this layer should be restricted to Unicode encodings.
For full generality, the library should provide facilities for converting between any two of a large list of encodings. No, I'll leave that to a character support library. (Parts of the
I have no plans of doing so. I just consider all encodings as encodings of the universal character set. An encoding is defined by how it maps the UCS code points onto groups of octets, words, or other primitives. library I will have to specify and implement to build the text converter on, but that has time. The binary stuff comes first.)
- Text formatting:
For text formatting, I think it would be very useful to look at the IBM ICU library. I have. Some interesting ideas there. It may in fact make sense to leave text formatting as a separate library Each of my layers can be considered a separate library. That's the way I'll implement them, too. I just design and present them at once because I need to consider the requirements of the higher levels on the lower levels.
Sebastian Redl

Sebastian Redl <sebastian.redl@getdesigned.at> writes:
Jeremy Maitin-Shepard wrote:
- One idea from [Boost.IOStreams] to consider is the direct/indirect device distinction.
I never noticed this distinction before. It seems useful, but there are issues not unlike the AsyncIO issues. Direct devices provide a different interface. A programmer can take advantage of this interface for some purposes, but for most, I fear, the advantages would be lost. Consider: - A direct device cannot be wrapped by filters that do dynamic data rewriting (such as (de)compression). The random access aspect would be lost. - A direct device cannot participate in the larger stack without propagating the direct access model throughout the stack. (And this stops at the text level anyway, because the character recoder does dynamic data rewriting.) Propagating another interface means a lot of additional implementation effort and complexity.
Okay. I'm inclined to agree with this.
- Binary transport layer issue:
Platforms with unusual features, like 9-bit bytes or inability to handle types less than 32-bits in size can possibly still implement the interface for a text/character transport layer, possibly on top of some other lower-level transport that need not be part of the boost library. Clearly, the text encoding and decoding would have to be done differently anyway.
A good point, but it does mean that the text layer dictates how the binary layer has to work. Not really desirable when pure binary I/O has nothing to do with text I/O.
I'm not sure what you mean by this exactly.
One approach that occurs to me would be to make the binary transport layer use a platform-specific byte type (octets, nonets, whatever) and have the binary formatting layer convert this into data suitable for character coding.
It seems like trying to support unusual architectures at all may be extremely difficult. See my other post. I suppose if you can find a clean way to support these unusual architectures, then all the better. It seems that it would be very hard to support e.g. utf-8 on a platform with 9-bit bytes or which cannot handle types smaller than 32-bits.
- Seeking:
Maybe make multiple mark/reset use the same interface as seeking, for simplicity. Just define that a seeking device has the additional restriction that the mark type is an offset, and the argument to seek need not be the result of a call to tell.
Another issue is whether to standardize the return type from tell, like std::ios_base::streampos in the C++ iostreams library.
These are incompatible requirements, and the reason I want to keep the interfaces separate. Standardizing the tell return type is a good idea and necessary for efficient work of type erasure and simple use of arbitrary values in seek(). The type must be transparent.
The return type of mark(), on the other hand, can and should be opaque. This allows for many interesting things to be done. For example: Consider a socket. It has no mark/reset, let alone seeking support. You have a recursive descent parser that requires multiple mark/reset support.
I see. It still seems that using different names means that something that requires only mark/reset support cannot use a stream providing seek/tell support, without an additional intermediate layer. [snip]
It should probably also be possible to determine using the library at compile time what the native format is. To what end? If the native format is one of the special predefined ones, it will hopefully be optimized in the platform-aware special implementation (well, I can dream) anyway.
The reason would be for a protocol in which little/big endian is specified as part of the message/data, and a typical implementation would always write in native format (and so it would need to determine which is the native format), but support both formats for reading.
- Header vs Precompiled:
I think as much should be separately compiled as possible, but I also think that type erasure should not be used in any case where it will significantly compromise performance.
I'm thinking of a system where components are templates on the component they wrap, so as to allow direct calls upwards. I'm thinking of using the common separately compiled template specialization extension of compilers to provide pre-compiled versions of the standard components instantiated with the erasure components. This is very similar to how Spirit works, except that it doesn't have pre-compiled stuff. In Spirit, rule is the erasure type, but the various parsers can be directly linked, too.
Ideally, the cost of the virtual function calls would normally be mitigated by calling e.g. read/write with a large number of elements at once, rather than with only a single element.
Then, if the performance is needed, the programmer can hand-craft his chain so that no virtual calls are made, at the cost of compiling his own copy of the components.
I'm afraid I don't see a better way of doing this. I'm wide open to suggestions.
- The "byte" stream and the character stream, while conceptually different, should probably both be considered just "streams" of particular POD types. I have explained in a different post why I don't think this is a good idea. - Text transport:
I don't think this layer should be restricted to Unicode encodings.
I have no plans of doing so. I just consider all encodings as encodings of the universal character set. An encoding is defined by how it maps the UCS code points onto groups of octets, words, or other primitives.
Is it in fact the case that all character encodings that are useful to support encode only a subset of Unicode? (i.e. there does not exist a useful encoding that can represent a character that cannot be represented by Unicode?) In any case, though, it is not clear exactly why there is a need to think of an arbitrary character encoding in terms of Unicode, except when explicitly converting between that encoding and a Unicode encoding. [snip] -- Jeremy Maitin-Shepard

Jeremy Maitin-Shepard wrote:
Sebastian Redl <sebastian.redl@getdesigned.at> writes:
Jeremy Maitin-Shepard wrote:
- Binary transport layer issue:
Platforms with unusual features, like 9-bit bytes or inability to handle types less than 32-bits in size can possibly still implement the interface for a text/character transport layer, possibly on top of some other lower-level transport that need not be part of the boost library. Clearly, the text encoding and decoding would have to be done differently anyway.
A good point, but it does mean that the text layer dictates how the binary layer has to work. Not really desirable when pure binary I/O has nothing to do with text I/O.
I'm not sure what you mean by this exactly.
Platforms using 9-bit bytes have need for binary I/O, too. They might have need for doing it in their native 9-bit units. It would be a shame to deprive them of this possibility just because the text streams require octets. Especially if we already have a layer in place whose purpose is to convert between low-level data representations.
One approach that occurs to me would be to make the binary transport layer use a platform-specific byte type (octets, nonets, whatever) and have the binary formatting layer convert this into data suitable for character coding.
It seems like trying to support unusual architectures at all may be extremely difficult. See my other post.
Which other post is this?
I suppose if you can find a clean way to support these unusual architectures, then all the better.
It seems that it would be very hard to support e.g. utf-8 on a platform with 9-bit bytes or which cannot handle types smaller than 32-bits.
I think the binary conversion can do it. The system would work approximately like this: 1) Every platform defines its basic I/O byte. This would be 8 bits for most computers (including those where char is 32 bits large), 9 or some other number of bits for others. The I/O byte is the smallest unit that can be read from a stream. 2) Most platforms will additionally designate an octet type. Probably I will just use uint8_t for this. They will supply a Representation for the formatting layer that can convert a stream of I/O bytes to a stream of octets. (E.g. by truncating each byte.) If an octet stream is then needed (e.g. for creating a UTF-8 stream) this representation will be inserted. 3) Platforms that do not support octets at all (or simply do not have a primitive to spare for unambiguous overloads - they could use another 9-bit type and just ignore the additional byte; character streams, at least, do not perform arithmetic on their units so overflow is not an issue) do not have support for this. They're off bad. I think this case is rare enough to be ignored.
The return type of mark(), on the other hand, can and should be opaque. This allows for many interesting things to be done. For example: Consider a socket. It has no mark/reset, let alone seeking support. You have a recursive descent parser that requires multiple mark/reset support.
I see. It still seems that using different names means that something that requires only mark/reset support cannot use a stream providing seek/tell support, without an additional intermediate layer.
Well, depends. Let's assume, for example, that the system will be implemented as C++09 templates with heavy use of concepts. The concepts for multimark/reset and tell/seek could look like this: typedef implementation-defined streamsize; enum start_position { begin, end, current }; template <typename T> concept Seekable { streamsize tell(T); void seek(T, start_position, streamsize); } template <typename T> concept MultiMarkReset { typename mark_type; mark_type mark(T); void reset(T, mark_type); } Now it's trivial to make every Seekable stream also support mark/reset by means of this simple concept map: template <Seekable T> concept_map MultiMarkReset<T> { typedef streamsize mark_type; mark_type mark(const T &t) { return tell(t); } void reset(T &t, mark_type m) { seek(t, begin, m); } }
The reason would be for a protocol in which little/big endian is specified as part of the message/data, and a typical implementation would always write in native format (and so it would need to determine which is the native format), but support both formats for reading.
Hmm ... makes sense. I'm not really happy, but it makes sense.
Ideally, the cost of the virtual function calls would normally be mitigated by calling e.g. read/write with a large number of elements at once, rather than with only a single element.
Yes, but that's the ideal case. In practice, this means that the application would have to do its own buffering even if it really wants the data unit by unit. The programmer will not want to construct the complicated full type for this. newline_filter<encoding_device<utf_8, native_converter<gzip_filter<buffering_filter<file_device> > > > > chain & = open_file(filename, read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines()); The programmer will want to simply write text_stream<utf_8> chain = open_file(filename, read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines()); But text_stream does type erasure and thus has a virtual call for everything. If the user now proceeds to read single characters from the stream, that's one virtual call per character. And I don't think this can really be changed. It's better than a fully object-oriented design, where every read here would actually mean 3 or more virtual calls down the chain. (That's the case in Java's I/O system, for example.)
Is it in fact the case that all character encodings that are useful to support encode only a subset of Unicode? (i.e. there does not exist a useful encoding that can represent a character that cannot be represented by Unicode?)
I think it is. If it isn't, that's either a defect the Unicode consortium will want to correct by adding the characters to Unicode, or the encoding is for really unusual stuff, such as Klingon text or Elven Tengwar runes. They can be seen as mappings to the private regions of Unicode and are by nature not convertible to other encodings. One possible exception is characters that only exist in Unicode as grapheme clusters but may be directly represented in other encodings.
In any case, though, it is not clear exactly why there is a need to think of an arbitrary character encoding in terms of Unicode, except when explicitly converting between that encoding and a Unicode encoding.
It is convenient to have a unified concept of a character, independent of its encoding. The Unicode charset provides such a concept. Unicode is also convenient in that it adds classification rules and similar stuff. This decision is not really visible to user code anyway, only to encoding converters: it should be sufficient to provide a conversion from and to Unicode code points to enable a new encoding to be used in the framework. Sebastian Redl

Sebastian Redl <sebastian.redl@getdesigned.at> writes: [snip]
Platforms using 9-bit bytes have need for binary I/O, too. They might have need for doing it in their native 9-bit units. It would be a shame to deprive them of this possibility just because the text streams require octets. Especially if we already have a layer in place whose purpose is to convert between low-level data representations.
It seems that the primary interface for the data formatting layer should be in terms of fixed-size types like {u,}int{8,16,32,64}_t. It is more the job of a serialization library to support platform-dependent types like short,int,long, etc., which would be of use primarily for producing serialization output that will only be used as input to the exact same program. I suppose an alternative is for the read/write functions in the data formatting layer to always specify an explicit number of bits. For example, write_{u,}int<32> or read_{u,}int<32>. read_int<N> always returns intN_t, and it is a compile-time error if that type does not exist. write_int<N> casts its argument to intN_t, and thus avoids the issue of multiple names for the same type, like int/long on most 32-bit platforms/compilers. This interface supports architectures with a 36-bit word (e.g. write_int<36>), but since everything is made explicit, avoids any confusion that might otherwise result from such support. Floating point types are somewhat more difficult to handle, and I'm not sure what is the best approach. One possibility is to also specify the number of bits explicitly, and to assume the IEEE 754 format will be used as the external format. For example, write_float<32> or write_float<64> or perhaps write_ieee754<32>. It should just be a compile-time error if the compiler/platform doesn't provide a suitable type. [snip]
It seems like trying to support unusual architectures at all may be extremely difficult. See my other post.
Which other post is this?
My comments there probably weren't very important anyway. I think it is worth considering, though, that given the rarity of non 8-bit byte platforms, it is probably not worth spending very much time in supporting them, and more importantly, it is not worth complicating the interface for 8-bit byte platforms in order to support them.
I suppose if you can find a clean way to support these unusual architectures, then all the better.
It seems that it would be very hard to support e.g. utf-8 on a platform with 9-bit bytes or which cannot handle types smaller than 32-bits.
I think the binary conversion can do it. The system would work approximately like this: 1) Every platform defines its basic I/O byte. This would be 8 bits for most computers (including those where char is 32 bits large), 9 or some other number of bits for others. The I/O byte is the smallest unit that can be read from a stream. 2) Most platforms will additionally designate an octet type. Probably I will just use uint8_t for this. They will supply a Representation for the formatting layer that can convert a stream of I/O bytes to a stream of octets. (E.g. by truncating each byte.) If an octet stream is then needed (e.g. for creating a UTF-8 stream) this representation will be inserted.
This padding/truncating would need to be done as an explicit way of encoding an octet stream as a nonet stream, and should probably not be done implicitly, unless this sort of conversion is always assumed on those platforms.
3) Platforms that do not support octets at all (or simply do not have a primitive to spare for unambiguous overloads - they could use another 9-bit type and just ignore the additional byte; character streams, at least, do not perform arithmetic on their units so overflow is not an issue) do not have support for this. They're off bad. I think this case is rare enough to be ignored.
Okay.
The return type of mark(), on the other hand, can and should be opaque. This allows for many interesting things to be done. For example: Consider a socket. It has no mark/reset, let alone seeking support. You have a recursive descent parser that requires multiple mark/reset support.
I see. It still seems that using different names means that something that requires only mark/reset support cannot use a stream providing seek/tell support, without an additional intermediate layer.
Well, depends. Let's assume, for example, that the system will be implemented as C++09 templates with heavy use of concepts.
I think it may not be a good idea to target this new I/O library to a language that does not yet exist, and which more importantly is not yet supported by any compiler, except perhaps Douglas Gregor's experimental ConceptGCC, which as the release notes state, is extremely slow, although the release notes also claim that performance can be improved. I suppose it may work fine to write the library (using the preprocessor) so that it can be compiled under existing compilers without concept support, and include a small amount of additional functionality/use more convenient syntax if concept support is available. I would be very, very wary of anything that would increase the compile-time for users of the library, though. [snip]
The reason would be for a protocol in which little/big endian is specified as part of the message/data, and a typical implementation would always write in native format (and so it would need to determine which is the native format), but support both formats for reading.
Hmm ... makes sense. I'm not really happy, but it makes sense.
What do you mean you're not happy? I think all that would really be needed would be a macro to indicate the endianness. Of course any code that depends on this would likely depend even more on 8-bit bytes, but that is another issue.
Ideally, the cost of the virtual function calls would normally be mitigated by calling e.g. read/write with a large number of elements at once, rather than with only a single element.
Yes, but that's the ideal case. In practice, this means that the application would have to do its own buffering even if it really wants the data unit by unit.
Possibly this issue can be mitigated by exposing in the types only a buffer around a text stream, although I agree that there is no perfect solution.
The programmer will not want to construct the complicated full type for this.
newline_filter<encoding_device<utf_8, native_converter<gzip_filter<buffering_filter<file_device> > > > > chain & = open_file(filename, read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines());
The programmer will want to simply write
text_stream<utf_8> chain = open_file(filename, read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines());
I notice that these code examples suggest that all streams will be reference counted (and cheaply copied). Is that the intention? A potential drawback to that approach is that a buffer filter would be forced to allocate its buffer on the heap, when it otherwise might be able to use the stack. [snip]
It is convenient to have a unified concept of a character, independent of its encoding. The Unicode charset provides such a concept. Unicode is also convenient in that it adds classification rules and similar stuff. This decision is not really visible to user code anyway, only to encoding converters: it should be sufficient to provide a conversion from and to Unicode code points to enable a new encoding to be used in the framework.
I am basically content using only Unicode for text handling in my own programs, but I think it would be useful to see what others that care about efficiency for certain operations (and work with languages that are not represented very efficiently using UTF-8) think about this. -- Jeremy Maitin-Shepard

Sebastian Redl wrote:
Hi,
A few weeks ago, a discussion that followed the demonstration of the binary_iostream library made me think about the standard C++ I/O and what I would expect from an I/O model.
Now I have a preliminary design document ready and would like to have some feedback from the Boost community on it.
The document can be found here: http://windmuehlgasse.getdesigned.at/newio/
Nice to hear someone is working on this. I really think that C++ needs a whole new IO system. My main complaints about iostreams are: -> Bad performance: It's really hard to get good performance for a conforming iostream implementation. Implementors need to start hacking around with watermarking and other tricks. No virtual calls for each character insertion, please. Stringstream is a bad substitute for sprintf. So the new IO needs to be designed from the ground as a high performance IO system. -> Big size: Even if iostreams are not used in the application (say we are using printf), the runtime and cout/cin initialization guarantees have some size overhead. That's why some embedded library vendors offer disabling iostreams from their libraries at compilation time. -> Hard formatting: printf/scanf are much easier to use and they don't need several hundred of function calls to do their job. The operator<< overloading is also a problem if each call needs some internal locking. A formatting syntax that can minimize the locking needs would be a good choice. I would be really happy with a type-safe printf (variadic templates to the rescue). Regards, Ion

bounces@lists.boost.org] De la part de Ion Gaztañaga
-> Hard formatting: printf/scanf are much easier to use and they don't need several hundred of function calls to do their job. The operator<< overloading is also a problem if each call needs some internal locking. A formatting syntax that can minimize the locking needs would be a good choice. I would be really happy with a type-safe printf (variadic templates to the rescue).
You can limit the need of locking call pretty easily, even with operator <<. You just need to use a temporary object which unlock the mutex when destroyed: Class stream; Class OStreamObj { Public: Explicit OStreamObj(stream& s): m_stream (s) { m_stream.lock(); } OStreamObj(const OStreamObj& o); // somethings smart ~OStreamObj() { m_stream.unlock(); } Private: stream& m_stream; }; Class stream { Public: Void Lock(); Void Unlock(); Private: Mutex m_mutex; }; Template< class T > Inline OStreamObj Operator<<(stream& s, const T& t) { OStreamObj oso(s); Oso << t; Return Oso; } // all the overload of << are on OStreamObj Inline OStreamObj& Operator<<( OStreamObj& oso, int i) { Oso.put_int(i); Return Oso; } Somethings like this (probably with a counter one the mutex or other c++ tricks to avoid the lock unlock of the first copy). I use the same kind of tricks for automatically putting a std::endl at the end of my logging primitive. struct AutoEndLine { AutoEndLine(std::ostream& os):m_ostream(os) {} ~AutoEndLine() { m_ostream << std::endl; } template< class T > AutoEndLine& operator<< (const T& e) { m_ostream<<e; return *this; } operator std::ostream&() { return m_ostream; } private: std::ostream& m_ostream; }; #define LOG if(IS_LOG_ACTIVATE) ::CT::AutoEndLine(::std::cout)<< "LOG: " There is some advantage to operator << or %, for example, I like to do: LOG << "foo " << foo; ASSERT(f!=NULL) << "Error"; DEBUG << "bar"; With #define DEBUG if(debug_active) LOG << "Debug: " Or somethings like this. Anyways, I don't think this is a good argument against <<. Even if I agree with you on all the point of your previous post. -- Cédric Venet

Ion Gaztañaga wrote:
Sebastian Redl wrote:
[snip]
-> Hard formatting: printf/scanf are much easier to use and they don't need several hundred of function calls to do their job. The operator<< overloading is also a problem if each call needs some internal locking. A formatting syntax that can minimize the locking needs would be a good choice. I would be really happy with a type-safe printf (variadic templates to the rescue).
I'd rather disagree with you here. Operator<< gives a great advantage of extensibility which printf cannot offer. I can define my own classes and overload the operator for them, and the IO system will work as expected with my extensions. The same thing with operator>> and scanf. I have to admit though, that indeed formatting is not a bright side of the current IO design. I think a better set of manipulators for basic primitives should do good. There might even be a printf-like manipulator: int x = 300; double y = 1.2; // Output: "Hello: 0x0000012c; 1,200" cout << "Hello: " << format("0x%08x", x) << "; " << fixed(y, point = ',', precision = 3) << endl; vector< int > v; // Output: "1, 2, 3" cout << range(v.begin(), v.end(), separator = ", ") << endl; And of course, widely used dump manipulator is an often requested feature. BTW, I think this domain is well suitable to be covered by yet another Boost library. :)

Andrey Semashev <andysem@mail.ru> writes:
Ion Gaztañaga wrote:
Sebastian Redl wrote:
[snip]
-> Hard formatting: printf/scanf are much easier to use and they don't
need several hundred of function calls to do their job. The operator<< overloading is also a problem if each call needs some internal locking. A formatting syntax that can minimize the locking needs would be a good choice. I would be really happy with a type-safe printf (variadic templates to the rescue).
I'd rather disagree with you here. Operator<< gives a great advantage of extensibility which printf cannot offer. I can define my own classes and overload the operator for them, and the IO system will work as expected with my extensions. The same thing with operator>> and scanf.
That is not a real issue. You can easily provide a mechanism for supporting user-defined types on top of a printf/scanf-like interface, e.g. by specializing a template, or overloading some other function that is called by the implementation of the formatting system. So ease of supporting user-defined types is not an argument in favor of operator<<. A key advantage of a printf interface, in addition to being much less verbose, is that it is much more convenient for internationalization; under that interface, internationalization of messages is usually possible simply by changing the format strings, an approach that is well supported by packages like gettext. Using the operator<< interface, however, in general it is necessary to change the source code in order to support internationalization, because in addition to being more difficult to correctly translate a small snippet of a message instead of an entire format string (it may be necessary for the translator to look at the source code context to determine how to translate it), it may not even be possible due to differences in grammar between languages. [snip] -- Jeremy Maitin-Shepard

-> Hard formatting: printf/scanf are much easier to use and they don't
need several hundred of function calls to do their job. The operator<<
I'd rather disagree with you here. Operator<< gives a great advantage of extensibility which printf cannot offer. I can define my own classes That is not a real issue. You can easily provide a mechanism for supporting user-defined types on top of a printf/scanf-like interface, e.g. by specializing a template, or overloading some other function that is called by the implementation of the formatting system. So ease of supporting user-defined types is not an argument in favor of operator<<.
Supposing I want some function which take a formatting chain and an arbitrary number of parameter to pass it to the new printf and then do something (usually, add a prefix, a new line or throw an exception). How can I do the forwarding in C++2003 (without defining x overload). Perhaps fusion provide some support for this? Motivation: Template<class T... > Void PrintAndThrow(const char* c, T... args) { Cout.printf(c,args); Throw somethings; } Another things is what would be the type of the format string: a const char* or std::string? Which would need parsing each time (an no compile time checking) or a complex type build by expression template (I am thinking about some proto DSL here). It would be more complex and slower to compile, but it as some advantage. This problem don't existe with << since we treat one object at a time which is the reason reordering is impossible and internationalization difficult. -- Cédric Venet

Jeremy Maitin-Shepard wrote:
Andrey Semashev <andysem@mail.ru> writes:
Ion Gaztañaga wrote:
Sebastian Redl wrote:
[snip]
-> Hard formatting: printf/scanf are much easier to use and they don't
need several hundred of function calls to do their job. The operator<< overloading is also a problem if each call needs some internal locking. A formatting syntax that can minimize the locking needs would be a good choice. I would be really happy with a type-safe printf (variadic templates to the rescue).
I'd rather disagree with you here. Operator<< gives a great advantage of extensibility which printf cannot offer. I can define my own classes and overload the operator for them, and the IO system will work as expected with my extensions. The same thing with operator>> and scanf.
That is not a real issue. You can easily provide a mechanism for supporting user-defined types on top of a printf/scanf-like interface, e.g. by specializing a template, or overloading some other function that is called by the implementation of the formatting system. So ease of supporting user-defined types is not an argument in favor of operator<<.
I agree with Cedric's reply to this. Specialization is too much a burden and, in fact, is intrusive. As for overloading some external function, we already have this function, it's operator<<.
A key advantage of a printf interface, in addition to being much less verbose, is that it is much more convenient for internationalization; under that interface, internationalization of messages is usually possible simply by changing the format strings, an approach that is well supported by packages like gettext. Using the operator<< interface, however, in general it is necessary to change the source code in order to support internationalization, because in addition to being more difficult to correctly translate a small snippet of a message instead of an entire format string (it may be necessary for the translator to look at the source code context to determine how to translate it), it may not even be possible due to differences in grammar between languages.
Printf is not a solution since there may be radical differences in phrase composition in different languages. Internationalization is, indeed, an issue. But to my mind, if we're speaking of a Standard proposal, we need a more well-thought solution to this problem, and neither current streaming IO nor C-style IO suits it.

Andrey Semashev <andysem@mail.ru> writes: [snip]
That is not a real issue. You can easily provide a mechanism for supporting user-defined types on top of a printf/scanf-like interface, e.g. by specializing a template, or overloading some other function that is called by the implementation of the formatting system. So ease of supporting user-defined types is not an argument in favor of operator<<.
I agree with Cedric's reply to this. Specialization is too much a burden and, in fact, is intrusive. As for overloading some external function, we already have this function, it's operator<<.
The point is simply that the burden is in fact exactly the same for both mechanisms, so that issue cannot be used in favor of one or the other.
A key advantage of a printf interface, in addition to being much less verbose, is that it is much more convenient for internationalization; under that interface, internationalization of messages is usually possible simply by changing the format strings, an approach that is well supported by packages like gettext. Using the operator<< interface, however, in general it is necessary to change the source code in order to support internationalization, because in addition to being more difficult to correctly translate a small snippet of a message instead of an entire format string (it may be necessary for the translator to look at the source code context to determine how to translate it), it may not even be possible due to differences in grammar between languages.
Printf is not a solution since there may be radical differences in phrase composition in different languages.
Indeed, it is true that printf is not perfect, because it requires that the arguments are referenced in the same order that they are specified in the source code, and that they can only be referenced exactly once. Very simple extensions can be used to support that, however, such as allowing arguments to be referenced by number (or perhaps by a name) and thereby provide a suitable solution.
Internationalization is, indeed, an issue. But to my mind, if we're speaking of a Standard proposal, we need a more well-thought solution to this problem, and neither current streaming IO nor C-style IO suits it.
I do agree that a well thought out proposal is needed, and that neither iostreams nor C printf are suitable. I'll be happy with anything that is efficient (both run-time and compile-time), supports internationalization, user-defined types, and isn't excessively verbose. It just seems that starting with a printf-like mechanism (or better yet, existing facilities in libraries like IBM ICU that resemble printf), and then trying to determine how to support all of the desired objectives, rather than starting with iostreams, may be a better idea. -- Jeremy Maitin-Shepard

Jeremy Maitin-Shepard wrote:
Andrey Semashev <andysem@mail.ru> writes:
Printf is not a solution since there may be radical differences in phrase composition in different languages.
Indeed, it is true that printf is not perfect, because it requires that the arguments are referenced in the same order that they are specified in the source code, and that they can only be referenced exactly once. Very simple extensions can be used to support that, however, such as allowing arguments to be referenced by number (or perhaps by a name) and thereby provide a suitable solution.
You just described the Boost.Format library. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - grafikrobot/yahoo

Rene Rivera <grafikrobot@gmail.com> writes:
Jeremy Maitin-Shepard wrote:
Andrey Semashev <andysem@mail.ru> writes:
Printf is not a solution since there may be radical differences in phrase composition in different languages.
Indeed, it is true that printf is not perfect, because it requires that the arguments are referenced in the same order that they are specified in the source code, and that they can only be referenced exactly once. Very simple extensions can be used to support that, however, such as allowing arguments to be referenced by number (or perhaps by a name) and thereby provide a suitable solution.
You just described the Boost.Format library.
The boost format library may be pretty good. It could presumably be adapted to work on top of a different set of backend facilities for formatting various types to character strings/streams, as opposed to being based on the iostreams system, -- Jeremy Maitin-Shepard

Quoting Jeremy Maitin-Shepard <jbms@cmu.edu>:
Rene Rivera <grafikrobot@gmail.com> writes:
You just described the Boost.Format library.
The boost format library may be pretty good. It could presumably be adapted to work on top of a different set of backend facilities for formatting various types to character strings/streams, as opposed to being based on the iostreams system,
boost.format seems to be unknow or not liked, anyways, it didn't seems like a reference. However, it should solve a majority of the need. Just it's not efficient (somethings like 3 or 4 times slower that printf from the doc if I recall correctly). For most application, this wouldn't be a problem. But sometimes it can. we could imagine something like: cout << format("var ",_s," =3D ",_h<8>,";")%v.n%v.v << endl; or cout << format("var ",_s," =3D ",_h<8>,";")(v.n,v.v) << endl; or cout << format("var "%_s%" =3D "%_h<8>%";",v.n,v.v) << endl; where _s and _h are respectively string and hex integer, the positionnal and format option could be passed as template or runtime parameter. for user defined type outputing, either use << like boost.format or define another free funtion which could take additional parameter like formating option template<class T,class OPT> void toString(stream& os, T t, OPT opt) { ... } this shouldn't be too slow at compile time, but I fear a little code bloating. There is then a tradeoff speed/size by selecting runtime or compile time parameter. -- Cédric Venet Jeremy, Sorry for the double post (I dislike this webmail wich ignore the reply-to)

----- Original Message ----- From: "Jeremy Maitin-Shepard" <jbms@cmu.edu> To: <boost@lists.boost.org> Sent: Tuesday, June 19, 2007 5:15 PM Subject: Re: [boost] [rfc] I/O Library Design [snip]
Indeed, it is true that printf is not perfect, because it requires that the arguments are referenced in the same order that they are specified
[snip]
Internationalization is, indeed, an issue. But to my mind, if we're speaking of a Standard proposal, we need a more well-thought solution to this problem, and neither current streaming IO nor C-style IO suits it.
I do agree that a well thought out proposal is needed, and that neither iostreams nor C printf are suitable. I'll be happy with anything that
Interesting. Would be real keen to know what the actual goals would be for such a proposal. I can see several excellent proposals arising from the ideas mentioned here, but each would have different goals.

Scott Woods wrote:
----- Original Message ----- From: "Jeremy Maitin-Shepard" <jbms@cmu.edu> To: <boost@lists.boost.org> Sent: Tuesday, June 19, 2007 5:15 PM Subject: Re: [boost] [rfc] I/O Library Design
[snip]
Indeed, it is true that printf is not perfect, because it requires that the arguments are referenced in the same order that they are specified
[snip]
Internationalization is, indeed, an issue. But to my mind, if we're speaking of a Standard proposal, we need a more well-thought solution to this problem, and neither current streaming IO nor C-style IO suits it. I do agree that a well thought out proposal is needed, and that neither iostreams nor C printf are suitable. I'll be happy with anything that
Interesting. Would be real keen to know what the actual goals would be for such a proposal. I can see several excellent proposals arising from the ideas mentioned here, but each would have different goals.
From my point of view we have touched two problems in the discussion: the ability to easy and efficiently format things into text or octet strings and internationalization support. These two tasks are attempted to be solved by the current C++ IO implementation, but the attempt is suboptimal in various ways. I think, there should be two proposals, each aimed to solve the corresponding problem. The first one should propose an easy and efficient way of formatting data into strings and the second one to provide a support for i18n. The latter means, for example, an ability to select and format messages to user in a language, known in run time. This may involve things like resources (may be dynamically loaded) and Unicode support, but the key requirement is that there should be no need to modify application's source code to run it in another language. This proposal separation doesn't mean that they are not related. We may eventually end up with a single solution to the both problems. But for now I see them as a different issues with different requirements to implementation (for example, our discussion of printf and streaming approach to formatting shows that the single format string compromises performance for general formatting cases, but may help to solve i18n problem). I hope, I've answered your question. Meantime, I can see now that we're heading a bit off-topic from the original post, since the initial discussion began on a new IO architecture proposal, which is a bit aside from the problems I mentioned above. Sorry, Sebastian.

Scott Woods wrote:
----- Original Message ----- From: "Jeremy Maitin-Shepard" <jbms@cmu.edu> To: <boost@lists.boost.org> Sent: Tuesday, June 19, 2007 5:15 PM Subject: Re: [boost] [rfc] I/O Library Design
[snip]
Indeed, it is true that printf is not perfect, because it requires that the arguments are referenced in the same order that they are specified
[snip]
Internationalization is, indeed, an issue. But to my mind, if we're speaking of a Standard proposal, we need a more well-thought solution to this problem, and neither current streaming IO nor C-style IO suits it. I do agree that a well thought out proposal is needed, and that neither iostreams nor C printf are suitable. I'll be happy with anything that
Interesting. Would be real keen to know what the actual goals would be for such a proposal. I can see several excellent proposals arising from the ideas mentioned here, but each would have different goals.
From my point of view we have touched two problems in the discussion: the ability to easy and efficiently format things into text or octet strings and internationalization support. These two tasks are attempted to be solved by the current C++ IO implementation, but the attempt is suboptimal in various ways. I think, there should be two proposals, each aimed to solve the corresponding problem. The first one should propose an easy and efficient way of formatting data into strings and the second one to provide a support for i18n. The latter means, for example, an ability to select and format messages to user in a language, known in run time. This may involve things like resources (may be dynamically loaded) and Unicode support, but the key requirement is that there should be no need to modify application's source code to run it in another language. This proposal separation doesn't mean that they are not related. We may eventually end up with a single solution to the both problems. But for now I see them as a different issues with different requirements to implementation (for example, our discussion of printf and streaming approach to formatting shows that the single format string compromises performance for general formatting cases, but may help to solve i18n problem). I hope, I've answered your question. Meantime, I can see now that we're heading a bit off-topic from the original post, since the initial discussion began on a new IO architecture proposal, which is a bit aside from the problems I mentioned above. Sorry, Sebastian.

Andrey Semashev <andysem@mail.ru> writes: [snip]
From my point of view we have touched two problems in the discussion: the ability to easy and efficiently format things into text or octet strings and internationalization support. These two tasks are attempted to be solved by the current C++ IO implementation, but the attempt is suboptimal in various ways.
Well, there is also the lowest-level task, which is actually the primary task to be addressed by the new I/O library, of providing a general stream framework that supports bytes, arbitrary character encodings, and (as I think it should) arbitrary POD types even. This stream framework should include common filters, like converting between character encodings and newline conversion. Thus, the task of converting from e.g. a stream of UTF-16 encoded uint16_t to e.g. a stream of iso-8859-1 uint8_t is completely separate from the issue of formatting text. [snip]
Meantime, I can see now that we're heading a bit off-topic from the original post, since the initial discussion began on a new IO architecture proposal, which is a bit aside from the problems I mentioned above. Sorry, Sebastian.
I don't think it is entirely off topic, simply because it may be best for the text formatting facilities to be designed to output to any stream (where by stream I mean stream as defined by the new I/O library, which could, but need not, be constructed on top of a string), and similarly, the text parsing facilities should be designed to read from an arbitrary stream. -- Jeremy Maitin-Shepard

----- Original Message ----- From: "Jeremy Maitin-Shepard" <jbms@cmu.edu> To: <boost@lists.boost.org> Sent: Wednesday, June 20, 2007 8:33 AM Subject: Re: [boost] [rfc] I/O Library Design
Andrey Semashev <andysem@mail.ru> writes:
[snip]
From my point of view we have touched two problems in the discussion: the ability to easy and efficiently format things into text or octet strings and internationalization support.
With you so far, but there is so much more! ;-) What if a developer wants to print an unsigned long as an IP host address (dotted notation), within some debug output? What if that same developer wants to allow for the input of some IP address other than the printed default? What if a developer wants to print a vector<site_security_profile> as a readable table (headings, tabstops...), within some logging system? What if the same developer wants to allow for runtime input of the security profiles, beyond the content of a file with appropriately formatted defaults? What if a developer is initializing a UI report control with the contents of map<SKU,shelf_item>? And the final product is to be rolled out through a franchise spanning most of Europe? Dont mean to be flippant with these examples. They are just supposed to shine the light around; drag some significant parties into the discussion, i.e. would resulting libraries be targeted at developers and their debug requirements or developers and their pan-language UI requirements? A functional breakdown of the above might look like; 1. Character stream with embedded runtime values * default format control on a per-C++ type basis * format overrides (unsigned long as dotted IP) * user-defined format for user-defined types * user-defined input-text-to-type-instance conversion 2.Character stream with embedded runtime layout control * default layout control for tabulation of STL containers (no headers, 8 space tabstops) * layout overrides (specific headers and tabstops) * user-defined layouts for user-defined types (e.g. labels, line-per-type-member) * navigation, selection and text input of layouts 3. Runtime construction of i8n display information * too big to start listing This is just a sequential listing; the order is not intended to be significant. The grouping relates to the examples rather than a set of suggested proposals. I suspect there is a matrix of end-user system requirements and potential software technologies. Developers need to be able to declare at compile-time whether the application debug stream has user-defined types and/or unicoding requirements. Does the UI need to speak anything other than Cantonese (and is that the developers first language?). With that ability to "cook up" the different streams the potential arguments over whether unicoding should be present within a debug stream evaporates. It becomes a per-application decision. As further input to this thread; I have recently done a lot of work in the areas of serialization and also rendering that same information in different UIs. There is an intriguing amount of overlap between what serialization does and what debug/logging/UI streams need to do. While I wont go as far as saying that application streams should exist over serialization technology (hmmmm) there might be borrowed techniques? Cheers.

Scott Woods wrote:
----- Original Message -----
Andrey Semashev <andysem@mail.ru> writes:
[snip]
From my point of view we have touched two problems in the discussion: the ability to easy and efficiently format things into text or octet strings and internationalization support.
With you so far, but there is so much more! ;-)
What if...
[snip]
Dont mean to be flippant with these examples. They are just supposed to shine the light around; drag some significant parties into the discussion, i.e. would resulting libraries be targeted at developers and their debug requirements or developers and their pan-language UI requirements?
A functional breakdown of the above might look like;
1. Character stream with embedded runtime values * default format control on a per-C++ type basis * format overrides (unsigned long as dotted IP) * user-defined format for user-defined types * user-defined input-text-to-type-instance conversion
2.Character stream with embedded runtime layout control * default layout control for tabulation of STL containers (no headers, 8 space tabstops) * layout overrides (specific headers and tabstops) * user-defined layouts for user-defined types (e.g. labels, line-per-type-member) * navigation, selection and text input of layouts
Most of the above is about formatting and parsing data to/from text. IMHO, most of it is implementable via stream manipulators in the same manner. Thus I don't see much difference between 1 and 2 in this way.
3. Runtime construction of i8n display information * too big to start listing
That's the least covered area in the discussion. Unfortunately, I don't have much experience in supporting i18n since I mostly develop server applications, so I don't have much to say here. It would be good if someone more experienced than me could elaborate this field better.
This is just a sequential listing; the order is not intended to be significant. The grouping relates to the examples rather than a set of suggested proposals.
I suspect there is a matrix of end-user system requirements and potential software technologies. Developers need to be able to declare at compile-time whether the application debug stream has user-defined types and/or unicoding requirements. Does the UI need to speak anything other than Cantonese (and is that the developers first language?).
With that ability to "cook up" the different streams the potential arguments over whether unicoding should be present within a debug stream evaporates. It becomes a per-application decision.
Agreed, users should be able to decide what they need and should not pay for what they don't need.
As further input to this thread; I have recently done a lot of work in the areas of serialization and also rendering that same information in different UIs. There is an intriguing amount of overlap between what serialization does and what debug/logging/UI streams need to do. While I wont go as far as saying that application streams should exist over serialization technology (hmmmm) there might be borrowed techniques?
Could be. In fact, I tend to think that streaming IO should implicitly support serialization as a form of formatting.

On Mon, Jun 18, 2007 at 11:48:41PM +0400, Andrey Semashev wrote:
Ion Gaztañaga wrote:
I have to admit though, that indeed formatting is not a bright side of the current IO design. I think a better set of manipulators for basic primitives should do good. There might even be a printf-like manipulator:
Do not forget about the mighty fine Boost.Format library. It's already in boost, and is quite capable of type safe printf-like formatting. -- Lars Viklund ------------------- To make it is hell. To fail is divine.

Lars Viklund wrote:
On Mon, Jun 18, 2007 at 11:48:41PM +0400, Andrey Semashev wrote:
Ion Gaztañaga wrote:
I have to admit though, that indeed formatting is not a bright side of the current IO design. I think a better set of manipulators for basic primitives should do good. There might even be a printf-like manipulator:
Do not forget about the mighty fine Boost.Format library. It's already in boost, and is quite capable of type safe printf-like formatting.
That's the key problem - it's like printf, thus makes extensibility for user-defined types difficult, if possible. What I was talking about is the best of the two worlds - ease of formatting and extensibility.

Andrey Semashev wrote:
Lars Viklund wrote:
On Mon, Jun 18, 2007 at 11:48:41PM +0400, Andrey Semashev wrote:
Ion Gazta?aga wrote:
I have to admit though, that indeed formatting is not a bright side of the current IO design. I think a better set of manipulators for basic primitives should do good. There might even be a printf-like manipulator:
Do not forget about the mighty fine Boost.Format library. It's already in boost, and is quite capable of type safe printf-like formatting.
That's the key problem - it's like printf, thus makes extensibility for user-defined types difficult, if possible.
No, Boost.Format is based on iostreams internally, so it automatically makes use of any operator<< functions that have been defined for user-defined types. It is also unlike printf in that it largely ignores the actual letter in the format specifier; see the recent thread about uint8_t being formatted as a char even when %d is used, for example. All of the current formatted I/O mechanisms have their problems. I don't have any good ideas about how to fix them, but maybe variadic templates will give someone a clever idea... Phil.

Phil Endecott wrote:
Andrey Semashev wrote:
Lars Viklund wrote: That's the key problem - it's like printf, thus makes extensibility for user-defined types difficult, if possible.
No, Boost.Format is based on iostreams internally, so it automatically makes use of any operator<< functions that have been defined for user-defined types. It is also unlike printf in that it largely ignores the actual letter in the format specifier; see the recent thread about uint8_t being formatted as a char even when %d is used, for example.
Ah, you're right. I must have messed it with some other printf-like library I dealt with. This looks like a very nice but heavy solution to me. If only it could perform at least as fast as regular stream output... I'm still finding myself using itoa & co. instead of sprintf, lexical_cast or ostringstream quite often just because it's easy and fast. I think Boost.Format performance could be noticeably higher if the value concept of the formatter was separated from the formatting functionality. This is better to be shown in example: // This creates an object that may be used like // current boost::format object and may be implemented similarly boost::formatter fmt("%1%, %2%, %3%"); fmt % a % b % c; std::cout << fmt.str(); // The "format" function returns an object of unnamed type, // which is essentially a tuple of pointers to the formatting // string and references to the arguments passed std::cout << boost::format("Hello: %1%, %2%, %3%", a, b, c); The advantage of boost::format above is that there are no additional dynamic allocations and neither format string (or its part) nor a, b or c are copied. The drawback is that the same argument may be output multiple times if its identifier is mentioned more than once in the format string. But I think, this case is quite uncommon.
All of the current formatted I/O mechanisms have their problems. I don't have any good ideas about how to fix them, but maybe variadic templates will give someone a clever idea...
With variadic templates Boost.Format could become a true formatting manipulator, taking all formatted parameters as arguments rather than by feeding operator% (see boost::format in my code snippet above). It seems to me there's too much compromise in this operator usage, and the size of the "Choices made" section tells me I'm right.

Ion Gaztañaga wrote:
-> Hard formatting: printf/scanf are much easier to use and they don't need several hundred of function calls to do their job. The operator<< overloading is also a problem if each call needs some internal locking. A formatting syntax that can minimize the locking needs would be a good choice. I would be really happy with a type-safe printf (variadic templates to the rescue).
Something that would use meta-programming to optimize the generation of the formatted string would be way better than printf.

Mathias Gaunard wrote:
Ion Gaztañaga wrote:
-> Hard formatting: printf/scanf are much easier to use and they don't need several hundred of function calls to do their job. The operator<< overloading is also a problem if each call needs some internal locking. A formatting syntax that can minimize the locking needs would be a good choice. I would be really happy with a type-safe printf (variadic templates to the rescue).
Something that would use meta-programming to optimize the generation of the formatted string would be way better than printf. How does this improve on printf ? Not that I favor printf but I dont see how this is any better. Speed wise yea much better at run time at least, printf is a hog on cpu cycles. Do you plan on generating every possible string at compile time ? I call that code bloat.
I dont think there is an acceptable solution to make everyone happy. Printf is easier to use for i8n but I like iostreams interface better. You can always write manipulators to do what you need for formatting with iostreams if needed. Best Regards, Richard V. Day

Richard Day wrote:
Something that would use meta-programming to optimize the generation of the formatted string would be way better than printf. How does this improve on printf ? Not that I favor printf but I dont see how this is any better. Speed wise yea much better at run time at least, printf is a hog on cpu cycles. Do you plan on generating every possible string at compile time ? I call that code bloat.
I call that lack of understanding. What we remove is simply the parsing of the format description, which can be done at compile-time.

Mathias Gaunard wrote:
Richard Day wrote:
Something that would use meta-programming to optimize the generation of the formatted string would be way better than printf. How does this improve on printf ? Not that I favor printf but I dont see how this is any better. Speed wise yea much better at run time at least, printf is a hog on cpu cycles. Do you plan on generating every possible string at compile time ? I call that code bloat.
I call that lack of understanding. What we remove is simply the parsing of the format description, which can be done at compile-time.
Yes, it will surely be a bit more efficient, but maybe the run-time version will be much tinnier (the compile-time parsing will surely create new types for each formatting need). I'm not against compile-time parsing engines, but we need really fast compilation times, and tiny size overhead. I'm maybe too ignorant on this issue (forgive me if I'm shooting my foot) but compile time generation creates a lot of new types that lead also to more typeinfo and other stuff to be linked to the program. My question is if printf is known to be slower than the compile-time parsing solution. If it's marginally slower, I would prefer the run-time approach, it's far more flexible (you can change the format on the fly, and you can't do that at compile time). Again: I might be saying something stupid. If so, let me know ;-) Regards, Ion

Ion Gaztañaga wrote:
I'm maybe too ignorant on this issue (forgive me if I'm shooting my foot) but compile time generation creates a lot of new types that lead also to more typeinfo and other stuff to be linked to the program.
AFAIK, creating new types doesn't add any data to the executable (except debugging info eventually) unless the type has static members (const integral ones don't count) or is polymorphic.

Mathias Gaunard wrote:
Ion Gaztañaga wrote:
I'm maybe too ignorant on this issue (forgive me if I'm shooting my foot) but compile time generation creates a lot of new types that lead also to more typeinfo and other stuff to be linked to the program.
AFAIK, creating new types doesn't add any data to the executable (except debugging info eventually) unless the type has static members (const integral ones don't count) or is polymorphic.
Type info is always emitted for a type, if not disabled completely or optimized away by compiler.

Mathias Gaunard wrote:
Richard Day wrote:
Something that would use meta-programming to optimize the generation of the formatted string would be way better than printf.
How does this improve on printf ? Not that I favor printf but I dont see how this is any better. Speed wise yea much better at run time at least, printf is a hog on cpu cycles. Do you plan on generating every possible string at compile time ? I call that code bloat.
I call that lack of understanding. What we remove is simply the parsing of the format description, which can be done at compile-time. Yes lack of understanding. Unfortunately I could only go on what you had written above, apparently you intend something different then what it appeared to suggest. My apologies.
Best regards, Richard Day

Richard Day wrote:
Yes lack of understanding. Unfortunately I could only go on what you had written above, apparently you intend something different then what it appeared to suggest. My apologies.
Sorry if that was unclear. 90% of the time, the string that specifies the format in (s)printf is a literal. We could probably exploit that fact to have that parsing done at compile-time. The syntax would need to be a little different though. Could be like xpressive static regexes maybe.

Mathias Gaunard <mathias.gaunard@etu.u-bordeaux1.fr> writes: [snip]
90% of the time, the string that specifies the format in (s)printf is a literal. We could probably exploit that fact to have that parsing done at compile-time. The syntax would need to be a little different though. Could be like xpressive static regexes maybe.
The issue is that in addition to not supporting internationalization as well, this compile-time approach is likely to increase generated code size and greatly slow down compile time. -- Jeremy Maitin-Shepard

As long as we're talking about a new I/O library for C++ I'll go ahead and give my $0.02. I think some consideration should be given to designing an I/O library on the basis of STL concepts. After all, files aren't really streams, they're arrays of bytes (or a similar basic element). Thus it seems more natural to model a class representing a file after std::vector rather than some stream concept. In fact I've done exactly this using memory mapped file I/O as part of another project where I needed to store many Bidirectional Iterators to various files and it worked quite well. Streaming I/O could be done with iterators. Conceptually streams can be thought of as simply being specialized iterators with different syntax. For example, a socket class would present a pair of iterators, one Input Iterator and one Output Iterator (or maybe just a single Forward Iterator?). They would be theoretically never-ending but the Input Iterator would end up equal to some end iterator when the socket has been closed and there is no further data available. Filtering could be done with wrapper iterators that consume elements from a stored iterator and emit elements of a potentially different type after performing some transformation. Wrapper iterators would be considered equal if the iterators they are holding are equal. Several of the current standard algorithms could be made usefully in such a scheme. And obviously a great many could be created for I/O centric operations. This seems to me like a more natural and flexible interface than IO Streams while making transfers to and from STL containers more convenient. Steven Siloti

Steven Siloti wrote:
Streaming I/O could be done with iterators. Conceptually streams can be thought of as simply being specialized iterators with different syntax. For example, a socket class would present a pair of iterators, one Input Iterator and one Output Iterator (or maybe just a single Forward Iterator?). They would be theoretically never-ending but the Input Iterator would end up equal to some end iterator when the socket has been closed and there is no further data available.
Short question: Wouldn't you need to handle block transfers with some "regular" function anyway , since you do not want to fetch each byte via an iterator dereference? Or do you assume buffering, etc within the iterator? Cheers, /Marcus

Steven Siloti <ssiloti@gmail.com> writes:
As long as we're talking about a new I/O library for C++ I'll go ahead and give my $0.02.
I think some consideration should be given to designing an I/O library on the basis of STL concepts. After all, files aren't really streams, they're arrays of bytes (or a similar basic element). Thus it seems more natural to model a class representing a file after std::vector rather than some stream concept. In fact I've done exactly this using memory mapped file I/O as part of another project where I needed to store many Bidirectional Iterators to various files and it worked quite well.
It would certainly be a requirement to provide an iterator/range interface to a stream. The boost iostreams library also has the concept of a direct stream, that is accessed by an iterator range (or maybe with the requirement that the iterators be pointers), which may be a useful concept to integrate into a new I/O library. The issue with relying on an iterator interface exclusively is the lack of efficiency: without a buffer (and in general, there need not be one), each dereference of the input iterator, or assignment to the output iterator, would require a separate call to the underlying source, which might mean a read/write system call. Furthermore, because the iterator interface requires that only a single element is processed at a time, applying type erasure/runtime polymorphism to the iterator types is much less efficient, since a virtual function call would be needed for every element, whereas with a stream interface, an entire array of elements can be processed using only a single virtual function call. [snip] -- Jeremy Maitin-Shepard

----- Original Message ----- From: "Jeremy Maitin-Shepard" <jbms@cmu.edu> To: <boost@lists.boost.org> Sent: Wednesday, June 20, 2007 7:47 PM Subject: Re: [boost] [rfc] I/O Library Design
Steven Siloti <ssiloti@gmail.com> writes:
[snip]
interface to a stream. The boost iostreams library also has the concept of a direct stream, that is accessed by an iterator range (or maybe with the requirement that the iterators be pointers), which may be a useful concept to integrate into a new I/O library.
The issue with relying on an iterator interface exclusively is the lack of efficiency: without a buffer (and in general, there need not be one), each dereference of the input iterator, or assignment to the output iterator, would require a separate call to the underlying source, which might mean a read/write system call. Furthermore, because the iterator interface requires that only a single element is processed at a time, applying type erasure/runtime polymorphism to the iterator types is much less efficient, since a virtual function call would be needed for every element, whereas with a stream interface, an entire array of elements can be processed using only a single virtual function call.
These issues sound a lot like issues solved in serialization. I have also created extensions to STL such that an input iterator representing an open file could be passed to standard algorithms. Off-hand I cant see why an iterator could not be crafted, that attached itself to a standard stream (shucks, doesnt this exist already?). Immediately you would have an "STL interface" to streams where a single input operation (e.g. *itr) does not necessarily result in a system call (buffering in the standard stream). Whether an STL interface to traditional application streams is an improvement, is a little murky. What do we currently do with these streams and how would iterator-based access improve the quality of my app? If you were implementing all the *nix utilities it would be cool. If you were printing debug and taking input of test-run parameteres, it might not. Maybe another part of the matrix, i.e. not exclusively iterator-based?

Scott Woods wrote:
----- Original Message ----- From: "Jeremy Maitin-Shepard" <jbms@cmu.edu> To: <boost@lists.boost.org> Sent: Wednesday, June 20, 2007 7:47 PM Subject: Re: [boost] [rfc] I/O Library Design
These issues sound a lot like issues solved in serialization. I have also created extensions to STL such that an input iterator representing an open file could be passed to standard algorithms.
Off-hand I cant see why an iterator could not be crafted, that attached itself to a standard stream (shucks, doesnt this exist already?). Immediately you would have an "STL interface" to streams where a single input operation (e.g. *itr) does not necessarily result in a system call (buffering in the standard stream).
Whether an STL interface to traditional application streams is an improvement, is a little murky. What do we currently do with these streams and how would iterator-based access improve the quality of my app?
If you were implementing all the *nix utilities it would be cool. If you were printing debug and taking input of test-run parameteres, it might not.
Maybe another part of the matrix, i.e. not exclusively iterator-based?
I did not mean to imply that iterators should be the exclusive means of access, my thinking at the moment just happened to depart down that tangent. What I really meant to say is that I would like to see an I/O library who's lowest level uses generic programming and extends on concepts used in the STL rather than the OOP style used with IOStreams and the Java I/O library. Higher level operations like formatted output could be implemented on top of this foundation. Going further down the iterators tangent I could imagine having the ability to create block level iterators that deal in arrays of elements for when one would want to use an iterator to do bulk processing. Steven Siloti

Sebastian Redl wrote:
A few weeks ago, a discussion that followed the demonstration of the binary_iostream library made me think about the standard C++ I/O and what I would expect from an I/O model.
Now I have a preliminary design document ready and would like to have some feedback from the Boost community on it.
Hi Sebastian, This is an interesting document and you have obviously put a lot of work into it. My few thoughts follow. I can't claim to have great insight into this problem, but there have been more than a couple of times when the limitations of what is currently available have struck me. ** Formatting of user-defined types often broken in practice. The ability to write overloaded functions to format user-defined types for text I/O is attractive in theory, but in practice it always lets me down somewhere. My main complaint is that neither of these work: typedef std::set<thing> things_t; operator<<(things_t things) { .... } // doesn't work because things_t is a typedef uint8_t i; cout << i; // doesn't work because uint8_t is actually a char When I do have a class, I often find that there is more than one way in which I'd like to format it, but there is only one operator<< to overload. And often I want to put the result of the formatting into a string, not a stream. So for all of these reasons I have more explicit to_str() functions in my code than operator<<s. ** lexical_cast<> uses streams, should the reversed. Currently we implement formatters that output to streams. We implement lexical_cast using stringstreams. Surely it would be preferable to implement formatters as specialisations of lexical_cast to a string (or character sequence / output iterator / whatever) and to implement formatted output to streams on top of that. I suppose you could argue that the stream model is better for very large amounts of output since you don't accumulate it all in a temporary string, but I've never encountered a case where that would matter. ** Formatting state has the wrong scope Spot the mistake here: cout << "address of buffer = 0x" << hex << p; yes, I forget to <<dec<< afterwards, so in some totally different part of the program when I write cout << "DEBUG: x=" << x and it prints '10', I think "10? should be 16!" and spend ages debugging. But reverting to dec might not be the right thing to do depending on what the caller was in the middle of doing, so I really want to save/restore the formatting state. And if I throw or do a premature return I still want the formatting state to be reverted: void f() { scoped_fmt_state(cout,hex); cout << ....; if (...) throw; cout << .....; } Hmm, I think that's too much work. I'd be happy with NO formatting state in the stream, and to use explicit formatting when I want it: cout << hex(x); OR cout << format("%08x",x); OR printf(stdout,"%08x",x); (No, I don't really use printf() in C++ code. But it does have its strengths; it's by far the best way to output a uint8_t. And it _is_ type safe if you are using a compiler that treats it as special.) ** Too much disconnect between POSIX file descriptors and std::streams I have quite a lot of code that uses sockets and serial ports, does ioctls on file descriptors, and things like that. So I have a FileDescriptor class that wraps a file descriptor with methods that implement simple error-trapping wrappers around the POSIX function calls. Currently, there's a strong separation between what I can do to a FileDescriptor (i.e. reads and writes) and what I can do to a stream. There is no reason why this has to be the case. It should be possible to add buffering to a FileDescriptor *and only add buffering*, and it should be possible to do formatted I/O on a non-buffered FileDescriptor. In other words: class ReadWriteThing; class FileDescriptor: ReadWriteThing; class Stream: ReadWriteThing; FileDescriptor fd("192.168.1.1:80"); // a socket int i=1234; fd << "GET " << i << "\r\n"; // Unbuffered write, text formatting. Stream s("foo.bin"); // a file, with a buffering layer for (int i=0; i<1000; ++i) { short r = f(); s.write(r); // Buffered, non-formatted binary write. } ** Character sets need support This is a hugely complex area which native English speakers are uniquely unqualified to talk about. I think that a starting point would be for someone to write a Boost interface to iconv (I have an example that makes functors for iconv conversions), and to write a tagged-string class that knows its encoding (either a compile-time type tag or a run-time enumeration tag or both). Ideally we'd spend a couple of years getting used to using that, and then consider how it can best integrate with IO. Regards, Phil.

Phil Endecott wrote:
** Formatting of user-defined types often broken in practice.
The ability to write overloaded functions to format user-defined types for text I/O is attractive in theory, but in practice it always lets me down somewhere. My main complaint is that neither of these work:
typedef std::set<thing> things_t; operator<<(things_t things) { .... } // doesn't work because things_t is a typedef
I see no specific reason why that would fail, as long as there isn't an operator << for std::set<thing> somewhere already. It's even legal, I think, because std::set<thing> depends on a type not in namespace std. (You can't overload for std::set<int>, for example, by the rules of the standard.)
uint8_t i; cout << i; // doesn't work because uint8_t is actually a char
Yes, that's annoying. In my opinion, it's a defect in the standard that unsigned and signed char are treated as characters instead of small integers. Characters is what char is for.
When I do have a class, I often find that there is more than one way in which I'd like to format it, but there is only one operator<< to overload. And often I want to put the result of the formatting into a string, not a stream.
I have an idea for a formatting system that should address all these issues. Basically, a format string would be able to specify, in an extensible and type-safe way, how to format an object. The format string would be used to look up a formatter in some sort of registry.
** lexical_cast<> uses streams, should the reversed.
Currently we implement formatters that output to streams. We implement lexical_cast using stringstreams. Surely it would be preferable to implement formatters as specialisations of lexical_cast to a string (or character sequence / output iterator / whatever) and to implement formatted output to streams on top of that. I suppose you could argue that the stream model is better for very large amounts of output since you don't accumulate it all in a temporary string, but I've never encountered a case where that would matter.
I have written in another post why I think the stream interface is better. Efficiency is one part of the issue. Another is that the code is simpler that way for the library implementer, and the difference is transparent for the library user. Also, it means that it's easier to switch the string type used (something that is not uncommon).
** Formatting state has the wrong scope void f() { scoped_fmt_state(cout,hex); cout << ....; if (...) throw; cout << .....; }
Hmm, I think that's too much work. I'd be happy with NO formatting state in the stream, and to use explicit formatting when I want it:
cout << hex(x); OR cout << format("%08x",x); OR printf(stdout,"%08x",x);
I absolutely agree. Stateful formatting is generally not good. The only state that should be in formatting is the used locale.
And it _is_ type safe if you are using a compiler that treats it as special.)
... _and_ if you use a string literal as the formatting string. Far from guaranteed, especially when localizing.
** Too much disconnect between POSIX file descriptors and std::streams
I cannot make myself think of this specific issue as a defect. It would mean platform coupling.
I have quite a lot of code that uses sockets and serial ports, does ioctls on file descriptors, and things like that. So I have a FileDescriptor class that wraps a file descriptor with methods that implement simple error-trapping wrappers around the POSIX function calls.
Is there any specific reason you cannot implement a streambuffer that acts on a file descriptor? A streambuffer, despite its name, doesn't have to buffer data.
Currently, there's a strong separation between what I can do to a FileDescriptor (i.e. reads and writes) and what I can do to a stream. There is no reason why this has to be the case. It should be possible to add buffering to a FileDescriptor *and only add buffering*, and it should be possible to do formatted I/O on a non-buffered FileDescriptor.
Yes. It is possible now. It should be easier with my system.
** Character sets need support
This is a hugely complex area which native English speakers are uniquely unqualified to talk about.
Luckily, I'm not a native English speaker. I have some experience with the issues involved, although my experience is limited to German umlauts. I have experienced the pains of unexpected encoding use in web applications. This is why I really, really think all C++ types involving text handling really need to be tagged with the encoding used.
I think that a starting point would be for someone to write a Boost interface to iconv (I have an example that makes functors for iconv conversions), and to write a tagged-string class that knows its encoding (either a compile-time type tag or a run-time enumeration tag or both). Ideally we'd spend a couple of years getting used to using that, and then consider how it can best integrate with IO.
I don't want to wait that long ;) I have in fact considered this issue and have drawn the outline of such a character handling and conversion library. In fact, a subset of it is absolutely needed for the text layer of my I/O plans. Sebastian Redl
participants (21)
-
Andrey Semashev
-
Ares Lagae
-
Cédric Venet
-
Greer, Joe
-
Ion Gaztañaga
-
Jeremy Maitin-Shepard
-
Johan Råde
-
John Hayes
-
Marcus Lindblom
-
Mathias Gaunard
-
Peter Bindels
-
Phil Endecott
-
Rene Rivera
-
Richard Day
-
Sascha Seewald
-
Scott Woods
-
Scott Woods
-
Sebastian Redl
-
Steven Siloti
-
Stjepan Rajko
-
zao@acc.umu.se