
Jeremy Maitin-Shepard wrote:
Okay, that is a good point. "Data stream" would probably be best then. I am quite averse to "binary stream", since it would really be a misuse, albeit a common misuse, of "binary".
I'm using "unstructured stream" in the next iteration of the design document. Does that seem appropriate to you?
I see. You are suggesting, I suppose, that in addition to providing formatting of individual values, the binary formatting layer also provides stream filters for converting a stream of one type into a stream of another type with a particular formatting. I like this idea.
Yes, exactly. This can be very useful for reading of data. However, it's not quite sufficient for runtime selection of a character encoding. For this, an interface that is not a stream of a single data type, but rather provides extraction of any data type at any time, is required.
This seems to suggest, then, that if you want to convert UTF-32 (native endian) in a file to say, a UTF-16 (native endian) text stream, you have to first convert the file uint8_t stream to a uint32_t stream (native endian formatting), and then mark this uint32_t stream as UTF-32, and then use a text encoding conversion filter to convert this UTF-32 stream to a UTF-16 (native endian) uint16_t stream.
Again, not quite. I suppose I should first define external and internal encoding, as you suggest below, because it is fundamental to this issue. I am of the opinion that an application does not gain anything by having string-like data for processing that is not of an encoding known at compile time. For any given string the application actually processes, the encoding of the data must be known at compile time. Everything else is a mess. String types should be tagged with the encoding. Text streams should be tagged. Buffers for text should be tagged. This compile-time-known encoding is the internal encoding. (An application can have different internal encodings for different data, but not for the same piece of data.) The external encodings are what external data is in. Files, networks streams, user input, etc. When reading a file as text, the application must specify the external encoding of this file. (Or it can fall back to a default, but this is often unacceptable.) The external encoding must be specifiable at run time, obviously, because different files, different network connections, can be in different encodings. Suppose, then, that my application uses UTF-16 internally for everything. Endian does not matter - does not exist, even, because UTF-16 uses uint16_t as the underlying type, and to the view of C++, endianness doesn't matter as long as the type isn't viewed as its components. To read in a text file in UTF-32, native endian, I could do this (tentative interface), if I know at compile time that the file is UTF-32: text_input_stream<utf_16> stream = open_file_input("filename.txt") // a file_input_stream .filter(buffer()) // a buffered_input_stream<iobyte, file_input_stream> .filter(assembler<uint32_t, native>()) // a assembler_input_stream<uint32_t, native, buffered_input_stream<iobyte, file_input_stream>> .filter(text_decode<utf_16, utf_32>()) // a text_decode_stream<utf_32, assembler_input_stream<uint32_t, native, buffered_input_stream<iobyte, file_input_stream>>> ; More likely, however, I would do this: auto assembler = generic_assembler<native_rules>(open_file_input("filename.txt").filter(buffer()); text_input_stream<utf_16> stream = text_decoder<utf_16>(assembler, "UTF-32"); assembler would be of type generic_assembler_t<native_rules, buffered_input_stream<iobyte, file_input_stream>> and would provide a single template member, read<T>(buffer<T> &target), that allows extracting any (primitive) type. The assembler follows the rules to provide the type. The text_decoder would then call this function using a type determined by the encoding specified. Yes, the read function would be instantiated for all types regardless of whether it's used, because the encoding is a run time decision. But at one point, you need to bridge the gap between compile time and run time decisions. Yet another alternative would be a direct_text_decoder<utf_16> that reads from a uint8_t stream and expects either that the encoding specifies the endianness (like "UTF-16BE") or that a byte order mark is present. Such a decoder would not be able to decode encodings where neither is the case.
Yes, a text stream is essentially a binary stream with an encoding instead of a data type. So the interface is the same in description, but the types involved are different. I think this is mostly a documentation issue.
Instead of a data type? But presumably both the data type and the encoding must be specified. Also, it seems like it may be useful to be able to specify the encoding at run-time, rather than just compile-time.
Instead of a data type. The data type of a text_stream<Encoding> is base_type<Encoding>::type. This is for internal use - the external use is different anyway.
Which encodings will be supported at compile-time, then? Just UTF-8, UTF-16, and UTF-32?
Whichever the library supplies. I think these three plus ASCII and Latin-1 would make a reasonable minimum requirement. Sebastian Redl