
Sebastian Redl <sebastian.redl@getdesigned.at> writes: [snip]
Platforms using 9-bit bytes have need for binary I/O, too. They might have need for doing it in their native 9-bit units. It would be a shame to deprive them of this possibility just because the text streams require octets. Especially if we already have a layer in place whose purpose is to convert between low-level data representations.
It seems that the primary interface for the data formatting layer should be in terms of fixed-size types like {u,}int{8,16,32,64}_t. It is more the job of a serialization library to support platform-dependent types like short,int,long, etc., which would be of use primarily for producing serialization output that will only be used as input to the exact same program. I suppose an alternative is for the read/write functions in the data formatting layer to always specify an explicit number of bits. For example, write_{u,}int<32> or read_{u,}int<32>. read_int<N> always returns intN_t, and it is a compile-time error if that type does not exist. write_int<N> casts its argument to intN_t, and thus avoids the issue of multiple names for the same type, like int/long on most 32-bit platforms/compilers. This interface supports architectures with a 36-bit word (e.g. write_int<36>), but since everything is made explicit, avoids any confusion that might otherwise result from such support. Floating point types are somewhat more difficult to handle, and I'm not sure what is the best approach. One possibility is to also specify the number of bits explicitly, and to assume the IEEE 754 format will be used as the external format. For example, write_float<32> or write_float<64> or perhaps write_ieee754<32>. It should just be a compile-time error if the compiler/platform doesn't provide a suitable type. [snip]
It seems like trying to support unusual architectures at all may be extremely difficult. See my other post.
Which other post is this?
My comments there probably weren't very important anyway. I think it is worth considering, though, that given the rarity of non 8-bit byte platforms, it is probably not worth spending very much time in supporting them, and more importantly, it is not worth complicating the interface for 8-bit byte platforms in order to support them.
I suppose if you can find a clean way to support these unusual architectures, then all the better.
It seems that it would be very hard to support e.g. utf-8 on a platform with 9-bit bytes or which cannot handle types smaller than 32-bits.
I think the binary conversion can do it. The system would work approximately like this: 1) Every platform defines its basic I/O byte. This would be 8 bits for most computers (including those where char is 32 bits large), 9 or some other number of bits for others. The I/O byte is the smallest unit that can be read from a stream. 2) Most platforms will additionally designate an octet type. Probably I will just use uint8_t for this. They will supply a Representation for the formatting layer that can convert a stream of I/O bytes to a stream of octets. (E.g. by truncating each byte.) If an octet stream is then needed (e.g. for creating a UTF-8 stream) this representation will be inserted.
This padding/truncating would need to be done as an explicit way of encoding an octet stream as a nonet stream, and should probably not be done implicitly, unless this sort of conversion is always assumed on those platforms.
3) Platforms that do not support octets at all (or simply do not have a primitive to spare for unambiguous overloads - they could use another 9-bit type and just ignore the additional byte; character streams, at least, do not perform arithmetic on their units so overflow is not an issue) do not have support for this. They're off bad. I think this case is rare enough to be ignored.
Okay.
The return type of mark(), on the other hand, can and should be opaque. This allows for many interesting things to be done. For example: Consider a socket. It has no mark/reset, let alone seeking support. You have a recursive descent parser that requires multiple mark/reset support.
I see. It still seems that using different names means that something that requires only mark/reset support cannot use a stream providing seek/tell support, without an additional intermediate layer.
Well, depends. Let's assume, for example, that the system will be implemented as C++09 templates with heavy use of concepts.
I think it may not be a good idea to target this new I/O library to a language that does not yet exist, and which more importantly is not yet supported by any compiler, except perhaps Douglas Gregor's experimental ConceptGCC, which as the release notes state, is extremely slow, although the release notes also claim that performance can be improved. I suppose it may work fine to write the library (using the preprocessor) so that it can be compiled under existing compilers without concept support, and include a small amount of additional functionality/use more convenient syntax if concept support is available. I would be very, very wary of anything that would increase the compile-time for users of the library, though. [snip]
The reason would be for a protocol in which little/big endian is specified as part of the message/data, and a typical implementation would always write in native format (and so it would need to determine which is the native format), but support both formats for reading.
Hmm ... makes sense. I'm not really happy, but it makes sense.
What do you mean you're not happy? I think all that would really be needed would be a macro to indicate the endianness. Of course any code that depends on this would likely depend even more on 8-bit bytes, but that is another issue.
Ideally, the cost of the virtual function calls would normally be mitigated by calling e.g. read/write with a large number of elements at once, rather than with only a single element.
Yes, but that's the ideal case. In practice, this means that the application would have to do its own buffering even if it really wants the data unit by unit.
Possibly this issue can be mitigated by exposing in the types only a buffer around a text stream, although I agree that there is no perfect solution.
The programmer will not want to construct the complicated full type for this.
newline_filter<encoding_device<utf_8, native_converter<gzip_filter<buffering_filter<file_device> > > > > chain & = open_file(filename, read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines());
The programmer will want to simply write
text_stream<utf_8> chain = open_file(filename, read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines());
I notice that these code examples suggest that all streams will be reference counted (and cheaply copied). Is that the intention? A potential drawback to that approach is that a buffer filter would be forced to allocate its buffer on the heap, when it otherwise might be able to use the stack. [snip]
It is convenient to have a unified concept of a character, independent of its encoding. The Unicode charset provides such a concept. Unicode is also convenient in that it adds classification rules and similar stuff. This decision is not really visible to user code anyway, only to encoding converters: it should be sufficient to provide a conversion from and to Unicode code points to enable a new encoding to be used in the framework.
I am basically content using only Unicode for text handling in my own programs, but I think it would be useful to see what others that care about efficiency for certain operations (and work with languages that are not represented very efficiently using UTF-8) think about this. -- Jeremy Maitin-Shepard