[boost] [SoC][string_cvt] The proposal for string conversions library.

1 May 2006

      Hello, boost

This is an idea of project for Google SoC2006 that I want to participate.
The library is called 'string_cvt' – or “string conversions”, it solves 
the problem of converting type to string and string to type with minimal 
runtime and syntactical overhead.

It is a simple "call for interest" mail.

Idea for this lib was inspired by recent discussion on boost developers
mailing list. The question under discussion was:
Is lexical_cast<> tool good enough for TR2 or not?

A proponents of lexical_cast<> have a point that the main
advantage of lexical_cast<> component is its usage
simplicity and symmetry (angle braces are used in both cases):
int i = lexical_cast<int>("1");
string s = lexical_cast<string>(1);

Additionally, it looks like built-in casts, and it is considered as a 
very cool thing.

On the other side, opponents of lexical_cast<> wants more functionality 
that doesn't fit into simple cast-like usage like:

The requirements table.
1) controlling conversions via facets (locales)
2) full power of iostreams in simple interface.
    All functionality accessible with iostreams
    (through manipulators) should be accessible.
3) functor adapters to use with std algorithms
4) error handling and reporting. (what kind of error occurred?)
    * optionally report failing without exceptions raising
5) best performance, especially for built-in types and for use in loops

The "Lexical Conversion Library Proposal for TR2" by Kevlin Henney and 
Beman Dawes states, that:
"The lexical_cast function template offers a convenient and consistent 
form for supporting common conversions to and from arbitrary types when 
they are represented as text. The simplification it offers is in 
expression-level convenience for such conversions. For more involved 
conversions, such as where precision or formatting need tighter control 
than is offered by the default behavior of lexical_cast, the 
conventional stringstream approach is recommended."

It is clear that lexical_cast is not intended to address (1-4) points in 
the list above,
and even (5). For optimizing conversions in loops you'll need to resort 
to stringstreams again.

I believe, that stringstreams are not the right tool for daily string 
conversions job. We need a special and fully featured solution, which 
addresses all issues in the Requirements table above. My dream is that 
one has no need to fallback to C-style solutions or to stringstreams 
anymore, just one consistent interface for all string conversion needs. 
This proposal for Google SoC project is an attempt to develop such a 
solution. The final ambitious goal of this project is to make 
boost::lexical_cast<> obsolete and replace it in TR2 with a new 
proposal. Regardless of SoC, I’m going to develop such a library for 
boost, but the participation in the Google SoC is important because 
otherwise it would be hard to manage enough time to finish this library 
before the deadline for TR2 in October.

As a result of this project we would have not only fully documented and 
tested library for string conversions, but full comparative performance 
analysis would be made to ensure that there is no more any need to 
fallback to some other solution.

There are short examples of intended usage of this library (for those 
who are too busy to read the full proposal’s text)

// simple initialization usage:
string s = string_from(1);
int i = from_string(“1”);

// embedded in expression usage:
double d = 2 + (double)from_string(“1”);

// usage with special locale:
string s = string_from(1, std::locale(“loc_name”));

// usage with special format:
string s = string_from(1, std::ios::hex);

// usage with special format and locale:
string s = string_from(1, std::ios::hex, std::locale(“loc_name”));

// usage with default value provided (exceptions are not thrown):
int i = from_string(“1”, 1);

// usage with cvtstate& argument (exceptions are not thrown. if 
conversion fails, reason is written in the cvtstate parameter supplied):
cvtstate state;
int i = from_string(“1”, state);

fmt and locale info can be supplied in from_string function too.

To optimize conversions in a loop one can do:
string_cvt cvt(std::ios::hex, std::locale(“loc_name”));
string s;
for(int i; i < 100; ++i) {
     string t;
     cvt(i, t);
     s += (t + “ “);
}

To convert one sequence to another one can do:
vector<double> vec_doubles(10, 1.2);
vector<string> vec_strings;

string_ocvt_fun<string> ocvtf(cvt); // cvt is defined in a previous example
transform(
     vec_doubles.begin(), vec_doubles.end(), // from
     back_inserter(vec_strings), // to
     ocvtf
);

// and in a reverse direction:
string_icvt_fun<double> icvtf(scvt);
vector<double> vec_doubles1(10);
transform(
     vec_strings.begin(), vec_strings.end(), // from
     vec_doubles1.begin(), // to
     icvtf
);

Details of this proposal are below:

The proposal, part 1. from_string/(w)string_from functions.

 From syntactical point of view an alternative to lexical_cast<> 
approach was proposed:
to_string/string_to<> pair of functions.
The "Lexical Conversion Library Proposal for TR2" has a good argument 
against it:
"... Furthermore, the from/to idea cannot be expressed in a simple and 
consistent form. The illusion is that they are easier than lexical_cast 
because of the name. This is theory. The practice is that the two forms, 
although similarly and symmetrically named, are not at all similar in 
use: one requires explicit provision of a template parameter and the 
other not. This is a simple usability pitfall that is guaranteed to 
catch experienced and inexperienced users alike -- the only difference 
being that the experienced user will know what to do with the error 
message."

There is one more problem with this approach:
to_string() function is coming from other languages like java, were it 
is a member function
of all types, so one can wrote:
String s = object.toString();
It can be spelled as: "Get string from object", or "Convert an object to 
string"
Both phrases are straightforward and reflect the way that we think of it:
1) I want a string (String s = )
2) I have an object (object)
3) I'm performing a conversion of this object to string (.toString())

But in C++ the to_string function would be a free-function,
resulting in code like:
string s = to_string(1);
It can be spelled as: "Get string by converting an object '1' to string"
The problem here is that the mental sequence is the same as in the 
example above, but
language constructs doesn't reflect it:
1) I want a string (string s = )
2) I have an object (1)
3) I'm performing a conversion of this object to string (to_string(1))

Note that (2) and (3) items are intermixed. It means, that programmer need
to do some additional mental work to jump from item (1) to item (3) and 
then back
to item (2) again. The final mind's workflow would be as follows:
1) I want a string (string s = )
2) I have an object (1, but not code it, hold it in memory for a while)
3) I'm performing a conversion of this object to string (to_string)
4) Yes! I can release my memory, and code the object finally. ( (1); )

For such a widely used component as string conversions this additional 
complexity is inappropriate.

Note: exactly the same critique can be addressed to lexical_cast<> too.
And it has an additional complexity of explicitly specified template 
parameter.

For string to type conversions all things are worse.

in java it would be:
try {
     int i = Integer.parseInt(s);
     // use i
} catch (NumberFormatException) { /* perform some error handling or 
ignore - the usual practice */ }

with lexical_cast<> it would be:

int i = lexical_cast<int>(s);
// use i
// exception handling is usually done on a higher levels

with string_to<> it would be:
int i = string_to<int>(s);

just a name was changed here.

The resulting mental sequence for all 3 variants above is far from optimal.
for lexical_cast<> it would be as follows:
1) I want an int (int i = )
2) I have a string (s, but not code it, hold it in memory for a while)
3) I'm performing a conversion of this string to an int ( 
lexical_cast<int> )
4) Yes! I can release my memory, and code the string finally. ( (s); )

The same mental complexity here.

the shortest mental sequence possible is as follows:
1) I want an int (int i = )
2) I have a string (s)
3) I'm performing a conversion of this string to an int ( toInt(); )

int i = s.toInt();
this approach scales bad, of cause, but it is optimal in a mental sense.

Furthermore, one can mention that the best way would be as follows:
int i = s;

"Construct an int from a string" - as simple as it could be.

Surprisingly, it can be implemented! (in terms of templated type cast 
operator):

class string
{
     template<typename T>
     operator T();
};

But this solution has major drawbacks:
1) it can not be made symmetrical with type to string conversion
2) it is hard to see such conversions in code
3) it requires changes in the standard strings library

all three can be resolved with some free-function adapter like string_to,
but with more appropriate naming:

int i = from_string(s);

its counterpart would become:
string s = string_from(1);
wstring s = wstring_from(1);

Note:
1) usage is symmetrical
2) no explicit template parameters

The from_string function has one minor drawback:
it can not be used in expressions without explicit casting to the type 
desired:

double d = 2.0 + from_string(s); // doesn't works
double d = 2.0 + (double)from_string(s); // does

But it can be seen as an advantage, because:
1) intention is clear and enforced by compiler (operator'+' ambiguity, 
or run-time exception if 2.0 becomes 2 and s looks like “1.1”)
2) mentally, the expression "(double)from_string(s)" is close to 
optimal, it can be thought of as:
"Get double from string" - It is hard to imagine thinking path that is 
shorter and reflects intentions in a more straightforward way.

To conclude: the pair of [w]string_from/from_string functions is 
proposed to compete lexical_cast<> function template for simple needs of 
converting some type to string or string to some type.

Additionally, these functions are not restricted to pure cast-like 
syntax, and could accept parameters like locale, std::ios::fmtflags and 
boost::cvtstate (it is a part of this proposal) to address issues (1), 
(2), and (4) consequently. (see the Requirements table above)

The proposal, part 2. converter objects and functor adapters.

This part is intended to address issues (3) and (5).

It can be achieved by providing templated "converter objects"
along with typedefs for char and wchar_t:

basic_string_icvt<char_type, traits_type, allocator_type>: string_icvt, 
wstring_icvt
basic_string_ocvt<char_type, traits_type, allocator_type>: string_ocvt, 
wstring_ocvt
basic_string_cvt<char_type, traits_type, allocator_type>: string_cvt, 
wstring_cvt

usage can be:

string_cvt scvt(ios_base::hex, locale(""));
string s;
scvt(12, s);

int i;
scvt(s, i);

and functor adapters:
basic_string_ocvt_fun<TCont>
typedef basic_string_ocvt_fun<std::string> string_ocvt_fun;
typedef basic_string_ocvt_fun<std::wstring> wstring_ocvt_fun;

basic_string_icvt_fun<Target, TChar, Traits, TAlloc>;

// template typedef
template <
     typename Target,
     typename Traits = std::char_traits<char>,
     typename TAlloc = std::allocator<char>
...
class string_icvt_fun :
     public basic_string_icvt_fun<Target, char, Traits, TAlloc>

// template typedef
template <
     typename Target,
     typename Traits = std::char_traits<wchar_t>,
     typename TAlloc = std::allocator<wchar_t>
...
class wstring_icvt_fun:
     public basic_string_icvt_fun<Target, wchar_t, Traits, TAlloc>

These classes can be used as follows:

vector<double> vec_doubles(10, 1.2);
vector<string> vec_strings;

string_ocvt_fun<string> ocvtf(scvt);
transform(
     vec_doubles.begin(), vec_doubles.end(), // from
     back_inserter(vec_strings), // to
     ocvtf
);

string_icvt_fun<double> icvtf(scvt);
vector<double> vec_doubles1(10);
transform(
     vec_strings.begin(), vec_strings.end(), // from
     vec_doubles1.begin(), // to
     icvtf
);

int sz = vec_doubles.size();
for (int i = 0; i < sz; ++i) {
     assert(vec_doubles[i] == vec_doubles1[i]);
}

And, finally, all power of iostreams can be achieved with this classes:
std::ios_base::fmtflags could be specified as a parameter of all 
converter classes’ constructors to specify some special formatting. 
Additionally, all family of fmtflags related functions from 
std::ios_base and std::basic_ios<> are provided. width() and fill() 
bounties are also provided. (If I forgot to mention some function - it 
was not intentionally, all meaningful functions from iostreams base 
classes would be included)

In order to satisfy requirement (1) std::locale object can be specified 
as a parameter of constructor, or as an argument to imbue() function. 
getloc() function is provided too.

For requirement (4) type cvtstate is provided, that is very close to 
std::ios_base::iostate type, but cvtstate is not a typedef for int, to 
allow function overloads on it. ‘cvtstate except’ parameter can be 
provided to constructors of converter classes to specify cases when 
exceptions should be thrown. By default no exceptions are thrown. The 
state of conversion (successful or not) can be viewed with rdstate() 
function and all good/bad/fail functions. Additionally, exception 
handling behavior can be queried/changed with exceptions() functions. 
Again, exactly as in std::basic_ios class.

Performance for built-in types (the requirement number 5) would be 
achieved in specializations of components proposed. These 
specializations would use the technique, proposed in n1803 document – 
“Simple Numeric Access”:
strtoXXX() C-library functions to convert strings to numbers and 
sprintf() function to convert from numbers to strings.

Support for non-standard strings can be done by specializing 
cvt_tarits<TCont> for them.

Till now I have a minimal working implementation of basic concepts proposed.

Possible mentors for this project could be authors of the “Lexical 
Conversion Library Proposal for TR2” proposal - Kevlin Henney and/or 
Beman Dawes.

Best,
PhD student, Oleg Abrosimov.