[inspect] Hall of Shame plus non-ASCII characters

There is a fresh run of the trunk inspect report up at http://mysite.verizon.net/beman/inspect.html A couple of things are different: * A "Hall of Shame" has been added to highlight what libraries are the worst offenders. I'm open to suggestions as to at what point we should cut off reporting. Maybe limit it to the worst 10 libraries? * A check for non-ASCII characters in source files has been added by Marshall Clow. It is picking up non-ASCII characters in people's names in copyright messages; that's why Boost multi-index looks so bad in the report. We need to come up with preferred approach for those with non-ASCII characters in their names. --Beman

Beman Dawes wrote:
There is a fresh run of the trunk inspect report up at http://mysite.verizon.net/beman/inspect.html
Question: How is this tested for c++ files?: *A* invalid bookmarks, invalid urls, broken links, unlinked files Fusion got lots of *A*s, but I don't understand why. Take: libs/fusion/test/sequence/tuple_comparison.cpp for example. How did it become an unlinked file? What am I missing? Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

----- Original Message ----- From: "Joel de Guzman" <joel@boost-consulting.com> To: <boost@lists.boost.org> Sent: Thursday, June 26, 2008 5:07 AM Subject: Re: [boost] [inspect] Hall of Shame plus non-ASCII characters
Beman Dawes wrote:
There is a fresh run of the trunk inspect report up at http://mysite.verizon.net/beman/inspect.html
Question:
How is this tested for c++ files?:
*A* invalid bookmarks, invalid urls, broken links, unlinked files
Fusion got lots of *A*s, but I don't understand why. Take:
libs/fusion/test/sequence/tuple_comparison.cpp
for example. How did it become an unlinked file? What am I missing?
Hi, There is a problem with the *A* tests. It is repeated *L* missing Boost license info, or wrong reference text *C* missing copyright notice *R* invalid (cr only) line-ending ====*A* invalid bookmarks, invalid urls, broken links, unlinked files *N* file/directory names issues *T* tabs in file ====*A* non-ASCII chars in file *M* uses of min or max that have not been protected from the min/max macros, or unallowed #undef-s *U* unnamed namespace in header Best Vicente

Joel de Guzman wrote:
Beman Dawes wrote:
There is a fresh run of the trunk inspect report up at http://mysite.verizon.net/beman/inspect.html
Question:
How is this tested for c++ files?:
*A* invalid bookmarks, invalid urls, broken links, unlinked files
Fusion got lots of *A*s, but I don't understand why. Take:
libs/fusion/test/sequence/tuple_comparison.cpp
I think the *A* is caused by the ä in Jaakko's surname. It could be the same reason for lots of *A* in other libraries.

gchen wrote:
Joel de Guzman wrote:
Beman Dawes wrote:
There is a fresh run of the trunk inspect report up at http://mysite.verizon.net/beman/inspect.html
Question:
How is this tested for c++ files?:
*A* invalid bookmarks, invalid urls, broken links, unlinked files
Fusion got lots of *A*s, but I don't understand why. Take:
libs/fusion/test/sequence/tuple_comparison.cpp
I think the *A* is caused by the ä in Jaakko's surname. It could be the same reason for lots of *A* in other libraries.
Ouch. How do you spell Jaakko Järvi in plain ASCII? Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

Sebastian Redl escribió:
Joel de Guzman wrote:
Ouch. How do you spell Jaakko Järvi in plain ASCII?
The German transliteration of the ä is ae. Not sure if that applies to Scandinavian names, but I think so.
Jaakko Järvi is Finnish, and seems like ä --> ae is not acceptable in that language, the custom being ä --> a instead: http://tinyurl.com/6m2gkq So, I think the transliteration should be Jaakko Jarvi, but surely some Finnish colleague can shed a light here. Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

Joel de Guzman wrote:
gchen wrote:
Joel de Guzman wrote:
Beman Dawes wrote:
There is a fresh run of the trunk inspect report up at http://mysite.verizon.net/beman/inspect.html
Question:
How is this tested for c++ files?:
*A* invalid bookmarks, invalid urls, broken links, unlinked files
Fusion got lots of *A*s, but I don't understand why. Take:
libs/fusion/test/sequence/tuple_comparison.cpp
I think the *A* is caused by the ä in Jaakko's surname. It could be the same reason for lots of *A* in other libraries.
Ouch. How do you spell Jaakko Järvi in plain ASCII?
I know some languages can use 'ae' to stand for ä, but this may not be valid for all other languages, so this non-ASCII character issue seems a little puzzle to solve, to me.

On Thu, Jun 26, 2008 at 04:44:11PM +0800, gchen wrote:
Joel de Guzman wrote:
Ouch. How do you spell Jaakko Järvi in plain ASCII?
I know some languages can use 'ae' to stand for ä, but this may not be valid for all other languages, so this non-ASCII character issue seems a little puzzle to solve, to me.
Let's ask iconv for a transliteration: echo "Jaakko Järvi" | iconv -t ascii//TRANSLIT Jaakko Jaervi It seems to prefer "ae" even in fi_FI locale ... Jens

2008/6/26 Jens Seidel <jensseidel@users.sf.net>:
On Thu, Jun 26, 2008 at 04:44:11PM +0800, gchen wrote:
Joel de Guzman wrote:
Ouch. How do you spell Jaakko Järvi in plain ASCII?
I know some languages can use 'ae' to stand for ä, but this may not be valid for all other languages, so this non-ASCII character issue seems a little puzzle to solve, to me.
Let's ask iconv for a transliteration:
echo "Jaakko Järvi" | iconv -t ascii//TRANSLIT Jaakko Jaervi
He uses Jarvi: http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?People/Jaakko... Daniel

Daniel James wrote:
2008/6/26 Jens Seidel <jensseidel@users.sf.net>:
On Thu, Jun 26, 2008 at 04:44:11PM +0800, gchen wrote:
Joel de Guzman wrote:
Ouch. How do you spell Jaakko Järvi in plain ASCII? I know some languages can use 'ae' to stand for ä, but this may not be valid for all other languages, so this non-ASCII character issue seems a little puzzle to solve, to me. Let's ask iconv for a transliteration:
echo "Jaakko Järvi" | iconv -t ascii//TRANSLIT Jaakko Jaervi
He uses Jarvi:
http://www.crystalclearsoftware.com/cgi-bin/boost_wiki/wiki.pl?People/Jaakko...
Developers get to choose how they spells their own names:-) I've pinged Jaakko to be sure it is OK if we go ahead and make the change. --Beman

Jaakko Jarvi is fine. (my passport says Jaervi, but I just tend to drop the dots from ä). Jens Seidel <jensseidel@users.sf.net> writes:
On Thu, Jun 26, 2008 at 04:44:11PM +0800, gchen wrote:
Joel de Guzman wrote:
Ouch. How do you spell Jaakko Järvi in plain ASCII?
I know some languages can use 'ae' to stand for ä, but this may not be valid for all other languages, so this non-ASCII character issue seems a little puzzle to solve, to me.
Let's ask iconv for a transliteration:
echo "Jaakko Järvi" | iconv -t ascii//TRANSLIT Jaakko Jaervi
It seems to prefer "ae" even in fi_FI locale ...
Jens _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Beman Dawes escribió:
There is a fresh run of the trunk inspect report up at http://mysite.verizon.net/beman/inspect.html
A couple of things are different:
* A "Hall of Shame" has been added to highlight what libraries are the worst offenders. I'm open to suggestions as to at what point we should cut off reporting. Maybe limit it to the worst 10 libraries?
* A check for non-ASCII characters in source files has been added by Marshall Clow. It is picking up non-ASCII characters in people's names in copyright messages; that's why Boost multi-index looks so bad in the report. We need to come up with preferred approach for those with non-ASCII characters in their names.
I think the options are: 1. The inspect tool is modified so as to bypass author names (possibly taken from an author names file). In a sense, this defeats the whole purpose of the non-ASCII check, I guess. 2. Supress all diacritical marks: Joaquín M López Muñoz --> Joaquin M Lopez Munoz 3. Encode with HTML entities: Joaquín M López Muñoz --> Joaquín M López Muñoz Whatever approach is agreed upon I'll happily apply asap to Boost.MultiIndex. Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

joaquin@tid.es wrote:
Beman Dawes escribió:
There is a fresh run of the trunk inspect report up at http://mysite.verizon.net/beman/inspect.html
A couple of things are different:
* A "Hall of Shame" has been added to highlight what libraries are the worst offenders. I'm open to suggestions as to at what point we should cut off reporting. Maybe limit it to the worst 10 libraries?
* A check for non-ASCII characters in source files has been added by Marshall Clow. It is picking up non-ASCII characters in people's names in copyright messages; that's why Boost multi-index looks so bad in the report. We need to come up with preferred approach for those with non-ASCII characters in their names.
I think the options are:
1. The inspect tool is modified so as to bypass author names (possibly taken from an author names file). In a sense, this defeats the whole purpose of the non-ASCII check, I guess.
There just isn't a way in standard C++ to deal with non-ASCII characters that will preserve their correct display on all systems, and avoids errors and/or warnings on some Asian language systems.
2. Supress all diacritical marks:
Joaquín M López Muñoz --> Joaquin M Lopez Munoz
I think that's really the only viable choice. Authors are free to use and transformation they desire, as long as it is entirely ASCII.
3. Encode with HTML entities:
Joaquín M López Muñoz --> Joaquín M López Muñoz
That makes the name much less readable except when viewed with a web browser or other HTML aware renderer.
Whatever approach is agreed upon I'll happily apply asap to Boost.MultiIndex.
Unless someone else comes up with an unexpected solution, I think you are going to have to become Joaquín M López Muñoz --> Joaquin M Lopez Munoz:-) --Beman

I'm sure that Joaquin, Jaakko, Ion and I (among others) have long ago been used to the transliteration of our names. After being called Arve by credit cards and phone vendors (a good test for hanging up quickly!) I for one no longer care about it, although I write it correctly in latex, word, html, etc. and I did edit by hand and get notarized the birth certificates of my children and they are learning to write their names properly :-) Seriously, given that alll documentation is in English, why should we care that a copyright name is transliterated? It still has legal value, which is the only purpose of the notice. For the rest (documentation, etc) we can put our hearts content o diacritical marks. -- Herve' Bro"nnimann (if you can read that :) Sent via BlackBerry from T-Mobile -----Original Message----- From: Beman Dawes <bdawes@acm.org> Date: Thu, 26 Jun 2008 06:03:36 To:boost@lists.boost.org Subject: Re: [boost] [inspect] Hall of Shame plus non-ASCII characters joaquin@tid.es wrote:
Beman Dawes escribió:
There is a fresh run of the trunk inspect report up at http://mysite.verizon.net/beman/inspect.html
A couple of things are different:
* A "Hall of Shame" has been added to highlight what libraries are the worst offenders. I'm open to suggestions as to at what point we should cut off reporting. Maybe limit it to the worst 10 libraries?
* A check for non-ASCII characters in source files has been added by Marshall Clow. It is picking up non-ASCII characters in people's names in copyright messages; that's why Boost multi-index looks so bad in the report. We need to come up with preferred approach for those with non-ASCII characters in their names.
I think the options are:
1. The inspect tool is modified so as to bypass author names (possibly taken from an author names file). In a sense, this defeats the whole purpose of the non-ASCII check, I guess.
There just isn't a way in standard C++ to deal with non-ASCII characters that will preserve their correct display on all systems, and avoids errors and/or warnings on some Asian language systems.
2. Supress all diacritical marks:
Joaquín M López Muñoz --> Joaquin M Lopez Munoz
I think that's really the only viable choice. Authors are free to use and transformation they desire, as long as it is entirely ASCII.
3. Encode with HTML entities:
Joaquín M López Muñoz --> Joaquín M López Muñoz
That makes the name much less readable except when viewed with a web browser or other HTML aware renderer.
Whatever approach is agreed upon I'll happily apply asap to Boost.MultiIndex.
Unless someone else comes up with an unexpected solution, I think you are going to have to become Joaquín M López Muñoz --> Joaquin M Lopez Munoz:-) --Beman _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

hervebronnimann@mac.com wrote:
Seriously, given that alll documentation is in English, why should we care that a copyright name is transliterated?
English does borrow some words from other languages, such as French, where accents are needed. Think of "fiancé(e)" for example. Documentation should be able to write these correctly.

Mathias Gaunard wrote:
hervebronnimann@mac.com wrote:
Seriously, given that alll documentation is in English, why should we care that a copyright name is transliterated?
English does borrow some words from other languages, such as French, where accents are needed. Think of "fiancé(e)" for example. Documentation should be able to write these correctly.
It isn't documentation that is the issue; it is C++ source files (.cpp, .hpp, .ipp, etc). Any Unicode character should be OK in documentation files. Inspect is only supposed to check C++ source files for non-ASCII characters. --Beman

hervebronnimann@mac.com wrote:
I'm sure that Joaquin, Jaakko, Ion and I (among others) have long ago been used to the transliteration of our names. After being called Arve by credit cards and phone vendors (a good test for hanging up quickly!) I for one no longer care about it, although I write it correctly in latex, word, html, etc. and I did edit by hand and get notarized the birth certificates of my children and they are learning to write their names properly :-)
Seriously, given that alll documentation is in English, why should we care that a copyright name is transliterated? It still has legal value, which is the only purpose of the notice. For the rest (documentation, etc) we can put our hearts content o diacritical marks.
-- Herve' Bro"nnimann (if you can read that :)
Thanks, Hervé --Beman

On Thu, 26 Jun 2008, joaquin@tid.es wrote:
Beman Dawes escribió:
* A check for non-ASCII characters in source files has been added by Marshall Clow. It is picking up non-ASCII characters in people's names in copyright messages; that's why Boost multi-index looks so bad in the report. We need to come up with preferred approach for those with non-ASCII characters in their names.
I think the options are: ... 2. Supress all diacritical marks:
Joaquín M López Muñoz --> Joaquin M Lopez Munoz
3. Encode with HTML entities:
Joaquín M López Muñoz --> Joaquín M López Muñoz
4. Use standard LaTeX encodings: Joaquín M López Muñoz --> Joaqu\'\in M L\'opez Mu\~noz or (as if the babel package were enabled) Joaqu'in M L'opez Mu~noz See http://tobi.oetiker.ch/lshort/lshort.pdf section 2.4.8, Accents and Special Characters This seems both precise and mostly readable. - Daniel

dherring@ll.mit.edu wrote:
On Thu, 26 Jun 2008, joaquin@tid.es wrote:
Beman Dawes escribió:
* A check for non-ASCII characters in source files has been added by Marshall Clow. It is picking up non-ASCII characters in people's names in copyright messages; that's why Boost multi-index looks so bad in the report. We need to come up with preferred approach for those with non-ASCII characters in their names.
I think the options are: ... 2. Supress all diacritical marks:
Joaquín M López Muñoz --> Joaquin M Lopez Munoz
3. Encode with HTML entities:
Joaquín M López Muñoz --> Joaquín M López Muñoz
4. Use standard LaTeX encodings:
Joaquín M López Muñoz --> Joaqu\'\in M L\'opez Mu\~noz or (as if the babel package were enabled) Joaqu'in M L'opez Mu~noz
See http://tobi.oetiker.ch/lshort/lshort.pdf section 2.4.8, Accents and Special Characters
This seems both precise and mostly readable.
It is OK if a developer wants to do that, we don't have any way to change how C++ compilers process text, so there is no that will have the desired effect. --Beman

At 2:22 PM +0200 6/26/08, Mathias Gaunard wrote:
Beman Dawes wrote:
* A check for non-ASCII characters in source files has been added by Marshall Clow. It is picking up non-ASCII characters in people's names in copyright messages;
Why not simply allow comments to be in utf-8?
See ticket <http://svn.boost.org/trac/boost/ticket/1736>, titled "Headers containing non ASCII characters cause MS VC to issue warning 4819 in some locales" IIRC, utf-8 sequences can contain characters that are not in the range 32 ... 127. These will trip the warnings referenced above. [ Note: I have no incoming mail until about 6 PM PDT this evening. ] -- -- Marshall Marshall Clow Idio Software <mailto:marshall@idio.com> It is by caffeine alone I set my mind in motion. It is by the beans of Java that thoughts acquire speed, the hands acquire shaking, the shaking becomes a warning. It is by caffeine alone I set my mind in motion.

Marshall Clow wrote:
See ticket <http://svn.boost.org/trac/boost/ticket/1736>, titled "Headers containing non ASCII characters cause MS VC to issue warning 4819 in some locales"
IIRC, utf-8 sequences can contain characters that are not in the range 32 ... 127. These will trip the warnings referenced above.
I see. The ideal solution would be UTF-7 or quoted-printable, but I doubt these are supported in any text editor.

Mathias Gaunard wrote:
Beman Dawes wrote:
* A check for non-ASCII characters in source files has been added by Marshall Clow. It is picking up non-ASCII characters in people's names in copyright messages;
Why not simply allow comments to be in utf-8?
UTF-8 causes exactly the same problem; the high-order bit is sometimes on, and that causes errors or warnings with some compiler / operating system combinations --Beman

Beman Dawes wrote:
There is a fresh run of the trunk inspect report up at http://mysite.verizon.net/beman/inspect.html
A couple of things are different:
* A "Hall of Shame" has been added to highlight what libraries are the worst offenders. I'm open to suggestions as to at what point we should cut off reporting. Maybe limit it to the worst 10 libraries?
Something, anything, that is much less than the whole list, please! Otherwise I promise many will ignore it -- I know I will; not through any evil intention either. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

David Abrahams wrote:
Beman Dawes wrote:
There is a fresh run of the trunk inspect report up at http://mysite.verizon.net/beman/inspect.html
A couple of things are different:
* A "Hall of Shame" has been added to highlight what libraries are the worst offenders. I'm open to suggestions as to at what point we should cut off reporting. Maybe limit it to the worst 10 libraries?
Something, anything, that is much less than the whole list, please! Otherwise I promise many will ignore it -- I know I will; not through any evil intention either.
Understood. I'm the same way. Take a look at http://mysite.verizon.net/beman/inspect.html It includes a whole bunch of changes to make the report more readable. Comments appreciated! --Beman

Beman Dawes wrote:
Take a look at http://mysite.verizon.net/beman/inspect.html
It includes a whole bunch of changes to make the report more readable.
Comments appreciated!
The list of worst offenders looks a lot better now that none of my work appears there ;-) -- Dave Abrahams BoostPro Computing http://www.boostpro.com
participants (15)
-
Beman Dawes
-
Daniel James
-
David Abrahams
-
dherring@ll.mit.edu
-
gchen
-
hervebronnimann@mac.com
-
jarvij@gmail.com
-
Jens Seidel
-
joaquin@tid.es
-
Joel de Guzman
-
Joel de Guzman
-
Marshall Clow
-
Mathias Gaunard
-
Sebastian Redl
-
vicente.botet