Unicode characters in filenames
Recently there was a thread that ended up changing the boost guidelines so that Unicode characters are now allowed in C++ source files. http://lists.boost.org/Archives/boost/2015/06/223822.php However, in the 1.59 release, there was a filename that had unicode characters in it: libs\preprocessor\doc\Appendix A An Introduction to Preprocessor Metaprogramming.html. Which, HTML encoded, actually looks like: Appendix%20A%20%C2%A0%20An%20Introduction. Note the %C2%A0 character (Hex C2A0, Octal: 302240, Windows displays:  )? Since this seems like a mistake, I've created a pull request for this in pre-processor. However, it begs the question: Should we support unicode codepoints for filenames in the boost distribution? I would like for this answer to be 'no' as there are still lots of tools out there that don't correctly handle unicode filenames. However, it is worth bringing up the discussion. Is there a reason we would want unicode file names? I would guess that tests uses them (especially the filesystem tests), however I would also expect that these tests generate the files on the fly, and that they aren't part of what is distributed. Thoughts? Tom Kent
On 14/08/15 23:47, Tom Kent wrote:
Recently there was a thread that ended up changing the boost guidelines so that Unicode characters are now allowed in C++ source files. http://lists.boost.org/Archives/boost/2015/06/223822.php
However, in the 1.59 release, there was a filename that had unicode characters in it: libs\preprocessor\doc\Appendix A An Introduction to Preprocessor Metaprogramming.html. Which, HTML encoded, actually looks like: Appendix%20A%20%C2%A0%20An%20Introduction. Note the %C2%A0 character (Hex C2A0, Octal: 302240, Windows displays:  )?
This is UTF-8 for U+00A0 NO-BREAK SPACE. You're wrongly interpreting that data as Windows-1252, hence the gibberish.
Since this seems like a mistake, I've created a pull request for this in pre-processor. However, it begs the question:
Should we support unicode codepoints for filenames in the boost distribution?
Not for code obviously, but for files that are automatically generated based on the content of other files, like documentation, I don't see a problem.
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Tom Kent Sent: 14 August 2015 23:47 To: Boost Developers List Subject: [boost] Unicode characters in filenames
Recently there was a thread that ended up changing the boost guidelines so that Unicode characters are now allowed in C++ source files. http://lists.boost.org/Archives/boost/2015/06/223822.php
However, in the 1.59 release, there was a filename that had unicode characters in it: libs\preprocessor\doc\Appendix A An Introduction to Preprocessor Metaprogramming.html. Which, HTML encoded, actually looks like: Appendix%20A%20%C2%A0%20An%20Introduction. Note the %C2%A0 character (Hex C2A0, Octal: 302240, Windows displays:  )?
Since this seems like a mistake, I've created a pull request for this in pre-processor. However, it begs the question:
Should we support unicode codepoints for filenames in the boost distribution?
I would like for this answer to be 'no' as there are still lots of tools out there that don't correctly handle unicode filenames. However, it is worth bringing up the discussion. Is there a reason we would want unicode file names? I would guess that tests uses them (especially the filesystem tests), however I would also expect that these tests generate the files on the fly, and that they aren't part of what is distributed.
Thoughts?
+1 KISS - no Unicode in filenames. (I know that this is very English-speaking-centric, but there isn't really a strong need?) Paul --- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 (0) 1539 561830
On 17/08/15 11:40, Paul A. Bristow wrote:
+1
KISS - no Unicode in filenames.
(I know that this is very English-speaking-centric, but there isn't really a strong need?)
I was assuming that given the file name, the file is auto-generated, in which case it is the tool that needs fixing to make sure names are always ASCII.
On 8/17/2015 3:21 PM, Mathias Gaunard wrote:
On 17/08/15 11:40, Paul A. Bristow wrote:
+1
KISS - no Unicode in filenames.
(I know that this is very English-speaking-centric, but there isn't really a strong need?)
I was assuming that given the file name, the file is auto-generated, in which case it is the tool that needs fixing to make sure names are always ASCII.
What happened is that I grabbed the file, Appendix A to the Boost MPL book as an html page, which used to be hosted at Boostpro, from the wayback machine Internet archive. Then I massaged it a bit to remove all the wayback machine cruft, but I think it had the Unicode in it once I had grabbed it. After that I copied the title from that page to links in two other preprocessor pages, thus propagating the Unicode. The GUI HTML editor I was using never flagged the Unicode as anything unusual so I really didn't see it. I have applied Tom Kent's the PR to 'develop' and will no doubt merge it to 'master' fairly shortly. The preprocessor docs were all written by Paul directly as HTML and it would be too much work at this point to change it to quickbook, although I love the latter.
-----Original Message----- From: Boost [mailto:boost-bounces@lists.boost.org] On Behalf Of Edward Diener Sent: 18 August 2015 23:57 To: boost@lists.boost.org Subject: Re: [boost] Unicode characters in filenames
On 8/17/2015 3:21 PM, Mathias Gaunard wrote:
On 17/08/15 11:40, Paul A. Bristow wrote:
+1
KISS - no Unicode in filenames.
(I know that this is very English-speaking-centric, but there isn't really a strong need?)
I was assuming that given the file name, the file is auto-generated, in which case it is the tool that needs fixing to make sure names are always ASCII.
What happened is that I grabbed the file, Appendix A to the Boost MPL book as an html page, which used to be hosted at Boostpro, from the wayback machine Internet archive. Then I massaged it a bit to remove all the wayback machine cruft, but I think it had the Unicode in it once I had grabbed it.
After that I copied the title from that page to links in two other preprocessor pages, thus
propagating
the Unicode. The GUI HTML editor I was using never flagged the Unicode as anything unusual so I really didn't see it.
I have applied Tom Kent's the PR to 'develop' and will no doubt merge it to 'master' fairly shortly.
The preprocessor docs were all written by Paul directly as HTML and it would be too much work at this point to change it to quickbook, although I love the latter.
I've looked at this and I don't think it would take that much time for me to convert to Quickbook - but I'm not sure of the benefit apart from having a familiar look'n'feel - unless we wanted to change things significantly. It looks good and very comprehensive. Paul --- Paul A. Bristow Prizet Farmhouse Kendal UK LA8 8AB +44 (0) 1539 561830
On 18 August 2015 at 23:56, Edward Diener wrote:
On 8/17/2015 3:21 PM, Mathias Gaunard wrote:
On 17/08/15 11:40, Paul A. Bristow wrote:
+1
KISS - no Unicode in filenames.
(I know that this is very English-speaking-centric, but there isn't really a strong need?)
I was assuming that given the file name, the file is auto-generated, in which case it is the tool that needs fixing to make sure names are always ASCII.
What happened is that I grabbed the file, Appendix A to the Boost MPL book as an html page, which used to be hosted at Boostpro, from the wayback machine Internet archive. Then I massaged it a bit to remove all the wayback machine cruft, but I think it had the Unicode in it once I had grabbed it.
After that I copied the title from that page to links in two other preprocessor pages, thus propagating the Unicode. The GUI HTML editor I was using never flagged the Unicode as anything unusual so I really didn't see it.
I have applied Tom Kent's the PR to 'develop' and will no doubt merge it to 'master' fairly shortly.
The file still has lots of ASCII spaces in the name: doc/Appendix A - An Introduction to Preprocessor Metaprogramming.html This breaks Fedora packaging because we do: find $docdir ... | xargs install -p -m 644 -t $docpath and xargs splits on spaces. I know I can fix it with -print0 and -0, but for the sake of KISS could it be just appendix.html, or intro.html or something?
On 8/21/2015 2:03 PM, Jonathan Wakely wrote:
On 18 August 2015 at 23:56, Edward Diener wrote:
On 8/17/2015 3:21 PM, Mathias Gaunard wrote:
On 17/08/15 11:40, Paul A. Bristow wrote:
+1
KISS - no Unicode in filenames.
(I know that this is very English-speaking-centric, but there isn't really a strong need?)
I was assuming that given the file name, the file is auto-generated, in which case it is the tool that needs fixing to make sure names are always ASCII.
What happened is that I grabbed the file, Appendix A to the Boost MPL book as an html page, which used to be hosted at Boostpro, from the wayback machine Internet archive. Then I massaged it a bit to remove all the wayback machine cruft, but I think it had the Unicode in it once I had grabbed it.
After that I copied the title from that page to links in two other preprocessor pages, thus propagating the Unicode. The GUI HTML editor I was using never flagged the Unicode as anything unusual so I really didn't see it.
I have applied Tom Kent's the PR to 'develop' and will no doubt merge it to 'master' fairly shortly.
The file still has lots of ASCII spaces in the name:
doc/Appendix A - An Introduction to Preprocessor Metaprogramming.html
This breaks Fedora packaging because we do:
find $docdir ... | xargs install -p -m 644 -t $docpath
and xargs splits on spaces.
I know I can fix it with -print0 and -0, but for the sake of KISS could it be just appendix.html, or intro.html or something?
I kept the long name but just removed all the spaces. Does this suit things for you ? I updated the 'develop' branch and will wait a few days, as long as I don't hear anyone complain, before updating the 'master' branch.
On Tue, Aug 18, 2015 at 5:56 PM, Edward Diener
On 8/17/2015 3:21 PM, Mathias Gaunard wrote:
On 17/08/15 11:40, Paul A. Bristow wrote:
+1
KISS - no Unicode in filenames.
(I know that this is very English-speaking-centric, but there isn't really a strong need?)
I was assuming that given the file name, the file is auto-generated, in which case it is the tool that needs fixing to make sure names are always ASCII.
What happened is...
From the responses I've seen, it seems that we probably shouldn't have any unicode character filenames checked into source control. Should we have a tool that checks this? How about adding that check to the smoke tests that
I'm not too concerned about this particular instance. I was just looking to understand what the community thinks the guidelines should be on this. the release managers do? Tom
participants (5)
-
Edward Diener
-
Jonathan Wakely
-
Mathias Gaunard
-
Paul A. Bristow
-
Tom Kent