On 17.09.19 08:32, Gavin Lambert via Boost wrote:
* On Unixes, argv contains whatever byte sequence the shell/caller put there. This might be the actual filename on disk (if they used tab completion) or it might be something subtly different (if they typed it themselves using some kind of IME), or even a binary blob. In the first two cases, while it is fairly *likely* to be UTF-8 (especially in modern systems), it is not guaranteed to be -- the user could be running a non-UTF-8 locale, or be accessing a filesystem created by someone who was.
Or the user could be running a non-UTF-8 locale, but accessing a filesystem created by somebody who was using UTF-8 - in which case any filenames should be in UTF-8, even if the user's locale disagrees. It is because of this last possibility that I recommend treating all command-line arguments as UTF-8 on Unix systems, even if running a non-UTF-8 locale, for all cases where treating them as binary blobs is impractical. Unix filenames are binary blobs, but the de-facto standard for interpreting these binary blobs as text is to use UTF-8. How can two users, running two different locales, share a filesystem? By using UTF-8 for all filenames, regardless of locale. How should a program convert command-line arguments into UTF-8 filenames? By assuming that they are already in UTF-8, because performing any kind of conversion will cause more problems than it will fix. -- Rainer Deyke (rainerd@eldwood.com)