[conversion] Isolating the phantom file changes problem

This is the simplest case I could come up with where git clone produced a initially modified state. Any ideas why this is happening? Is this another manifestation of crlf problems? --Beman D:\modular-boost\libs\intrusive>git clone --recursive git@github.com:boostorg/boost.git modular-boost ... much output, no apparent error messages... D:\modular-boost>git status # On branch master # Changes not staged for commit: # (use "git add <file>..." to update what will be committed) # (use "git checkout -- <file>..." to discard changes in working directory) # (commit or discard the untracked or modified content in submodules) # # modified: libs/interprocess (modified content) # modified: libs/intrusive (modified content) # modified: libs/pool (modified content) # no changes added to commit (use "git add" and/or "git commit -a") D:\modular-boost>cd libs/pool D:\modular-boost\libs\pool>git status # Not currently on any branch. # Changes not staged for commit: # (use "git add <file>..." to update what will be committed) # (use "git checkout -- <file>..." to discard changes in working directory) # # modified: doc/images/mb1.svg # modified: doc/images/mb2.svg # modified: doc/images/mb3.svg # modified: doc/images/mb4.svg # modified: doc/images/pc1.svg # modified: doc/images/pc2.svg # modified: doc/images/pc3.svg # modified: doc/images/pc4.svg # modified: doc/images/pc5.svg # no changes added to commit (use "git add" and/or "git commit -a") D:\modular-boost\libs\pool>cd ..\interprocess D:\modular-boost\libs\interprocess>git status # Not currently on any branch. # Changes not staged for commit: # (use "git add <file>..." to update what will be committed) # (use "git checkout -- <file>..." to discard changes in working directory) # # modified: proj/vc7ide/managed_shared_memory.vcproj # modified: proj/vc7ide/offset_ptr_test.vcproj # no changes added to commit (use "git add" and/or "git commit -a") D:\modular-boost\libs\interprocess>cd ..\intrusive D:\modular-boost\libs\intrusive>git status # Not currently on any branch. # Changes not staged for commit: # (use "git add <file>..." to update what will be committed) # (use "git checkout -- <file>..." to discard changes in working directory) # # modified: proj/vc7ide/avl_multiset/avl_multiset.vcproj # modified: proj/vc7ide/avl_set/avl_set.vcproj # modified: proj/vc7ide/sg_multiset/sg_multiset.vcproj # modified: proj/vc7ide/sg_set/sg_set.vcproj # modified: proj/vc7ide/splay_multiset/splay_multiset.vcproj # modified: proj/vc7ide/splay_set/splay_set.vcproj # no changes added to commit (use "git add" and/or "git commit -a")

On 26/11/2013 09:26, Quoth Beman Dawes:
This is the simplest case I could come up with where git clone produced a initially modified state.
Any ideas why this is happening? Is this another manifestation of crlf problems?
It would make sense. .vcproj files would definitely get created with CRLF line endings, and SVG files probably would too if they were created with a Windows-based editor. So it wouldn't be surprising if they were committed to SVN that way. The conversion process needs to somehow ensure that only LF endings end up in the repository for files marked as "text" (or text="auto" that aren't obviously binary to git) in the .gitattributes, regardless of whether they had LF or CRLF endings in SVN. If you're running the conversion from a Linux box, I think that means that you'll need to selectively dos2unix those files before committing them to git. And as Steven Watanabe said, the svn:eol-style property is probably a good place to look to find potentially troublesome files. In particular, what SVN does with svn:eol-style=CRLF is noticeably different from what git does with eol=crlf. Shell scripts might benefit from being marked eol=lf in git, in case someone on Windows wants to run the script via Cygwin or MSYS. There's probably no need to mark batch files or VS projects as eol=crlf, since these files are only likely to actually be used on Windows, and git should expand these to CRLFs just via "text" or "text=auto".

On 25 November 2013 22:55, Gavin Lambert
Shell scripts might benefit from being marked eol=lf in git, in case someone on Windows wants to run the script via Cygwin or MSYS. There's probably no need to mark batch files or VS projects as eol=crlf, since these files are only likely to actually be used on Windows, and git should expand these to CRLFs just via "text" or "text=auto".
If a file only works with crlf line endings then that is what the file should contain. Sometimes people use mac or linux checkouts on windows. For example, we shouldn't require users to download boost twice if they use it on both windows and linux. The .tar.gz and .tar.bz2 downloads should work everywhere.

Hi Beman, On Monday, 25. November 2013 15:26:56 Beman Dawes wrote:
This is the simplest case I could come up with where git clone produced a initially modified state.
Any ideas why this is happening? Is this another manifestation of crlf problems?
Yes, definetely.
D:\modular-boost\libs\pool>git status # Not currently on any branch. # Changes not staged for commit: # (use "git add <file>..." to update what will be committed) # (use "git checkout -- <file>..." to discard changes in working directory) # # modified: doc/images/mb1.svg
Those are missing svn:eol-style: boost.svn$ svn proplist libs/pool/doc/images/mb1.svg Properties on 'libs/pool/doc/images/mb1.svg': svn:mime-type boost.svn$ svn propget svn:mime-type libs/pool/doc/images/mb1.svg image/svg+xml And image/* marks this as "binary". We had this discussion weeks ago with noch real decision.
D:\modular-boost\libs\interprocess>git status
# modified: proj/vc7ide/managed_shared_memory.vcproj # modified: proj/vc7ide/offset_ptr_test.vcproj
boost.svn$ svn proplist libs/interprocess/proj/vc7ide/managed_shared_memory.vcproj Properties on 'libs/interprocess/proj/vc7ide/managed_shared_memory.vcproj': svn:eol-style svn:mime-type boost.svn$ svn propget svn:eol-style libs/interprocess/proj/vc7ide/managed_shared_memory.vcproj CRLF boost.svn$ svn propget svn:mime-type libs/interprocess/proj/vc7ide/managed_shared_memory.vcproj text/xml Explicit CRLF.
D:\modular-boost\libs\intrusive>git status
# modified: proj/vc7ide/avl_multiset/avl_multiset.vcproj # modified: proj/vc7ide/avl_set/avl_set.vcproj # modified: proj/vc7ide/sg_multiset/sg_multiset.vcproj # modified: proj/vc7ide/sg_set/sg_set.vcproj # modified: proj/vc7ide/splay_multiset/splay_multiset.vcproj # modified: proj/vc7ide/splay_set/splay_set.vcproj
Dito. The final question is how to resolve this. I am not sure which solution is best. I'd like to set all text-based files (including .svg and .bat) to svn:eol-style native and svn:mime-type text/* in svn and then re-run the conversion again. The main reason is that this worked really well for the rest of the repository. Just my .02€ Yours, Jürgen -- * Dipl.-Math. Jürgen Hunold ! * voice: ++49 4257 300 ! Fährstraße 1 * fax : ++49 4257 300 ! 31609 Balge/Sebbenhausen * jhunold@gmx.eu ! Germany

On 26/11/2013 19:54, Quoth Jürgen Hunold:
boost.svn$ svn proplist libs/pool/doc/images/mb1.svg Properties on 'libs/pool/doc/images/mb1.svg': svn:mime-type boost.svn$ svn propget svn:mime-type libs/pool/doc/images/mb1.svg image/svg+xml
And image/* marks this as "binary". We had this discussion weeks ago with noch real decision.
Just curious: where is that rule coming from? If it's in the conversion code (which seems likely), can it be changed to treat */*+xml as text regardless of other rules?
The final question is how to resolve this. I am not sure which solution is best. I'd like to set all text-based files (including .svg and .bat) to svn:eol-style native and svn:mime-type text/* in svn and then re-run the conversion again. The main reason is that this worked really well for the rest of the repository.
That might be easier though, provided they're all found. :)

Hi Gavin, On Tuesday, 26. November 2013 20:06:00 Gavin Lambert wrote:
On 26/11/2013 19:54, Quoth Jürgen Hunold:
boost.svn$ svn proplist libs/pool/doc/images/mb1.svg
Properties on 'libs/pool/doc/images/mb1.svg': svn:mime-type
boost.svn$ svn propget svn:mime-type libs/pool/doc/images/mb1.svg image/svg+xml
And image/* marks this as "binary". We had this discussion weeks ago with noch real decision.
Just curious: where is that rule coming from? If it's in the conversion code (which seems likely), can it be changed to treat */*+xml as text regardless of other rules?
No, subversion will not do CRLF-conversion on binary files. This creates the same issues as explicit CRLF settings when doing the conversion on Unix. And the conversion script will have to use subversion commands in order to get the commits. The reason for the "image/*" setting for svg was to have them displayed as images when viewed in a web browser. But those display settings can be better configured server-side. My experience is to have all text file eol-style "native" and mime-type "text/something" to get the best cross-platform integration. This is true even for .vcproj files as you can then script-edit them on Unix without problems.
The final question is how to resolve this. I am not sure which solution is best. I'd like to set all text-based files (including .svg and .bat) to svn:eol-style native and svn:mime-type text/* in svn and then re-run the conversion again. The main reason is that this worked really well for the rest of the repository.
That might be easier though, provided they're all found. :)
We have to handle all files which come out "modified" from a fresh checkout. The list from Beman is quite comprehensive. And I would just use a skript to force-set the needed properties on the svn side. Yours, Jürgen -- * Dipl.-Math. Jürgen Hunold ! * voice: ++49 4257 300 ! Fährstraße 1 * fax : ++49 4257 300 ! 31609 Balge/Sebbenhausen * jhunold@gmx.eu ! Germany

On 26/11/2013 20:18, Quoth Jürgen Hunold:
On Tuesday, 26. November 2013 20:06:00 Gavin Lambert wrote:
On 26/11/2013 19:54, Quoth Jürgen Hunold:
boost.svn$ svn proplist libs/pool/doc/images/mb1.svg
Properties on 'libs/pool/doc/images/mb1.svg': svn:mime-type
boost.svn$ svn propget svn:mime-type libs/pool/doc/images/mb1.svg image/svg+xml
And image/* marks this as "binary". We had this discussion weeks ago with noch real decision.
Just curious: where is that rule coming from? If it's in the conversion code (which seems likely), can it be changed to treat */*+xml as text regardless of other rules?
No, subversion will not do CRLF-conversion on binary files. This creates the same issues as explicit CRLF settings when doing the conversion on Unix. And the conversion script will have to use subversion commands in order to get the commits.
The reason for the "image/*" setting for svg was to have them displayed as images when viewed in a web browser. But those display settings can be better configured server-side. My experience is to have all text file eol-style "native" and mime-type "text/something" to get the best cross-platform integration. This is true even for .vcproj files as you can then script-edit them on Unix without problems.
I think you misunderstood my question. I understand why they were labelled as image/svg+xml, and I understand why there's a "mime type image/* in svn => binary attribute in git" rule. I'm just suggesting that (if it makes sense with how the conversion tool actually works -- I don't know much about it) a higher priority rule be added to the conversion tool that says "mime type */*+xml in svn => text attribute in git", which would then result in SVG files being marked as text and all other files marked as images. Though it'd still require the files to be dos2unix'd if the conversion was run on Linux -- if it was run on Windows then the text attribute by itself should fix the problem. Another possibility might be to explicitly mark all text files in .gitattributes with eol=lf and then run the conversion; I think that would fix the repository blobs, and then the eol attribute can be removed for actual use.

On 26/11/2013 20:18, Quoth Jürgen Hunold:
The reason for the "image/*" setting for svg was to have them displayed as images when viewed in a web browser. But those display settings can be better configured server-side. My experience is to have all text file eol-style "native" and mime-type "text/something" to get the best cross-platform integration. This is true even for .vcproj files as you can then script-edit them on Unix without problems.
Changing subversion properties doesn't fix historical data, so it
won't fix anything that we can't fix in git.
On 26 November 2013 23:34, Gavin Lambert
Another possibility might be to explicitly mark all text files in .gitattributes with eol=lf and then run the conversion; I think that would fix the repository blobs, and then the eol attribute can be removed for actual use.
The conversion doesn't respect .gitattributes, that's why there's a problem. We can fix these issues in current and future versions by normalizing the files (as described in the gitattributes man page). Since we can't fix these files in historical versions we could re-run the conversion with gitattributes that unsets the text attribute for the problematic files (something like "*.bat -text", "*.vsproj -text", "*.svg -text"). Then in git update the gitattributes (to something like "*.bat text eol=crlf", "*.vsproj text eol=crlf", "*.svg text") and normalize the files so that it does what we want. We can script that, so it shouldn't require too much work.

Excuse my ignorance, but why do we need to set attributes about eol or even suggest auto.crlf=true ? Why can't this be managed at the editor level? My point is that I never understood the logic about autoconvertion of line endings. Of course it'll report modified files if you use autoconvertion, it's what you are asking the tool to do! I understand the idea ist that then the tool ignores line ending differences but it sounds very kludgy to me. A better idea imho would be to enforce line endings in some pre-commit/pre-push hook, and convert all existing files prior to these hooks, but do we really want to enforce such a thing? My 0.02$ Philippe

On Wed, Nov 27, 2013 at 4:43 AM, Philippe Vaucher < philippe.vaucher@gmail.com> wrote:
Excuse my ignorance, but why do we need to set attributes about eol or even suggest auto.crlf=true ? Why can't this be managed at the editor level?
The core.autocrlf true suggestion for Windows users is a bit of a side track, and will probably be removed. The .attributes file allows "you to ensure consistent behaviour for all users regardless of their git settings." See https://help.github.com/articles/dealing-with-line-endings#platform-all At least that's my current understanding, so take it with a grain of salt. --Beman

On Wed, Nov 27, 2013 at 2:00 AM, Daniel James
On 26/11/2013 20:18, Quoth Jürgen Hunold:
The reason for the "image/*" setting for svg was to have them displayed
as
images when viewed in a web browser. But those display settings can be better configured server-side. My experience is to have all text file eol-style "native" and mime-type "text/something" to get the best cross-platform integration. This is true even for .vcproj files as you can then script-edit them on Unix without problems.
Changing subversion properties doesn't fix historical data, so it won't fix anything that we can't fix in git.
That's also my understanding.
On 26 November 2013 23:34, Gavin Lambert
wrote: Another possibility might be to explicitly mark all text files in .gitattributes with eol=lf and then run the conversion; I think that
would
fix the repository blobs, and then the eol attribute can be removed for actual use.
The conversion doesn't respect .gitattributes, that's why there's a problem. We can fix these issues in current and future versions by normalizing the files (as described in the gitattributes man page).
After doing a lot of reading of git docs and about how others dealt with similar problems, that was also my conclusion. I'm wondering why others haven't suggested renormalization? Since we can't fix these files in historical versions we could re-run
the conversion with gitattributes that unsets the text attribute for the problematic files (something like "*.bat -text", "*.vsproj -text", "*.svg -text"). Then in git update the gitattributes (to something like "*.bat text eol=crlf", "*.vsproj text eol=crlf", "*.svg text") and normalize the files so that it does what we want. We can script that, so it shouldn't require too much work.
As long as we are sure the .gitattributes are correct, and then renormalize, why do we have to rerun the conversion at all? What am I missing? --Beman

Hi Beman, On Wednesday, 27. November 2013 09:25:19 Beman Dawes wrote:
On Wed, Nov 27, 2013 at 2:00 AM, Daniel James
wrote: Changing subversion properties doesn't fix historical data, so it won't fix anything that we can't fix in git.
That's also my understanding.
Yes, this is right.
The conversion doesn't respect .gitattributes, that's why there's a problem. We can fix these issues in current and future versions by normalizing the files (as described in the gitattributes man page).
Ah, "normalizing" is the keyword I am missing. Thanks.
After doing a lot of reading of git docs and about how others dealt with similar problems, that was also my conclusion. I'm wondering why others haven't suggested renormalization?
I'm still more familiar with Subversion than git. Fixing it on the svn side would be my personal favourite. I did some similar conversions on while converting from CVS years ago in order to end up with a "clean" repository without need for further actions. But cvs2svn/git is a scary thing...
Since we can't fix these files in historical versions we could re-run
the conversion with gitattributes that unsets the text attribute for the problematic files (something like "*.bat -text", "*.vsproj -text", "*.svg -text"). Then in git update the gitattributes (to something like "*.bat text eol=crlf", "*.vsproj text eol=crlf", "*.svg text") and normalize the files so that it does what we want. We can script that, so it shouldn't require too much work.
As long as we are sure the .gitattributes are correct, and then renormalize, why do we have to rerun the conversion at all? What am I missing?
My goal was to avoid any "conversion" commits. I should have explicitly said so. I'm fine with having a handful of "normalizing" commits and let the repository go live.
From the replies I think there is consensus (or at least a great majority) to renormalize and move forward.
Thanks for managing this whole thing so patiently. Yours, Jürgen -- * Dipl.-Math. Jürgen Hunold ! * voice: ++49 4257 300 ! Fährstraße 1 * fax : ++49 4257 300 ! 31609 Balge/Sebbenhausen * jhunold@gmx.eu ! Germany

On 27 November 2013 14:25, Beman Dawes
Since we can't fix these files in historical versions we could re-run the conversion with gitattributes that unsets the text attribute for the problematic files (something like "*.bat -text", "*.vsproj -text", "*.svg -text"). Then in git update the gitattributes (to something like "*.bat text eol=crlf", "*.vsproj text eol=crlf", "*.svg text") and normalize the files so that it does what we want. We can script that, so it shouldn't require too much work.
As long as we are sure the .gitattributes are correct, and then renormalize, why do we have to rerun the conversion at all? What am I missing?
If someone checks out an older version of a module (i.e. before normalization) then they'll still have this problem. If the conversion is rerun with gitattributes set as I described then I don't think they will. It won't fix the files, it'll just tell git not to expect them to have the correct newlines. But this isn't a big deal, and there are possibly issues with this approach, so it might not be worth it.

On Wed, Nov 27, 2013 at 10:47 AM, Daniel James
On 27 November 2013 14:25, Beman Dawes
wrote: Since we can't fix these files in historical versions we could re-run the conversion with gitattributes that unsets the text attribute for the problematic files (something like "*.bat -text", "*.vsproj -text", "*.svg -text"). Then in git update the gitattributes (to something like "*.bat text eol=crlf", "*.vsproj text eol=crlf", "*.svg text") and normalize the files so that it does what we want. We can script that, so it shouldn't require too much work.
As long as we are sure the .gitattributes are correct, and then renormalize, why do we have to rerun the conversion at all? What am I missing?
If someone checks out an older version of a module (i.e. before normalization) then they'll still have this problem. If the conversion is rerun with gitattributes set as I described then I don't think they will. It won't fix the files, it'll just tell git not to expect them to have the correct newlines.
Ah! Understood. But like you, I don't see that as a big deal.
But this isn't a big deal, and there are possibly issues with this approach, so it might not be worth it.
I would really like to avoid rerunning the conversion, since just to be sure I would want to then rerun the independent verification checks. That involves a lot of semi-manual steps, and it has taken me the better part of three days so far. I'm having to rerun the trunk/develop verification right now because it ran out of disk space last night. Sigh. I'd also like to hear other opinions, particularly from Dave, on this before we make a final decision. --Beman

Beman Dawes
After doing a lot of reading of git docs and about how others dealt with similar problems, that was also my conclusion. I'm wondering why others haven't suggested renormalization?
They have.
Since we can't fix these files in historical versions we could re-run the conversion with gitattributes that unsets the text attribute for the problematic files (something like "*.bat -text", "*.vsproj -text", "*.svg -text"). Then in git update the gitattributes (to something like "*.bat text eol=crlf", "*.vsproj text eol=crlf", "*.svg text") and normalize the files so that it does what we want. We can script that, so it shouldn't require too much work.
As long as we are sure the .gitattributes are correct, and then renormalize, why do we have to rerun the conversion at all? What am I missing?
It depends on whether you want to "correct" history or not. If you want to go back in time and renormalize, then all the SHAs in submodules change, and the branches in the master repo need to be rewritten accordingly.

On Wed, Nov 27, 2013 at 11:25 AM, Dave Abrahams
Beman Dawes
writes: After doing a lot of reading of git docs and about how others dealt with similar problems, that was also my conclusion. I'm wondering why others haven't suggested renormalization?
They have.
Since we can't fix these files in historical versions we could re-run the conversion with gitattributes that unsets the text attribute for the problematic files (something like "*.bat -text", "*.vsproj -text", "*.svg -text"). Then in git update the gitattributes (to something like "*.bat text eol=crlf", "*.vsproj text eol=crlf", "*.svg text") and normalize the files so that it does what we want. We can script that, so it shouldn't require too much work.
As long as we are sure the .gitattributes are correct, and then renormalize, why do we have to rerun the conversion at all? What am I missing?
It depends on whether you want to "correct" history or not. If you want to go back in time and renormalize, then all the SHAs in submodules change, and the branches in the master repo need to be rewritten accordingly.
My initial reaction is not to "correct" history. I'm quite comfortable with the idea that files from svn were moved into "git" as is, and then as part of bringing Boost into the git world, we renormalized. I think of it this way; a programmer deeply familiar with git who has never seen Boost before should not be surprised necessarily, beyond having to grasp the idea that Boost is a set of individual libraries rather than on giant library. IIUC, such a programmer would expect normalized line endings in master and develop heads, and .attributes files to ensure portability. --Beman

On 27 Nov 2013 at 9:25, Beman Dawes wrote:
After doing a lot of reading of git docs and about how others dealt with similar problems, that was also my conclusion. I'm wondering why others haven't suggested renormalization?
This was discussed and rejected due to loss of data. Here is the patch implementing renormalisation: https://github.com/ryppl/Boost2Git/pull/42. Niall -- Currently unemployed and looking for work. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/

On 29 November 2013 04:04, Niall Douglas
On 27 Nov 2013 at 9:25, Beman Dawes wrote:
After doing a lot of reading of git docs and about how others dealt with similar problems, that was also my conclusion. I'm wondering why others haven't suggested renormalization?
This was discussed and rejected due to loss of data. Here is the patch implementing renormalisation: https://github.com/ryppl/Boost2Git/pull/42.
When we talk about renormalisation, we're talking about the procedure described in the gitattributes man page.

On Fri, Nov 29, 2013 at 4:26 AM, Daniel James
On 29 November 2013 04:04, Niall Douglas
wrote: On 27 Nov 2013 at 9:25, Beman Dawes wrote:
After doing a lot of reading of git docs and about how others dealt with similar problems, that was also my conclusion. I'm wondering why others haven't suggested renormalization?
This was discussed and rejected due to loss of data. Here is the patch implementing renormalisation: https://github.com/ryppl/Boost2Git/pull/42.
When we talk about renormalisation, we're talking about the procedure described in the gitattributes man page.
+1 There is also a nice discussion at https://help.github.com/articles/dealing-with-line-endings#re-normalizing-a-... I've forked the boost super repo and am testing the procedure now. --Beman

On 29 Nov 2013 at 8:14, Beman Dawes wrote:
When we talk about renormalisation, we're talking about the procedure described in the gitattributes man page.
Unfortunately that procedure would corrupt many files within Boost.
+1
There is also a nice discussion at https://help.github.com/articles/dealing-with-line-endings#re-normalizing-a-...
I've forked the boost super repo and am testing the procedure now.
Off the top of my head, you'll need to watch for the following (this list is incomplete): * Files with text file extensions not in ASCII or UTF-8. If you use simple EOL renormalisation with UTF-16 text for example, you'll corrupt that text. * Text files with intentionally mixed EOLs. You'll need to change their extension to not .txt (best), or add special exceptions to .gitattributes (brittle, I wouldn't recommend this option). * Scan the first 8Kb of every file with an extension not marked as text nor binary in .gitattributes for zeros. If you don't find a zero, git will assume it is text and EOL normalise it. Unfortunately some binary file types such as PDF don't have zeros in their first 8Kb, so that would be very bad. We never dealt with these issues during conversion, and there are probably more we don't know about yet. This is why I said Boost is not ready to do the transition - plus too few want to do the manual labour involved in achieving a "perfect" conversion and just want "someone else" to do the tedious work for them. Niall -- Currently unemployed and looking for work. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/

On Fri, Nov 29, 2013 at 12:26 PM, Niall Douglas
On 29 Nov 2013 at 8:14, Beman Dawes wrote:
When we talk about renormalisation, we're talking about the procedure described in the gitattributes man page.
Unfortunately that procedure would corrupt many files within Boost.
I've tested, on Windows and Linux, without apparent problems. The most files modified (152) are on master. Here is the list:
+1
There is also a nice discussion at
https://help.github.com/articles/dealing-with-line-endings#re-normalizing-a-...
I've forked the boost super repo and am testing the procedure now.
Off the top of my head, you'll need to watch for the following (this list is incomplete):
* Files with text file extensions not in ASCII or UTF-8. If you use simple EOL renormalisation with UTF-16 text for example, you'll corrupt that text.
None of the modified files fit in that category.
* Text files with intentionally mixed EOLs. You'll need to change their extension to not .txt (best), or add special exceptions to .gitattributes (brittle, I wouldn't recommend this option).
None of the modified files fit in that category. I checked tools/inspect/wrong_line_ends_test.cpp to see why it wasn't normalized, and the reason was simple. It had already been normalized! That isn't worrisome - inspect is a tool that will need tuning for git anyhow.
* Scan the first 8Kb of every file with an extension not marked as text nor binary in .gitattributes for zeros. If you don't find a zero, git will assume it is text and EOL normalise it. Unfortunately some binary file types such as PDF don't have zeros in their first 8Kb, so that would be very bad.
.pdf is in .gitattributes, so no problem. I checked a couple of .pdf files to be sure, and adobe reader opens them without problems. All of the modified files have extensions that are in .gitattributes, by the way.
We never dealt with these issues during conversion, and there are probably more we don't know about yet. This is why I said Boost is not ready to do the transition - plus too few want to do the manual labour involved in achieving a "perfect" conversion and just want "someone else" to do the tedious work for them.
I'm sure there are problems we don't know about. But we have done enough testing do know that vast numbers of files were converted correctly, that passing tests on both trunk and branches/release still pass, and that the small number of the minor problems we have found are not even close to being showstoppers. To delay further just because of FUD will be harmful. Thanks for your list of possible problem areas. It gave me something additional things to look for. --Beman

On 29 Nov 2013 at 16:26, Beman Dawes wrote:
When we talk about renormalisation, we're talking about the procedure described in the gitattributes man page.
Unfortunately that procedure would corrupt many files within Boost.
I've tested, on Windows and Linux, without apparent problems. The most files modified (152) are on master. Here is the list:
I looked through that list - it doesn't seem to contain anything from Sandbox? My original Boost2Git EOL conversion patch was developed against trunk only i.e. not including sandbox. That conversion, to my knowledge, was completely lossless, even when going right back to the beginning. The problem was Sandbox really. Dave spotted that my patch fatal exited when Sandbox was included, and it was an assertion check for an "impossible" condition. Things got worse the more I worked around problems. I eventually concluded we would have to accept data loss if we were converting all of history, and upcalled the decision to Dave and Daniel. One option I did posit was to convert the last three years of history, and flatten everything before that. We could have done a perfect conversion of the last three years easily enough, and it was my personally preferred option. I think we're beyond that now though.
Off the top of my head, you'll need to watch for the following (this list is incomplete): [snip] None of the modified files fit in that category.
I checked tools/inspect/wrong_line_ends_test.cpp to see why it wasn't normalized, and the reason was simple. It had already been normalized! That isn't worrisome - inspect is a tool that will need tuning for git anyhow.
The problematic files were mostly in Sandbox. I found little issue with trunk. I think because stuff in trunk had to pass peer review, people didn't do unwise commits.
.pdf is in .gitattributes, so no problem. I checked a couple of .pdf files to be sure, and adobe reader opens them without problems.
Yes, I made sure .pdf was in gitattributes specifically to deal with that problem. I still think a test which sweeps the first 8Kb of all files in Boost which don't have extensions in .gitattributes is a very good idea.
We never dealt with these issues during conversion, and there are probably more we don't know about yet. This is why I said Boost is not ready to do the transition - plus too few want to do the manual labour involved in achieving a "perfect" conversion and just want "someone else" to do the tedious work for them.
I'm sure there are problems we don't know about. But we have done enough testing do know that vast numbers of files were converted correctly, that passing tests on both trunk and branches/release still pass, and that the small number of the minor problems we have found are not even close to being showstoppers.
To delay further just because of FUD will be harmful.
Caution is not FUD. My biggest single worry has always been lack of testing of the validity of the conversion. I personally hate doing anything irreversible which has not been tested to destruction. The decision is out of my hands of course. And I have always seen your point about getting on with it Beman, it's just I am personally much more cautious in this type of situation (equally I am far less cautious than many on this list in many other situations). Also, it's entirely possible much more testing has been done than I am aware of - after all I am not on the steering commitee. If that is so, my caution is unwarranted and I have been talking out of my ass.
Thanks for your list of possible problem areas. It gave me something additional things to look for.
You're welcome. Niall -- Currently unemployed and looking for work. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/

Daniel James
On 26/11/2013 20:18, Quoth Jürgen Hunold:
The reason for the "image/*" setting for svg was to have them displayed as images when viewed in a web browser. But those display settings can be better configured server-side. My experience is to have all text file eol-style "native" and mime-type "text/something" to get the best cross-platform integration. This is true even for .vcproj files as you can then script-edit them on Unix without problems.
Changing subversion properties doesn't fix historical data, so it won't fix anything that we can't fix in git.
On 26 November 2013 23:34, Gavin Lambert
wrote: Another possibility might be to explicitly mark all text files in .gitattributes with eol=lf and then run the conversion; I think that would fix the repository blobs, and then the eol attribute can be removed for actual use.
The conversion doesn't respect .gitattributes, that's why there's a problem. We can fix these issues in current and future versions by normalizing the files (as described in the gitattributes man page).
Since we can't fix these files in historical versions
We could, but it would be lossy. Also, some files in SVN have the wrong settings there. We decided it was not the conversion's job to try to "fix" problems in SVN. But if someone can sort out what we really, really want, it should be relatively easy to implement. The problem is that nobody is sure what is wanted. Before going any further, IMO, someone should generate consensus around a clear set of *unambiguous* goals for the result.

Gavin Lambert
I'm just suggesting that (if it makes sense with how the conversion tool actually works -- I don't know much about it) a higher priority rule be added to the conversion tool that says "mime type */*+xml in svn => text attribute in git", which would then result in SVG files being marked as text and all other files marked as images.
Such changes can be made.
Though it'd still require the files to be dos2unix'd if the conversion was run on Linux
As mentioned elsewhere, that's incorrect. The files are never checked out into a working copy during the conversion process. That would be hella-slow.
-- if it was run on Windows then the text attribute by itself should fix the problem.
Another possibility might be to explicitly mark all text files in .gitattributes with eol=lf and then run the conversion; I think that would fix the repository blobs,
What goes into .gitattributes has no effect on the content of blobs in the repository.
and then the eol attribute can be removed for actual use.

On 28/11/2013 05:17, Quoth Dave Abrahams:
Another possibility might be to explicitly mark all text files in .gitattributes with eol=lf and then run the conversion; I think that would fix the repository blobs,
What goes into .gitattributes has no effect on the content of blobs in the repository.
Sorry, when I wrote that I was envisaging the conversion process doing the equivalent of an SVN checkout and then git add/commit. (And git commit does take the attributes into account for eol conversion.) Since apparently it doesn't work like that, just ignore that suggestion. I still think that we need to renormalise the line endings though. Whether that's done as a reconversion (fixing history) or just as an extra commit (fixing current heads), that's up to you guys. ;)

Jürgen Hunold
Hi Gavin,
On Tuesday, 26. November 2013 20:06:00 Gavin Lambert wrote:
On 26/11/2013 19:54, Quoth Jürgen Hunold:
boost.svn$ svn proplist libs/pool/doc/images/mb1.svg
Properties on 'libs/pool/doc/images/mb1.svg': svn:mime-type
boost.svn$ svn propget svn:mime-type libs/pool/doc/images/mb1.svg image/svg+xml
And image/* marks this as "binary". We had this discussion weeks ago with noch real decision.
Just curious: where is that rule coming from? If it's in the conversion code (which seems likely), can it be changed to treat */*+xml as text regardless of other rules?
No, subversion will not do CRLF-conversion on binary files. This creates the same issues as explicit CRLF settings when doing the conversion on Unix.
Let me be very clear: it does not matter a bit what platform is used to do the conversion. The bytes of every file are streamed literally from SVN to Git. https://github.com/ryppl/Boost2Git/blob/master/src/importer.cpp#L459
And the conversion script
And please don't call it a script ;-)
will have to use subversion commands in order to get the commits.
The conversion works uses the SVN API directly.
The reason for the "image/*" setting for svg was to have them displayed as images when viewed in a web browser. But those display settings can be better configured server-side. My experience is to have all text file eol-style "native" and mime-type "text/something" to get the best cross-platform integration. This is true even for .vcproj files as you can then script-edit them on Unix without problems.
The final question is how to resolve this. I am not sure which solution is best. I'd like to set all text-based files (including .svg and .bat) to svn:eol-style native and svn:mime-type text/* in svn and then re-run the conversion again. The main reason is that this worked really well for the rest of the repository.
That might be easier though, provided they're all found. :)
We have to handle all files which come out "modified" from a fresh checkout. The list from Beman is quite comprehensive. And I would just use a skript to force-set the needed properties on the svn side.
Yours,
Jürgen
participants (7)
-
Beman Dawes
-
Daniel James
-
Dave Abrahams
-
Gavin Lambert
-
Jürgen Hunold
-
Niall Douglas
-
Philippe Vaucher