Re: [boost] Git Modularization Ready for Review
Niall Douglas
writes: Me personally I'd just chuck away any unmappable historical information. Chalk it up to a cost of transition. If they really need the history, they can go fetch it from the read-only SVN repo.
I see you've not been keeping up with the list lately ;) Daniel et. al. suggested doing just that a few months ago and was met with such a chorus of criticism, they didn't really have a choice but to fix it.
Never actually was on this list, not ever, until recently as it's a lot of extra reading. Been involved with Boost for over a decade though.
Personally, I agree with the chorus. After all, the point of a VCS is to have a history of the code's evolution to a point. The VCS, be it SVN, Git, whatever, is just a means to get that history. Jetissoning the ends for the means seems misguided.
What happens in a few years time when Git is replaced with the next big
No, it's a cost of doing an upgrade. Those of you who ever migrated a large CVS to SVN transition know what I mean: stuff gets lost, and actually it isn't important enough to preserve that it requires a quadrupling of transition effort when a read-only copy of the old technology repo is good enough. Distributed SCM is much more dimorphic again, and you *have* to accept some data loss. Let me put this another way: those who want no history loss ought to be the ones volunteering the additional time to preserve history. What actually happens sadly is then the argument becomes we must stick with whatever the old technology is, because people will choose a small reduction of productivity every day over fixing tooling once and for all ("programmers program, they don't do tooling"). I *still* find CVS in use in places because "history must never ever be lost", and given the anti-productivity nature of CVS that costs far more in present developer time than accepting a 5% or 10% history loss, most of which will never significantly matter anyway because it occurs on the edges where SCMs dimorph. thing?
Do we lose the history again? And then again when that gets replaced too?
That's exactly what happens. Bitrot is always inevitable in the long run. Here call it "non-fatal bitrot" :) My only red line is corruption of past and present source code releases. It must *always* be possible to check out tag X and build release X. Other than that, I'm flexible, including loss of branch integrity, because in the end if that branch is really important its owner will manually fix up the damage.
I fear that if modularization is taken to its logical extreme, you could see submodules get out of step with other submodules. You may of course have already anticipated this and have implemented something I haven't realized. As I said, I am confused.
Can you explain a bit more about what you mean by out-of-step? The whole point of modularising the code is to *help* modules to get out-of-step and therefore be easier to develop and test independent of what other Boost libraries are doing. But perhaps you mean something else?
[1]: By findable I mean that when Boost library users do #include
they get the main Boost repo version, not the submodule version. I absolutely would expect an automated tool to pull headers from submodules, check them for ABI breakage and push them into the main repo. My point is that some sanity check ought to be
Well, the Git way of helping stuff get out of step is that everyone gets their own full copy of the whole repo, and their copy is just as important as anyone else's copy. You clone, you develop and test, and optionally push changes elsewhere which could be your friend, your team, your employer, or of course some central authoritative copy. So I'm afraid I just don't get the present design for a library as small and as tightly integrated as Boost. Something huge and cleanly separated like KDE sure, but for Boost I suspect it's overkill. Unless Boost plans to grow 10x in the next three years that is. there.
I'm not following why you would want to do this. Perhaps you can explain
what
problem you are anticipating and how this would solve it?
Most of Boost is implemented in headers, very much unlike KDE or most other C++ libraries. Moreover, those headers are quite brittle, unlike KDE or most other C++ libraries. If broken into submodules, I can see an apparently innocent change in submodule X appearing to compile and be okay in developer X's set of submodule clones, but silently be a breaking change with a simultaneous change in submodule Y by developer Y. Why this matters is because when in git you go to push, git will force you to merge before push and at that point you "see" the breakage by a conflict appearing (hopefully) and if not then your immediate next compile will fail. With the submodule approach that doesn't happen, so you *don't* see the breakage till much later when the regression tests suddenly start failing. I'm a great believer in refusing to let programmers commit code which breaks other code rather than nagging them later to fix an earlier commit. The point of failure notification ought to be as close to cause as possible.
I also don't get what 'findable' means. What would a non-findable header be?
Any internal header only findable by internal implementation. What I'm basically suggesting is an approach where the master repo keeps a gold candidate set of headers automatically extracted regularly from the submodules. Then on push a hook can do appropriate black magic to force the pusher to merge headers before the push. Then stuff which is broken appears broken as soon as possible, rather than suddenly emerging many days later. Does this make sense? Niall --- Opinions expressed here are my own and do not necessarily represent those of BlackBerry Inc.
Niall Douglas
Niall Douglas
writes: Me personally I'd just chuck away any unmappable historical information. Chalk it up to a cost of transition. If they really need the history, they can go fetch it from the read-only SVN repo.
I see you've not been keeping up with the list lately ;) Daniel et. al. suggested doing just that a few months ago and was met with such a chorus of criticism, they didn't really have a choice but to fix it.
snip
Personally, I agree with the chorus. After all, the point of a VCS is to have a history of the code's evolution to a point. The VCS, be it SVN, Git, whatever, is just a means to get that history. Jetissoning the ends for the means seems misguided.
No, it's a cost of doing an upgrade. Those of you who ever migrated a large CVS to SVN transition know what I mean: stuff gets lost, and actually it isn't important enough to preserve that it requires a quadrupling of transition effort when a read-only copy of the old technology repo is good enough. Distributed SCM is much more dimorphic again, and you *have* to accept some data loss.
It's a question of degree. Daniel et. al.'s recent work shows that you can keep that loss pretty low. "Where there's a will there's a way?" snip
What happens in a few years time when Git is replaced with the next big thing? Do we lose the history again? And then again when that gets replaced too?
That's exactly what happens. Bitrot is always inevitable in the long run. Here call it "non-fatal bitrot" :)
My only red line is corruption of past and present source code releases. It must *always* be possible to check out tag X and build release X.
So why have a VCS at all? Named tarballs meet your red line.
Other than that, I'm flexible, including loss of branch integrity, because in the end if that branch is really important its owner will manually fix up the damage.
I fear that if modularization is taken to its logical extreme, you could see submodules get out of step with other submodules. You may of course have already anticipated this and have implemented something I haven't realized. As I said, I am confused.
Can you explain a bit more about what you mean by out-of-step? The whole point of modularising the code is to *help* modules to get out-of-step and therefore be easier to develop and test independent of what other Boost libraries are doing. But perhaps you mean something else?
Well, the Git way of helping stuff get out of step is that everyone gets their own full copy of the whole repo, and their copy is just as important as anyone else's copy. You clone, you develop and test, and optionally push changes elsewhere which could be your friend, your team, your employer, or of course some central authoritative copy.
That works for a monolithic project. Boost is not monolithic but it's single repo model was forcing it to act like one. As well as making life a little easier for library developers, the modularisation effort should help end users too. For a while now people have been wanting to take only what they need from Boost and ignore the rest. It's especially important in commercial environments where manager (rightly or wrongly) are reluctant to depend on huge quantities of unreviewable code with unknown interactions. Modularisation won't solve the problem overnight but it helps.
So I'm afraid I just don't get the present design for a library as small and as tightly integrated as Boost.
You must be talking about some other Boost.
Something huge and cleanly separated like KDE sure, but for Boost I suspect it's overkill. Unless Boost plans to grow 10x in the next three years that is.
[1]: By findable I mean that when Boost library users do #include
they get the main Boost repo version, not the submodule version. I absolutely would expect an automated tool to pull headers from submodules, check them for ABI breakage and push them into the main repo. My point is that some sanity check ought to be there. I'm not following why you would want to do this. Perhaps you can explain what problem you are anticipating and how this would solve it?
Most of Boost is implemented in headers, very much unlike KDE or most other C++ libraries. Moreover, those headers are quite brittle, unlike KDE or most other C++ libraries. If broken into submodules, I can see an apparently innocent change in submodule X appearing to compile and be okay in developer X's set of submodule clones, but silently be a breaking change with a simultaneous change in submodule Y by developer Y.
I think the idea is that it allows you to mix your own boost release by choosing the set of submodules to combine. The main boostorg set is the official mix but not the only one. If the libraries were just developed as clones on the boostorg repo, you would be forced to choose the exact set of library version that one particular library developed against.
Why this matters is because when in git you go to push, git will force you to merge before push and at that point you "see" the breakage by a conflict appearing (hopefully) and if not then your immediate next compile will fail. With the submodule approach that doesn't happen, so you *don't* see the breakage till much later when the regression tests suddenly start failing.
I'm a great believer in refusing to let programmers commit code which breaks other code rather than nagging them later to fix an earlier commit. The point of failure notification ought to be as close to cause as possible.
Surely the developer sees the breakage the moment they update the versions of the other libraries they are working against. And then they can fix it at their leisure while early adopters get to use their code with the previous version of the conflicting library. I can't find the details written down anywhere but I vaguely remember there is a way the developers are supposed to signal that a particular version is suitable for use with the master libraries or something like that. It would be good to see this in black and white.
I also don't get what 'findable' means. What would a non-findable header be?
Any internal header only findable by internal implementation. What I'm basically suggesting is an approach where the master repo keeps a gold candidate set of headers automatically extracted regularly from the submodules.
That's pretty much what happens. The boostorg repo will have the 'gold candidate' of submodule hashes which are known to work together. I don't get what 'extracting' the headers would add to that.
Then on push a hook can do appropriate black magic to force the pusher to merge headers before the push. Then stuff which is broken appears broken as soon as possible, rather than suddenly emerging many days later.
When you are talking about breakages are you talking about merge conflicts or compile/test-failures? Up till now I've assumed you mean the latter. But now I'm thinking you might be asking what happens if the developer of library X makes a change to header H and developer Y also makes a change to _the same header_? The way the modularisation works, I'm not even sure this is possible. The libraries are meant to be self-contained so you only ever change your own headers. It's a good question and one I hand't thought about. Perhaps the people working on this can weigh in here. Alex -- Swish - Easy SFTP for Windows Explorer (http://www.swish-sftp.org)
on Fri May 10 2013, Niall Douglas
Niall Douglas
writes: Me personally I'd just chuck away any unmappable historical information. Chalk it up to a cost of transition. If they really need the history, they can go fetch it from the read-only SVN repo.
I see you've not been keeping up with the list lately ;) Daniel et. al. suggested doing just that a few months ago and was met with such a chorus of criticism, they didn't really have a choice but to fix it.
Never actually was on this list, not ever, until recently as it's a lot of extra reading. Been involved with Boost for over a decade though.
Personally, I agree with the chorus. After all, the point of a VCS is to have a history of the code's evolution to a point. The VCS, be it SVN, Git, whatever, is just a means to get that history. Jetissoning the ends for the means seems misguided.
No, it's a cost of doing an upgrade. Those of you who ever migrated a large CVS to SVN transition know what I mean: stuff gets lost, and actually it isn't important enough to preserve that it requires a quadrupling of transition effort when a read-only copy of the old technology repo is good enough. Distributed SCM is much more dimorphic again, and you *have* to accept some data loss.
Nothing important need be lost. We're not losing any commits. SVN does represent merge information as an exact set of contributing commits rather than as a chain of history, and trying to preserve all that exactly would be... dumb.
Let me put this another way: those who want no history loss ought to be the ones volunteering the additional time to preserve history. What actually happens sadly is then the argument becomes we must stick with whatever the old technology is, because people will choose a small reduction of productivity every day over fixing tooling once and for all
I didn't want to do the job of modularizing history, but now that it's close-to-finished, arguing for someone else to do it seems... pointless.
My only red line is corruption of past and present source code releases. It must *always* be possible to check out tag X and build release X.
Which is why we're keeping a read-only copy of SVN. Modularized repositories are never going to reassemble to form an exact picture of history, because several files from the same SVN directory need to be sorted into different repos. -- Dave Abrahams
participants (3)
-
Alexander Lamaison
-
Dave Abrahams
-
Niall Douglas