On Tue, Feb 1, 2011 at 5:20 AM, Steven Watanabe
AMDG
On 1/30/2011 4:35 PM, Dean Michael Berris wrote:
With subversion, each commit is a different revision number right? Therefore that means there's only one state of the entire repository including private branches, etc.
Yes, but I don't see how that's relevant. How can the repository be in more than one state? Now if only we had quantum repositories...
That's important because in a DVCS, there's no one repository. Therefore that means each and every repository clone of the original canonical repository out there has its own state. However in Git, every commit has its own identity which knows who its parents are. That means you can then apply many different commits from many places, conglomerate them into your local repository, and reflect that tree onto the canonical repo if you're the maintainer. This then allows everyone else to merge in this tree into their local repo and therefore you get the distributed and scalable aspect for multi-developer projects. It's really an important distinction.
What then happens when you have two people trying to merge from two different branches into one branch. Can you do this incrementally?
What do you mean by that? I can merge any subset of the changes, so I can split it up if I want to, or I can merge everything at once.
I mean let's say we have the tree in subversion: trunk r1 |---- your branch |---- my branch If I merge in changes from trunk to my branch, and then you do the same at a slightly later time, and we both try to merge back into trunk. In subversion we would have to do that in a single commit each. With Git, merging the trees of my branch and your branch is a single command, and is largely automatic -- if we share commits from trunk in our branches, Git knows what to do and a lot of the conflicts are largely really just the conflicts on changes we both have that need resolving. We do the resolving locally so we both can keep trying to run into each other's merge race, but then the canonical repo we're both working on maintains a single tree.
How would you track the single repository's state?
Each commit is guaranteed to be atomic.
And so that means the state I have in my working copy is not a unique state. That means, if I've made a ton of changes locally that I haven't committed yet and 80% of those changes conflict with changes in the central repo, I just say "OH FML"? Note that even if I do have a branch, synchronizing changes back into the source branch will be a PITA.
How do you avoid clobbering each other's incremental merges?
If the merges touch the same files, the second person's commit will fail. This is a good thing because /someone/ has to resolve the conflict. Updating and retrying the commit will work if the tool can handle the merge automatically. (I personally always re-run the tests after updating, to make sure that I've tested what will be the new state of the branch even if there were no merge conflicts.).
We have the same workflow except in Git, I don't have a chance to mess up the canonical repo if I'm not the maintainer. What happens with Git is that if I am a co-maintainer of a library's repo with you, then if I push and the merge that happens upstream is not a "fast-forward" (i.e. no merge conflicts will happen and is likely just an index update) then I have to pull the changes and merge the commits locally -- this is not the case with Subversion as I have to have exactly the same revision number in my working copy for me to be able to commit anything. Note that git works as a series of patches laid out in a DAG and each commit is unique, meaning each commit can be transplanted from branch to branch and the identity is maintained even if you had it in a ton of branches (or repositories).
Remember you're assuming that you're the only one trying to do the merge on the same code in a single repository. Consider the case where you have more than just you merging from different branches into the same branch.
In git, merging N remote-tracking branches into a single branch is possible with a single command on a local repo -- if you really wanted to do it that way.
This would require N svn commands. (Of course if I did it a lot I could script it. It really isn't a big deal.).
Not only that, it would also require your working copy to be sync'ed with repo's state for the branch for which you want to do the commit. This synchronization is a killer in multi-developer projects touching the same code base.
Of course you already stated that you don't want automated tools so if you *really* wanted to inspect the merge one commit at a time you can actually do it interactively as well.
I didn't say that I didn't want automated tools. I said that I didn't trust them. With svn that means that, before I commit I always a) run all the relevant tests b) review the full diff
And that workflow is very much supported in git as well. You can review full diffs in git as well. And you can commit locally and test everything locally and when you're satisfied you can then ask the maintainer to (or if you're the maintainer, just) push the changes up to the publicly accessible repo. Every other developer then synchronizes their own local repo by merging in changes from upstream (the canonical repo) and then stabilizing their local repo's at their own pace. What's important here is the "at their own pace" part.
This is regardless of whether I'm committing new changes or merging from somewhere else.
Agreed. And this is precisely the same workflow that is made a lot easier by Git. Not only is it easy, it's crazy fast as well.
With subversion what happens is everyone absolutely has to be on the same page all the time and that's a problem.
It isn't a problem unless you're editing the same piece of code in parallel. If you find that you're stomping on each other's changes a lot
a) The situation is best avoided to begin with. The version control tool can only help you so much, no matter how cool it is. No tool is ever going to be able to resolve true merge conflicts for you.
Right, but in the cases where lots of people are touching the same code base, a tool that supports this workflow better is suited best for that situation. If we want to keep Boost as a "read, but don't touch" open source project where only a handful of people get the privilege of mucking the single repository up, then I guess there's no point in having a discussion on using Git because subversion is perfect for that. If we're in agreement that we don't want more contributors pitching in to the same codebase then I withdraw the suggestion to use Git and use it in my projects instead.
b) Working in branches will buy you about as much as using a DVCS as far as putting off resolving conflicts is concerned.
Honestly, if you assume the worst case, and don't use the tool intelligently, you're bound to get in trouble. I'm sure that I could invent cases where I get myself in trouble (mis-)using git that work fine with svn.
I think there's fundamental impedance mismatch when you look at the act of developing an open source project with 100 people as compared to having just 2 or three people touching the same code. In the simplest case scenario of less than a handful of people are touching the code, heck tarballs and exchanging patches should work just fine -- you're going to resolve conflicts anyway. But if you have a lot more people doing this then you have a choice: either use a tool that supports thousands of concurrent developers, or use one that supports maybe a few tens of developers. Branches in git are a different beast from how a branch in subversion looks like. Basically in subversion, you're copying a snapshot of the code and making changes on top of that. When merging you take commits that are made from one branch into another using the repository version as the identifier for changes made in the code. That's alright if you only had one repository and just a few developers touching the code and doing the merging -- now scale that to a hundred people touching the same code and having one branch each, then you start seeing how just branches won't cut it. This is true not only for Boost -- imagine a hundred people working on the containers or algorithms collections at the same time -- especially if it wants to support a lot more contributors than it already has. In Git, branches themselves are basically sub-trees where each and every sub-branch (a branch from a branch) can be transplanted from one branch to any other branch. And then you have the distributed nature of the beast where in your branches can be tracking remote branches -- meaning it will be synchronizing with the remote branch's tree. So there's no "one state" of the whole project except the one that the developers agree upon is the canonical repo. Then that means anybody can be working on Boost libraries and porting them to a platform that nobody else in the current Boost pool of developers has, and making it stable until such time that they see it fit to contribute changes back upstream -- this means they didn't need to get anybody's permission to muck around with Boost to get commit access to the repository so that they can work on things that matter to them *and keep a record of the changes locally*. This goes the same for people to just want to maintain local Boost repositories for their own organizations and would want for example to fix all warnings and not have to submit those changes until they're ready later on.
The maintainer can then do the adjustments on the history of the repo -- things like consolidating commits, etc. -- which largely is really what maintainers do,
Is it? I personally don't want to spend a lot of time dealing with version control--and I don't. The vast majority of my time is spent writing code or reviewing patches or running tests. All of which are largely unaffected by the version control tool.
Of course in Boost, what happens is maintainers are largely the same developers of the project as well. Which is odd for an open source project the magnitude and importance of Boost. If you don't want to spend a lot of time dealing with version control then git is precisely the tool you want. If you spend a couple of seconds (or maybe a minute) committing things or merging them to the single Boost subversion repository, then you can spend a fraction of that (an order of magnitude less) than the time you would using git. Benchmarks abound comparing performance of git against subversion in most of these routine operations showing how git is much more efficient and better at staying out of your way than subversion is.
only with git it's just a lot easier.
It isn't just easier with git, it's basically impossible with svn. In svn, the history is strictly append only. (Of course, some including me see this as a good thing...)
In publicly-accessible Git repositories, it is encouraged that history is preserved so that those that clone from it and build upon it see a "truthful" version of the code. But precisely because you can muck around with your local commits before submitting patches upstream, this flexibility allows you to do things like that on your local repository. Just one of those things that changes the workflow and allows developers to improve things locally *incrementally* and synchronize later only when it's necessary.
b) Why would I want to try it several different ways? I always know exactly what I want to merge before I start.
Which is also the point with git -- because you can choose which changesets exactly you want to take from where into your local repository. The fact that you *can* do this is a life saver for multi-developer projects -- and because it's easy it's something you largely don't have to avoid doing.
This doesn't answer the question I asked.
Of course you're looking at the whole thing with centralized VCS in mind. Consider the case that you have multiple remote branches you can pull from. If you're the maintainer and you want to basically consolidate the effort of multiple developers working on different parts of the same system, then you can do this piece-meal.
For example, you, Dave Abrahams, and I are working on some extensions to MPL. Let's just say for the sake of example.
I can have published changes up on my github fork of the MPL library, and Dave would be the maintainer, and you would have your published changes up on your github fork as well. Now let's say I'm not done yet with what I'm working on but the changes are available already from my fork. Let's say you tell Dave "hey Dave, I'm done, here's a pull request". Dave can then basically do a number of things:
1.) Just merge in what you've done because you're already finished and there's a pull request waiting. He does this on his local repo first to run tests locally -- once he's done with that he can push the changes to the canonical repo.
2.) Pull in my (not yet complete) changes first before he tries to merge your stuff in to see if there's something that I've touched that could potentially break what you've done. In this case Dave can notify you to pull the changes I've already made and see if you can work it out to get things fixed again. Or he can notify me and say "hey fix this!".
3.) Ask me to pull your stuff and ask me to finish up what I'm doing so that I can send a pull request that actually already incorporates your changes when I'm done.
... ad infinitum.
4.) Dave isn't paying attention, so nothing happens. A couple years later, after we've both moved on to other things, he notices my changes and decides that they're good and merges them. ...More time passes... He sees your changes and they look reasonable, so he tries to merge them. He gets a merge conflict and then notifies you asking you to update your feature. You are no longer following Boost development, so the changes get dropped on the floor. ...A few more years go by... Another developer finds that he needs your stuff. He resolves the conflicts with the current version and the changes eventually go into the official version.
This is something like how things seem to work in practice now, and I don't see how using a different tool is going to change it.
And this is so easy to fix with git because then if Dave the maintainer isn't paying attention, either one of us can ping a release manager or let everybody know that "hey, we're trying to consolidate changes here but Dave isn't paying attention!" and thus someone can pick either one of our repositories as the "canonical" repo for the library. Of course that promotes either one of us to be the maintainer -- it's a lot more fluid process that is explicitly supported and encouraged by the git workflow. This is the insurance mechanism and the "business continuity process" that is built-into the distributed version control systems like git, mercurial, bazaar, etc.
With subversion, there's no way for something like this to happen with little friction.
Why not? Replace "github fork" with "branch" and subversion supports everything that you've described.
If you made your subversion repository publicly accessible without need for authenticating who the user is to be able to commit changes then that would be true. Otherwise as it stands at the moment you need permission to even touch the Boost repository. And this part turns a lot of people away from wanting to contribute because the other way around it is to submit a patch in Trac -- which is quite honestly painful and time consuming as heck.
First we can't be working on the same code anyway because every time we try to commit we could be stomping on each other's changes and be spending our time just cursing subversion as we wait for the network traffic and spend most of our time just trying to merge changes when all we want to do is commit our changes so that we can record progress. Second we're going to have to use branches and have "rebasing" done manually anyway just so that we can all stay synchronized all the time --
What do you mean by "rebasing." Subversion has no such concept. If you want to stay synchronized constantly, you can. If you want to ignore everyone else's changes, you can. If you want to synchronize periodically, you can. If you want to take specific changes, you can. What's the problem?
The concept of rebasing is really simple: 1. I branch from trunk revision 1, and make changes until revision 30. 2. In between r30 and r1 some things change in trunk. 3. I want to make my branch upto date with the changes that have been in trunk since r1 to r30 so I 're-base' by pulling the changes from trunk into my branch up to r30. 4. I have to (or subversion has to) remember that I've already merged in changes up to r30 so the next time I do the same operation, I don't try to pull in the changes that are already there. 5. When I commit r31, then I have effectively rebased my branch to trunk r30. OTOH with git, we can just be working on our local master tracking the canonical master, and just keep making changes willy-nilly locally. When we want to push to the repository only then would we want to actually merge in changes. That's supported sure, and that's no better than the subversion approach. BUT ... with git you and I can work on separate local branches that fork off from master. We can keep making changes in that branch and then later on once we're ready to integrate back to master, we do that locally (we might even squash commits from the local branch so that we can submit a single big-ass patch to the upstream maintainer). That doesn't seem enticing at first but then imagine 20 or 100 of us doing that to the same source code and you'll quickly see why the subversion approach isn't going to scale and can potentially hold our individual progress up, not just the progress of the whole project.
c) Even if I were merging by trial and error, I still don't understand what makes a distributed system so much better. It doesn't seem like it should matter.
Because in a distributed system, you can have multiple sources to choose from and many different ways of globbing things together.
So, what I'm hearing is the fact that you have more things to merge makes merging easier. But that can't be what you mean, because it's obviously nonsense. Come again?
Yes, that's exactly what I mean.
Apparently not, since your answer flips around what I said.
I meant, with *git* and because merging is almost as painless as possible most of the time, having more things to merge from makes it easier. The logic is really simple: if I can pick from more sources of things to merge in, I can do that all at the same time and if things fail, back out and exclude a source and see if things go fine. I can then isolate which things I would merge without having to resolve manual conflicts, push those to the canonical repo, and as a maintainer just tell the other sources "hey, synchronize with the state now and try again". That means it's easier for me as a maintainer now that I don't have to manually figure everything out, I can have other sources deal with it for me if they really want to have their stuff included. The carrot is that your changes get into Boost, the stick is your pull request has to apply cleanly.
Because merging is easy with git and is largely an automated process anyway,
If you will recall, the question I started out with is: "What about a distributed version control system makes merging easier?" That question remains unanswered.
I've said yes all the while here and that was largely mostly because once you've tried it and have been in a project where the distributed thing is actually done, you'd see that having everything be synchronized is just a waste of time.
The best I've gotten is "git's automated merge is smart," but it seems to me that this is orthogonal to the fact that git is a DVCS.
Why? Because it's distributed is precisely why merging is so much easier.
merging changes from multiple sources when integrating for example to do a "feature freeze" and "stabilization" by the release engineering group is actually made *fun* and easier than if you had to merge every time you had to commit in an actively changing codebase.
I've never run into this issue. a) Boost code in general isn't changing that fast.
Which I suppose is due to: 1. Lack of active contributors. 2. The process to contributing requires all sorts of permissions and front-loaded work on potential contributors which means even before people want to strart contributing they're turned off by the rigidity of the process and the toolset leading to 1 above. 3. See 1 above.
b) My commits are generally "medium-sized." i.e. Each commit is a single unit that I consider ready to publish to the world. For smaller units, I've found that my memory and my editor's undo are good enough. Now, please don't tell me that I'm thinking like a centralized VCS user. I know I am, and I don't see a problem with it, when I'm using a centralized VCS.
Now, with a DVCS you don't have to rely on your memory too much or your editor's undo limit. This also, if I may say so myself, isn't a scalable way of doing it. There are local branches for that sort of thing and if you want to submit a singular patch (a squashed merge into a single commit) then that's *trivial* to do with git.
c) There's nothing stopping you from using a branch to avoid this problem. If you're unwilling to use the means that the tool provides to solve your issue, then the problem is not with the tool.
But having to branch in a central repo compared to branching a local repo is the difference between night in the jungle and day by the beach, respectively.
Have you ever heard of branches? Subversion does support them, you know.
And have you tried merging in changes from N different branches into your private branch in Subversion to get the latest from other developers working on the same code? Because I have done this with git and it's *trivial*.
I've never wanted to do this, but unless there are conflicts, it should work just fine. If there are conflicts, you're going to have to resolve them one way or another regardless of the version control tool.
Unfortunately, that's not as easy as you make it sound with subversion. Let's take a simple example: 1. I branch off trunk r1 2. Developer B branches of trunk r99 3. Developer C branches off trunk r1000 Now I want to merge changes from Developer B's branch into my branch so that I can try it out. That's fine because I'll be pulling the changes from r1 to r99. Now let's take the reverse, Developer C wants to pull from my branch, what happens? Hell breaks loose because he doesn't have the history in his branch about the state that was trunk r1..999. This kind of thing is what I'm talking about git makes easy -- because you know the whole history up front, even if two branches were branched off different points in the tree, there's no problem making that merge and trying to replay your changes on top of things. Of course the likelihood that you'll see conflicts is dependent on the parts of the code that is being touched, but the fact that *it's possible* is just powerful. Now scale the above to 10, 20, 50 developers and you'll see why the centralized model breaks down.
Also are you really suggesting that Linux development would work with thousands of developers using subversion to do branches? Do you expect anybody to get anything done in that situation? And no that's not a rhetorical question.
It might overload the server. That's a legitimate concern. But other than that, I don't see why not. (However, since I have nothing to do with Linux development, I may be totally wrong.)
It's not just that. Nobody would want to be merging anything with subversion that way. And imagine if everyone asked Linus to do the merge for them into his branch. That just isn't the scalable way to go. Oh and not to mention the administration nightmare of that managing thousands of usernames and passwords, worrying about backups, the insane checkouts and switches required, etc. HTH -- Dean Michael Berris about.me/deanberris