ryppl / boost2git Goto Github PK

Conversion to Git for Boost

Home Page: http://jenkins.boost.org/job/Boost2Git

Emacs Lisp 1.64% Shell 0.26% Python 1.26% C++ 96.83%

boost2git's Introduction

Boost2Git

This project converts an SVN repository into multiple Git repositories, optionally registering each repository as a submodule in some other Git repository. It started out as KDE's svn2git tool, but has been almost completely rewritten, to the point where very little of the original code remains.

There were many reasons for our initial deviations from svn2git, but the heart of the original program was still there until we discovered it was producing nonsense results. When we evaluated the core logic, it became clear that the svn2git approach was insufficiently general to correctly handle our branch and directory mapping structure. Our rewrite requires C++11.

In the rewrite, we dropped several features of svn2git that aren't needed for Boost, most notably incremental conversions. The dropped features could be brought back without too much difficulty, but unless someone else takes over maintenance of this project, they are unlikely to get addressed. The issue tracker is our record of what can or should still be done.

For any substantially large SVN-to-Git conversion + modularization job, if you start with today's technology, some amount of coding will be necessary. Because it is quite general and fairly clean, Boost2Git is probably a good starting point.

At the time of this writing, Boost is being continuously converted into these Git repositories.

boost2git's People

Contributors

Stargazers

Watchers

Forkers

jopie64 yodamaster ned14 danieljames black-square cnsuhao

boost2git's Issues

Performance improvement

The current algorithm to find a match rule is quite slow. It loops linearly through all match rules and checks a) the minimum revision, b) the maximum revision, and c) the prefix. The first rule that matches will be chosen, hence the order is relevant.

Searching for the longest prefix instead of the first prefix can improve both performance (eg by using a radix tree) and usability (repositories.txt may list repositories in any order (currently graph_parallel must come before graph)).

Document rule DSL

We can't ask people to submit rule edits if they don't understand how the rules work

Do some cleanup on the ruleset

For example, there are lots of branch rules in common_branches that really are specific to a given repository. It would greatly simplify thinking about how the rules work if these were localized to specific repository sections. If we wrote some post-processing code to dump information about branch rules that are only matched in a single repository, that would be easy to automate.

Incremental conversion breaks .gitmodules

It seams the information about existing submodules is lost when doing an incremental Boost2git conversion.

Workaround: Run a full conversion.
Solution: Read the .gitmodules file on startup.

Explain unmatched source directories

I'm seeing these in the log, which I'm at a loss to explain:

++ WARNING: SVN reports a "copy from" @38875 from /trunk@38874 but no matching rules found! Ignoring copy, treating as a modification

++ WARNING: SVN reports a "copy from" @38875 from /trunk@38874 but no matching rules found! Ignoring copy, treating as a modification

++ WARNING: SVN reports a "copy from" @39706 from /tags/Version_1_34_1/boost/boost/algorithm@39705 but no matching rules found! Ignoring copy, treating as a modification

(etc)

Seems like there must be a bug somewhere.

Comment more of the code more thoroughly

Unless this codebase is going to die with the Boost transition to Git, it will need many more very clear comments so that someone else can pick it up and use it later

Record SVN revision numbers somewhere in Git

The original svn2git had code to write revision numbers into the commit messages and/or into Git notes. We have neither.

Create superproject

Deal with executable files

Right now we only create text; they need to be written with a different "mode" string to git-fast-import

Suppress duplicate lines in the output

That would improve readability; see http://jenkins.boost.org/job/Boost2Git/324/console for example.

Remove redundant merge parents

This is not yet working.

If you look at
boostorg/smart_ptr@af86d7fd85cc68 in the Network Graph, you can quite clearly see that it merges boostorg/smart_ptr@fb2886a8 and its ancestor boostorg/smart_ptr@92a049a (among many others). The latter is totally redundant as far as Git is concerned.

Lost submodules

bdbb694 was supposed to address a problem that apparently persists. Assigning this to Daniel so that he can provide a complete description and testcase.

Translate symlinks

I don't think we have any in Boost's SVN, but they need a different "mode" string in the git-fast-import input.

Make it incremental

The original svn2git codebase could restart the conversion process where it left off for cases when SVN is just receiving new commits. This is not hard to implement, but not necessarily required for Boost

Put information about line numbers into ruleset

Useful diagnostics from svn2git would be a whole lot more useful if the ruleset contained line number information about the participating rules. I tried to implement this myself but couldn't figure it out. Assigning to @purpleKarrot since he's the one with the Spirit chops.

Optimize ancestry computation

Instead of traversing the SVN filesystem, do it by examining SVN subtree rules (patrie has a method for this). Not sure whether this is actually a bottleneck, so maybe profile first

Re-implement "recurse" action

The new DSL has no "recourse" action yet.
Either it should get one, or we provide an additional file with all the directories listed that should be recoursed into. It may also be helpful to limit the version range where this rule should be considered.

Do a lot more asynchronously

There's no reason SVN change discovery can't go on in parallel with writing to git fast-import, and that more git fast-import stuff can't be happening in parallel (we have a separate process for each repo after all!) ASIO might be helpful here.

Deal with branch deletion

When branches become empty in Git, it's usually because they have been deleted in SVN. We should leave a tag on the ref's final commit before becoming empty, and then delete the ref.

Port the submodule logic

We need to start creating submodule references again

Missing slash in some path translations

See the directory names in this tree for example.

Check for illegal branch hierarchies

I think the problem addressed by 9402466 could be detected at ruleset processing time. It took hours to debug. Could you add error detection for this?

Inject a .gitattributes file

See http://article.gmane.org/gmane.comp.lib.boost.devel/241829

Support for tags

Make sure tags are created correctly.

Recognize folders

Currently the output shows:

Revision 7623
++ WARNING: File '/branches/unlabeled-1.1.1.1.10/boost/boost/detail' not accounted for. Putting to fallback.
++ WARNING: File '/branches/unlabeled-1.1.1.1.10/boost/libs' not accounted for. Putting to fallback.

These two changes are actually folders. If a folder is not accounted for, Boost2Svn autumatically recurses. For some reason they are not recognized as folders.

Report and/or correct for malformed rules

I lost almost a whole day to a leading slash prohibited by these rules before I finally fixed it.

Address Warnings

Warnings in the svn2git output are usually ruleset errors. With very few exceptions, they look like they should be fixed.

Optimize merge discovery

Right now there's a pass across every file participating in an SVN directory copy. This is terribly inefficient and I'm certain it could be replaced by a use of patrie::svn_subtree_rules, perhaps along with some exploration to make sure that SVN actually contains a file under the path in question.

Review destination branch/tag names

Some examples of issues that should be considered:

We have lots of refs beginning with old-branches/. I believe that name was chosen originally by @jwiegley because either the branch had been deleted (in which case we can let svn2git handle that using its backups/ feature) or because it is not being actively developed. IIRC in that case he was directing the old branch at a tag, which we are not currently doing.
there are lots of duplicate ref name suffixes in SVN, some of which are being collapsed to the same tag in git. This mostly happens in the Boost.Build repo, where we have, e.g., /tags/jam/ and /tags/tools/jam/<whatever>. It seems likely that these tags were all simply moved/re-rooted... ah, yes, see r39733. In a case like that, the collapsing is fine, but we should try to be sure of it.

Should we ignore branch deletions unconditionally?

See the commit message on a203c9a. I'm not 100% certain, but I think perhaps that logic applies equally well to any branch (not just master). What do you think?

Mimic gitflow

map 'trunk' to 'develop'
make 'develop' the default branch
connect all tags to 'master'

Make sure empty commit elimination works with ancestry

The part of the process that resets a ref to its previous commit if the SHA1 hasn't changed needs to be checked for interactions with ancestry. We probably want to avoid creating a new commit just because a branch was created, but otherwise we want to make sure that ancestry isn't dropped.

#!/usr/bin/ruby
old_parents = gets.chomp.gsub('-p ', ' ')

if old_parents.empty? then
  new_parents = []
else
  new_parents = `git show-branch --independent #{old_parents}`.split
end

puts new_parents.map{|p| '-p ' + p}.join(' ')

Here is a bash one-liner that should might work.

git filter-branch --force --parent-filter 'read commit; test -z "$commit" || git show-branch --independent `echo -n "$commit" | sed -e "s/-p / /g"` | sed -e "s/.*/-p &/" | tr "\n" " "; echo'

ryppl / boost2git Goto Github PK

boost2git's Introduction

Boost2Git

boost2git's People

Contributors

Stargazers

Watchers

Forkers

boost2git's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs