I just tried converting Roxen 2.4:
Stage Memory use Commits --------------------------------------- Import 339 MB ~17000 Raking 359 MB ~17000 Verify 361 MB ~17000 Merging 662 MB 11886 Graphing 803 MB 10897 Generate 803 MB 10897
I believe that the memory use can be reduced by using more custom datatypes (currently there are a lot of mappings generated in the merging and graphing stages). Another way to reduce the memory use is to partition the graphs in the time axis.
I've now rewritten the last stage of the importer so that it is ~2 times faster than before.
FYI: The commit merging code in the above script was broken prior to today, and could sometimes reintroduce old file revisions.
And now I've tried it with the entirety of Pike & ulpc in one go. Number of revisions: 61660 Number of commits: 34847 Memory before graphing: ~900 MB Memory max: 2657 MB
I believe that the memory use can be reduced a bit further by detecting some common cases.
Hm, the number of revisions sound a bit high; I have 36620 revisions in the SVN repository of Pike and ulpc up to end of August. Do you create one revision for each file when there are commits touching multiple files?
Yes, revision = one revision in one RCS-file. Commit = a set of revisions that were made at the same time (within 5min) belonging to different RCS-files done by the same user and having the same log message.
And now I've tried it with the entirety of Pike & ulpc in one go. Number of revisions: 61660 Number of commits: 34847 Memory before graphing: ~900 MB Memory max: 2657 MB
I believe that the memory use can be reduced a bit further by detecting some common cases.
With some simple garbage collection, I've now gotten it to keep memory use steady at under 900 MB. Which has the additional benefit of speeding up the actual committing phase (from ~1.6 commits/s to ~3.9 commits/s on my machine). Committing the entirety of Pike thus goes from ~10.7 hours to ~4.4 hours (much better, but still quite a bit too long though...).
With some simple garbage collection, I've now gotten it to keep memory use steady at under 900 MB. Which has the additional benefit of speeding up the actual committing phase (from ~1.6 commits/s to ~3.9 commits/s on my machine). Committing the entirety of Pike thus goes from ~10.7 hours to ~4.4 hours (much better, but still quite a bit too long though...).
By rewriting the script to use git fast-import instead, it's now chugging along nicely at ~12.5 commits/s (ie ~47 minutes).
Henrik Grubbstr?m (Lysator) @ Pike (-) developers forum wrote:
With some simple garbage collection, I've now gotten it to keep memory use steady at under 900 MB. Which has the additional benefit of speeding up the actual committing phase (from ~1.6 commits/s to ~3.9 commits/s on my machine). Committing the entirety of Pike thus goes from ~10.7 hours to ~4.4 hours (much better, but still quite a bit too long though...).
By rewriting the script to use git fast-import instead, it's now chugging along nicely at ~12.5 commits/s (ie ~47 minutes).
Now we're getting somewhere. Nice work!
By rewriting the script to use git fast-import instead, it's now chugging along nicely at ~12.5 commits/s (ie ~47 minutes).
An almost useable export of Pike is now available from git://pike-git.lysator.liu.se/pike-new-alpha2
One known issue is commit 4435a84964353881596269079577725c1c9951e3, which is due to src/test/.cvsignore and src/test/create_testsuite not having been killed properly in the transition to Pike 7.2.
Another issue is that all commits are currently credited to the committer (ie not necessarily the author). I'll try to extract the information from git://pike-git.lysator.liu.se/pike.git.
Feed-back appreciated.
/grubba
By rewriting the script to use git fast-import instead, it's now chugging along nicely at ~12.5 commits/s (ie ~47 minutes).
Two months and lots of changes later...
Currently a full export of ulpc and Pike takes ~1 hour, and contains ~34000 commits.
An almost useable export of Pike is now available from
[...]
A new export is available from git://pike-git.lysator.liu.se/pike-full-20100101
New since the previous export is that I've found and incorporated another historic ulpc repository (ulpc.hubbe), which means that most of the early history of Pike has been recovered.
I believe that I've identified most of the renamed files and when they were renamed, so the commit graph should make more sense now.
Another issue is that all commits are currently credited to the committer (ie not necessarily the author). I'll try to extract the information from git://pike-git.lysator.liu.se/pike.git.
The above is still on the todo.
Feed-back appreciated.
Please, I received next to none for the previous repository.
Happy New Year!
/grubba
An almost useable export of Pike is now available from
[...]
A new export is available from git://pike-git.lysator.liu.se/pike-full-20100101
Two days, and a new export git://pike-git.lysator.liu.se/pike-full-20100103
I'm reasonably convinced that the commit graph now is almost "correct", there are two known extraneous commits during ulpc times, but otherwise the graph should be fine.
Some of the points left on the todo list are:
* Handling of RCS keyword expansion. * Identification of authors where not the same as committers.
And a policy question:
Currently commits that are identical (including same history) in more than one major branch are kept as one single commit. Should they be forced to be split apart if after the split point for the branches?
Feed-back appreciated.
As before.
/grubba
- Handling of RCS keyword expansion.
How about EOL tagging? There is at least one file which changes EOL convention during its lifetime...
Currently commits that are identical (including same history) in more than one major branch are kept as one single commit. Should they be forced to be split apart if after the split point for the branches?
What are the practical implications within the git system of the two options?
- Handling of RCS keyword expansion.
How about EOL tagging? There is at least one file which changes EOL convention during its lifetime...
I'm not sure; git handles all files as binary data, so it depends on how RCS handles EOL.
Currently commits that are identical (including same history) in more than one major branch are kept as one single commit. Should they be forced to be split apart if after the split point for the branches?
What are the practical implications within the git system of the two options?
I'll try to explain it with illustrations. More recent commits at the top:
Current:
Branch: A A and B B Commit
| | | o C10 o | C9 | | o | C8 |________ | | \ o C7 | | __________/| | |/ | | o | C6 | | o C5 o | __________/| C4 ________ |/ | | | o | C3 | o C2 | __________/ |/ Split: o C1 | o | o | o
C1 was the last commit when A and B still were the same repository.
C2 was the first commit that was unique to repository B.
C3 was a commit that was performed identically (same content, user, message, prior history and commit time (second precision)) in both A and B.
C4 was the first commit that was unique to repository A. It thus branches off from C3.
C5 was the second unique to A. Since it came after C3 it merges C3 and C2.
C6 was the last commit that was identical in both A and B. The virtual branch "A and B" thus ends here.
Alternative:
Split the commits common to both branches after the split. ie create commits C3A and C3B from C3 and C6A and C6B from C6 giving the graph:
Branch: A A and B B Commit
| | | o C10 o | C9 | | o | C8 | | | o C7 | | | | o o C6 | o C5 o | C4 | | | | o o C3 ________ o C2 \ __________/ |/ Split: o C1 | o | o | o
Currently commits that are identical (including same history) in more than one major branch are kept as one single commit. Should they be forced to be split apart if after the split point for the branches?
What are the practical implications within the git system of the two options?
Current:
Alternative:
Split the commits common to both branches after the split. ie create commits C3A and C3B from C3 and C6A and C6B from C6 giving the graph:
Interesting. I didn't notice this during my own imports, then again I imported the larger part of the older history from the SVN version. What would be sort of interesting is why the branch-creation went in this haphazard way in the past?
In any case, the following observations can be made: a. Since the original branch creation in CVS was rather chaotic, it seems only correct to preserve this chaotic process as much as possible. b. Since git doesn't care if you do it one way or the other, you might as well pick the more correct one. c. For all practical operations on history, it is irrelevant which presentation you pick.
So, all things considered, I'd say preserve history the way it happened (which would be the "current", instead of the "alternative" method).
I'm not sure; git handles all files as binary data, so it depends on how RCS handles EOL.
You mean it has no EOL handling at all? With subversion, you get files where the EOL convention does not matter (such as most source files) converted to the local EOL convention when checking out, and then back again when committing. This is conventient for example for some source files which make sense to edit on W*ndows (because they contain W*ndows-specific code, and thus need to be tested in that environment). In CVS, those files had CRLF convention in the repository, which is inconventient when checking out on a non-W*ndows system.
Furthermore, a specific EOL convention can be enforced for a file; but I guess that can be emulated with commit hooks. (I assume git has _those_ at least...)
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum wrote:
I'm not sure; git handles all files as binary data, so it depends on how RCS handles EOL.
You mean it has no EOL handling at all? With subversion, you get
Git supports a crlf attribute which should cater to all your needs (and probably some others too). See: "man gitattributes".
Right. But then grubba's tool needs to import it correctly.
... and now it does. I also added conversion of .cvsignore files to corresponding .gitignore files for good measure.
From my latest conversion:
The initial commit of rijndael_*.txt modifies .gitattributes:
*.tex crlf ident -*.txt crlf ident -/src/UnicodeData.txt -crlf ident +*.txt -crlf ident +/src/modules/Postgres/quickmanual.txt crlf ident +/lib/modules/Graphics.pmod/Graph.pmod/doc.txt crlf ident +/src/UnicodeData-ReadMe.txt crlf ident +/refdoc/tags.txt crlf ident *.wmml crlf ident
The commit where autodoc documentation was moved to .txt files:
*.tex crlf ident -*.txt -crlf ident -/src/modules/Postgres/quickmanual.txt crlf ident -/lib/modules/Graphics.pmod/Graph.pmod/doc.txt crlf ident -/src/UnicodeData-ReadMe.txt crlf ident -/src/UnicodeData.txt crlf ident -/src/post_modules/Unicode/NormalizationTest-3.1.0.txt crlf ident -/lib/modules/Protocols.pmod/HTTP.pmod/Server.pmod/extensions.txt crlf ident -/refdoc/tags.txt crlf ident +*.txt crlf ident +/src/modules/_Crypto/rijndael_ecb_d_m.txt -crlf ident +/src/modules/_Crypto/rijndael_ecb_iv.txt -crlf ident +/src/modules/_Crypto/rijndael_ecb_tbl.txt -crlf ident +/src/modules/_Crypto/rijndael_cbc_e_m.txt -crlf ident +/src/modules/_Crypto/rijndael_ecb_vk.txt -crlf ident +/src/modules/_Crypto/rijndael_cbc_d_m.txt -crlf ident +/src/modules/_Crypto/rijndael_ecb_vt.txt -crlf ident +/src/modules/_Crypto/rijndael_ecb_e_m.txt -crlf ident *.wmml crlf ident
The commit where Nilsson adjusted the linebreaks for rijndael_*.txt:
*.txt crlf ident -/src/modules/_Crypto/rijndael_ecb_d_m.txt -crlf ident -/src/modules/_Crypto/rijndael_ecb_iv.txt -crlf ident -/src/modules/_Crypto/rijndael_ecb_tbl.txt -crlf ident -/src/modules/_Crypto/rijndael_cbc_e_m.txt -crlf ident -/src/modules/_Crypto/rijndael_ecb_vk.txt -crlf ident -/src/modules/_Crypto/rijndael_cbc_d_m.txt -crlf ident -/src/modules/_Crypto/rijndael_ecb_vt.txt -crlf ident -/src/modules/_Crypto/rijndael_ecb_e_m.txt -crlf ident *.wmml crlf ident
And for completeness, the .gitattributes at HEAD for the 7.8 branch:
*.1 crlf ident *.Debian crlf ident *.S crlf ident *.ac crlf ident *.asm crlf ident *.autodoc crlf ident *.c crlf ident *.cfg crlf ident *.cmod crlf ident *.css crlf ident *.cvsignore crlf ident *.db binary -ident *.diff crlf ident *.dirs crlf ident *.doc-base crlf ident *.dtd crlf ident *.el crlf ident *.files crlf ident *.gif binary -ident *.gitignore crlf ident *.h crlf ident *.head crlf ident *.html crlf ident *.ibd binary -ident *.ico binary -ident *.in crlf ident *.inc crlf ident *.info crlf ident *.list crlf ident *.m crlf ident *.m4 crlf ident *.manual crlf ident *.pbm binary -ident /src/post_modules/GTK/examples/low_level/psnow/snow06.pbm crlf ident /src/post_modules/GTK/examples/low_level/psnow/snow00.pbm crlf ident *.pike crlf ident /src/post_modules/COM/examples/word.pike -crlf ident /src/post_modules/COM/examples/word2.pike -crlf ident /src/post_modules/COM/examples/shelltest.pike -crlf ident /src/post_modules/COM/examples/ads.pike -crlf ident *.plist crlf ident *.pmod crlf ident *.png binary -ident *.pnm binary -ident *.postinst crlf ident *.postrm crlf ident *.pre crlf ident *.prerm crlf ident *.readme crlf ident *.refdoc crlf ident *.s crlf ident *.sed crlf ident *.sh crlf ident *.sha1 -crlf ident *.supp crlf ident *.svg crlf ident /src/post_modules/_Image_SVG/pike+fish.svg -crlf ident *.symlist crlf ident *.tab crlf ident *.tiff binary -ident *.touch-list crlf ident *.txt crlf ident *.vbs crlf ident *.wxs crlf ident *.xml crlf ident *.xpm crlf ident *.xsl crlf ident *.yacc crlf ident .gitattributes crlf ident .gitignore crlf ident ANNOUNCE crlf ident AUTHORS crlf ident BUGS crlf ident CHANGES crlf ident COMMITTERS crlf ident COPYING crlf ident COPYRIGHT crlf ident DISCLAIMER crlf ident FAQ crlf ident FILES crlf ident MANIFEST crlf ident Makefile crlf ident README crlf ident README-CVS crlf ident TODO crlf ident africa crlf ident antarctica crlf ident asia crlf ident australasia crlf ident backward crlf ident buggy_testsuite crlf ident changelog crlf ident compat crlf ident compose crlf ident configure crlf ident control crlf ident copyright crlf ident dependencies crlf ident dirs crlf ident doc_roxen_template crlf ident docs crlf ident etcetera crlf ident europe crlf ident export_list crlf ident extensions crlf ident factory crlf ident files_to_compile crlf ident hilfe crlf ident install-sh crlf ident install-welcome binary -ident install_module crlf ident keysyms crlf ident leapseconds crlf ident menu crlf ident metatest crlf ident mktestsuite crlf ident namedays crlf ident nobinary_dummy crlf ident northamerica crlf ident options crlf ident pacificnew crlf ident parse_install_log crlf ident postinst crlf ident prerm crlf ident psetroot crlf ident pv crlf ident regional crlf ident rsif crlf ident rules crlf ident run_autoconfig crlf ident simple_menu_shortcuts crlf ident smartlink crlf ident solar87 crlf ident solar88 crlf ident solar89 crlf ident southamerica crlf ident strip_opcodes crlf ident systemv crlf ident testfont binary -ident unbug crlf ident xenofarm_gdb_cmd crlf ident
Hm, why do you treat .txt files with crlf conversion as the exception at first? It seems more natural to me to use your second variant, where the rijndael files are the exception, from the start, because those are the files which are somehow special, not the Postgres quickmanual et al...
The commit where Nilsson adjusted the linebreaks for rijndael_*.txt:
*.txt crlf ident -/src/modules/_Crypto/rijndael_ecb_d_m.txt -crlf ident -/src/modules/_Crypto/rijndael_ecb_iv.txt -crlf ident -/src/modules/_Crypto/rijndael_ecb_tbl.txt -crlf ident -/src/modules/_Crypto/rijndael_cbc_e_m.txt -crlf ident -/src/modules/_Crypto/rijndael_ecb_vk.txt -crlf ident -/src/modules/_Crypto/rijndael_cbc_d_m.txt -crlf ident -/src/modules/_Crypto/rijndael_ecb_vt.txt -crlf ident -/src/modules/_Crypto/rijndael_ecb_e_m.txt -crlf ident *.wmml crlf ident
No, the files should not be crlf converted after the change either. Before the change the EOL had to be CRLF. After the change the EOL had to be LF. Neither variant was EOL agnostic.
Hm, why do you treat .txt files with crlf conversion as the exception at first? It seems more natural to me to use your second variant, where the rijndael files are the exception, from the start, because those are the files which are somehow special, not the Postgres quickmanual et al...
The code uses a histogram to determine what the default attributes for an extension should be. When the rijndael_*.txt files were added, there were more *.txt files with CRLF than with LF, thus the default to CRLF. When the autodoc documentation files were renamed to *.txt the balance switched.
No, the files should not be crlf converted after the change either. Before the change the EOL had to be CRLF. After the change the EOL had to be LF. Neither variant was EOL agnostic.
As far as I can see, the original test supported either, but Nilsson's commit "No \r in the testdata anymore." removed the support for \r\n.
The code uses a histogram to determine what the default attributes for an extension should be. When the rijndael_*.txt files were added, there were more *.txt files with CRLF than with LF, thus the default to CRLF. When the autodoc documentation files were renamed to *.txt the balance switched.
Ok, that's an explanation but not a reason. ;-) I suggest overriding the decision in this case.
As far as I can see, the original test supported either, but Nilsson's commit "No \r in the testdata anymore." removed the support for \r\n.
Ah, right you are. But the bottom line is still that crlf conversion should not be enabled after the change.
On Mon, Jan 04, 2010 at 03:00:21PM +0000, Henrik Grubbstr�m (Lysator) @ Pike (-) developers forum wrote:
Currently commits that are identical (including same history) in more than one major branch are kept as one single commit. Should they be forced to be split apart if after the split point for the branches?
What are the practical implications within the git system of the two options?
I'll try to explain it with illustrations.
i think the second one more correctly represents the workflow that happened, namely that the same commit was made to two branches. this result we would also get today on git if a commit is made to one branch and then cherry-picked to other branches. it makes for easier to read history, and especially, if cherry-picking is used today to make the same commit to multiple branches then old branches from cvs look just like new branches that were made in git first.
greetings, martin.
I also prefer the second variant. Having a "virtual" branch after the split is just confusing, especially if it can be checked out. If we instantiate "A", "B" and "A and B" as e.g. "7.4", "7.5" and "7.3", then no commits should exist on the "7.3" branch after the split.
It also seems to me like in the first graph, if you check out branch A from a date between C6 and C8, it will not contain the changes made in C6, which is inaccurate.
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum wrote:
It also seems to me like in the first graph, if you check out branch A from a date between C6 and C8, it will not contain the changes made in C6, which is inaccurate.
Well, strictly speaking both C4 and C6 are in branch A, so checking out something between the dates of C6 and C8 is ambiguous and you'd need to specify the left (C4) or the right (C6) branch before you get any meaningful snapshot. This sounds complicated, and it is, but it all depends on what happened back in the CVS days at the actual split. It just seems that what was done then actually *was* this complicated (shortly after the split).
There is nothing complicated in the CVS repository (at least not in this case). Branch (repository) A contains commits C3a, C4, C6a, C8, and C9, in that order, and B contains commits C2, C3b, C5, C6b, C7, and C10. C3a and C3b have exactly the same delta, but are separate commits (of course, in CVS every file is committed separately anyway). Same thing with C6a and C6b. The way these commits were made is typically that the change was made in a working copy of one of the branches and tested there, then ported over to a working copy of the other branch (either by hand, or by using cvs diff + patch) and tested there too. Then both changes were committed using the same cvs commit command, which would commit each file in sequence, but due to the low precision of the timestamps they would appear to have happend simultanously.
Making a checkout of a branch by date ambiguous is definitiely not something we'd want, so I guess that means that variant is out for sure.
Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum wrote:
I also prefer the second variant. Having a "virtual" branch after the split is just confusing, especially if it can be checked out. If we instantiate "A", "B" and "A and B" as e.g. "7.4", "7.5" and "7.3", then no commits should exist on the "7.3" branch after the split.
The way this would look in git is that you would just have 7.4, 7.5 and 7.3. There would not be a way of checking out the virtual branch, and there would not exist commits on the 7.3 branch after the split (i.e. checking out 7.3 would just give the snapshot it which the split occurred). It's just a matter of where you hang the labels/branches. The virtual branch is never seen, except when you view that spot in a graph visualiser.
Martin B?hr wrote:
On Mon, Jan 04, 2010 at 03:00:21PM +0000, Henrik Grubbstr???m (Lysator) @ Pike (-) developers forum wrote:
Currently commits that are identical (including same history) in more than one major branch are kept as one single commit. Should they be forced to be split apart if after the split point for the branches?
i think the second one more correctly represents the workflow that happened, namely that the same commit was made to two branches. this
Can someone then explain how it happened that both commits were done at identical points in time? (in CVS) Seems to me like someone cheated and just merely *said* he splitted the trees, but actually didn't just yet. Made a few commits to the actually original repository, and splitted afterward?
Martin B?hr wrote:
On Mon, Jan 04, 2010 at 03:00:21PM +0000, Henrik Grubbstr???m (Lysator) @ Pike (-) developers forum wrote:
Currently commits that are identical (including same history) in more than one major branch are kept as one single commit. Should they be forced to be split apart if after the split point for the branches?
i think the second one more correctly represents the workflow that happened, namely that the same commit was made to two branches. this
Can someone then explain how it happened that both commits were done at identical points in time? (in CVS) Seems to me like someone cheated and just merely *said* he splitted the trees, but actually didn't just yet. Made a few commits to the actually original repository, and splitted afterward?
Another possibility is that the RCS file was copied between the repositories at a later time.
Note though that it is possible to get this effect if a file was patched in the multiple branches, and then commited in all the branches simultaneously.
Another possibility is that the RCS file was copied between the repositories at a later time.
Note though that it is possible to get this effect if a file was patched in the multiple branches, and then commited in all the branches simultaneously.
Are the original committers still around? Can they still remember what they did? In any case, it's not critical, of course, so if we presume that the original committers just were very fast and committed the same patch on different branches within one second, then by all means, simplify the graph and linearise it.
Another possibility is that the RCS file was copied between the repositories at a later time.
Note though that it is possible to get this effect if a file was patched in the multiple branches, and then commited in all the branches simultaneously.
Are the original committers still around? Can they still remember what they did?
The committers are usually still around, but I doubt they remember the specific commits (I know that I don't).
In any case, it's not critical, of course, so if we presume that the original committers just were very fast and committed the same patch on different branches within one second, then by all means, simplify the graph and linearise it.
The typical case is probably that someone (like me) has all branches of Pike checked out at the same time:
Pike/0.5 Pike/0.6 Pike/7.0 ... Pike/7.8
Fixes the bug in Pike 7.8, and then backpatches it to the branches where relevant.
And then with cwd == Pike issues a command like:
cvs ci -m 'Fixed something or other.' */path/to/patched/file
If the patched file hasn't diverged earlier odds are that the new file will still be identical accross the branches and have identical history.
BTW: A third (more complicated) alternative would be to introduce merge commits for each of the main branches:
Branch: A A and B B Commit
| | | o C10 o | C9 | | o | C8 | | | o C7 | | *_________ ___________* M6 | \o/ | C6 | | o C5 o | | C4 | | | _________ | ___________* M3 \o/ | C3 | o C2 | __________/ |/ Split: o C1 | o | o | o
The typical case is probably that someone (like me) has all branches of Pike checked out at the same time:
cvs ci -m 'Fixed something or other.' */path/to/patched/file
If the patched file hasn't diverged earlier odds are that the new file will still be identical accross the branches and have identical history.
I always wondered how this was done. This would explain it. Then we have no business messing with merges here, simply make a clean split and keep the branches truly separate after that.
Indeed I do. The src/modules/_Crypto/rijndael_*.txt files in 7.3 had CRLF line endings before 2001-12-31, after which they use LF line endings. The rijndaeltest.pike script is dependent on the EOL convention being the right one (it was updated on 2002-01-02 to work with the new convention).
Hmm... Seems cvs's behaviour doesn't match its documentation (surprise), as I interpret the manual, it says that line ending conversion is performed in all modes except -kb...
By the way, the following files contain literal CR:s, and should not be subjected to EOL conversion lest they might break:
nt-tools/init_nt nt-tools/tools/lib
Apart from these files, and the rijndael files mentioned, I have used native EOL mode for all plaintext files in the SVN repository, even though there were a few more files which used CRLF convention in the CVS repository (see 16685807 for full details).
Henrik Grubbstr?m (Lysator) @ Pike (-) developers forum wrote:
A new export is available from git://pike-git.lysator.liu.se/pike-full-20100101
I checked pike-full-20100103, which appears to be newer.
New since the previous export is that I've found and incorporated another historic ulpc repository (ulpc.hubbe), which means that most of the early history of Pike has been recovered.
Where does this ulpc.hubbe fit in? Before the ulpc repository? When I look at your repository, I see 1995/08/09 still as the oldest commit.
I believe that I've identified most of the renamed files and when they were renamed, so the commit graph should make more sense now.
Nice. If my import from the SVN version was correct, this should match with my import. An automated check between the two git repositories is not so difficult and actually very fast.
Another issue is that all commits are currently credited to the committer (ie not necessarily the author). I'll try to extract the information from git://pike-git.lysator.liu.se/pike.git.
The above is still on the todo.
Could it be that this is fixed in the 103 version I just checked?
With respect to the too-clever branche/splitting merging, I see that around certain commits the same problem occurs as you discussed earlier around the repository splits. Case in point, around label v7.8.350 there is an artificial "branch loop", and looking back through history I see it again at v7.8.336, and probably throughout the entire repository. Obviously these commits were not splitting off to another branch and then remerging in the next commit. I suggest you change your heuristics to avoid splitting at those (and many other) points, and thus leave the history linearly (as it originally is in CVS at those points).
With regard to the ID: and rev: fields at the bottom: it is likely the best solution to keep track of what went where. If it can be trimmed down, that would be great, but if not, it's something which is workable.
In order to incorporate the cherry-pick information (which was incorrectly encoded as additional parents in my import), I suggest we include one or more "Original-patch" fields at the very bottom of the commit field. The information which should go in there, can be extracted from my import repository, since most references in there were already handpicked by me/mast/you.
The import repository I made should be around 50MB, yours is around 150MB. It looks like it still contains a lot of extra garbage. After importing, the following tags are forcefully removed to reduce clutter:
hubbes.tag.before.merge1 hubbes.tag.before.rewriting.compiler msqlmod.1.1pre6 postgres.0.5 start v0.6a1
I also delete the following branches:
Hubbe Image.polygon Infovav heddas.polypatchar hubbes.working.branch kinkie locked.for.mirar nisse nisses-certifikat-hack tags/hubbes.tag.before.merge1 tags/hubbes.tag.before.rewriting.compiler tags/msqlmod.1.1pre6 tags/postgres.0.5 tags/start tags/v0 tags/v0.1 tags/v0.6a1
I'm not sure if your import generates even more stray tags/branches. It would probably be desirable to trim/organise your import so that it ends up with the same tags and branches as mine (in addition to those which are the result of any extra ulpc.hubbe repository, of course).
Henrik Grubbstr?m (Lysator) @ Pike (-) developers forum wrote:
A new export is available from git://pike-git.lysator.liu.se/pike-full-20100101
I checked pike-full-20100103, which appears to be newer.
New since the previous export is that I've found and incorporated another historic ulpc repository (ulpc.hubbe), which means that most of the early history of Pike has been recovered.
Where does this ulpc.hubbe fit in? Before the ulpc repository? When I look at your repository, I see 1995/08/09 still as the oldest commit.
ulpc.hubbe is a few weeks newer than the latest commit on ulpc.old. The ulpc repository has been imported as branch E-13 whick branches off from ulpc.hubbe.
The graph looks something like:
/---------0.5 ulpc.old----//ulpc.hubbe--------...---Pike------< ^ \ ---------... | \ulpc--- | There's some history missing here.
The proper graph for the early (missing) history of Pike should probably be something like:
1.0E-100---------1.0E-30-----------1.0E-14--ulpc.hubbe---... \ \ulpc.old---
Some of it is potentially recoverable from the ChangeLog file and old ulpc dists, but I haven't gone that far...
I believe that I've identified most of the renamed files and when they were renamed, so the commit graph should make more sense now.
Nice. If my import from the SVN version was correct, this should match with my import. An automated check between the two git repositories is not so difficult and actually very fast.
Note that the $Id$ markers in pike-full-10100103 have been extracted in -ko mode, so they most likely don't match the SVN repository. This has been fixed in the current converter.
Another issue is that all commits are currently credited to the committer (ie not necessarily the author). I'll try to extract the information from git://pike-git.lysator.liu.se/pike.git.
The above is still on the todo.
Could it be that this is fixed in the 103 version I just checked?
No, it's still on the todo.
With respect to the too-clever branche/splitting merging, I see that around certain commits the same problem occurs as you discussed earlier around the repository splits. Case in point, around label v7.8.350 there is an artificial "branch loop", and looking back through history I see it again at v7.8.336, and probably throughout the entire repository. Obviously these commits were not splitting off to another branch and then remerging in the next commit. I suggest you change your heuristics to avoid splitting at those (and many other) points, and thus leave the history linearly (as it originally is in CVS at those points).
No, those are actually proper splits. They have typically been casued by the build system running in parallel with someone committing stuff. They are (in the general case) also neccessary to keep the tags on the correct versions of files.
With regard to the ID: and rev: fields at the bottom: it is likely the best solution to keep track of what went where. If it can be trimmed down, that would be great, but if not, it's something which is workable.
The ID field is there for debug purposes. The Rev field is intended to be kept, since the alternative would be add an excessive amount of tags...
In order to incorporate the cherry-pick information (which was incorrectly encoded as additional parents in my import), I suggest we include one or more "Original-patch" fields at the very bottom of the commit field. The information which should go in there, can be extracted from my import repository, since most references in there were already handpicked by me/mast/you.
This should be possible to add.
The import repository I made should be around 50MB, yours is around 150MB. It looks like it still contains a lot of extra garbage.
The main reason is probably that it's a raw repository as generated by git-fast-import. The git-fast-import manual recommends running git-repack -f --window=50 on the repository afterwards.
I'm not sure if your import generates even more stray tags/branches.
Currently I keep all tags and branches except some of the tags generated by cvs import.
It would probably be desirable to trim/organise your import so that it ends up with the same tags and branches as mine (in addition to those which are the result of any extra ulpc.hubbe repository, of course). -- Sincerely, Stephen R. van den Berg.
Nice. If my import from the SVN version was correct, this should match with my import. An automated check between the two git repositories is not so difficult and actually very fast.
Note that the $Id$ markers in pike-full-10100103 have been extracted in -ko mode, so they most likely don't match the SVN repository. This has been fixed in the current converter.
Er, yes, indeed. The check will only be very fast if the file content is identical (in which case the tree hashes for entire commits can be matched).
Case in point, around label v7.8.350 there is an artificial "branch loop", and looking back through history I see it again at v7.8.336, and probably
No, those are actually proper splits. They have typically been casued by the build system running in parallel with someone committing stuff. They are (in the general case) also neccessary to keep the tags on the correct versions of files.
Ok. Excellent. Just make sure then that the build system is using the second-parent-slot, and the regular commits use the first-parent-slot (so that --first-parent on git commands merely yields proper commits).
The import repository I made should be around 50MB, yours is around 150MB. It looks like it still contains a lot of extra garbage.
The main reason is probably that it's a raw repository as generated by git-fast-import. The git-fast-import manual recommends running git-repack -f --window=50 on the repository afterwards.
I think I tried that already. But I'll look into it once more.
Case in point, around label v7.8.350 there is an artificial "branch loop", and looking back through history I see it again at v7.8.336, and probably
No, those are actually proper splits. They have typically been casued by the build system running in parallel with someone committing stuff. They are (in the general case) also neccessary to keep the tags on the correct versions of files.
Ok. Excellent. Just make sure then that the build system is using the second-parent-slot, and the regular commits use the first-parent-slot (so that --first-parent on git commands merely yields proper commits).
This will typically be the case, since it does a best-match tag reachability analysis of the parents of the merge node, which will prefer the parent with no tags of its own. If this fails, it will fall back to selecting the most recent parent, which again typically will not be the commit from the build system, since if it was, the split wouldn't have happened.
The import repository I made should be around 50MB, yours is around 150MB. It looks like it still contains a lot of extra garbage.
The main reason is probably that it's a raw repository as generated by git-fast-import. The git-fast-import manual recommends running git-repack -f --window=50 on the repository afterwards.
I can't seem to get git-repack to do anything; it just says "Nothing new to pack.".
The import repository I made should be around 50MB, yours is around 150MB. It looks like it still contains a lot of extra garbage.
The main reason is probably that it's a raw repository as generated by git-fast-import. The git-fast-import manual recommends running git-repack -f --window=50 on the repository afterwards.
I can't seem to get git-repack to do anything; it just says "Nothing new to pack.".
I usually use the following alias: [alias] packall = !rm -rf .git/ORIG_HEAD .git/FETCH_HEAD .git/index .git/logs .git/info/refs .git/objects/pack/pack-*.keep .git/refs/original .git/refs/patches .git/patches .git/gitk.cache && git prune --expire now && git repack -a -d --window=200 && git gc
But, in your repository, even that doesn't do much. So if I'd had to guess, there still is a lot of extra garbage in there; the question just is: Where?
Note that the $Id$ markers in pike-full-10100103 have been extracted in -ko mode, so they most likely don't match the SVN repository. This has been fixed in the current converter.
Meaning you now use -kk instead?
Yes, -kkv (thus the fixes to Parser.RCS a few days ago).
Um, -kkv means to include also the value. To get just the key (like in the SVN repository), you need to use -kk.
Um, -kkv means to include also the value. To get just the key (like in the SVN repository), you need to use -kk.
True. Anyway the repository is now generated in -kkv mode, which means that any checked out files will have that same content as if checked out from cvs. A .gitattributes file is also generated, which contains the attribute "ident" for these files, which means that git will strip the revision info from the $Id$ markers on commit (and then expand with a git identifier on next checkout).
True. Anyway the repository is now generated in -kkv mode, which means that any checked out files will have that same content as if checked out from cvs. A .gitattributes file is also generated, which contains the attribute "ident" for these files, which means that git will strip the revision info from the $Id$ markers on commit (and then expand with a git identifier on next checkout).
And that is a Good Thing (tm)? Meaning: I generally find differences between committed and checked out objects confusing, and therefore prefer getting rid of the (antiquated?) $Id$. I.e. IMO, leaving the unaltered in the checked out version would be the most unintrusive option for commits imported from CVS.
IMO, $Id$ is handy for exports, but not all that critical in a wc where you can get the version of the file anyway.
In SVN, you can actually see what the file looks like in the repository (i.e. no keyword expansion, no EOL conversion etc) by examining the file ".svn/text-base/FILE.svn-base".
stripping $Id$ is a good thing in the long run. someone may to decide to continue developing some part of pike in cvs. such code will have all $Id$ replaced with a new value. merging such code back into our repo is a huge pain.
i have experienced such a case on the linux kernel. some company developed a driver based on a driver already in the kernel. they just took some version and dumped it into cvs for their development. then i got a snapshot of that and had to spend some hours creating a clean diff that was not littered with random $keyword$ differences...
going through the pain now will make things easier for others in the future.
greetings, martin.
Why does difference in keyword values generate a conflict when the value is stripped on commit anyway? That sounds stupid...
By rewriting the script to use git fast-import instead, it's now chugging along nicely at ~12.5 commits/s (ie ~47 minutes).
Two months and lots of changes later...
Another one and a half months, and several more changes...
A new export is available from git://pike-git.lysator.liu.se/pike-full-20100214
Another issue is that all commits are currently credited to the committer (ie not necessarily the author). I'll try to extract the information from git://pike-git.lysator.liu.se/pike.git.
This has now been implemented. I have however kept the original author as before in a few cases, where the credit in the commit message was for the bugreport or similar. I have also identified w few more cases. For details, please see config/Pike-real-authors in the pcvs2git repository.
Another change is that by popular demand, the split points are now enforced.
Feed-back appreciated.
As before.
Happy Valentine!
/grubba
Hm, om jag försöker clone:a git://pike-git.lysator.liu.se/pike-full-20100313 så får jag "fatal: The remote end hung up unexpectedly"...
pike-devel@lists.lysator.liu.se