Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum wrote:
It does this only when doing diffs or annotations, not when retrieving logs.
So the log will not show the original filename of a copy then, only for a "resurrection"? Can this be fixed by using the resurrection method (whatever it is) instead of regular cp to copy the file? I mean, you should be able to "resurrect" a file regardless of whether it was actually deleted in a later commit or not, and since you can "resurrect" it to a new filename, that would serve as a copy operation.
Maybe we should avoid getting lost in terminology. Maybe your idea of what a log is, and git's idea of what a log is differs. It might help to understand what git does, in essence, when committing. As a matter of fact, everytime you commit changes, what git basically do is the following:
a. Take a snapshot of the *entire* source tree (not just the changes). b. Take the SHA1-hash of the previous commit, and store it as the first parent of the current commit. c. Collect any other SHA1-hashes of other commits that is being merged from (regardless if they are part of the current branch, or not). d. Generate a new SHA1 for the current commit, which hashes all of the source-tree snapshot and collected parent SHA1s. e. Make the newly calculated SHA1 the head of the current branch.
In diff/annotate or other views of the history, all the copies are inferred from the full tree-snapshots and their parent relationships. The snapshots themselves do not carry information on which file went from where to where. Yet when you do git diff or git annotate or git blame, it will acurately and swiftly give you information on copied, renamed and moved files or pieces of source.
No, it is not a performance problem because everything internally is hashed, i.e. it's not comparing large amounts of texts, it's comparing hashes mostly.
Ok, now I'm confused. You previously implied that I could make local changes (up to 50%) before commit and it would still be recognized as a copy. That would mean that it has to compare more than just a hash.
Say you have a 30MB checked out source tree containing about 8000 files. Say you committed commit A, then copied one file, altered 40% of it, then commit it as commit B; whereas commit A is the parent of commit B. Then subsequently when you git diff A..B, git will look at both source-tree snapshots, will notice that in the first snapshot of 30MB it has 8000 files, in the second snapshot of 30MB it has 8001 files, it quickly compares hashes for all files, and figures out that 8000 files are identical, then knows for which file it is missing history. Then git begins to search for the new content, does partial hashes for chunks of the file, and searches through the partial hashes in the other files; possibly augmented by an algorithm which prefers files in the same directory or with similar names first. Then it tells you what it finds, and gives you the verdict that a file is a copy of, or just that chunks of the file appear in other files (and connects history there).
Also, I'm not entirely comfortable with the though that the history of an already committed file might spontanously change later.
It doesn't. There are such things as regression tests, development and quality assurance on git is *very* active.
You said previously that the "usual" way to fix an incorrect guess was to change the logic in git, and that this had indeed happened. How does regression tests help if people are deliberately changing the behaviour?
Bad matches are a rare event in practice. Whenever someone encounters it, improvements to the search and match algorithms in git can be offered for inclusion in mainstream git; the patches are only accepted if they do not cause mismatches in regressiontests which already exercise "cornercase" matches found and fixed earlier.