Git difftool ridiculously slow in Cygwin/MinGW

2018-07-01 22:54:47

I noticed that git difftool is very slow. An delay of about 1..2 seconds appears between each diff invocation.

To benchmark it I have written a custom difftool command:

#!/bin/sh
echo $0 $1 $2

And configured Git to use this tool in my ~/.gitconfig

[diff]
    tool = mydiff
[difftool "mydiff"]
    prompt = false
    cmd = "~/mydiff "$LOCAL" "$REMOTE""

I tested it on the Git sources:

$ git clone https://github.com/git/git.git
$ cd git
$ git rev-parse HEAD
1bc8feaa7cc752fe3b902ccf83ae9332e40921db
$ git diff head~10 --stat --name-only | wc -l
23

When I time a git difftool with 259b5e6d33 , the result is ridiculously slow:

$ time git difftool 259b5
mydiff /dev/null Documentation/RelNotes/2.6.3.txt
...
mydiff /tmp/mY2T6l_upload-pack.c upload-pack.c

real    0m10.381s
user    0m1.997s
sys     0m6.667s

By trying a simpler script it goes much faster:

$ time git diff --name-only --stat 259b5 | xargs -n1 -I{} sh -c 'git show 259b5:{} > {}.tmp && ~/mydiff {} {}.tmp'
mydiff Documentation/RelNotes/2.6.3.txt Documentation/RelNotes/2.6.3.txt.tmp
mydiff upload-pack.c upload-pack.c.tmp

real    0m1.149s
user    0m0.472s
sys     0m0.821s

What did I miss?

Here the results I got

| Cygwin | Debian | Ubuntu | Method   |
| ------ | ------ | ------ | -------- |
| 10.381 |  2.620 | 0.580  | difftool |
|  1.149 |  0.567 | 0.210  | custom   |

For the Cygwin results, I measured 2.8s spent in git-difftool and 7.5s spent in git-difftool--helper . The latter is 98 lines long. I don't understand why it is that slow.

Using some of the techniques found on the msysgit GitHub, I have narrowed this down a bit.

For each file in the diff, git-difftool--helper re-runs the following internal commands:

12:44:46.941239 git.c:351               trace: built-in: git 'config' 'diff.tool'
12:44:47.359239 git.c:351               trace: built-in: git 'config' 'difftool.bc.cmd'
12:44:47.933239 git.c:351               trace: built-in: git 'config' '--bool' 'mergetool.prompt'
12:44:48.797239 git.c:351               trace: built-in: git 'config' '--bool' 'difftool.prompt'
12:44:49.696239 git.c:351               trace: built-in: git 'config' 'difftool.bc.cmd'
12:44:50.135239 git.c:351               trace: built-in: git 'config' 'difftool.bc.path'
12:44:50.422239 git.c:351               trace: built-in: git 'config' 'mergetool.bc.path'
12:44:51.060239 git.c:351               trace: built-in: git 'config' 'difftool.bc.cmd'
12:44:51.452239 git.c:351               trace: built-in: git 'config' 'difftool.bc.cmd'

Notice that, in this particular case, it took roughly 4.5 seconds to execute these. This is a pretty consistent pattern throughout my log.

Note too that some of these are duplicate - git config difftool.bc.cmd is called 4 times!

Now, possible remedies:

I cut the execution time for these commands in half by moving all of the diff-related sections to the top of my .gitconfig file. Seriously. It's still noticeable, but now on the order of 2 seconds instead of 4.5.

Make sure that your Git folder under Program Files and your user profile (where .gitconfig lives) are both excluded from realtime virus scanning.

Fundamentally, Git needs to be more efficient with parsing and getting configuration values. Ideally, it would cache these instead of re-requesting (and reparsing...) from config every time in a loop. Perhaps even cached for the entire command execution.

git difftool should be slightly faster with Git 2.13 (Q2 2017)
See commit d12a8cf (14 Apr 2017) by Jeff Hostetler ( jeffhostetler ).
(Merged by Junio C Hamano -- gitster -- in commit 8868ba1, 24 Apr 2017)

`unpack-trees` : avoid duplicate ODB lookups during checkout

(ODB: Object DataBase)

Teach traverse_trees_recursive() to not do redundant ODB lookups when both directories refer to the same OID.

In operations such as read-tree and checkout , there will likely be many peer directories that have the same OID when the differences between the commits are relatively small.
In these cases we can avoid hitting the ODB multiple times for the same OID.

This patch handles n=2 and n=3 cases and simply copies the data rather than repeating the fill_tree_descriptor().

================

On the Windows repo (500K trees, 3.1M files, 450MB index), this reduced the overall time by 0.75 seconds when cycling between 2 commits with a single file difference.

(avg) before: 22.699
(avg) after:  21.955
===============

After some investigation I have evidence that the bad performance had to do with files owned by a user from a different domain. Specifically, I arrived at the following conclusions:

I'm working in a corporate environment with several domains and thousands of users.

Due to organizational changes each user is, probably only during a transition phase, kept in two domains, his or her primary domain as well as a second domain. When changing object ownership through the Windows GUI each user appears twice and one must go to the extended user selection to identify the one assigned to a specific domain.

cygwin with acl enabled displays the "other domain" file user as "<domain>+<username>". The primary domain self is just "<username>". Cygwin without acl displays just "<username>" in both cases. That can be fairly confusing because file permission and ownership as recognized by cygwin would indicate write permission, while the user factually does not have that.

Files belonging to the "other domain" self are writable by my "this domain" self, so the domain assignment is largely transparent.

A large source tree from our version control system (which was also mirrored in a git repo) had thousands of files owned by the "other domain self". That seemed to cause the slow file operations. Changing ownership to the "primary domain self" fixed the speed issue, both for git and for other file access.

I must assume that obtaining file permissions for users in other domains is slow, and for some reason not cached (it was always the same user).

The rest of the article below is what I originally posted. I let it stand.

For me (working in a large company with multiple, geographically distributed Windows domains) the culprit is that cygwin uses Windows acl per default. Consider this request for all known users in the Domain:

$ time (mkpasswd -D | wc -l)
45183

real    27m55,340s
user    0m1,637s
sys     0m0,123s

The fix (1)(2) was a simple matter of mounting the NTFS file systems with noacl , ie my /etc/fstab contains the line

none / cygdrive binary,posix=0,user,noacl 0 0

(at the same time eliminating the annoying cygdrive prefix).

I cannot help but imagine that cygwin/msys (same behavior there, except that the Windows git installation mounts noacl by default, probably for this reason) performs a domain server query for every file it touches and does not cache the results.

The change was introduced some time around 2015 with cygwin 2.4 or 2.5. From the release notes for 2.4:

To accommodate standard Windows ACLs, the POSIX permissions of the owner and all other users in the ACL are computed using the Windows AuthZ API. This may slow down the computation of POSIX permissions noticably in some circumstances [...] (emphasis by me).

The noacl option reduced the time to launch BeyondCompare (or echo a string, for that matter) from 25 seconds to 1. It is completely unintelligible why a simple git diff on the same file is very fast even with acl since I would naively assume that the required information and thus the required FS actions are identical.

I'll check out the cygserver now which may improve things by caching.

Update: cygserver does not improve the situation, unfortunately.

(1) The fix for git. mkpasswd is not affected.

(2) I have not understood and tested the impact on file permissions and ownership with respect to git (and ClearCase views which we also access through cygwin). My gut feeling is that one wants to stay true to Windows semantics as closely as possible (meaning that noacl may run into problems).

(3) The cygwin documentation discusses scenarios in which the query results are not cached. One consists of a sequence of cygwin processes which are not spawned from a common cygwin ancestor (like a bash) but from a windows program like cmd . I must assume that Windows provides a caching mechanism for native programs, or a Windows system would be unusable in this corporate environment. For some reason cygwin does not use it.

链接地址: http://www.djcxy.com/p/89086.html

上一篇: SSIS 2012日期格式dmy vs mdy

下一篇: Cygwin / MinGW的Git difftool可笑地变慢了

Git difftool ridiculously slow in Cygwin/MinGW

unpack-trees : avoid duplicate ODB lookups during checkout

`unpack-trees` : avoid duplicate ODB lookups during checkout