> vs.> =会导致显着的性能差异

我只是偶然发现了一些事情。 起初我认为这可能是一个分支预测失误的情况,就像在这种情况下一样,但我无法解释为什么分支预测失误会导致这种现象。 我在Java中实现了两个版本的Bubble Sort,并进行了一些性能测试:

import java.util.Random;

public class BubbleSortAnnomaly {

    public static void main(String... args) {
        final int ARRAY_SIZE = Integer.parseInt(args[0]);
        final int LIMIT = Integer.parseInt(args[1]);
        final int RUNS = Integer.parseInt(args[2]);

        int[] a = new int[ARRAY_SIZE];
        int[] b = new int[ARRAY_SIZE];
        Random r = new Random();
        for (int run = 0; RUNS > run; ++run) {
            for (int i = 0; i < ARRAY_SIZE; i++) {
                a[i] = r.nextInt(LIMIT);
                b[i] = a[i];
            }

            System.out.print("Sorting with sortA: ");
            long start = System.nanoTime();
            int swaps = bubbleSortA(a);

            System.out.println(  (System.nanoTime() - start) + " ns. "
                               + "It used " + swaps + " swaps.");

            System.out.print("Sorting with sortB: ");
            start = System.nanoTime();
            swaps = bubbleSortB(b);

            System.out.println(  (System.nanoTime() - start) + " ns. "
                               + "It used " + swaps + " swaps.");
        }
    }

    public static int bubbleSortA(int[] a) {
        int counter = 0;
        for (int i = a.length - 1; i >= 0; --i) {
            for (int j = 0; j < i; ++j) {
                if (a[j] > a[j + 1]) {
                    swap(a, j, j + 1);
                    ++counter;
                }
            }
        }
        return (counter);
    }

    public static int bubbleSortB(int[] a) {
        int counter = 0;
        for (int i = a.length - 1; i >= 0; --i) {
            for (int j = 0; j < i; ++j) {
                if (a[j] >= a[j + 1]) {
                    swap(a, j, j + 1);
                    ++counter;
                }
            }
        }
        return (counter);
    }

    private static void swap(int[] a, int j, int i) {
        int h = a[i];
        a[i] = a[j];
        a[j] = h;
    }
}

如您所见,这两种排序方法之间的唯一区别是> vs. >= 。 当用java BubbleSortAnnomaly 50000 10 10运行程序时,显然希望sortBsortA慢。 但是我在三台不同的机器上得到了以下(或类似的)输出:

Sorting with sortA: 4.214 seconds. It used  564960211 swaps.
Sorting with sortB: 2.278 seconds. It used 1249750569 swaps.
Sorting with sortA: 4.199 seconds. It used  563355818 swaps.
Sorting with sortB: 2.254 seconds. It used 1249750348 swaps.
Sorting with sortA: 4.189 seconds. It used  560825110 swaps.
Sorting with sortB: 2.264 seconds. It used 1249749572 swaps.
Sorting with sortA: 4.17  seconds. It used  561924561 swaps.
Sorting with sortB: 2.256 seconds. It used 1249749766 swaps.
Sorting with sortA: 4.198 seconds. It used  562613693 swaps.
Sorting with sortB: 2.266 seconds. It used 1249749880 swaps.
Sorting with sortA: 4.19  seconds. It used  561658723 swaps.
Sorting with sortB: 2.281 seconds. It used 1249751070 swaps.
Sorting with sortA: 4.193 seconds. It used  564986461 swaps.
Sorting with sortB: 2.266 seconds. It used 1249749681 swaps.
Sorting with sortA: 4.203 seconds. It used  562526980 swaps.
Sorting with sortB: 2.27  seconds. It used 1249749609 swaps.
Sorting with sortA: 4.176 seconds. It used  561070571 swaps.
Sorting with sortB: 2.241 seconds. It used 1249749831 swaps.
Sorting with sortA: 4.191 seconds. It used  559883210 swaps.
Sorting with sortB: 2.257 seconds. It used 1249749371 swaps.

当你将LIMIT的参数设置为例如50000java BubbleSortAnnomaly 50000 50000 10 )时,你会得到预期的结果:

Sorting with sortA: 3.983 seconds. It used  625941897 swaps.
Sorting with sortB: 4.658 seconds. It used  789391382 swaps.

我将程序移植到C ++以确定此问题是否是Java特定的。 这是C ++代码。

#include <cstdlib>
#include <iostream>

#include <omp.h>

#ifndef ARRAY_SIZE
#define ARRAY_SIZE 50000
#endif

#ifndef LIMIT
#define LIMIT 10
#endif

#ifndef RUNS
#define RUNS 10
#endif

void swap(int * a, int i, int j)
{
    int h = a[i];
    a[i] = a[j];
    a[j] = h;
}

int bubbleSortA(int * a)
{
    const int LAST = ARRAY_SIZE - 1;
    int counter = 0;
    for (int i = LAST; 0 < i; --i)
    {
        for (int j = 0; j < i; ++j)
        {
            int next = j + 1;
            if (a[j] > a[next])
            {
                swap(a, j, next);
                ++counter;
            }
        }
    }
    return (counter);
}

int bubbleSortB(int * a)
{
    const int LAST = ARRAY_SIZE - 1;
    int counter = 0;
    for (int i = LAST; 0 < i; --i)
    {
        for (int j = 0; j < i; ++j)
        {
            int next = j + 1;
            if (a[j] >= a[next])
            {
                swap(a, j, next);
                ++counter;
            }
        }
    }
    return (counter);
}

int main()
{
    int * a = (int *) malloc(ARRAY_SIZE * sizeof(int));
    int * b = (int *) malloc(ARRAY_SIZE * sizeof(int));
    for (int run = 0; RUNS > run; ++run)
    {
        for (int idx = 0; ARRAY_SIZE > idx; ++idx)
        {
            a[idx] = std::rand() % LIMIT;
            b[idx] = a[idx];
        }

        std::cout << "Sorting with sortA: ";
        double start = omp_get_wtime();
        int swaps = bubbleSortA(a);

        std::cout << (omp_get_wtime() - start) << " seconds. It used " << swaps
                  << " swaps." << std::endl;

        std::cout << "Sorting with sortB: ";
        start = omp_get_wtime();
        swaps = bubbleSortB(b);

        std::cout << (omp_get_wtime() - start) << " seconds. It used " << swaps
                  << " swaps." << std::endl;
    }

    free(a);
    free(b);

    return (0);
}

该程序显示相同的行为。 有人可以解释一下,这里究竟发生了什么?

先执行sortB然后sortA不会改变结果。


我认为这可能确实是由于分支预测。 如果您计算交换次数与内部排序迭代次数的比较,您会发现:

限制= 10

  • A = 560M交换/ 1250M环路
  • B = 1250M掉期/ 1250M循环(掉期比循环减少0.02%)
  • 限制= 50000

  • A = 627M交换/ 1250M环路
  • B = 850M掉期/ 1250M循环
  • 因此,在Limit == 10情况下,交换在B类中执行99.98%的时间,这对于分支预测器显然是有利的。 在Limit == 50000情况下,交换只是随机命中68%,所以分支预测器的好处不大。


    我认为这可以通过分支预测失误来解释。

    例如,考虑LIMIT = 11和sortB 。 在外层循环的第一次迭代中,它会很快地碰到一个等于10的元素。所以它将有a[j]=10 ,因此肯定a[j]将是>=a[next] ,因为那里没有大于10的元素。因此,它将执行交换,然后在j只执行一步,再次发现a[j]=10 (相同的交换值)。 所以再一次,它将是a[j]>=a[next] ,等等。 除了几个比较之外,每个比较都是真实的。 同样,它将在外循环的下一次迭代中运行。

    sortA不一样。 它会以大致相同的方式开始,偶然发现a[j]=10 ,以类似的方式做一些掉期交易,但是只有当它找到a[next]=10也是如此。 那么条件将是错误的,不会进行交换。 等等:每当它在a[next]=10上绊倒时,条件是错误的,并且没有交换完成。 因此,这个条件在11个(从0到9的a[next]值)中是10次,在11个中是1个是false。没有什么奇怪的分支预测失败。


    使用提供的C ++代码(删除了时间计数)和perf stat命令,我得到了证实brach-miss理论的结果。

    Limit = 10 ,BubbleSortB从分支预测(0.01%未命中)中获益匪浅,但Limit = 50000分支预测的失败甚至比BubbleSortA(失败率分别为12.69%和12.76%)还要多15.65%。

    BubbleSortA限制= 10:

    Performance counter stats for './bubbleA.out':
    
       46670.947364 task-clock                #    0.998 CPUs utilized          
                 73 context-switches          #    0.000 M/sec                  
                 28 CPU-migrations            #    0.000 M/sec                  
                379 page-faults               #    0.000 M/sec                  
    117,298,787,242 cycles                    #    2.513 GHz                    
    117,471,719,598 instructions              #    1.00  insns per cycle        
     25,104,504,912 branches                  #  537.904 M/sec                  
      3,185,376,029 branch-misses             #   12.69% of all branches        
    
       46.779031563 seconds time elapsed
    

    BubbleSortA限制= 50000:

    Performance counter stats for './bubbleA.out':
    
       46023.785539 task-clock                #    0.998 CPUs utilized          
                 59 context-switches          #    0.000 M/sec                  
                  8 CPU-migrations            #    0.000 M/sec                  
                379 page-faults               #    0.000 M/sec                  
    118,261,821,200 cycles                    #    2.570 GHz                    
    119,230,362,230 instructions              #    1.01  insns per cycle        
     25,089,204,844 branches                  #  545.136 M/sec                  
      3,200,514,556 branch-misses             #   12.76% of all branches        
    
       46.126274884 seconds time elapsed
    

    BubbleSortB限制= 10:

    Performance counter stats for './bubbleB.out':
    
       26091.323705 task-clock                #    0.998 CPUs utilized          
                 28 context-switches          #    0.000 M/sec                  
                  2 CPU-migrations            #    0.000 M/sec                  
                379 page-faults               #    0.000 M/sec                  
     64,822,368,062 cycles                    #    2.484 GHz                    
    137,780,774,165 instructions              #    2.13  insns per cycle        
     25,052,329,633 branches                  #  960.179 M/sec                  
          3,019,138 branch-misses             #    0.01% of all branches        
    
       26.149447493 seconds time elapsed
    

    BubbleSortB限制= 50000:

    Performance counter stats for './bubbleB.out':
    
       51644.210268 task-clock                #    0.983 CPUs utilized          
              2,138 context-switches          #    0.000 M/sec                  
                 69 CPU-migrations            #    0.000 M/sec                  
                378 page-faults               #    0.000 M/sec                  
    144,600,738,759 cycles                    #    2.800 GHz                    
    124,273,104,207 instructions              #    0.86  insns per cycle        
     25,104,320,436 branches                  #  486.101 M/sec                  
      3,929,572,460 branch-misses             #   15.65% of all branches        
    
       52.511233236 seconds time elapsed
    
    链接地址: http://www.djcxy.com/p/31561.html

    上一篇: > vs. >= causes significant performance difference

    下一篇: 1 vs x >= 0, is there a performance difference