Maximum SIMD integer multiplications on Ivy Bridge using SSE/AVX?

Would somebody be able to advise me how I can work out the maximum number of 32-bit unsigned integer multiplications I would be able to do concurrently on an Ivy Bridge CPU using SIMD via SSE/AVX? I understand AVX did have 256-bit registers for multiplication but this was for floating point (AVX2 introduced 256-bit integer registers). Therefore I am not overly sure whether it would be better t

使用SSE / AVX在Ivy Bridge上进行最大SIMD整数乘法?

有人能够告诉我怎样才能算出最大数量的32位无符号整数乘法,我可以通过SSE / AVX使用SIMD在Ivy Bridge CPU上同时完成这些操作? 我知道AVX确实有用于乘法的256位寄存器,但是这是用于浮点的(AVX2引入了256位整数寄存器)。 因此,我不太确定使用浮点寄存器进行整数乘法(如果可能的话)是否会更好? 另外,我不确定它是否仅仅关注寄存器的数量,或者是否需要查看CPU的端口。 看起来像端口0和端口5可以处理SSE整数ALU?

FMA3 in GCC: how to enable

I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile. SSE2: gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp AVX: gcc matrix.cpp -o matrix_gcc -O3 -mavx -fopenmp AVX2+FMA: gcc matrix.cpp -o matrix_gcc -O3 -march=native -fopenmp -ffast-math The SSE2 and AVX

GCC中的FMA3:如何启用

我有一个有AVX2和FMA3的i5-4250U。 我正在测试我编写的Linux上GCC 4.8.1中的一些密集矩阵乘法代码。 以下是我编译的三种不同方式的列表。 SSE2: gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp AVX: gcc matrix.cpp -o matrix_gcc -O3 -mavx -fopenmp AVX2+FMA: gcc matrix.cpp -o matrix_gcc -O3 -march=native -fopenmp -ffast-math SSE2和AVX版本在性能上明显不同。 但是,AVX2 + FMA并不比AVX版本好。

Floating point division vs floating point multiplication

Is there any (non-microoptimization) performance gain by coding float f1 = 200f / 2 in comparision to float f2 = 200f * 0.5 A professor of mine told me a few years ago that floating point divisions were slower than floating point multiplications without elaborating the why. Does this statement hold for modern PC architecture? Update1 In respect to a comment, please do also consider this

浮点除法与浮点乘法

编码是否有任何(非微型优化)性能增益? float f1 = 200f / 2 与...相比 float f2 = 200f * 0.5 几年前,我的一位教授告诉我,浮点数的划分比浮点乘法要慢,但没有详细说明原因。 这种说法是否适用于现代PC架构? UPDATE1 关于评论,请考虑这种情况: float f1; float f2 = 2 float f3 = 3; for( i =0 ; i < 1e8; i++) { f1 = (i * f2 + i / f3) * 0.5; //or divide by 2.0f, respectively } 更新2从评论中引

How can I compare the performance of log() and fp division in C++?

I'm using a log-based class in C++ to store very small floating-point values (as the values otherwise go beyond the scope of double ). As I'm performing a large number of multiplications, this has the added benefit of converting the multiplications to sums. However, at a certain point in my algorithm, I need to divide a standard double value by an integer value and than do a *= to a lo

我如何比较C ++中log()和fp的性能?

我在C ++中使用一个基于日志的类来存储非常小的浮点值(因为值超出了double的范围)。 由于我正在执行大量的乘法运算,因此将乘法运算转换为和运算具有额外的好处。 但是,在我的算法的某个点上,我需要将一个标准double值除以一个integer数值,然后将一个*=除以一个基于对数的值。 我已经为我的基于日志的类重载了*=运算符,并且通过运行log()并将其添加到左侧值,将右侧值首先转换为基于log()值。 因此实际执行的操作是浮

Why doesn't clang/llvm optimize this?

When compiling this code with clang 3.9: constexpr bool is_small(long long v) { return v < 0x4000000000000000; } int foo(); int f(int a) { if (is_small(a)) return a; else return foo(); } it produces assembly equivalent to int f(int a) { return a; } int f(int a) { return a; } , since it determined that is_small(a) will always be true, since a is an int , which (on my platform) is alway

为什么不clang / llvm优化呢?

当使用clang 3.9编译此代码时: constexpr bool is_small(long long v) { return v < 0x4000000000000000; } int foo(); int f(int a) { if (is_small(a)) return a; else return foo(); } 它产生的组件相当于int f(int a) { return a; } int f(int a) { return a; } ,因为它确定is_small(a)将始终为真,因为a是一个int ,它(在我的平台上)始终小于0x4000000000000000 。 当我将is_small更改为: constexpr boo

Decimal arithmetics in C or C++?

The IEEE-754 norm define decimal arithmetics in order to avoid rounding errors when using base-ten floating point numbers (see for example decimal64 on wikipedia). Is there a way to use this decimal arithmetics in C or C++? TR 24733 specifies decimal floating-point math for C++, based on IEEE-754. The TR means that it's a technical report, so it's not part of the C++ standard. GCC sa

C或C ++中的十进制算术?

IEEE-754标准定义了十进制算术,以避免使用基数为10的浮点数时的舍入误差(例如,请参阅wikipedia上的decimal64)。 有没有办法在C或C ++中使用这个十进制算术? TR 24733为基于IEEE-754的C ++指定了十进制浮点数学运算。 TR意味着它是一份技术报告,因此它不是C ++标准的一部分。 海湾合作委员会表示他们有部分实施。 目前工作中有一项建议将其添加到C ++标准中,但最好还是几年。

Change floating point rounding mode

What is the most efficient way to change the rounding mode* of IEEE 754 floating point numbers? A portable C function would be nice, but a solution that uses x86 assembly is ok too. *I am referring to the standard rounding modes of towards nearest, towards zero, and towards positive/negative infinity This is the standard C solution: #include <fenv.h> #pragma STDC FENV_ACCESS ON // st

更改浮点舍入模式

什么是改变IEEE 754浮点数舍入模式的最有效方法? 可移植的C函数会很好,但使用x86汇编的解决方案也可以。 *我指的是朝向最近的,趋近于零的,以及朝向正/负无穷的标准舍入模式 这是标准的C解决方案: #include <fenv.h> #pragma STDC FENV_ACCESS ON // store the original rounding mode const int originalRounding = fegetround( ); // establish the desired rounding mode fesetround(FE_TOWARDZERO); // do w

Convert float to bigint (aka portable way to get binary exponent & mantissa)

In C++, I have a bigint class that can hold an integer of arbitrary size. I'd like to convert large float or double numbers to bigint. I have a working method, but it's a bit of a hack. I used IEEE 754 number specification to get the binary sign, mantissa and exponent of the input number. Here is the code (Sign is ignored here, that's not important): float input = 77e12; bigi

将float转换为bigint(便携式获取二进制指数和尾数)

在C ++中,我有一个可以容纳任意大小整数的bigint类。 我想将大浮点数或双数转换为bigint。 我有一个工作方法,但它有点破解。 我使用IEEE 754数字规范来获取输入数字的二进制符号,尾数和指数。 这是代码(Sign在这里被忽略,这不重要): float input = 77e12; bigint result; // extract sign, exponent and mantissa, // according to IEEE 754 single precision number format unsigned int *raw = reinterpre

Range of representable values of 32

In the C++ standard it says of floating literals: If the scaled value is not in the range of representable values for its type, the program is ill-formed. The scaled value is the significant part multiplied by 10 ^ exponent part. Under x86-64: float is a single-precision IEEE-754 double is a double-precision IEEE-754 long double is an 80-bit extended precision IEEE-754 In this conte

可表示值的范围为32

在C ++标准中,它提到了浮动文字: 如果缩放值不在其类型的可表示值范围内,则该程序不合格。 缩放值是重要部分乘以10 ^指数部分。 在x86-64下: float是一个单精度IEEE-754 double是双精度IEEE-754 long double是一个80位扩展精度的IEEE-754 在这种情况下,这三种类型的可代表值的范围是多少? 这在哪里记录? 或者它是如何计算的? 答案是(如果你使用IEEE浮点的机器)在float.h 。 FLT_MAX , DBL_MAX和LD

How does `std::less` work?

Pointer relational operators do not define a total order (§ 5.9 of the C++11 standard): If two pointers p and q of the same type point to different objects that are not members of the same object or elements of the same array or to different functions, or if only one of them is null, the results of p<q , p>q , p<=q , and p>=q are unspecified. std::less documentation says: The pa

`std :: less`如何工作?

指针关系运算符没有定义总的顺序(C ++ 11标准的第5.9节): 如果相同类型的两个指针p和q指向不是同一对象或同一数组的元素或不同对象的成员的不同对象,或者只有其中一个为null,则p<q , p>q , p<=q和p>=q未指定。 std :: less文档说: 对于任何指针类型, std::less的部分特化将产生全部顺序,即使内置operator<不是。 它是如何从部分订单中产生这个总订单的? 我无法通过查看/usr/include/c++/4.9