Fast integer matrix multiplication with bit
I am asking if it is possible to improve considerably integer matrix multiplication with bitwise operations. The matrices are small, and the elements are small nonnegative integers (small means at most 20).
To keep us focused, let's be extremely specific, and say that I have two 3x3 matrices, with integer entries 0<=x<15.
The following naive C++ implementation executed a million times performs around 1s, measured with linux time
.
#include <random>
int main() {
//Random number generator
std::random_device rd;
std::mt19937 eng(rd());
std::uniform_int_distribution<> distr(0, 15);
int A[3][3];
int B[3][3];
int C[3][3];
for (int trials = 0; trials <= 1000000; trials++) {
//Set up A[] and B[]
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 3; ++j) {
A[i][j] = distr(eng);
B[i][j] = distr(eng);
C[i][j] = 0;
}
}
//Compute C[]=A[]*B[]
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 3; ++j) {
for (int k = 0; k < 3; ++k) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
}
return 0;
}
Notes:
A[]
and B[]
can be encoded as a single 64 bit integer. Think of what would happen for just a bit larger matrices. Related: Binary matrix multiplication bit twiddling hack and What is the optimal algorithm for the game 2048?
The question you linked is about a matrix where every element is a single bit. For one-bit values a
and b
, a * b
is exactly equivalent to a & b
.
For adding 2-bit elements, it might be plausible (and faster than unpacking) to add basically from scratch, with XOR (carryless-add), then generate the carry with AND, shift, and mask off carry across element boundaries.
A 3rd bit would require detecting when adding the carry produces yet another carry. I don't think it would be a win to emulating even a 3 bit adder or multiplier, compared to using SIMD. Without SIMD (ie in pure C with uint64_t
) it might make sense. For add, you might try using a normal add and then try to undo the carry between element boundaries, instead of building an adder yourself out of XOR/AND/shift operations.
packed vs. unpacked-to-bytes storage formats
If you have very many of these tiny matrices, storing them in memory in compressed form (eg packed 4bit elements) can help with cache footprint / memory bandwidth. 4bit elements are fairly easy to unpack to having each element in a separate byte element of a vector.
Otherwise, store them with one matrix element per byte. From there, you can easily unpack them to 16bit or 32bit per element if needed, depending on what element sizes the target SIMD instruction set provides. You might keep some matrices in local variables in unpacked format to reuse across multiplies, but pack them back into 4bits per element for storage in an array.
Compilers suck at this with uint8_t
in scalar C code for x86 . See comments on @Richard's answer: gcc and clang both like to use mul r8
for uint8_t
, which forces them to move data into eax
(the implicit input/output for a one-operand multiply), rather than using imul r32, r32
and ignoring the garbage that leaves outside the low 8 bits of the destination register.
The uint8_t
version actually runs slower than the uint16_t
version, even though it has half the cache footprint.
You're probably going to get best results from some kind of SIMD.
Intel SSSE3 has a vector byte multiply, but only with adding of adjacent elements. Using it would require unpacking your matrix into a vector with some zeros between rows or something, so you don't get data from one row mixed with data from another row. Fortunately, pshufb
can zero elements as well as copy them around.
More likely to be useful is SSE2 PMADDWD
, if you unpack to each matrix element in a separate 16bit vector element. So given a row in one vector, and a transposed-column in another vector, pmaddwd
( _mm_madd_epi16
) is one horizontal add
away from giving you the dot-product result you need for C[i][j]
.
Instead of doing each of those adds separately, you can probably pack multiple pmaddwd
results into a single vector so you can store C[i][0..2]
in one go.
You may find that reducing the data size gives you a considerable performance improvement if you are performing this calculation over a large number of matrices:
#include <cstdint>
#include <cstdlib>
using T = std::uint_fast8_t;
void mpy(T A[3][3], T B[3][3], T C[3][3])
{
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 3; ++j) {
for (int k = 0; k < 3; ++k) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
}
The pentium can move and sign-extend an 8-bit value in one instruction. This means you're getting 4 times as many matricies per cache line.
UPDATE: curiosity piqued, I wrote a test:
#include <random>
#include <utility>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <typeinfo>
template<class T>
struct matrix
{
static constexpr std::size_t rows = 3;
static constexpr std::size_t cols = 3;
static constexpr std::size_t size() { return rows * cols; }
template<class Engine, class U>
matrix(Engine& engine, std::uniform_int_distribution<U>& dist)
: matrix(std::make_index_sequence<size()>(), engine, dist)
{}
template<class U>
matrix(std::initializer_list<U> li)
: matrix(std::make_index_sequence<size()>(), li)
{
}
matrix()
: _data { 0 }
{}
const T* operator[](std::size_t i) const {
return std::addressof(_data[i * cols]);
}
T* operator[](std::size_t i) {
return std::addressof(_data[i * cols]);
}
private:
template<std::size_t...Is, class U, class Engine>
matrix(std::index_sequence<Is...>, Engine& eng, std::uniform_int_distribution<U>& dist)
: _data { (void(Is), dist(eng))... }
{}
template<std::size_t...Is, class U>
matrix(std::index_sequence<Is...>, std::initializer_list<U> li)
: _data { ((Is < li.size()) ? *(li.begin() + Is) : 0)... }
{}
T _data[rows * cols];
};
template<class T>
matrix<T> operator*(const matrix<T>& A, const matrix<T>& B)
{
matrix<T> C;
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 3; ++j) {
for (int k = 0; k < 3; ++k) {
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
return C;
}
static constexpr std::size_t test_size = 1000000;
template<class T, class Engine>
void fill(std::vector<matrix<T>>& v, Engine& eng, std::uniform_int_distribution<T>& dist)
{
v.clear();
v.reserve(test_size);
generate_n(std::back_inserter(v), test_size,
[&] { return matrix<T>(eng, dist); });
}
template<class T>
void test(std::random_device& rd)
{
std::mt19937 eng(rd());
std::uniform_int_distribution<T> distr(0, 15);
std::vector<matrix<T>> As, Bs, Cs;
fill(As, eng, distr);
fill(Bs, eng, distr);
fill(Cs, eng, distr);
auto start = std::chrono::high_resolution_clock::now();
auto ia = As.cbegin();
auto ib = Bs.cbegin();
for (auto&m : Cs)
{
m = *ia++ * *ib++;
}
auto stop = std::chrono::high_resolution_clock::now();
auto diff = stop - start;
auto millis = std::chrono::duration_cast<std::chrono::microseconds>(diff).count();
std::cout << "for type " << typeid(T).name() << " time is " << millis << "us" << std::endl;
}
int main() {
//Random number generator
std::random_device rd;
test<std::uint64_t>(rd);
test<std::uint32_t>(rd);
test<std::uint16_t>(rd);
test<std::uint8_t>(rd);
}
example output (recent macbook pro, 64-bit, compiled with -O3)
for type y time is 32787us
for type j time is 15323us
for type t time is 14347us
for type h time is 31550us
summary:
on this platform, int32 and int16 proved to be as fast as each other. int64 and int8 were equally slow (the 8-bit result surprised me).
conclusion:
As ever, express intent to the compiler and let the optimiser do its thing. If the program is running too slowly in production, take measurements and optimise the worst-offenders.
链接地址: http://www.djcxy.com/p/40186.html上一篇: 配置WebClient加速下载
下一篇: 用位快速整数矩阵乘法