Tensorflow：同时预测GPU和CPU

2018-06-13 21:32:59

我正在使用tensorflow，我想通过同时使用CPU和一个GPU来加速预训练的Keras模型（我对训练阶段不感兴趣）的预测阶段。

我尝试创建两个不同的线程，这两个线程提供两个不同的tensorflow会话（一个运行在CPU上，另一个运行在GPU上）。每个线程提供固定数量的批处理（例如，如果我们总共有100个批处理，我希望为循环中的CPU分配20个批处理，或者在GPU上分配80个批处理，或者将这两个批处理任意组合），然后合并结果。如果分割是自动完成的话会更好。

然而，即使在这种情况下，批处理似乎是以同步方式进行馈送，因为即使向CPU发送少量批次并计算GPU中的所有其他批量（以GPU作为瓶颈），我观察到整体预测时间为相对于仅使用GPU进行的测试总是更高。

我预计它会更快，因为当只有GPU工作时，CPU使用率约为20-30％，因此有一些CPU可用于加速计算。

我读了很多讨论，但他们都处理与多个GPU并行，而不是在GPU和CPU之间。

下面是我写的代码示例： tensor_cpu和tensor_gpu对象以这种方式从相同的Keras模型加载：

with tf.device('/gpu:0'):
    model_gpu = load_model('model1.h5')
    tensor_gpu = model_gpu(x)

with tf.device('/cpu:0'):
    model_cpu = load_model('model1.h5')
    tensor_cpu = model_cpu(x)

然后预测完成如下：

def predict_on_device(session, predict_tensor, batches):
    for batch in batches:
        session.run(predict_tensor, feed_dict={x: batch})


def split_cpu_gpu(batches, num_batches_cpu, tensor_cpu, tensor_gpu):
    session1 = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    session1.run(tf.global_variables_initializer())
    session2 = tf.Session(config=tf.ConfigProto(log_device_placement=True))
    session2.run(tf.global_variables_initializer())

    coord = tf.train.Coordinator()

    t_cpu = Thread(target=predict_on_device, args=(session1, tensor_cpu, batches[:num_batches_cpu]))
    t_gpu = Thread(target=predict_on_device, args=(session2, tensor_gpu, batches[num_batches_cpu:]))

    t_cpu.start()
    t_gpu.start()

    coord.join([t_cpu, t_gpu])

    session1.close()
    session2.close()

我怎样才能实现这种CPU / GPU并行化？我想我错过了一些东西。

任何形式的帮助将非常感谢！

下面是我的代码，演示了CPU和GPU如何并行执行：

import tensorflow as tf
import numpy as np
from time import time
from threading import Thread

n = 1024 * 8

data_cpu = np.random.uniform(size=[n//16, n]).astype(np.float32)
data_gpu = np.random.uniform(size=[n    , n]).astype(np.float32)

with tf.device('/cpu:0'):
    x = tf.placeholder(name='x', dtype=tf.float32)

def get_var(name):
    return tf.get_variable(name, shape=[n, n])

def op(name):
    w = get_var(name)
    y = x
    for _ in range(8):
        y = tf.matmul(y, w)
    return y

with tf.device('/cpu:0'):
    cpu = op('w_cpu')

with tf.device('/gpu:0'):
    gpu = op('w_gpu')

def f(session, y, data):
    return session.run(y, feed_dict={x : data})


with tf.Session(config=tf.ConfigProto(log_device_placement=True, intra_op_parallelism_threads=8)) as sess:
    sess.run(tf.global_variables_initializer())

    coord = tf.train.Coordinator()

    threads = []

    # comment out 0 or 1 of the following 2 lines:
    threads += [Thread(target=f, args=(sess, cpu, data_cpu))]
    threads += [Thread(target=f, args=(sess, gpu, data_gpu))]

    t0 = time()

    for t in threads:
        t.start()

    coord.join(threads)

    t1 = time()


print t1 - t0

计时结果是：

CPU线程：4-5s（当然会因机器而异）。

GPU线程：5s（它的工作量是16x）。

两者同时：5秒

请注意，没有必要进行2次会话（但也适用于我）。

您可能会看到不同结果的原因可能是

对系统资源的一些争夺（GPU执行会消耗一些主机系统资源，并且如果运行CPU线程，它会使性能恶化）

时间不正确

部分模型只能在GPU / CPU上运行

其他地方的瓶颈

一些其他问题

链接地址: http://www.djcxy.com/p/39595.html

上一篇: Tensorflow: simultaneous prediction on GPU and CPU

下一篇: Speeding Up C#