目標

Thrust 是一個極其強大的庫，適用於各種 cuda 加速演算法。但是，thrust 旨在與向量而不是間距矩陣一起使用。以下教程將討論將 cv::cuda::GpuMat 包裝到可以與 thrust 演算法一起使用的 thrust 迭代器中。

本教程將向您展示如何

將 GpuMat 包裝到 thrust 迭代器中
用隨機數填充 GpuMat
就地對 GpuMat 的一列進行排序
將大於 0 的值複製到新的 GPU 矩陣
將流與 thrust 結合使用

將 GpuMat 包裝到 thrust 迭代器中

以下程式碼將為 GpuMat 生成一個迭代器

/*
@Brief GpuMatBeginItr 返回一個 thrust 相容的迭代器，指向 GPU mat 記憶體的開頭。
@Param mat 是輸入矩陣
@Param channel 是迭代器正在訪問的矩陣通道。如果設定為 -1，迭代器將按順序訪問每個元素
*/
template<typename T>
thrust::permutation_iterator<thrust::device_ptr<T>, thrust::transform_iterator<step_functor<T>, thrust::counting_iterator<int>>> GpuMatBeginItr(cv::cuda::GpuMat mat, int channel = 0)
{
    if (channel == -1)
    {
mat = mat.reshape(1);
channel = 0;
    }
    CV_Assert(mat.depth() == cv::DataType<T>::depth);
    CV_Assert(channel < mat.channels());
    return thrust::make_permutation_iterator(thrust::device_pointer_cast(mat.ptr<T>(0) + channel),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), step_functor<T>(mat.cols, mat.step / sizeof(T), mat.channels())));
}

/*
@Brief GpuMatEndItr 返回一個 thrust 相容的迭代器，指向 GPU mat 記憶體的末尾。
@Param mat 是輸入矩陣
@Param channel 是迭代器正在訪問的矩陣通道。如果設定為 -1，迭代器將按順序訪問每個元素
*/
template<typename T>
thrust::permutation_iterator<thrust::device_ptr<T>, thrust::transform_iterator<step_functor<T>, thrust::counting_iterator<int>>> GpuMatEndItr(cv::cuda::GpuMat mat, int channel = 0)
{
    if (channel == -1)
    {
mat = mat.reshape(1);
channel = 0;
    }
    CV_Assert(mat.depth() == cv::DataType<T>::depth);
    CV_Assert(channel < mat.channels());
    return thrust::make_permutation_iterator(thrust::device_pointer_cast(mat.ptr<T>(0) + channel),
thrust::make_transform_iterator(thrust::make_counting_iterator(mat.rows*mat.cols), step_functor<T>(mat.cols, mat.step / sizeof(T), mat.channels())));
}

我們的目標是擁有一個迭代器，它將從矩陣的開頭開始，並正確遞增以訪問連續的矩陣元素。對於連續行來說，這是微不足道的，但是對於間距矩陣的列來說呢？為此，我們需要迭代器瞭解矩陣的尺寸和步長。此資訊嵌入在 step_functor 中。

template<typename T> struct step_functor : public thrust::unary_function<int, int>
{
    int 列；
    int 步長；
    int 通道；
__host__ __device__ step_functor(int columns_, int step_, int channels_ = 1) : columns(columns_), step(step_), channels(channels_) { };
__host__ step_functor(cv::cuda::GpuMat& mat)
    {
        CV_Assert(mat.depth() == cv::DataType<T>::depth);
columns = mat.cols;
step = mat.step / sizeof(T);
channels = mat.channels();
    }
__host__ __device__
        int operator()(int x) const
    {
        int row = x / columns;
        int idx = (row * step) + (x % columns)*channels;
        return idx;
    }
};

step functor 接受一個索引值，並返回與矩陣開頭處的適當偏移量。計數迭代器只是在畫素元素範圍內遞增。組合到 transform_iterator 中，我們有一個迭代器，它從 0 計數到 M*N，並正確遞增以考慮 GpuMat 的間距記憶體。不幸的是，這不包括任何記憶體位置資訊，為此我們需要一個 thrust::device_ptr。透過將裝置指標與 transform_iterator 結合使用，我們可以將 thrust 指向矩陣的第一個元素，並使其相應地步進。

用隨機數填充 GpuMat

現在我們有一些很好的函式來為 thrust 建立迭代器，讓我們用它們做一些 OpenCV 做不到的事情。不幸的是，在撰寫本文時，OpenCV 沒有任何 Gpu 隨機數生成。謝天謝地，thrust 有，現在在兩者之間進行互操作非常簡單。示例取自 http://stackoverflow.com/questions/12614164/generating-a-random-number-vector-between-0-and-1-0-using-thrust

首先，我們需要編寫一個 functor，它將生成我們的隨機值。

struct prg
{
float a, b;
 
__host__ __device__
prg(float _a = 0.f, float _b = 1.f) : a(_a), b(_b) {};
 
__host__ __device__
float operator()(const unsigned int n) const
  {
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> dist(a, b);
rng.discard(n);
return dist(rng);
  }
};

這將接受一個整數值，並輸出一個介於 a 和 b 之間的值。現在我們將使用 thrust 變換用介於 0 和 10 之間的值填充我們的矩陣。

  {
cv::cuda::GpuMat d_value(1, 100, CV_32F);
auto valueBegin = GpuMatBeginItr<float>(d_value);
auto valueEnd = GpuMatEndItr<float>(d_value);
thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_value.cols), valueBegin, prg(-1, 1));
 
cv::Mat h_value(d_value);
  }

就地對 GpuMat 的一列進行排序

讓我們用隨機值和一個索引填充矩陣元素。之後，我們將對隨機數和索引進行排序。

  {
cv::cuda::GpuMat d_data(1, 100, CV_32SC2);
// Thrust 相容的 begin 和 end 迭代器，指向此矩陣的通道 1
auto keyBegin = GpuMatBeginItr<int>(d_data, 1);
auto keyEnd = GpuMatEndItr<int>(d_data, 1);
// Thrust 相容的 begin 和 end 迭代器，指向此矩陣的通道 0
auto idxBegin = GpuMatBeginItr<int>(d_data, 0);
auto idxEnd = GpuMatEndItr<int>(d_data, 0);
// 用從 0 到 100 的數字序列填充索引通道
thrust::sequence(idxBegin, idxEnd);
// 用介於 0 和 10 之間的隨機數填充鍵通道。此處使用計數迭代器為每個位置提供一個整數值，作為 prg::operator() 的輸入
thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_data.cols), keyBegin, prg(0, 10));
// 對鍵通道和索引通道進行排序，以便鍵和索引保持在一起
thrust::sort_by_key(keyBegin, keyEnd, idxBegin);
 
cv::Mat h_idx(d_data);
  }

在使用流的同時將大於 0 的值複製到新的 GPU 矩陣

在此示例中，我們將瞭解如何將 cv::cuda::Streams 與 thrust 結合使用。不幸的是，此特定示例使用的函式必須將結果返回到 CPU，因此這不是流的最佳使用方式。

  {
cv::cuda::GpuMat d_value(1, 100, CV_32F);
auto valueBegin = GpuMatBeginItr<float>(d_value);
auto valueEnd = GpuMatEndItr<float>(d_value);
cv::cuda::Stream stream;
//! [random_gen_stream]
// 與之前的隨機生成程式碼相同，只不過現在變換正在流上執行
thrust::transform(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_value.cols), valueBegin, prg(-1, 1));
//! [random_gen_stream]
// 計算我們要複製的值的數量
int count = thrust::count_if(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), valueBegin, valueEnd, pred_greater<float>(0.0));
// 為複製的值分配目標
cv::cuda::GpuMat d_valueGreater(1, count, CV_32F);
// 複製滿足謂詞的值。
thrust::copy_if(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), valueBegin, valueEnd, GpuMatBeginItr<float>(d_valueGreater), pred_greater<float>(0.0));
cv::Mat h_greater(d_valueGreater);
  }

首先，我們將在流上用 -1 和 1 之間隨機生成的資料填充 GPU mat。

// 與之前的隨機生成程式碼相同，只不過現在變換正在流上執行

thrust::transform(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_value.cols), valueBegin, prg(-1, 1));

請注意 thrust::system::cuda::par.on(...) 的使用，這會建立一個執行策略，用於在流上執行 thrust 程式碼。cuda 工具包隨附的 thrust 版本中存在一個錯誤，截至 7.5 版本，此錯誤尚未修復。此錯誤會導致程式碼無法在流上執行。但是，可以透過使用來自 git 儲存庫的最新版本的 thrust 來修復此錯誤。(http://github.com/thrust/thrust.git) 接下來，我們將使用 thrust::count_if 和以下謂詞來確定有多少值大於 0

template<typename T> struct pred_greater
{
T value;
__host__ __device__ pred_greater(T value_) : value(value_){}
__host__ __device__ bool operator()(const T& val) const
  {
return val > value;
  }
};

我們將使用這些結果來建立一個輸出緩衝區來儲存複製的值，然後我們將使用 copy_if 和相同的謂詞來填充輸出緩衝區。最後，我們將值下載到 CPU mat 中以進行檢視。

目錄

目標

將 GpuMat 包裝到 thrust 迭代器中

用隨機數填充 GpuMat

就地對 GpuMat 的一列進行排序

在使用流的同時將大於 0 的值複製到新的 GPU 矩陣