Filtering 1bpp images

2018-06-08 09:12:10

I'm looking to filter a 1 bit per pixel image using a 3x3 filter: for each input pixel, the corresponding output pixel is set to 1 if the weighted sum of the pixels surrounding it (with weights determined by the filter) exceeds some threshold.

I was hoping that this would be more efficient than converting to 8 bpp and then filtering that, but I can't think of a good way to do it. A naive method is to keep track of nine pointers to bytes (three consecutive rows and also pointers to either side of the current byte in each row, for calculating the output for the first and last bits in these bytes) and for each input pixel compute

sum = filter[0] * (lastRowPtr & aMask > 0) + filter[1] * (lastRowPtr & bMask > 0) + ... + filter[8] * (nextRowPtr & hMask > 0) ,

with extra faff for bits at the edge of a byte. However, this is slow and seems really ugly. You're not gaining any parallelism from the fact that you've got eight pixels in each byte and instead are having to do tonnes of extra work masking things.

Are there any good sources for how to best do this sort of thing? A solution to this particular problem would be amazing, but I'd be happy being pointed to any examples of efficient image processing on 1bpp images in C/C++. I'd like to replace some more 8 bpp stuff with 1 bpp algorithms in future to avoid image conversions and copying, so any general resouces on this would be appreciated.

I found a number of years ago that unpacking the bits to bytes, doing the filter, then packing the bytes back to bits was faster than working with the bits directly. It seems counter-intuitive because it's 3 loops instead of 1, but the simplicity of each loop more than made up for it.

I can't guarantee that it's still the fastest; compilers and especially processors are prone to change. However simplifying each loop not only makes it easier to optimize, it makes it easier to read. That's got to be worth something.

A further advantage to unpacking to a separate buffer is that it gives you flexibility for what you do at the edges. By making the buffer 2 bytes larger than the input, you unpack starting at byte 1 then set byte 0 and n to whatever you like and the filtering loop doesn't have to worry about boundary conditions at all.

Look into separable filters. Among other things, they allow massive parallelism in the cases where they work.

For example, in your 3x3 sample-weight-and-filter case:

Sample 1x3 (horizontal) pixels into a buffer. This can be done in isolation for each pixel, so a 1024x1024 image can run 1024^2 simultaneous tasks, all of which perform 3 samples.

Sample 3x1 (vertical) pixels from the buffer. Again, this can be done on every pixel simultaneously.

Use the contents of the buffer to cull pixels from the original texture.

The advantage to this approach, mathematically, is that it cuts the number of sample operations from n^2 to 2n , although it requires a buffer of equal size to the source (if you're already performing a copy, that can be used as the buffer; you just can't modify the original source for step 2). In order to keep memory use at 2n , you can perform steps 2 and 3 together (this is a bit tricky and not entirely pleasant); if memory isn't an issue, you can spend 3n on two buffers (source, hblur, vblur).

Because each operation is working in complete isolation from an immutable source, you can perform the filter on every pixel simultaneously if you have enough cores. Or, in a more realistic scenario, you can take advantage of paging and caching to load and process a single column or row. This is convenient when working with odd strides, padding at the end of a row, etc. The second round of samples (vertical) may screw with your cache, but at the very worst, one round will be cache-friendly and you've cut processing from exponential to linear.

Now, I've yet to touch on the case of storing data in bits specifically. That does make things slightly more complicated, but not terribly much so. Assuming you can use a rolling window, something like:

d = s[x-1] + s[x] + s[x+1]

works. Interestingly, if you were to rotate the image 90 degrees during the output of step 1 (trivial, sample from (y,x) when reading), you can get away with loading at most two horizontally adjacent bytes for any sample, and only a single byte something like 75% of the time. This plays a little less friendly with cache during the read, but greatly simplifies the algorithm (enough that it may regain the loss).

Pseudo-code:

buffer source, dest, vbuf, hbuf;

for_each (y, x)   // Loop over each row, then each column. Generally works better wrt paging
{
    hbuf(x, y) = (source(y, x-1) + source(y, x) + source(y, x+1)) / 3   // swap x and y to spin 90 degrees
}
for_each (y, x)
{
    vbuf(x, 1-y) = (hbuf(y, x-1) + hbuf(y, x) + hbuf(y, x+1)) / 3    // 1-y to reverse the 90 degree spin
}
for_each (y, x)
{
    dest(x, y) = threshold(hbuf(x, y))
}

Accessing bits within the bytes ( source(x, y) indicates access/sample) is relatively simple to do, but kind of a pain to write out here, so is left to the reader. The principle, particularly implemented in this fashion (with the 90 degree rotation), only requires 2 passes of n samples each, and always samples from immediately adjacent bits/bytes (never requiring you to calculate the position of the bit in the next row). All in all, it's massively faster and simpler than any alternative.

Rather than expanding the entire image to 1 bit/byte (or 8bpp, essentially, as you noted), you can simply expand the current window - read the first byte of the first row, shift and mask, then read out the three bits you need; do the same for the other two rows. Then, for the next window, you simply discard the left column and fetch one more bit from each row. The logic and code to do this right isn't as easy as simply expanding the entire image, but it'll take a lot less memory.

As a middle ground, you could just expand the three rows you're currently working on. Probably easier to code that way.

链接地址: http://www.djcxy.com/p/25390.html

上一篇: cuda适合阵列过滤

下一篇: 过滤1分贝图像