instance segmentation之maskRCNN

论文简介

地址：https://arxiv.org/abs/1703.06870
基于之前two stage的fasterRCNN的目标检测方法，添加了mask refinement的branch，完成了实例分割的任务。
在其他任务上，如姿态估计等，都做了实验，实验证明有提出的maskRCNN方法有很好的泛化性能。
maskRCNN可以跑到5fps。

introduction

在之前的fasterRCNN系列中，使用RoiPooling将roi变成相同大小的feature map，是对其使用向下取整，如果在实例分割也这样的话，会大大降低像素点的定位精度，因此使用RoiAlign的方法，结合双线性插值进行pooling，得到相同大小的feature map。
之前的预测过程中，对于多分类的问题，使用softmax对最后一层feature map进行处理，并使用softmax loss进行BP。但是这种方法会使得像素之间的各个类别形成竞争，因此在本文中使用multiple sigmoid的方法，对于每个类别都是用sigmoid进行处理，避免类别之间竞争的问题，实验证明这种方法可以提升模型的精度。

mask RCNN

loss function是

$$L = L_{cls} + L_{box} + L_{mask}$$

其中$L_{cls}$与$L_{box}$在fastRCNN中已有定义，$L_{mask}$就是上面提到的mask的loss。

roialign的示意图如下

maskRCNN-roialign

网络结构如下

maskRCNN-网络结构

前端的网络使用resnet或者fpn均做了测试，这里也使用了转置卷积，提升高维feature map在mask预测时的size。

在训练的过程中
- 只有那些被标注为positive roi(与gt box的IOU大于特定阈值)的bbox才会对最终的$L_{mask}$做出贡献。
- 这里每个点生成了5(scales)X3(aspect ratios)=15个proposal。
- 去掉很多负样本，使得正负样本的比例在1:3左右。
在不加任何其他的技巧的情况下，秒杀同时期其他single stage的实例分割方法。贴一张模型的性能。

maskRCNN-结果

conclusion

在实例分割、物体检测、姿态估计、关键点检测等任务中都达到了state-of-art的水准，也证明了maskRCNN的通用性。

关于代码的讲解

参考链接

关于maskRCNN的代码有较多版本，这里给一个比较流行的版本（涉及到的代码文件较少，方便研究）。代码链接：https://github.com/matterport/Mask_RCNN

论文中的疑问

在groundtruth中，每个bbox的大小不一定与feature map最终输出的大小相同，因此在训练的时候，将groudtruth的mask以及对应的bbox使用双线性插值resize到CNN输出的mask的大小，之后再将其四舍五入到0，1即可。在utils.py中有一个函数用于实现这个功能。

def minimize_mask(bbox, mask, mini_shape):
    """Resize masks to a smaller version to reduce memory load.
    Mini-masks can be resized back to image scale using expand_masks()

    See inspect_data.ipynb notebook for more details.
    """
    mini_mask = np.zeros(mini_shape + (mask.shape[-1],), dtype=bool)
    for i in range(mask.shape[-1]):
        # Pick slice and cast to bool in case load_mask() returned wrong dtype
        m = mask[:, :, i].astype(bool)
        y1, x1, y2, x2 = bbox[i][:4]
        m = m[y1:y2, x1:x2]
        if m.size == 0:
            raise Exception("Invalid bounding box with area of zero")
        # Resize with bilinear interpolation
        m = skimage.transform.resize(m, mini_shape, order=1, mode="constant")
        mini_mask[:, :, i] = np.around(m).astype(np.bool)
    return mini_mask

在输出的时候，模型输出的mask是28X28或者56X56的大小（根据配置文件中的设定），同时可以得到bbox信息（包含whxy信息），因此在处理的时候，将得到的mask进行resize，变成与bbox的大小相同，最后得到基于bbox的mask信息。utils.py中的unmold_mask函数完成这样的功能，具体代码如下：

def unmold_mask(mask, bbox, image_shape):
    """Converts a mask generated by the neural network to a format similar
    to its original shape.
    mask: [height, width] of type float. A small, typically 28x28 mask.
    bbox: [y1, x1, y2, x2]. The box to fit the mask in.

    Returns a binary mask with the same size as the original image.
    """
    threshold = 0.5
    y1, x1, y2, x2 = bbox
    mask = skimage.transform.resize(mask, (y2 - y1, x2 - x1), order=1, mode="constant")
    mask = np.where(mask >= threshold, 1, 0).astype(np.bool)

    # Put the mask in the right location.
    full_mask = np.zeros(image_shape[:2], dtype=np.bool)
    full_mask[y1:y2, x1:x2] = mask
    return full_mask