jessisyj / papers-note Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 0 B

papers-note's People

Watchers

papers-note's Issues

Actor and Action Video Segmentation from a Sentence [2018-08-02]

Actor and Action Video Segmentation from a Sentence

https://arxiv.org/pdf/1803.07485.pdf

Textual Encoder

Word2Vec :

using the pretrained model on 'GoogleNews'

each words = 300 dimension vec

each sentence padding to have the same size (eg:15x300)

CNN:

details:

temporal filter size = 2x2

channel = 300(same as word2vec representation)
ablation study:

51.8 for lstm

52.1 for bi-lstm

53.6 for cnn

Video Encoder

I3D Two-srteam

detials:

I3d last max-pooling layer --> average pooling over temporal dimention --> l2norm for each spatial position in feature map

ablation study:

49.5 for flow_only

53.6 for RGB_only

55.1for two-stream

tanh() is better

Decoding with dynamic filters

bottom up top down?

Real-time Hand Gesture Detection and Classification Using Convolutional Neural Networks [2019-07-03]

Real-time Hand Gesture Detection and Classification Using Convolutional Neural Networks

https://arxiv.org/pdf/1901.10323.pdf

Task:

实时手势动作识别

Framework:

首先对输入的视频用滑动窗（t, t+8)取8帧作为一段，放入网络中，得到Detector的输出（二维），大于0.5则该段有动作，以同样的起始时间取32帧作为一段（t, t+32), 放入网络得到Classifier的输出（84维：83个动作和None）。Classifier的输出按规则取最大值，即为该段的动作类别。提出Post-processing和Single-time Activation的方法来解决一些针对性的问题，使得模型更准确。

Sliding window fashion

滑动窗取视频段

Detector（scale: 8, stride: 1)
Classifier (scale: 32, stride: 1)

A Detector + A Classifier

Detector（ResNet-10）: 二分类（有无手势）；同时作为Classifier的触发器（检测到有手势出现则进入分类器，无手势则进入下一段视频）
Classifier（ResNeXt-101）：softmax(pre) 判断出现的手势属于哪一类

Detector 具体（Post-processing）

解决问题：在实时识别手势的过程中，可能会存在动作幅度太大跑出画面的情况，此时依然是在手势动作的过程中，但Detector输出的confidence score很低。
方案：提出Post-processing，保留之前段落预测的confidence score（文中保留4个值）和当前confidence score共同组成一个五位的数组，取中位数作为最终判断是否存在手势动作的预测值（二分类argmax(pre)=1则有动作，反之没有）。

Classifier 具体（Single-time Activation）

解决问题：一个完整的手势动作可分为三个过程（准备期preparation, 峰值期nucleus，结束期retraction），在准备期时Detector已检测出动作并送入Classifier，但许多动作在preparation期往往是很相似的（如下图），会产生CS值很高的错误预测。
方案：对不同时期的预测采用不同的权重。

公式:

首先定义一个常量t:

其中t是常量，u代表平均的ground truth动作时长（Ego数据集该值为38），s为步长，这里通过移动的步长来判断到第几个位置。论文中s取1，4文章没有解释，我认为这里的4是将每个动作时长分成四段，前四分之一表示准备期。因此需要s与之对应，例如果s取2，则4应改为取2.

其次对当前所取动作状态加权，权值为：

其中j是指检测到一个动作状态下的时间索引，当首次检测到动作时，为0,之后j递加，直到动作状态结束j重新赋值为0。该公式中t为定义好的常量（9），权值W随j的增加而增大，j=t时权值为0.5。

当检测到出整个段落后，将权值与预测结果相乘取平均，在均值中取最大的两个值，若这两个值的差大于某个阈值，则输出得分高的动作，作为该段的手势动作类别。如果遍历完j段依然不满足这个要求，则取出最大的值，该值大于0.15则将其作为最终的分类结果。

实验结果

EgoGesture dataset

train: 1239 videos 14416gesture
val: 411 videos 4768gesture
test: 431 videos 4977gesture

实验结果分析：

1），Depth图效果更好，解释：Depth图filter out背景信息，可以更focus在手势上。

2），检测器8帧效果最好，解释：该模型下Detector的设计至关重要，不能有遗漏取值应尽可能的小。

复现细节

待更新

TALL:Temporal activity localization via language query(2018-03-28)

Abstrace

This paper proposes a Cross-modal Temporal Regression Localizer (CTRL) to jointly model text query and video clips output alignment scores and action boundary regression results for candidate clipes.For evaluation,this paper builds Charades-STA based on Charades datasets.,even a more complex sentence queries in Charades-STA for test.

Model

Visual Encoder

For one video clip , we consider itself ( as the central clip ) and its surrouning clips ( as context clips ) .We uniformly sample n frames from each clip , useing extractor to extract the central clip , for the context clips , we use a pooling layer to calculate a pre-context feature and a post-context feature.

Sentence Encoder

LSTM.

Off-the-shelf Skip-thought

Multi-modal processing module

The input dmension of the FC layer is 2*d and the output is d

Temporal Localization Regression Networks

Temporal localization regression network takes the multi-modal representation as input , and has two sibling output layers ,1) ailgnment score between setence and the video clip , 2) clip location regression offsets—parameterized one and unparameterized one(better performance).

Training

Loss function

We design a multi-task loss L

Sampling training examples

We use multi-scale temporal sliding windows with frames and 80% overlap ( at test time we only use coarsely sample clips)
aligning quest: 1) IOU 2) nLOU 3) one to one

Charades-STA

split one sentence to some sub-sentences by a set of conjunctions
keywords maping
human check

Text-to-clip Video Retrieval with Early Fusion and Re-Captioning[2018-08-01]

Text-to-clip Video Retrieval with Early Fusion and Re-Captioning

https://arxiv.org/pdf/1804.05113.pdf

Segment Prososals

Input:video V --> encodes all frames in V using C3D --> predicting a relative R(center,length) --> C3D for R

loss function:

Example1: R-C3D

[R-C3D: Region Convolutional 3D Network for Temporal Activity Detection.
https://arxiv.org/pdf/1703.07814.pdf]

Example2 :

[Jointly Localizing and Describing Events for Dense Video Captioning.
https://arxiv.org/pdf/1804.08274.pdf]

Early Fusion

word-by-word fusion: using LSTM return a similarity

Caption loss

Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks [2019-07-05]

Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks

Paper：https://arxiv.org/pdf/1905.09646.pdf
Notes from others：https://zhuanlan.zhihu.com/p/66928045
Code：

Task

提出一种基于group结构的轻量级空间attention模块，对每个group的每个空间位置进行加权。
个人理解（为了提升group结构表达能力，对每个分组加了一个attention，因为该attention结构主要的操作是pooling和dot product，对每个group，只增加了两个缩放偏移参数，所以轻量）

Attention module

Step1, 特征图分组并按空间位置拆分。

暂不考虑batch，假设当前为（H，W，C），分为G个group，则每个group维度为（H，W，C/G). 对于每个group的特征图，可将其在空间维度上展开，则表示为：

即分为HxW个小段，每个小段表示一个位置的特征，维度为1x1xC/G。

Step2, 对每个group进行空间关联表达。

g 的维度为 1x1xC/G

这里ci 也可用以下公式求得：

减均值除标准差的normalize：

Step3, 对原始x增加权重信息。

对每个group的空间表达特征增加偏移量参数，得到attention的mask：

这里其实也可以用矩阵来表示，可能效果更直观一些。整个attention模块的参数只在这个部分体现，即总的参数量为group数的2倍。

将attention mask通过sigmoid得到attention score，将attention score 乘在原始X上，表示对空间上的不同位置赋以不同的重要程度：

这里 H x W 个xi^组成的特征即为空间加权后的group特征，维度为 HxWxC/G。

实验参数及结果

Ablation

G=64结果最好（Figure 5）；
Norm很重要不可去掉（差一个点，可见Table3）
两个偏移量参数的初始值对比如下（0，1结果最好，Table2）

Experiments on Object Detection

[S-CNN]Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs (2018-03-29)

Temporal Action Localization in Untrimmed Videos via Milti-stage CNNs

Abstract

This paper exploits the effectiveness of deep networks in temporal action locallization via three segment-based 3D Convnets:

(1) a proposal network -- identifies candidate segments in a long video that may contain actions

(2) a classification network (very important for training) -- serve as initialization for the localization network

(3) a localization network -- fine-tunes the learned classification network to localize each action instance

Detailed descriptions of Segment-CNN

Multi - scale segment generation

Each frame is resized to 171 X 128 pixels

For untrimmed video X, this paper conducts temporal sliding windows of varied lengths as 16,32,64,128,256,512 frames with 75% overlap(sampling also 16 frames)

Network architecture

Their deep networks use C3D as the basic archietecture in all stages .

cnv1a(64) - pool1(1,1) -

-conv2a(128) - pool2(2,2) -

-conv3a(256) - conv3b(256) - pool3(2,2) -

-conv4a(512) - conv4b(512) - pool4(2,2) -

-conv5a(512) - conv5b(512) - pool5(2,2) -

-fc6(4096) - fc7(4096) - fc8(K+1)

Each input for this deep network is a segment s of dimention 171 X 128 X 16 .

Training procedure

And Impact of individual networks

Compare S-CNN / S-CNN(w/o proposal) / S-CNN(w/o classification) / S-CNN(w/o localization)

1) The proposal network

label k:{0,1}

For each segment of the trimmed video , set its label as positive.

For candidate segments from an untrimmed video , assign a label for each ground truth (>0.7 or [lagest & >0.5] + ; <0.3 - )

reduce the number of operations conducted on background segments

2) The classification network

label: background and action k:{1....K}

In order to balance the number of training data for each class , this paper reduce the number of background instances to

better pormance

3) The localization network

Proposing this localization network with a new loss function , which tacks IOU with ground truth instance into consideration

The new loss function is formed by combining L（softmax） and L（overlap）：