- 1 -
中国科技论文在线
Hierarchical Saliency-based Representation for Human
Interaction Recognition#
GUO Wei, HU Tao, LIU Ruqian
*
5
(Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing,Wuhan
University,Wuhan 430072)
Abstract: Recognizing human interactions is a one of the most important problems in computer vision
and impacts a wide range of applications. This paper presents a new method for the recognition of
two-person interactions using hierarchical saliency-based representation. Hierarchical saliency is 10
defined as Salient Action at the highest level, Salient Point at middle level, Salient Joint at the lowest
level of interaction, determined by the greatest spatial-temporal positional changes at each level. Given
the saliency of interactions at different levels, several types of features were extracted according to the
discriminative characteristics of behaviors, such as spatial displacement, direction relations and etc.
Since there are few publicly accessible test datasets, we created a new dataset with eight types of 15
interactions named K3HI, using a new depth sensor, the Microsoft Kinect. The method was tested
using the SVM multi-class classifier; our experimental results demonstrate that the average recognition
accuracy of hierarchical saliency-based representation is %, outperforming methods using other
features.
Key words: pattern recognition; human interactions recognition; Kinect; hierarchical saliency 20
0 Introduction
Human behavior analysis is an active research area in computer vision and widely applied to
various domains, such as security surveillance, home care for elderly people and human-computer
interaction[1]. Actions involving more than one person make the recognition task more complicated 25
since the individual tracking of multiple interacting body parts needs to be maintained along the
entire image sequence[2]. An effective way to resolve the two-person interaction recognition
problem is to decompose an interaction into multiple interacting processes, each corresponding to
one person. Many recognition models have been developed to decompose and recognize two
person interactions, such as Coupled Hidden Markov Models (CHHMs)[3], scale decomposition[4] 30
and so on. All this research, however, was based on the traditional video camera, which only
provides color images of a scene. It is difficult to recognize human motion in badly-lit and dark
scenes. Thus, image contrast enhancement is required to improve the perceptibility of objects in
the scene by enhancing the brightness difference between objects and their backgrounds[5]. This is
a time-consuming process. Recently, the rapid development of depth sensors (. Microsoft 35
Kinect) provides adequate accuracy for real-time full-body tracking at a low cost[6]. As compared
with the traditional video camera, a depth sensor has advantage of synchronous acquisition of
color and depth images. With the use of depth imagery, three dimensional data, collected under
diverse conditions, is easily computed. Hence, human activity recognition efficiency can be
improved. 40
In this paper, using a depth sensor and a new hierarchical saliency representation method, we
can accurately recognize human interactions performed by two people. In our work, we created a
new dataset for two-person interactions using Microsoft Kinect. Based on the dataset we created,
we present a novel methodology to estimate the salient features hierarchically in two person
interactions. This method effectively recognizes eight types of two-person interactions in the 45
dataset we created. An interaction at the highest level is composed of two atomic actions. We
- 2 -
中国科技论文在线
extract the atomic action with relatively greater movement to differentiate between these two
actions. The action with greater movement is the Salient Action. At the middle level in an action
time sequence, there is a point when the action shows the greatest displacement relative to the
action in the first frame of the sequence. This point discriminates between types of action; 50
therefore, we define this as the Salient Point. At lowest level, an action movement is computed
using the changes in joint positions. For an action, there always exists a joint which owns the
greatest displacement; this we refer to as the Salient Joint. Depending on the saliency at different
levels, we selected several types of spatial relationships, such as spatial displacement, direction
relations, as key input features for interaction recognition. These features can characterize human 55
interactions at different levels for better understanding of actions and thereby improve recognition
efficiency.
1 Hierarchical Saliency Definition
This section gives a detailed description of the hierarchical saliency concept that we use to
represent interactions: Salient Action at the highest level, Salient Point at the middle level and 60
Salient Joint at lowest level. These saliencies can effectively characterize human interactions at
different levels to improve understanding of actions.
In order to improve understanding of hierarchical saliency definitions, this paper simplifies
interactions into two groups. The first group is reciprocal, one person starts to act and another
person reacts correspondingly; the second group is synchronous, both people act simultaneously 65
and in a similar fashion. The first group of interactions, what we call Positive Interaction,
includes pointing, kicking, punching, and pushing; the second group what we call Negative
Interaction, includes approaching, departing, exchanging an object, and shaking hands. Based on
the groups‟ classification, the hierarchical saliencies are defined as follows.
Definition 1 (Salient Action): A Salient Action of an interaction is identified in two ways. In a 70
Postive Interaction, the action which starts first, resulting in another person’s reaction is the
salient action Another is for Negative Interaction, in which we simply define an action which
moves with greater position changes in the first few frames as a Salient Action, since both
people’s behavior is similar and synchronized.
Figure 1 shows the structure of mutil-level saliencies. For all types of interactions, we 75
categorize them as either a Positive Interaction or Negative Interaction. Each type of interaction is
composed of two atomic actions performed by two persons, separately. Following definition 1, if
the actomic action shows greater movement during the first few frames, it is the Salient Action.
Definition 2 (Salient Point): A salient point of an action is a single time instance when the
presence of the action is clear and can be uniquely identified for all instances of the action. 80
We propose the notion of a “Salient Point” for precise temporal anchoring of human actions..
A Salient Point is a precise temporal anchor point relative to the action performance, taking
account of both the temporal structure and the discriminative subparts.
Definition 3 (Salient Joint): A salient joint of a person is identified as the joint which has the
greatest positional changes during the action sequence. 85
Since our collected data are saved as 3D coordinates for each joint of a person, and the joints
express the movement of people, we believe that a salient joint exists and represents the
characteristics of an action. Arms or legs always have the greatest positional changes during an
action sequence, thus the candidate Salient Joint comes from the four limbs. Therefore, the salient
limb should be computed first, and the corresponding Salient Joint can then be extracted easily. 90
The movement and trajectory of a joint which is far away from the main body is more easily
- 3 -
中国科技论文在线
differentiated among different types of interactions and defined as Salient Joint.
The structure of multi-level salient feature
2 Hierarchical Saliencies Identification 95
Salient Action
Before identifying Salient Action, a key issue is to align the interaction sequence for time or
frame length variances. Then we can extract Salient Action according to the definition 1. The
extraction process is divided into following three procedures.
Aligning the Sequence 100
Even for the same kind of interaction activity, there always are time or frame length
variances when capturing data. Before discerning a Salient Action, we first selected interactions of
the same class to align the sequences. Then, the DTW (Dynamic Time Warping) model was used
to align the sequences of the same activity class as mentioned in[7].
Computing key joint position changes. 105
We selected eight joints as key joints, which represent changes in the body‟s motion; these
joints include the left and right elbow, left and right hand, left and right knee, and left and right
foot. The positional changes of the joints are described by calculating the distances between
neighboring frames, defined as:
( ; , , ) ( ; , , )
( , 1) 1
j j x y z j x y z
i i i iD P P (1) 110
- 4 -
中国科技论文在线
where ( , 1)
j
i iD is the Euclidean distance of a key joint j between frame i and 1i ;
( ; , , )j x y z
iP
indicates the position of joint j at frame i and ( , , )x y z is the 3D coordinates.
Identifying Salient Action.
For Positive Interactions, it is tougher to extract the Salient Action than in Negative
Interactions. In our Salient Action definition, because the joint positions at the first two adjacent 115
frames change and conform to the benchmark, we can compare the maximum positional changes
of both persons‟ key joints between initial ith and 3i th frame of a sequence. This is
expressed as:
( 1; ) ( 2; )
( , 3) ( , 3)arg max(max( ),max( ))
p j p j
i i i iSalientAction D D (2)
where ( 1; )( , 3)max( )
p j
i iD and
( 2; )
( , 3)max( )
p j
i iD indicates the maximum position changes of joints for 120
person one and person two in an interaction; 1 2max( , )D D indicates that if 1 2D D , 1D will
represent the Salient Action. Figure 3 shows the processing results for Salient Actions, ignoring
the Non-salient Actions. Each action has its own distinct characteristics.
According to the definition for Negative Interactions, we also use the Equation (2), the
person with the maximum ( ; )( , 3)
p j
i iD , performs the Salient Action. 125
Skeletons visualization of Salient Actions. Compared with figure 7, the red skeletons show Salient Actions;
the people, represented with blue skeletons, act Non-Salient Actions.
Salient Action
Great movement of joints for each Salient Action are always arms or legs, thus we identify 130
the Salient Joint according to the displacement of joints. Since a leg or an arm in our dataset has
two joints and they are closely related, we compute the movements for each joint at adjacent
frames and sum up the values of leg and arm separately.
1 1 1 1
1 1 1 1
int arg max( , , , )
N N N N
LA LL RA RL
t t t t
t t t t
SalientJo d d d d
(3)
where LA, LL, RA and RL indicate left arm, left leg, right arm and right leg and
j
td 135
represents
2 2 2
1, , 1, , 1, ,
j
t t j t j t j t j t j t jd x x y y z z
(4)
which means Euclidean distance at time t . If j equals LA, then
j
id is the sum of
Euclidean distances for two joints: left elbow and left hand. After displacements are computed for
- 5 -
中国科技论文在线
each limb, the largest value indicates the Salient Limb and the further joint, which belongs to the 140
limb, is the Salient Joint. We take interaction “kicking” for instance. Figure 6 shows the changes
of fifteen joints from the 1st frame to the 50th frame. It can be seen that Right Foot and Right
Knee have larger values, so the Salient Joint is Right Foot.
Position Changes for interaction “kicking”. 145
Salient Point
After we have identified the Salient Action at the highest level and Salient Joint at the lowest
level, the Salient Point at the middle level is considered, based on both of the other saliencies. We
suggest that for each Salient Action in the whole sequence, if the position of Salient Joint at the
n -th frame has the largest angle changes relative to the central axis (composed by two joints: 150
Torso Center and Hip) then the time at the n -th frame is the Salient Point. The angle is defined as
2 1
2 1
arctan
1
k k
k k
0 (5)
where 2
k
and 1
k
are the gradients of two lines. For an interaction whose Salient Joint is
„hand‟, such as „punching‟, we estimate that the first line is formed by the joints „Hand‟ and
„Shoulder‟; and the other line is formed by „Neck‟ and „Torso Center‟. Similarly, for the 155
interaction „kicking‟ whose Salient is „leg‟, the angle is estimated by the line formed by „Foot‟ and
„Left Hip‟, and another line formed by „Neck‟ and „Torso Center‟. Each joint is represented as a
3D point . Figure 4 shows how the angle is computed.
The Salient Point identification method. The left one is for „punching‟ and the right is for „kicking‟. 160
3 Recognition Method
Features Extraction
Based on the hierarchical saliency, there are five types of feature input combinations for
- 6 -
中国科技论文在线
classification extracted from the collected data. The first type determines whether an interaction is
a Positive or a Negative Interaction. The second type obtains the Salient Joint, „hand‟ or „foot‟. 165
These two types of features can be extracted following the method. The third feature combination
gets the vertical distance
hdF and horizontal distance vdF from the Salient Joint at Salient Point
to the target plane. It is used to distinguish interactions, including pointing, punching, and pushing.
In detail, the horizontal distance is defined as:
, , , , , ,hd x y y yi j k lF i j k l dist p p p p (6) 170
where , ,y y yj k lp p p indicates the plane spanned by
y
jp ,
y
kp ,
y
lp and ,xidist p is the
closest distance from point
x
ip to the plane ; i stands for the Salient Joint at Salient Point
during the sequence of Salient Action x ; and j , k and l specify the joint „neck‟, „left
shoulder‟ and „right shoulder‟ of the other person y . Similarly, the vertical distance is
represented as: 175
, , , , , ,vd x y y yi j k l nF i j k l dist p p p p (7)
where , ,y y yj k l n
p p p indicates the plane with normal vector y y
j kp p passing through
y
lp ;
j , and k are specified as joint „neck‟, „torso center‟; l is the central point between „left
shoulder‟ and „right shoulder‟.
180
(a) Horizontal distance (b) Vertical distance
Horizontal and vertical distance feature
The fourth type of feature disF distinguishes the interactions approaching and departing from
other Negative Interactions. It is represented by the displacements of fifteen joints from the first
frame to the final frame, and defined as: 185
1, ,
;
N
dis x y
i t i tF j t p p (8)
where j is any joint; 1t and Nt indicate the first frame and the last frame; this feature is
measured between two persons x and y .
- 7 -
中国科技论文在线
(a) Fitting line for „exchanging‟ (b) Fitting line for „shaking‟ 190
The fitting lines for interactions. The gradient of line in (a) is and in (b), it is .
The final type feature
gradF is used to compute the spatial direction of a Salient Joint at the
time of a Salient Point. Originally, the direction is obtained through the gradient of a line,
composed of at least two points. However, considering the outliers of the collected joint points, we
choose more points to fit the line. Besides the Salient Joint, eight other points adjacent to the 195
Salient Point are included in the calculation. The feature
gradF is measured to discriminate
between the interactions „exchanging an object‟ and „shaking hands‟. Figure 6 shows the results
from line fitting for both interactions. These results verify that the movement of „shaking hands‟ is
likely to be up and down, while „exchanging‟ is a movement in the horizontal direction.
SVM Multi-class Classifier 200
In supervised learning, discriminative classifiers often achieve better performance than
generative classifiers[8]. SVM is a successful representative of discriminative classifiers. SVM has
good generalization ability as it is based on the principle of structural risk minimization in
statistical learning theory, while activity recognition is exactly a kind of classification problem
with limited samples that a SVM classifier resolves. 205
A SVM classifier deals with two-category classification problems, given a training sample
set
, , 1, , , , 1, 1di i i ix y i n x R y
, where i
x
is the feature vector, and i
y
is
the label. SVM was developed to find the optimal classification plane and maximize the margin
between two categories. The optimal hyper-plane can be constructed by solving it as an
optimization problem: 210
min w . 1i iy w x b 1,2, ,i n (8)
Although the SVM classifier is originally developed for two-category problems, it can be
extended to multi-class classification by several derived approaches, the one-against-one method,
one-against-all, and DAG-SVM. In our experiment, we use one-against-all method to classify
eight types of interactions. 215
4 Experiments and Results
In this section, we do three things: first we describe our created dataset based on the
Microsoft Kinect. Second, we recognize eight types of interactions based on hierarchical salient
feature-based representation and compare the result with another widely used human interaction
recognition approach. 220
Dataset
We collected two-person interactions using a Microsoft Kinect sensor. All videos were
recorded in an indoor room while 15 volunteers performed activities. The dataset has a total of
approximately 300 interactions and it is publicly available on the Internet at
225
The most important data in our dataset is the spatial information (3D coordinates) of two
persons‟ skeletons. In order to ensure the integrity and continuity of target data, we ignored the
original RGB and depth information when capturing data. An articulated skeleton for each person
was extracted by OpenNI software with NITE (Natural Interaction Middleware) provided by
- 8 -
中国科技论文在线
PrimeSense. However, when two persons overlapped, especially in a hugging activity, the full 230
body tracking of interactions with NITE middleware might be inaccurate. Bad and lost tracking
will seriously affect interaction results, so hugging was not being considered in our dataset. At last,
eight types of two-person interactions were captured, including approaching, departing, kicking,
punching, pointing, pushing, exchanging an object and shaking hands.
Experiments with Interaction Recognition 235
Since Yun[9] extend the pose-based features for two-person interaction recognition and reveal
that the geometric rational features based on joints‟ distances outperform other features, in our
experiment, we also extracted the same types of features: Joint Distance and Joint Motion, Plane
and Normal Plane, Velocity and Normal Velocity, to recognize interactions using our created
data.. 240
Body-pose features for interaction recognition.
Features Average accuracy
Raw Position %
Joint Distance %
Joint Motion %
Plane %
Normal Plane %
Velocity %
Normal Velocity %
Joint features %
Plane features %
Velocity features %
All features %
In interaction classification process, we used the LIBSVM software. Table 1 shows the
recognition result using different features, where Joint features include joint distance and joint
motion features, Plane features include plane and normal plane features. Velocity features include
velocity and normal velocity features. The results suggest Joint features yield higher recognition 245
accuracy than plane features or velocity features. Thus, the geometric relational body-pose
features based on distance between all pairs of joints outperforms other feature choices, verifying
the conclusion as Yun asserts. Besides, the result suggests that our dataset with 3D coordinates
created using a Kinect sensor, is more or less equivalent to the dataset used in Yun‟s paper and is
proved to be effective. 250
More importantly, to illustrate the effectiveness of our proposed method, we selected features
based on hierarchical saliency proposed in this paper in Section 3 and compared the recognition
results with Joint features results as mentioned in this section, combining joint distance and joint
motion features to distinguish two-person interactions. Figure 7 shows a confusion matrix for
recognition results based on two different types of features: (a) employing the Joint features 255
geometric features as proposed by Yun [9], and combining Joint Distance and Joint Motion features
(b) based on our proposed method. The average recognition accuracies for both approaches are
% and %, thus our proposed method is almost 10% higher than the other method.
However, for the features extracted from the hierarchical saliency structure, the accuracy for those
three interactions is 30% percent improved on average. The reason is that in our proposed method, 260
„pushing‟ is grouped with the Positive Interactions that own a Salient Joint, while ‟exchanging‟
and „shaking‟ belongs to the Negative Interactions. Therefore, the features for those three types of
- 9 -
中国科技论文在线
interactions are discriminated when extracted for training and testing process. In summary,
hierarchical saliency-based representation for human interaction recognition outperforms the
methods based on other features. 265
(a) Yun‟s features (b) proposed features
Results comparison for different features.
5 Conclusion
The recognition of human activities has garnered increasing interest in computer vision, but 270
remains challenging given the complexity and variety of human actions. The contribution this
paper makes is twofold: 1) a hierarchical saliency structure, including Salient Action at the highest
level, Salient Point at the lowest level, and Salient Joint at the lowest level. The saliencies are
extracted according to the spatial-temporal changes of joints at different levels and; 2) Following
this principle of hierarchical saliency, we developed several types of new features for interactions 275
classification—based on the dataset we created. This dataset was collected using a RGB-D sensor
and can be accessed on the internet. Experimental result shows that our proposed recognition
method outperforms the approach using other features.
In a realistic scene, human activities are more complex than the interactions in this paper,
thus more research is needed on long time sequences, including different types of interactions, to 280
make the best use of this new kind of RGB-D sensor. We plan to find more volunteers to capture
more data and extend our interaction dataset to include additional interaction categories. More
importantly, owing to the limitations of human tracking software, there occasionally are some
inaccurate tracking results. Therefore, we need to find a better way to track human actions, further
improving recognition accuracy. 285
Acknowledgements
We are grateful to the volunteers for capturing data. This work was supported by Specialized
Research Fund for the Doctoral Program of Higher Education (20120141120041).
References
[1] Chen L, Wei H, Ferryman J. A survey of human motion analysis using depth imagery[J]. Pattern Recognition 290
Letters, 2013, 34(15): 1995-2006.
[2] Park, S. and J. hierarchical Bayesian network for event recognition of human actions and
interactions[J]. Multimedia Systems. 2004,10(2): 164-179.
[3] Brand. Coupled hidden Markov models for modeling interacting processes[M].San Juan:IEEE,1997.
[4] Du, Y. T. U. Feng Chen , Wenli Xu. Human Interaction Representation and Recognition Through Motion 295
Decomposition[J]. Signal Processing ,14(12): 952 - 955.
[5] Fiete, R. D. Modeling the Imaging Chain of Digital Camera[M].Bellingham, Washington:SPIE,2010.
[6] J. Shotton, A. F., M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. -time human pose
recognition in parts from single depth images[J]. Communications of the ACM, 2013, 56(1): 116-124.
[7] Sempena S, Maulidevi N U, Aryan P R. Human action recognition using dynamic time warping[A].Sempena S. 300
Proceedings of the 2011 International Conference on Electrical Engineering and Informatics[C],Bandung,
- 10 -
中国科技论文在线
Indonesia, -5.
[8] Jordan A. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes[J].
Advances in neural information processing systems, 2002, 14: 841.
[9] Yun, K., et a. Two-person interaction detection using body-pose features and multiple instance 305
learning[A].IEEE Computer Society Conference on Computer Vision And Pattern Recognition Workshops
CVPRW[C].Providence, Rhode Island:IEEE,-35.
基于层次显著性表达的人体行为识别 310
呙维,胡涛,刘汝倩
(武汉大学测绘遥感信息工程国家重点实验室,武汉 430072)
摘要:识别人体行为是计算机视觉中最重要的问题之一,广泛影响着许多应用。本文提出一
种基于层次显著性表达的两人互动行为识别的新方法。层次显著性在互动行为的最高层次上
被定义为显著动作,在中等层次上为显著点,最低的层次则为显著关节点。层次显著性由每315
个层次上的最大时空位移决定。给出不同层次的互动行为显著性后,便可根据行为的判别特
性提取出几种类型特征。这些判别特性包括空间位移、方向关系等。目前,由于很少有公开
的测试数据集,因此本文创建了一个包含八种互动行为类型的新数据集 K3HI。该数据集使
用微软公司新型深度传感器 Kinect完成数据采集。本文中采用 SVM多级分类器进行测试。
实验结果表明:基于层次显著性表达识别方法的平均识别精度为 %,该结果优于使用320
其它特征的方法。
关键词:模式识别;人体互动行为识别;Kinect;层次显著性;
中图分类号: