Detecting and tracking humans in crowded scenes based on 2D image understanding

Simonnet, Damien Rémi Jules Joseph (2012) Detecting and tracking humans in crowded scenes based on 2D image understanding. (PhD thesis), Kingston University, .

Official URL: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos...

Abstract

Tracking pedestrians in surveillance videos is an important task, not only in itself but also as a component of pedestrian counting, activity and event recognition, and scene understanding in general. Robust tracking in crowded environments remains a major challenge, mainly due to the occlusions and interactions between pedestrians. Methods to detect humans in a single frame are becoming increasingly accurate. Therefore, the majority of multi-target tracking algorithms in crowds follow a tracking-by-detection approach, along with models of individual and group behaviour, and various types of features to re-identify any given pedestrian (and discriminate them from the remainder). The aim is, given a Closed Circuit TeleVision (CCTV) camera view (moving or static) of a crowded scene, to produce tracks that indicate which pedestrians are entering and leaving the scene to be used in further applications (e.g. a multi-camera tracking scenario). Therefore, this output should be accurate in terms of position, have few false alarms and identity changes (i.e. tracks have not to be fragmented nor switch identity). Consequently, the presented algorithm concentrates on two important characteristics. Firstly, production of a real-time or near real-time output to be practically usable for further applications without penalising the final system. Secondly, management of occlusions which is the main challenge in crowds. The methodology presented, based on a tracking-by-detection approach, proposes an advance over those two aspects through a hierarchical framework to solve short and long occlusions with two novel methods. First, at a fine temporal scale, kinematic features and appearance features based on non-occluded parts are combined to generate short and reliable 'tracklets'. More specifically, this part uses an occlusion map which attributes a local measurement (by searching over the non-occluded parts) to a target without a global measurement (i.e. a measurement generated by the global detector), and demonstrates better results in terms of tracklet length without generating more false alarms or identity changes. Over a longer scale, these tracklets are associated with each other to build up longer tracks for each pedestrian in the scene. This tracklet data association is based on a novel approach that uses dynamic time warping to locate and measure the possible similarities of appearances between tracklets, by varying the time step and phase of the frame-based visual feature. The method, which does not require any target initialisations or camera calibrations, shows significant improvements in terms of false alarms and identity changes, the latter being a critical point for evaluating tracking algorithms. The evaluation framework, based on different metrics introduced in the literature, consists of a set of new track-based metrics (in contrast to frame-based) which enables failure parts of a tracker to be identified and algorithms to be compared as a single value. Finally, advantages of the dual method proposed to solve long and short occlusions are to reduce simultaneously the problem of track fragmentation and identity switches, and to make it naturally extensible to a multi-camera scenario. Results are presented as a tag and track system over a network of moving and static cameras. In addition to public datasets for multi-target tracking in crowds (e.g. Oxford Town Centre (OTC) dataset) where the new methodology introduced (i.e. building tracklets based on non-occluded pedestrian parts plus re-identification with dynamic time warping) shows significant improvements. Two new datasets are introduced to test the robustness of the algorithm proposed in more challenging scenarios. Firstly, a CCTV shopping view centre is used to demonstrate the effectiveness of the algorithm in a more crowded scenario. Secondly, a dataset with a network of CCTV Pan Tilt Zoom (PTZ) cameras tracking a single pedestrian, demonstrates the capability of the algorithm to handle a very difficult scenario (abrupt motion and non-overlapping camera views) and therefore its applicability as a component of a multitarget tracker in a network of static and PTZ cameras. The thesis concludes with a critical analysis of the work and presents future research opportunities (notably the use of this framework in a non-overlapping network of static and PTZ cameras).

Actions (Repository Editors)

Item Control Page