SEGMENTATION AND TRANSCRIPTION OF VIDEO DOCUMENTS

Georges QUÉNOT,

Object: This research focuses on the analysis of video documents contents for the purpose of their indexation (in the MPEG7 context or in relation with the ``Action Indexation Multimédia'' project within the CHM and ISIS working groups). We study the problem of the full or partial automation of the segmentation and transcription of video documents. The main application is the content-based access of multimedia documents. Another stronly related application is ``intelligent'' video compression (MPEG4). The inter-group (TLP - IMM) project takes place in the context of an incitative action of the laboratory and has received a BQR sponsorship of the Paris~XI University's Computer Science Department. We consider here the image track analysis problem, the audio track analysis problem being studied by the TLP group [1].

Description: We began with segmentation of the document into continuous shots. Sharp transitions are extracted using an original approach: the measure of the residual difference after motion compensation (when residual difference without compensation is important). This method is similar to the method used in speech recognition when the residual difference between sonograms is compared after temporal alignment. Motion compensation is performed using an Optical Flow Computation technique based on dynamic programming [2]. A few additional filters have been added (for instancs photographic flashes detection or dynamic thresholds adjustment according to the the average detected motion before or after the possible transition) to improve the quality of the result. Compared to classical methods (like discontinuity detection in color histograms or in parameter vectors), this method has the advantage to take into account the motion compensated spatial continuities and is therefore able to discriminate images that look alike only from their color or texture distributions. As a special case, it permit to obtain a correct segmentation for purely monochrome sequences.
A simple method for the detection of dissolve transitions in the case where there is no or little motion has also been developped. The principle is to search, inside non stationary sequences, images that are linear combinations of their neighbors.
We have also begun the acquisition and the manual segmentation of a TV news corpus (currently 60 minutes in french and 10 minutes in american) for the training and evaluation of our systems. We are also using the corpus developped by INA in the context of the ``Action Indexation Multimédia'' project.

Results and perspectives: On the french and american TV news of our corpus, our system detects 99 % of the sharp transitions with less than 5 % of false alarms. The system also detects about 50 % of dissolve transitions and between 85 and 90 % of all the transitions (whatever the type). Figure 1 shows a visual representation of a video document (third MPEG of corpus prepared by the INA for the ``Action Indexation Multimédia'' group)as a mosaic of images, each one corresponding to a countinuous shot automatically extracted.
This is a preliminary work. Our objective is to go much farther in the documents' contents analysis. Further investigations will be carried out in the following directions: improvement of the robustness of transition detection, search for other transition types, classification of shots by types, grouping of related shots, search and transcription of texts, search and identification of people, correlation and fusion of image, audio and text information and, finally, construction of a synthetic representation suitable for the information retrieval systems.

References:
[1] Transcription of Broadcast News [25735 bytes], J.L. Gauvain and L. Lamel and G. Adda and M. Adda-Decker, Eurospeech, Rhodes, September 1997.
[2] Computation of Optical Flow Using Dynamic Programming [1536721 bytes], Georges M. Quénot, Machine Vision Applications, Tokyo, Japan, 12-14 nov 1996. Abstract [1104 bytes].