SEGMENTATION AND TRANSCRIPTION OF VIDEO DOCUMENTS
Object:
This research focuses on the analysis of video documents contents for the
purpose of their indexation (in the
MPEG7 context or in relation
with the ``Action Indexation Multimédia''
project within the
CHM and
ISIS working groups).
We study the problem of the full or partial automation of the
segmentation and transcription of video documents. The main application
is the content-based access of multimedia documents. Another stronly
related application is ``intelligent'' video compression
(MPEG4). The inter-group
(TLP -
IMM) project
takes place in the context of an
incitative action of the laboratory and has received a BQR sponsorship
of the Paris~XI University's Computer Science Department. We consider here
the image track analysis problem, the audio track analysis problem being
studied by the TLP group [1].
Description:
We began with segmentation of the document into continuous shots. Sharp
transitions are extracted using an original approach: the measure of the
residual difference after motion compensation (when residual difference
without compensation is important). This method is similar to the method
used in speech recognition when the residual difference between sonograms
is compared after temporal alignment. Motion compensation is performed
using an
Optical Flow Computation technique based on dynamic programming [2].
A few additional filters have been added (for instancs photographic flashes
detection or dynamic thresholds adjustment according to the the average
detected motion before or after the possible transition) to improve the
quality of the result. Compared to classical methods (like discontinuity
detection in color histograms or in parameter vectors), this method has the
advantage to take into account the motion compensated spatial continuities
and is therefore able to discriminate images that look alike only from
their color or texture distributions. As a special case, it permit to obtain
a correct segmentation for purely monochrome sequences.
A simple method for the detection of dissolve transitions in the case where
there is no or little motion has also been developped. The principle is to
search, inside non stationary sequences, images that are linear combinations
of their neighbors.
We have also begun the acquisition and the manual segmentation of a TV news
corpus (currently 60 minutes in french and 10 minutes in american) for
the training and evaluation of our systems. We are also using the corpus
developped by INA in the context of the ``Action Indexation Multimédia''
project.
Results and perspectives:
On the french and american TV news of our corpus, our system detects
99 % of the sharp transitions with less than 5 % of false alarms.
The system also detects about 50 % of dissolve transitions and between
85 and 90 % of all the transitions (whatever the type).
Figure 1 shows a visual representation of a
video document (third MPEG of corpus prepared by the INA for the
``Action Indexation Multimédia'' group)as a mosaic of images, each one
corresponding to a countinuous shot automatically extracted.
This is a preliminary work. Our objective is to go much farther in the
documents' contents analysis. Further investigations will be carried out
in the following directions: improvement of the robustness of transition
detection, search for other transition types, classification of shots by
types, grouping of related shots, search and transcription of texts,
search and identification of people, correlation and fusion of image, audio
and text information and, finally, construction of a synthetic representation
suitable for the information retrieval systems.
References:
[1]
Transcription of Broadcast News [25735 bytes],
J.L. Gauvain and L. Lamel and G. Adda and M. Adda-Decker,
Eurospeech, Rhodes, September 1997.
[2]
Computation of Optical Flow Using Dynamic Programming [1536721 bytes],
Georges M. Quénot,
Machine Vision Applications, Tokyo, Japan, 12-14 nov 1996.
Abstract [1104 bytes].