The Xtrieve Cross-Modal Multimedia Database

Collecting massive amounts of video, audio, and image data into databases serves little purpose if appropriate content cannot be located and retrieved. Of the many ways to index and browse multimedia data, one of the newest technologies is cross-modal information retrieval. In many instances two or more media streams share a temporal relationship. One example is the serial relationship between a textual transcript and a spoken presentation of that transcript. Computed multimedia synchronization is used to discover this hidden relationship in the form of temporal alignment information. Once such synchronization data has been computed, it can be used to translate search results in one medium to results in another. The search medium may be a well understood data type such as text while permitting the presentation of results in a complex type such as video. The Xtrieve system demonstrates the cross-modal retrieval concept by providing for query-based and browsing-based searching of media content in many modalities. Some of the projects underway in cross-modal retrieval include text-to-speech, score-to-music, translation-to-translation, and slides-to-lecture.