NEC USA, Inc. NEC Corporation
110 Rio Robles Ave. 1-1 Miyazaki 4-Chome, Miyamae-ku
San Jose, CA 95134 Kawasaki, Kanagawa 216, Japan
Tel: 1-408-943-3002
E-mail: {hirata, hara}@ccrl.sj.nec.com E-mail:{takano, kawasaki}@mmp.cl.nec.co.jp
We also describe our content-oriented integrated hypermedia system "Himotoki." It provides a wide variety of navigational tools such as visual content-based navigation, moving hot-spot navigation and schema navigation. Each media translation is modularized as the corresponding media augmenter so that it can flexibly adapt to a distributed environment. Applications such as "Electronic Aquatic Life" and "Hypermedia Museum" demonstrate the usefulness of these navigational tools.KEYWORDS: Content-oriented Integration, Conceptual-based Navigation, Media-based Navigation, Media Augmenter, Recognition Engine, Matching Engine, Moving Hot-spots, Content-based Retrieval
The mechanism to integrate large amounts of multimedia data, however, is insufficient with conventional hypermedia systems. The conventional integration depends essentially on the hypermedia designer's expertise rather than the system capabilities. Hypermedia designers have had to assign specific links among objects or assign various keywords in order to provide a flexible, user-customized navigation. From the user's point of view, it has been necessary to integrate various types of operations such as link-following navigation, conditional search and content-based retrieval[3][5][11][15]. Consequently, conventional hypermedia systems are cumbersome and clumsy in terms of handling huge amounts of data.
Content-oriented integration, as we propose here, aims to provide well-organized media contents and their operations. It translates a part of media representation (media-independent part) into a conceptual one. The translation establishes a linkage between a media representation and a conceptual representation, such as an object name or keywords, thereby giving semantics to a media representation. One can integrate media representations simply by connecting the corresponding conceptual representations on the hypermedia platform. As the translation usually incorporates various media information, the method enables integration of media naturally.
The main advantage of content-oriented integration is to provide rich navigational capabilities in a browsing phase. They are conceptual-based navigation which is enhanced by the conceptual representation, and media-based navigation[9] as a complementary usage. In an authoring phase, the system designer has mainly to focus on designing the schema of the conceptual representation, since it can be designed independent from various types of media representations. Based on the schema, multimedia applications are created by connecting media representation to conceptual ones.
To establish the model of content-oriented integration on a hypermedia platform, we introduce the media augmenter which processes each media representation. Users navigate through the information space by various kinds of media clues generated by each media augmenter. The navigational "look & feel" is compatible to that of conventional hypermedia.
In this paper, we also present a content-oriented integrated hypermedia system, "Himotoki," which we have developed [6][8][9]. Himotoki consists of various types of media augmenters such as, image, video, audio and text augmenters and provides such navigational tools including visual content-based navigation, moving hot-spot navigation and schema navigation, based on these augmenters. Experimental applications such as "Electronic Aquatic Life" and "Hypermedia Museum" demonstrate the usefulness of its system design.
This paper commences with an overview of the content-oriented integration. Next, we explain the components of the media augmenter and show its data flows. Then, we describe our hypermedia system "Himotoki," and elaborate on implemented hypermedia applications, primarily the navigational functions. Finally, we discuss future work.
For large-scale hypermedia systems, however, users often get lost in hyperspace, particularly in distributed hypermedia systems such as World Wide Web. This disorientation problem is relevant for hypermedia systems[7]. To avoid this problem, it is important not only to improve visual interface[12] but also improve navigational capabilities[5] [9]. In addition to conventional keyword-based navigation, such flexible navigation as content-based retrieval are necessary.
Although hypermedia system integrates multimedia data in a uniform way, integration quality depends, to a great extent, on the system designer's expertise. Hypermedia designers have been responsible for assigning proper links among objects or many appropriate keywords in order to provide flexible, user-customized navigation. As a result, costs in organizing such systems rise accordingly, and it is difficult to apply this "designer-dependent approach" directly to large-scale systems.
As mentioned above, the requirement for next generation hypermedia system is to enable user's multiple navigational operations over large-scale data in a natural way. It is essential to provide a wide variety of relationships by introducing media processing technologies[2][3][5][11][15]. Existing works, however, lack organizational integrative techniques for media processing analysis with semantic analysis. They cannot directly extend to large-scale distributed environment both in authoring phase and in the navigation phase. Therefore, the fundamental requirement is to integrate media processing analysis with semantic analysis appropriately as a natural extension of hypermedia data models.
In order to improve the performance, an appropriate system configuration is also important, in particular, for distributed hypermedia systems. For example, real-time task scheduling capability and transaction capabilities are necessary to handle video streams and node-link information. Modularity is another important factor for a distributed environment.
Each media representation has the part which can be translated into the conceptual representation. This part indicates the semantics of the media representation and is media-independent. We can integrate the various types of media representations by extracting the translatable parts and translating through the connection into the corresponding conceptual representation such as an object name and keywords. By this connection, the media representation inspires its own semantics and can be handled with the corresponding conceptual representation.
The other part of the media representation, which is difficult to translate into conceptual representation, is processed directly without any translations. Based on the media-dependent information such as shape, color for still images, motion for video, melody for auditory data, users can navigate through the information spaces. The system can accept the media clues such as a rough sketch for still images, hum for audio, and start browsing.
Navigation in the content-oriented integration environment consists of two parts as illustrated in Figure 1, i.e., conceptual-based navigation and media-based navigation. Conceptual-based navigation is the one based on the connection to the conceptual representation. Media-based navigation is the one based on the media-dependent clues. These navigations work complementarily. Users can access multimedia data in a rich way, by using both conceptual representation and media representation.
Media Integration with Semantics: By connecting media representation with conceptual representation, integration of multimedia data is executed in a natural way. Imagine the picture in which a hummingbird is flying in the sky. The system extracts the region of the bird and analyzes its shape and color. By comparing with the templates in the dictionary, the system specifies the name of the bird and connects the region into the concept of "hummingbird." This is an example which connects the media representation ( the region of the bird ) into the conceptual representation ( "hummingbird" as a conceptual instance). Users can navigate to another scene in which hummingbirds are flying or listen to the sound of the hummingbird, simply by clicking the bird on the screen. Since this integration is based on multimedia contents, users can traverse from one media to another smoothly.
Rich Media Description: Conceptual representation is designed to consider high level semantic structures and is separated clearly from media representations. It is designed both in a top-down and a bottom-up way. Database technologies can be applied to handle the conceptual representations, since they can be described as classes and their instances. For example, the relationship between objects in the conceptual representation is described by such conceptual database models as extended Entity-Relationship model. Since the media representation can be handled with the well organized representation, we can also manage a huge amount of corresponding media representation clearly
Navigation by Media Processing: Users can use directly the media clues for retrieving media objects. They do not have to translate the media-based clues into conceptual ones. For example, users draw a rough sketch or hum a melody to retrieve multimedia objects. Users can navigate using ambiguous clues without being influenced by the system designer's subjective perception[9]. By fine-tuning matching parameters, user-customized navigation is enabled. The system can adaptively modify the matching parameters from the history of users' interaction.
Integrating Multiple Operations: Since each media representation and conceptual representation is integrated on the same hypermedia platform, it is easy to combine several information clues. For instance, users can retrieve the video scene in which the butterfly moves from the left side of the screen to the right side. This is an example of combining the concept like butterfly with video-based clues like moving object.
Distribution and Extensibility: Each media is integrated in a loosely coupled way. The common API of media-conceptual links encapsulates the heterogeneity in different augmenters which are organized as independent modules. Therefore, flexible system configuration is provided. Because of its modularity, it is easy to replace the functions about one media regarding the accomplishment of media processing technology without changing functions about the other media.
Each media augmenter is connected into the conceptual augmenter through the translation unit which executes the hypermedia navigation. The media augmenter consists of three parts; recognition engine, understanding engine and matching engine. The conceptual augmenter has all except for the understanding engine, because the understanding engine translates the media representation into a conceptual one to connect these augmenters.
By taking this design architecture, each process is modularized and is applicable to automatic generation. Some processing in the understanding engine, however, is difficult to execute automatically. Interactive approach is also applicable to this process. In such a case, since the problem is simplified and the system designer simply connects the anchor objects into the conceptual representation, it is easy to construct the media structures.
The media augmenter mainly works as follows;
Conceptual-based Navigation: It extracts the translatable part of the media representation and connects them to the corresponding conceptual representation.
Conceptual-based Navigation: It is a navigation which uses a conceptual representation. In addition to the navigation starting from or ending at the conceptual representation, the navigation from media representation to media representation via conceptual representation is also included. In this media-to-media navigation, user's input is first translated into the corresponding conceptual representation. The navigation is executed by following the links defined among the conceptual representations and then, the media representation which corresponds to the conceptual representation is shown to the user. Compared with the direct links among the objects in the media representation, this conceptual-based navigation can reduce the amount of links keeping navigational capabilities, since they are implicit links[4].
Media-based Navigation: Media-based navigation[9] is illustrated shown as Figure 3. The navigation is executed by comparing the media-based indexes on demand. The mechanism is described in detail in a later section.
Schema Navigation: Schema navigation is a navigation based on classes or set of instances. It is defined in both conceptual-based navigation and media-based navigation. Since the system designer can define the conceptual-based schema in conceptual representation, users can navigate through the information space along this schema. Media-based schema navigation is also defined using media-based attributes. Schema navigation enables users to grasp overview hypermedia structure which is particularly useful for large hypermedia systems.
As mentioned above, the content-oriented integrated hypermedia system provides a wide variety of navigational capabilities, conceptual to conceptual, conceptual to media, media to conceptual and media to media. In addition, The content-oriented integrated system can accept various kinds of clues as illustrated in Figure 3. The operations are integrated in the user-friendly manner on the hypermedia platform. Users can navigate through the information space, simply by clicking the mouse or input clues directly.
This engine works to project media contents into a measurable information space specified by the attributes. They are, for example, power spectrum values for auditory data, color values for image and motion vectors for video. This projection is similar to an operation in the field of database modeling.
These attributes are sometimes difficult to describe as the meaningful keywords, but they can be powerful clues to specify the media. They are stored as media-indexes. The media-index needs to be compact as lots of data will be stored and matching has to be time efficient. They are applied to both matching engine and understanding engine.
In Figure 4, the recognition engine receives the image (media input) in the center of which a woman with red clothes is standing. The engine applies the image processing techniques such as edge detection, region division, and extracts the media attributes related to color, contour, region and texture. They are stored as the media index, without translating into textural representation.
For video data, breaks of the scene, moving objects, visual attributes of each object and its locus are extracted, for example, and are stored as the media index.
When the recognition engine creates the media components such as the video scene and media objects, a new anchor ID is registered to distinguish each other and to include these new components into the conceptual-based navigation.
For applications on large scale systems, this recognition step needs to be executed automatically. Many media processing techniques such as edge detection, region division for still images, motion or object detection for video data are applied for it.
Understanding engine extracts the anchor objects which can be translated into conceptual representation and defines the links between the anchor objects and the conceptual instances which is corresponding to the media representation.
This step is one of the labeling procedures using the media dictionary or knowledge base. Rules for the object understanding and rules describing the relation between objects are stored in the media dictionary and they are utilized for understanding. For example, it is possible to connect the pictures into the sensuous keywords such as "elegant" and "vivid" using color composition information. Media objects in the visual data are identifiable by comparing the attributes in the dictionary and those in the media objects. Text abstraction or keywords extraction for the text representation is also an example which the understanding step is executed automatically.
When the corresponding conceptual representation does not exist in the conceptual instance list, the understanding engine instanciates the conceptual object and the media representation is connected to the instance object.
Understanding engine enables users to execute combined navigation of media features with semantics. It is, for example, "Retrieve pictures in which a green tortoise is racing with a white rabbit" or "jump to the scene where a butterfly fish is moving at the center of the screen." Rich anchor representation provides a flexible navigation.
In Figure 4, understanding engine connects the pictures with the sensuous word "elegant" and connects the region of the woman with "red clothes" and "woman." and the region of the flower with "flower." New anchor objects such as the woman and flower can be the new media anchor for conceptual-based navigation.
Usually, user's inputs are very rough and imprecise and their intentions, ambiguous. Powerful matching methods are required that will find relevant objects based on low quality input. By fine-tuning the priority of the attributes, user-customized links, which are not influenced from the system designer's viewpoint, can be defined. Therefore, usability is increasingly improved. Matching engine outputs the result in the order of their matching scores. Then, it is easy to integrate the results of plural media augmenters.
The system can classify the media representation automatically, using the similarity among instances in media representation. This is called media-based clustering. The system can speed up the matching procedure by skipping the images which belong to a cluster far from the query. In addition, it is possible to create a media-based schema by the similarity of the media contents[9]. The overview diagram from this clustering and media-based schema navigation is effective for large-scale hypermedia system.
As the matching engine can directly compare the attributes, the system does not have to understand the contents themselves in the media objects. Media-based navigation can cover the untranslated parts of media representation.
As shown in Figure 5, matching engine accepts the media clues or media ID for query input. It reads media indexes through the media management module, and calculates the similarity between the query input and media indexes and outputs the candidate IDs. Media-based navigation is combined with conceptual-based navigation. Retrieval candidates are represented on a browsing table or the user can specify the query for next navigation.
(a) Himotoki Clients: In Himotoki, clients input/output processing for GUI is handled as in other systems. In addition to conventional hypermedia input, media clues input such as a rough sketch is also available. The client includes the navigation management module corresponding to the hypermedia management capabilities. It processes not only node-link access to the Conceptual Augmenter, but also media access to Media Augmenter for media-based navigation or the requested media data to be displayed. Client APs such as browsing and authoring applications run on the client.
(b) Himotoki Servers: The server consists of the following two components, as illustrated in Figure 6.
(i) Conceptual Augmenter: It manages hypermedia nodes and links that define relations among nodes. It mainly manages the conceptual representation. The DBMS handles the secondary storage management for node/link data and for attributes.
(ii) Media Augmenter: It manages media data as contents corresponding to each node. It consists of individual media management modules for each media to provide efficient media processing. The augmenter provides a uniform API to encapsulate the heterogeneity in each media representation. Large volumes of media data are stored in the Media Augmenter.
Using this function, for example, users access an explanation of the displayed images simply by touching the moving objects on the screen as shown in Figure 8. Most of the existing hypermedia system handle stream media only as destination objects, since they have no internal structures which may be assigned to anchor objects. Or the whole stream, such as a scene, has only to act as an anchor object. By defining the moving hot-spot, we can construct the video data based on its contents.
With the current media processing techniques, we have adopted an editorial approach in the recognition phase and understanding phase rather than automatic generation. We have implemented the moving hot-spot editor in the Himotoki authoring. The designer has only to specify the location of the corresponding object in the key frame. The system automatically interpolates the location in other frames. Each locus are connected to the conceptual representations interactively with drag-and-drop interfaces.
Users have only to draw a rough sketch with the outline and colored regions of an image to retrieve the original image, as shown in Figure 9. Users can enjoy browsing through paintings with similar compositions and colors, without making manual links. They can continue further retrieval using these retrieval candidates. In this way, step by step, users can locate a painting they want by focusing on similarities in color and shape without drawing a sketch for every query. The action of the matching engine is encapsulated and the operation is executed as the node-link metaphor which is the same manner of hypermedia navigation. Therefore, combination of other types of media clues is performed smoothly.
Matching engine classifies the media from the point of image-based features, i.e., shape and color, automatically. The clustering results are used for the overview diagram and the selection of matching targets.
(1) If network band-width is insufficient: In order to prepare enough bandwidth, we can locate some critical functions of Media Augmenter into the Himotoki clients. For example, locating the video storage to the client site improves efficiency since video stream can be transmitted through large band-width of internal data-bus in accessing media data.
(2) If network communication latency is too long: In order to minimize latency, we can locate the Media Augmenter at the nearest site to the user, improving the response time. It is also possible to locate multimedia contents on the Media Augmenter at the nearest site to the frequent user.
(C) If the traffic on the specific paths is too busy: In order to ensure load balancing, we can locate Media Augmenters appropriately. In addition, we can duplicate multimedia contents and locate the same data at multiple sites, so that the loads of servers are balanced.
Simply by touching fish moving on the screen, users can receive an explanation about the specified fish which is constituted with video, texts and pictures. This navigation is executed through moving hot-spot navigation. In this application, the recognition engine extracts the locus corresponding to the movement of a fish and the understanding engine connects the locus to the conceptual representation, e.g., fish name. Therefore, users can easily handle these video-based clues when they would navigate. Users can select their proper media representation according to their interests. For example, users can listen to voice annotations instead of textual representations.
In this application, more than 20 different fish appearing on the same video frame, can be identified. Users can also specify the fish from the class or inhabiting area and jump to the scene where the specified fish is swimming.
In this system, laser disk players are set at each client machine and work as the video server. The other media such as text, picture and sound are served from the center machine through the network. This system consists of approximately 30 minutes of video, 100 photographs, 600 nodes and 2000 links.
In addition to this conventional retrieval, users can execute media-based navigation for visual data. Simply drawing a rough sketch of the pictures and showing it to the system, the system searches for candidates similar to the rough sketch. Figure 11 is an example of the navigation which combines media-based clues and bibliographic clues. In this example, the user specifies the painter ( in this case, Van Gogh ) and draws a rough sketch. The system searches for the paintings by Van Gogh which are similar to the sketch and presents them. Appropriate candidates such as "Sunflower" are presented.
In this application, recognition engine extracts the visual information such as color, shape and general composition. This process is executed automatically. Since, matching engine registers the color information used in this system in indexing phase, the system can display the overview diagrams according to the color used. Users can tune the contribution ratio of color and shape in a user-friendly manner. User-customized navigation is executed. Since, the recognition engine extracts the objects in the images automatically, it is possible to execute content-oriented integration for still images by connecting these objects into the conceptual representations.
This system has approximately 5 basic entities, 300 pictures, 600 nodes and 2000 links.
Development of the video augmenter is one of our on-going projects. Video has its own information clues for retrieval. In addition to the interactive approach, it is necessary to make video-index automatically. We have only to focus on the continuity between frames, and extract the objects. We are also trying to make the video-matching engine which compares the video features such as moving locus, while the system selects the candidates and shows them to the users.
Media-based overview diagram, which is based on the similarity in the media representation, provides users with the navigational guidance for large-scale hyperspaces. The visualization methods of large information spaces must be designed in a user-friendly manner. Overview diagram of node-link structure and the media representation should also be considered[12].
In computer graphics, some trials exist to translate from the auditory data to video information directly[14]. In the content-oriented integration, the translation from one media representation directory into another one should be executed. By including these trials into the user's navigation, the navigation capabilities will increase.
We also present a content-oriented integrated hypermedia system, "Himotoki", which we have developed. We explained the navigational capabilities on Himotoki , in particular, moving hot-spot navigation and visual content-based navigation.
Content-oriented integration provides the guidance to integrate a huge amount of multimedia data in a uniform way. We believe that content-oriented integration expands the uses of next generation multimedia systems.
[2] V. Burril, T. Kirste, et al, "Time-varying sensitive regions in dynamic multimedia objects: a pragmatic appoach to content-based retrieval from video," Information and Software Technology Journal Special Issue on Multimedia 36(4) pp.213-224, Jul. 1994.
[3] P. Constantopoulos, J. Drakopoulos, Y. Yeorgaroudakis, "Retrieval of Multimedia Documents by Pictorial Content; A Prototype System," Proc. International Conference on Multimedia Information System'91, McGraw Hill, pp. 35-48, Jan. 1991.
[4] S.J. DeRose, "Expanding the notion of Links," ACM Hypertext, pp. 249-267, 1989.
[5] Arun Hampapur, et. al, "Digital video indexing in multimedia systems," Proceedings of the ACM conference on Multimedia. Association of Computing Machinery, October 1994
[6] Y. Hara, A. M. Keller, G. Wiederhold: "Implementing Hypertext Database Relationship through Aggregation and Exceptions," ACM Hypertext'91 pp. 75-90, Dec. 1991.
[7] F.G. Halasz. Seven Issues: Revisited. Keynote Talk at ACM Hypertext `91, Dec. 1991.
[8] K. Hirata, T. Kato, "Query by Visual Example, "Extending Database Technology ` 92, pp.56-72 Mar. 1992.
[9] K. Hirata, Y. Hara, et al., "Media-based Navigation for Hypermedia Systems," ACM Hypertext'93 pp.159-173, Nov. 1993.
[10] S. Kawasaki, K. Hirata, et al., "Hypermedia on Demand - A Distributed Navigational Hypermedia System," Telecom'95, Oct. 1995
[11] T. Kageyama, et al. "Melody Retrieval with Humming," Int. Comput. Music Conf., pp. 349-351, Sept. 1993.
[12] G. Noik, "Exploring Lager Hyperdocuments: Fisheye Views of Nested Networks," ACM Hypertext'93, Nov. 1993.
[13] N. Streitz, et al., "SEPIA: A Cooperative Hypermedia Authoring Environment," ECHT, pp.11-22, 1992
[14] N. Shibata, K. Morimoto, et. al, "A system that generates images interactively from musical performance" , The institute of television engineers of Japan, ITE Technical report vol. 17. No. 24 pp.13-18, 1993.
[15] M. Tabuchi, et al. "Hyperbook: A Multimedia Information System That Permits Incomplete Queries," Proc. International Conference on Multimedia Information System'91, McGraw Hill, pp. 3-16, Jan. 1991
.