Retrieving Relevant Metaverses Using Hierarchical Features
Metaverse environments offer immersive, multimedia-rich experiences with growing relevance in education, entertainment, and cultural applications. The ability to grasp the contents of these environments, consisting of many rooms or subspaces, is a key building block for implementing Metaverse retrieval systems. However, current methods remain limited, as they are not designed to separate local (e.g., individual multimedia elements, and room-level details) from global, Metaverse-level semantics. Moreover, public datasets on similar topics do not capture the complexity of multi-room environments filled with multimedia contents. Our contributions are twofold. First, we introduce HiCALM, a hierarchical Metaverse retrieval framework built around the structural hierarchy of Metaverse environments. HiCALM models Metaverses in a bottom-up way, progressively grasping local and global semantics. To reduce the gap with textual data and achieve high-performance text-to-Metaverse retrieval, a novel cross-modal hierarchical loss supervises the process by teaching the model to associate the hierarchical visual features with textual information, also extracted hierarchically. Second, to overcome the absence of suitable datasets for the task, we present Museums3k, a large-scale dataset of 3,000 virtual museums annotated with detailed descriptions, each composed of multiple rooms populated with diverse multimedia elements; and GamingMV, a smaller dataset with data coming from 239 gaming-related real-world Metaverses. Through extensive quantitative and qualitative experiments, we show that HiCALM achieves considerable improvements in text-to-Metaverse retrieval, obtaining up to 95.0\% R@1 and 60.0\% (text{nDCG}_{3}@10) on Museums3k (more than +40\% R@1 and +11\% nDCG, compared to existing solutions), and up to 62.0\% R@1 and 62.6\% (text{nDCG}_{3}@10) on GamingMV (more than +23\% R@1 and +5\% nDCG).
Added 2026-04-21