Home > Application, Data Mining, Hubness, Text > Improving the semantic representations for cross-lingual document retrieval

Improving the semantic representations for cross-lingual document retrieval

May 4th, 2013

I have had the pleasure of presenting some of our recent results at PAKDD 2013 in Gold Coast, Australia. The conference was great and the location couldn’t have been better, so we were able to catch some sun and walk along the beaches while discussing future collaboration, theory and applications.

Hubs are known to be the centers of influence and are known to arise in textual data. Also, they are known to cause problems by being frequent neighbors (= very similar) to semantically different types of documents. However, it was previously unknown whether this property is language-dependent and how it affects the cross-lingual information retrieval process.

What we have shown by analyzing aligned text corpora can be summarized by the following: Hubs is one language are not necessarily hubs in another language, different documents become influential. However, surprisingly, the percentage of label mismatches in reverse neighbor sets remains more or less unchanged. In other words, the nature of occurrences is preserved over different languages. This comes as a bit of surprise, since hubness is arguably a geometric property arising from the interplay of metrics and data representations. Yet, it seems that more semantics than was previously thought remains hidden there, captured and preserved across different languages.

We have used this observation to show that it was possible to improve the common semantic representation made via the CCA method (canonical correlation analysis) by simply introducing some hubness-aware instance weights. This is certainly not the only way to go about it and probably not the very best one, but it served as a good proof-of-concept.

The entire paper can be found here: The Role of Hubs in Cross-lingual Supervised Document Retrieval

Categories: Application, Data Mining, Hubness, Text Tags:
Comments are closed.