HIGHLIGHTS
- What: Building upon this the authors propose momentum encoding from a multi-task perspective as a bridge for information effectively improving mutual information representation quality and optimizing the distribution of feature points within the crossmodal shared semantic space. The authors seek not only to retrieve a single image but also to attach essential descriptions when summoning an image. To address this, the authors propose a multi-task model for joint training in cross-modal image-text retrieval and image captioning. In this paper, the objective is to preserve semantic consistency in the context of fine-grained visual . . .

If you want to have access to all the content you need to log in!
Thanks :)
If you don't have an account, you can create one here.