Xiao Ling

ABSTRACT

I was part of a large team led by Derry Wijyaya, in collaboration with Brendan Callahan , John Hewitt , Jie Gao, Marianna Apidianaki. Supervised by Chris Callison-Burch. The paper was published in ACL 2017 (25% acceptance rate). Bilingual lexicon induction is the task of learning word translations without bilingual parallel corpora. In this paper published in ACL 2017, we model bilingual lexicon induction as a matrix completion problem leveraging diverse set of data from text to images, each of which is allowed to be incomplete or noisy. Specifically, we construct a matrix with source words in the columns and target words in the rows, and formulate translation as matrix factorization with a Bayesian Personalized Ranking objective (BPR). This objective is appropriate because the absence of values in the matrix does not imply a missing translation, and BPR cast the prediction of missing values as a ranking task. The framework is tested on the VULIC1000 data set comprising of 1000 nouns in Spanish, Italian, and Dutch, and their one-to-one ground-truth translations in English, outperforming the state-of-the art models by significant margin.

INTRODUCTION

The world is shrinking, opportunity and danger lay in every corner. When it speaks, will you understand?

Machine translation between languages is a crucial step in bringing the world together. However machine translation requires large, sentence aligned bilingual texts to learn effective models, and such corpora is rare, especially for resource-constrained languages, and result in inaccurate translations. Due to the low quantity and thus coverage of the texts, there may still be “out-of-vocabulary” words encountered at run-time. The Bilingual Lexicon Induction (BLI) task which learns word translations from monolingual or comparable corpora to increase word coverage. Past attempts have often relied on simple count based methods, however the feature vectors constructed such way typically only support heuristic notion of distance. In recent years, there has been a resurgence of interest in word embeddings learned by recurrent neural networks (RNNs) that supports better notion of word similarities in certain tests. Recent work in machine translation leveraged such advances by constructing a shared bilingual embedding space spanned by the same basis, in this shared space words of the same meaning from different languages are distributed close together, providing an immediate measure to determine translations (see Fig 1). Typically, the transformation between the semantic space between two languages is induced using seed translations from existing dictionaries, or reliably extracted from pseudo-bilingual copra of comparable tasks. Recent work by Vulic et al. (2016) and others showed what naively combining word embeddings with visual similarities from images can improve translation accuracy. We extend prior attempts by constructing a unified framework to integrate signals from monolingual, bilingual, and image data to improve accuracy of machine translation.

Fig 1. It is possible to align the embeddings of two languages so that if two words translates into each other, then they are closer in a shared space. The alignment process requires a set of seed words to constrain the transformation. From Vulic and Korhonen (2016).

PROBLEM FORMULATION

Fig 2. Matrix factorization is commonly used in recommender systems, where the rows are users and columns are products. The matrix is sparse since most data is missing, and the goal is to predict the missing values. The assumption is that the observed data can be decomposed as the product of two matrices representing the latent factors that explains the data. Geometrically, this may be interpreted as finding the best set of basis that spans the data points.

Overview

Our approach is based on extensions to the probabilistic model of matrix factorization (MF) in collaborative filtering. We represent our translation task as a matrix with source words in the columns and target words in the rows (Figure 1). Based on observed translations in the matrix found in a seed dictionary, our model learns low-dimensional feature vectors that encode the latent properties of the words in the row and the words in the column. The dot product of these vectors, which indicate how “aligned” the source and the target word properties are, captures how likely they are to be translations. Missing values in the matrix are interpreted as missing translations. Bayesian Personalized Ranking approach to Matrix Factorization (Rendle et al., 2009) formulates the task of predicting missing values as a ranking task. With the assumption that observed true translations should be given higher values than unobserved translations, BPR learns to optimize the difference between values assigned to the observed translations and values assigned to the unobserved translations. However bilingual dictionaries do not exist for some languages, thus existing formulation of MF with BPR suffers from the “cold start” issue. We mitigate the problem by incorporating monolingual signals of translation equivalence, and visual representations of words extracted from images by AlexNet.

Fig 3. Five images for the french word eau and its top 4 translations ranked using visual similarities, defined as cosine distance between image features extracted by AlexNet. From Wijaya et al.

Details

Specifically, given a set of source words F and a set of target words E, and the pair of words (e,f) where e ∈ E and f ∈ F, (e,f) is a candidate translation with an associated score x_e,f^MF ∈ [0, 1], we wish to generate the missing translations. That is given any arbitrary tuple (e,f), we wish wish to assign a value x_e,f^MF stating how likely f translates to e. Using matrix factorization, we can decompose the matrix X^MF of x_e,f^MF into two low-rank matrices:

so that each x_e,f^MF = p_e q_f. We solve the cold-start problem alluded above by computing an auxiliary value x_e,f^AUX using other features associated with each word:

in this case θ_e is the visual features extracted by AlexNet, and each α_i is another auxiliary feature. Note all the parameters in the expression are learned. Note although it is possible to combine x_e,f^MF and x_e,f^AUX by addition, we note that x_e,f^MF has high precision but low recall, and vice versa for x_e,f^AUX. Thus we only back-off to x_e,f^AUX for words that have too few associated translations.

Learning for Bayesian Personalized Ranking involves maximizing the difference in values assigned to observed translations compared to those assigned to unobserved translations. That is given a training set of triples: (e, f, g) so that e translates to f, but not to g, we want to maximize the difference:

Specifically, for logistic sigmoid function σ, parameter vector Θ with values x_e,f,g, and hyperparameter λ_Θ the objective function of BPR is to maximize the sum of log triplet score differences subject to sparsity penalty:

The model is trained using stochastic gradient descent.

DATASET AND EXPERIMENTS

We benchmark our method on two tests sets:

VULIC1000: 1000 nouns in Spanish, Italian, and Dutch, along with their one-to-one ground-truth word translations in English.
CROWDTEST: a larger set of 27 languages from crowdsourced dictionaries.

For each language, we randomly pick up to 1000 words that have only one English word translation in the crowdsourced dictionary to be the test set for that language. On average, there are 967 test source words with a variety of POS per language. Furthermore, we construct another bilingual translation dataset WIKI, using interlingual links among Wikipedia pages to identify words in a third language that translates to words in the source and target language. And finally, we construct WIKI + CROWD comprised of a set of translations from the source to non-target languages, pivoted through wikipedia pages. We test the efficacy of our algorithm in these four studies using the datasets we collected:

x_e,f^MF-W Monolingual signals for translation: using WIKI.
x_e,f^MF-W+C Monolingual signals for translation: using WIKI+CROWD
x_e,f^AUX-WEBilingual informed word embeddings: the source and target embeddings are learned using word2vec skiagram model on tokenized Wikipedia pages. And the mapping between two embeddings are learned with crowdsourced translations as seed. The linear mapping minimizes squared error after translation:
where X_E and X_F are source and target languages, while W is the transformation between the two sets.
x_e,f^AUX-VIS Visual representation of words: We use a large multilingual corpus of labeled images to obtain the visual representation of the words in our source and target languages. The corpus contains 100 images for up to 10k words in each of 100 foreign languages, plus images of each of their translations into English. For each word, we use 10 images from the dataset and extract feature representation using a four layer neural net W = (φ(1), φ(2), φ(3), φ(4)) trained with the objective:
where x_e and x_f are source and target words.

Finally, during testing we combine two auxiliary signals word embeddings and image embeddings with:

and output the translation probability according to:

RESULTS AND DISCUSSION

Using the back-off scheme we described before, we run the following six set of experiments:

BPR_W uses only x_e,f^MF-W
BPR_W+C uses x_e,f^MF-W with back-off to x_e,f^MF-W+C
BPR_LN uses only x_e,f^AUX-WE with linear mapping as described in III
BPR_NN uses only x_e,f^AUX-WE with nonlinear mapping as seen in IV
BPR_WE uses x_e,f^MF-W, x_e,f^MF-W+C and x_e,f^AUX-WE with nonlinear mapping
BPR_VIS uses x_e,f^MF-W, x_e,f^MF-W+C, and combines x_e,f^AUX-WE with x_e,f^AUX-VIS as described above.

We benchmark each of the methods against the state of the art baseline model on the VULIC1000 dataset:

The baseline mutual nearest neighbor method (MNN) uses mutual nearest neighbor pairs (MNN) obtained from pseudo-bilingual corpora constructed from unannotated monolingual data of the source and target languages, and ranks candidate target words based on their cosine similarities to the source word in the shared space. As we can see above, our methods perform better across all translation pairs, in particular when multi-modal information is used for translation. We observe a similar trend in the CROWDTest data.

Fig 4. CROWDTest across 27 languages shows that more signals improves translation accuracies.

Finally, we noted that a seed lexicon size of 5k is sufficient to achieve optimal performance across languages, although it is unclear why this is. In conclusion, in this project we proposed a novel framework for combining multi-modal data from diverse sources, each of which may be sparse or noisy. The framework is based on matrix completion and use BPR to learn translations. Our method is effective across different languages, and accuracy improves as the volume of data increases. Finally, our framework can trivially incorporate additional signals.

Learning Translations viaMatrix Completion