Paper Review

“Aspect-augmented Adversarial Networks for Domain Adaptation” by Yuan Zhang, Regina Barzilay and Tommi Jaakkola (TACL 2017)

Overview

The goal of this paper is to apply adversarial training to domain transfer. More specifically, the paper addresses a somewhat restricted situation where we have a shared, single task we want to solve on two different domains, with the additional constraint that only one domain has classification labels that can be used for supervision. For instance, we would like to train a classifier on the brain cancer domain but still have it work just as well for the asthma domain, or we could train a classifier on the restaurant reviews domain but still have it work on a book review domain. The key to solving this is to give our classifier model the ability to avoid using domain-specific clues in its prediction. The authors utilize adversarial loss to achieve this goal. However, it is not easy to train models using adversarial loss because the adversarial objective can run counter to the standard supervision objective, providing destructive gradients. To address this issue many architectural modifications are introduced to balance the two objectives.

Model

A) Document Classification

Let’s start with a brief overview of a document classification model. A document classification model takes some document, such as a pathology report, and through one or more mapping functions turns it into an encoded vector (a.k.a encoding/embedding/representation). A classifier then takes this output, and tries to decide which class the vector belongs to. Figure 1 shows a snippet of an example input document used in this paper.

What if we don’t have labels for a domain? We can perform domain adaptation. A standard approach in deep learning is to learn domain-independent representations of the data by a shared auto-encoder and then training a classifier on it.

The paper proposes a method using adversarial loss to achieve domain adaptation without resorting to a two-stage pipeline involving an auto-encoder.

C) Details of the Model

Figure 2 illustrates the overall model and its components.

2. Transformation Layer for Domain Invariance
3. Semi-supervised Sentence Level Attention for Aspect Awareness
4. Word Level Reconstruction Loss

• Semi-supervised Sentence Level Attention for Aspect Awareness

In the pathology dataset used in this paper, a single document contained multiple domains: That is, each domain is an ‘aspect’ of the text of the pathology report.

For example, a single report can contain descriptions about both Invasive Ductal Carcinoma (IDC) and Atypical Lobular Hyperplasia (ALH). We need a way to pull apart these aspects when we encode the documents, otherwise the label predictor will try to draw from information about both aspects, when only one aspect is ever relevant to the classification at hand.

The proposed solution is to use a sentence relevance module, shown as the middle part of the ‘Document encoder’ (shown in green) in Figure 2.

The authors heuristically generated relevance scores of each sentence w.r.t each aspect for every sentence in every document. The score was set to 1 if the sentence included at one or more of that aspect’s keywords, 0 otherwise. The keywords were determined heuristically by human medical experts.

These relevance labels were used as targets for the relevance score regression submodule of the document encoder, with a loss defined as:

If there are $n$ aspects in the document, this relevance score predictor will output $n$-way classification scores. To produce a document embedding for the $l^{th}$ document in aspect $a$, whose sentences are indexed by $i$, we sum the sentence embeddings weighted by their relevance scores $\hat{r}^{a}_{l,i}$ :

• Transformation Layer for Domain Invariance

The relevance weighted document embedding is passed through a transformation layer. This additional transformation is there to further erase any domain specific information. The transformation is defined by :

with an additional strong regularization term to discourage significant deviation away from identity. This regularizer helps to prevent the adversarial gradient wipe out the document signal :

The effectiveness of the regularizer is shown in Table 7:

$\lambda^{t}=\infty$ represents the removal of the transformation layer, while $\lambda^{t}=0$ represents a transformation layer without any regularization. As we can see, finding a suitable regularization value for the transformation layer results in the best performance, indicating the importance of balancing the adversarial objective again.

• Word Level Reconstruction Loss

The ideal training scenario is that the document embeddings stay as informative as possible, and only the aspect specific information that can provide hints to the aspect adversary is erased.

However the model could easily fall into the trap of accomplishing the aspect adversarial objective by simply making embeddings less informative overall, for instance by erasing features.

The authors propose to address this problem via a word level auto-encoder, shown in Figure 3. The reconstruction objective is stated as the following :

where $h_{i,j}$ is the convolution output with $x_{i,j}$ at the center.

Table 6 shows that this word level reconstruction loss indeed improves performance.

In Figure 4, they also show that their assumption about the destructive effect of adversarial loss was correct. The first matrix shows that without adversarial loss, the document embeddings from the two aspects (top half and bottom half), are relatively easily distinguishable. When we add adversarial loss, we can see, in the second matrix, that the two aspects are now indistinguishable. Unfortunately, the embeddings have also become extremely sparse. Finally, in the third matrix, adding the reconstruction loss makes the embeddings denser again, while still making the two aspects hard to distinguish.

• Label classifier

The label classifier minimizes a standard cross entropy loss over classes $k$ over documents $l$. Since labels only exist in the source aspect, the label classifier is defined only on the source data. $\hat{p}_{l;k}$ indicates the predicted probability of document $l$  belonging to class $k$, and $[y^{s}_{l;1}...y^{s}_{l;m}]$ are one-hot vectors indicating which class document $l$ belongs to in the source aspect:

• Domain classifier

The domain classifier attempts to figure out which aspect/domain a document belongs to. It will minimizes the cross entropy loss on the one-hot aspect label $y^{a}_{k}$ . $\hat{q}(x^{tr,a}_{l})$ is the aspect probability predicted from the aspect relevance weighted encoding of the input.

• End-to-End training

All objectives are jointly optimized, end-to-end.

It should be pointed out that the domain adversary network itself always minimizes $L^{dom}$ w.r.t its own parameters. The $-\rho L^{dom}$ term in (8) reflects the fact that the gradients originating from the domain adversary are reversed in sign before being back-propagated to encourage the learning of domain-invariant features.

Results

The results, shown in Table 4, demonstrate that the combination of semi-supervised aspect separation and aspect adversarial loss was effective in producing domain-independent document representations.

• Effect of Relevance Scoring and Adversarial Loss

Ours-NR is identical to the full model, except that the relevance scoring module was omitted. We see that without the relevance scoring module making aspect-differentiation possible, the performance of the model drops dramatically. The results of Ours-NA, which is the full model with the adversarial loss module removed, show that the adversarial loss is indeed effective, once the relevance scoring module makes aspect-differentiation possible. Together, the combination of these two modules makes the full model’s performance quite close to In-Domain, which is the model trained with supervision in both domains.

Conclusion

In this post, we saw how an adversarial objective can be used to learn domain-independent representations of the input. We also saw that it can be beneficial to consider the balance between a supervision objective and an adversarial objective when designing such models.