Here are some quick notes for EMNLP 2021 papers, which focus on the evaluation of text generation tasks, few-shot learning for generation and classification tasks.



  1. Visually Grounded Reasoning across Languages and Cultures (best long paper)
    • [video] [paper] [code]
    • A 5K evaluation dataset for cross-lingual vision and language transfer learning.
    • Task: predict if a caption is true/false for 2 images.
  2. Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation
    • [video] [paper] [code]
    • Categorize NLG tasks based on information change from input (\( X\)) to output (\(Y\)): (1) compression (\(X > Y\)), (2) transduction (\(X = Y\)), (3) creation (\(X < Y\)).
    • Evaluate by measuring information alignment (between \(x\), \(y\), reference \(r\), and context \(c\)) for different NLG tasks.
  3. Evaluating the Evaluation Metrics for Style Transfer: A Case Study in Multilingual Formality Transfer
    • [video] [paper] [code]
    • A structured review of style transfer evaluation: Style, Meaning, Fluency.
    • Suggest a set of automatic metrics for style transfer that empirically aligned well with human across 4 different languages.
  4. The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation
    • [poster] [paper]
    • Reveal all crowdsourcing problems of evaluating text generation on AMT.
    • Be very careful to select qualified AMT workers to accomplish your task.

NLG (Generation in Low-data Settings)


  1. When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute (outstanding paper)
    • [video] [paper] [code]
    • Attention is NOT ALL we need, combine fast recurrence and attention: attention helps recurrence avoid information & gradient propagation issue, and recurrence helps attention remove multi-head and relative position.
    • Parallelize the computation (linear project + attention) of state vector \(c_t\), forget gate \(f_t\) and reset gate \(r_t\), then recurrently compute hidden state \(h_t\).
    • Much faster training speed, comparable test perplexity, and similar amount of parameters with Transformer-XL.
  2. Few-Shot Text Generation with Natural Language Instructions
    • [video] [paper] [code]
    • Provide large pretrained LM with some manually-designed task descriptions (prompt), and fine-tune the model under the few-shot summarization tasks.
  3. Smelting Gold and Silver for Improved Multilingual AMR-to-Text Generation
    • [video] [paper] [code]
    • Data augmentation: build neural models to annotate unlabeled data: (text -> AMR), (AMR -> text).


  1. MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks (outstanding paper)
    • [video] [paper] [code]
    • A new dataset to study human collaborative behaviors in situated language communication.
  2. GOLD: Improving Out-of-Scope Detection in Dialogues using Data Augmentation
    • [video] [paper] [code]
    • Use data augmentation to improve the out-of-scope detection in dialogues.
    • Find out-of-scope utterances from external related datasets, replace the original utterance with the external utterances, select the candidate which receives major votes from an ensemble of out-of-scope detectors.
  3. ConvFiT: Conversational Fine-Tuning of Pretrained Language Models
    • [video] [paper]
    • A new pretrained conversational LM ConvFiT, which shows SOTA performances in few-shot intention detection task.
  4. Self-training Improves Pre-training for Few-shot Learning in Task-oriented Dialog Systems
    • [video] [paper]
    • Leverage self-training + data augmentation (MLM) for 4 tasks in few-shot learning setting: intent classification, dialog state tracking, dialog act prediction, response selection.


  1. SituatedQA: Incorporating Extra-Linguistic Contexts into QA (outstanding paper)
    • [video] [paper] [code]
    • A new open-retrieval QA dataset where systems must produce the correct answer to a question given the temporal or geographical context.
  2. Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right
    • [video] [paper] [code]
    • For zero-shot QA, GPT-3 may generate multiple valid answers, but those answers may not be one of the multiple choice options. Therefore, ranking by string probabilties can be problematic.
    • Introduce domain conditional pointwise mutual information, which reweighs each option according to a term that is proportional to its a priori likelihood within the context of the zero-shot task, to score the multiple choice options.
  3. Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering
    • [video] [paper]
    • Leverage question generation models to produce synthetic multi-lingual QA pairs.
  4. Back-Training excels Self-Training at Unsupervised Domain Adaptation of Question Generation and Passage Retrieval
    • [video] [paper] [code]
    • In self-training, inputs are from target domain and outputs are predicted, which can lead to overfitting to source domain.
    • In back-training, inputs are predicted and outputs are from target domain, which generates data closer to target domain.

NLU (Sample Selection and Data Augmentation)


  1. Dynamic Knowledge Distillation for Pre-trained Language Models
    • [video] [paper] [code]
    • Dynamic Teacher Adoption: compute the prediction uncertainty (entropy) of the student model across all training data, dive the training data into high-uncertainty group and low-uncertainty group, and select the large teacher model for high-uncertainty group and the small teacher model for low-uncertainty group.
    • Dynamic Data Selection: select informative instances in each training batch according to the prediction uncertainty (entropy) of the student model.
    • Dynamic Objective Adjustment: adjust the weight of the aligment objective (matching student & teacher hidden states) with the prediction uncertainty (entropy) of the student model.
  2. HypMix: Hyperbolic Interpolative Data Augmentation
    • [poster] [paper] [code]
    • A new interpolation & mixup data augmentation method: augment unlabeled texts via back-translation, and predict soft labels for unlabeled and augmented data in the hyperbolic space.
  3. Unsupervised Data Augmentation with Naive Augmentation and without Unlabeled Data
    • [video] [paper]
    • Compare different data augmentation methods in text classification and sequence labeling: back-translation, random substitutions, masked language model.
    • Performances: masked language model > random substitutions > back-translation.


  1. Active Learning by Acquiring Contrastive Examples
    • [video] [paper] [code]
    • Uncertainty: the predictive uncertainty, e.g. least confident data.
    • Diversity: the heterogeneity in feature space, e.g. clustering.
    • Contrastive examples: datapoints that are close in the model feature space (\(K\)-nearest neighbours), but the model produces different predictive likelihoods.
  2. Certified Robustness to Programmable Transformations in LSTMs
    • [video] [paper] [code]
    • Develop a certified defense to arbitrary string transformations that applies to recursive neural networks.
    • Certify robustness by proving that a network’s prediction is the same no matter how a given input is perturbed.
  3. Efficient Contrastive Learning via Novel Data Augmentation and Curriculum Learning
    • [video] [paper] [code]
    • Data augmentation: cutoff (on hidden states) + PCA jittering is effective for robustness to sentence-level noise.
    • Curriculum learning (easy first, difficult later): increase the noise level of input embedding leads to faster convergence.
  4. STraTA: Self-Training with Task Augmentation for Better Few-shot Learning
    • [video] [paper]
    • Task augmentation: train an NLI data generator to produce synthetic in-domain NLI training examples.
    • Self-training: initialize the teacher and student model with a strong auxiliary-task base model, then fine-tune the base model using the labeled target task data. At each iteration, use the teacher model to generate pesudo-labels for unlabeled in-domain examples and augment the original labeled target task data.
    • Experiments show that using a strong base model and training on a broad distribution of pseudo-labeled data are key factors for successful deployment in NLP.