Selected Long Papers / Journal Articles

MARBLE: Music Audio Representation Benchmark for universaL Evaluation

Ruibin Yuan, Yinghao Ma, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin, Le Zhuo, Yiqi Liu, Jiawen Huang, Zeyue Tian, Binyue Deng, Ningzhi Wang, Chenghua Lin, Emmanouil Benetos, Anton Ragni, Norbert Gyenge, Roger Dannenberg, Wenhu Chen, Gus Xia, Wei Xue, Si Liu, Shi Wang,
Ruibo Liu
, Yike Guo, Jie Fu
Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)

In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement.

Mind's Eye: Grounded Language Model Reasoning Through Simulation

Ruibo Liu
, Jason Wei, Shane Shixiang Gu, Te-Yen Wu, Soroush Vosoughi, Claire Cui, Denny Zhou, and Andrew M. Dai
The Eleventh International Conference on Learning Representations (ICLR 2023)

Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world -- their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind's MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100x larger. Finally, we confirm the robustness of Mind's Eye through ablation studies.

Review Score: 8/6/6 of 10

Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits

Ruibo Liu
, Chenyan Jia, Ge Zhang, Ziyu Zhuang, Tony X Liu, and Soroush Vosoughi
Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022)

We present Second Thought, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, Second Thought not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. The generated editing steps also offer better interpretability and ease for interactive error correction. Extensive human evaluations further confirm its effectiveness.

Knowledge Infused Decoding

Ruibo Liu
, Guoqing Zheng, Shashank Gupta, Radhika Gaonkar, Chongyang Gao, Soroush Vosoughi, Milad Shokouhi, Ahmed Hassan Awadallah
The Tenth International Conference on Learning Representations (ICLR 2022)

Pre-trained language models (LMs) have been shown to memorize a substantial amount of knowledge from the pre-training corpora; however, they are still limited in recalling factually correct knowledge given a certain context. We present Knowledge Infused Decoding (KID)---a novel decoding algorithm for generative LMs, which dynamically infuses external knowledge into each step of the LM decoding. Specifically, we maintain a local knowledge memory based on the current context, interacting with a dynamically created external knowledge trie, and continuously update the local memory as a knowledge-aware constraint to guide decoding via reinforcement learning.

Review Score: 8/6/6/5 of 10

Non-Parallel Text Style Transfer with Self-Parallel Supervision

Ruibo Liu
, Chongyang Gao, Chenyan Jia, Guangxuan Xu, Soroush Vosoughi
The Tenth International Conference on Learning Representations (ICLR 2022)

The performance of existing text style transfer models is severely limited by the non-parallel datasets on which the models are trained. In non-parallel datasets, no direct mapping exists between sentences of the source and target style; the style transfer models thus only receive weak supervision of the target sentences during training, which often leads the model to discard too much style-independent information, or utterly fail to transfer the style. In this work, we propose LaMer, a novel text style transfer framework based on large-scale language models. LaMer first mines the roughly parallel expressions in the non-parallel datasets with scene graphs, and then employs MLE training, followed by imitation learning refinement, to leverage the intrinsic parallelism within the data.

Review Score: 8/8/6/6/3 of 10

Quantifying and Alleviating Political Bias in Language Models

Ruibo Liu
, Chenyan Jia, Jason Wei, Guangxuan Xu, and Soroush Vosoughi
The Journal of Artificial Intelligence (Invited Journal Article)

Current large-scale language models can be politically biased as a result of the data they are trained on, potentially causing serious problems when they are deployed in real-world settings. In this paper, we first describe metrics for measuring political bias in GPT-2 generation, and discuss several interesting takeaways: 1) The generation of vanilla GPT-2 model is mostly liberal-leaning, 2) Such political bias depends on the sensitive attributes mentioned in the context, and 3) Priming the generation with a explicit political identifier, the extent of political bias is imbalanced (between liberal and conservative). We then propose a reinforcement learning (RL) framework for mitigating such political biases in generated text.

Impact Factor: 14.05

Language Model Augmented Relevance Score

Ruibo Liu
, Jason Wei, and Soroush Vosoughi
ACL 2021 (Oral Presentation)

We propose Language Model Augmented Relevance Score (MARS), a new context-aware metric for NLG evaluation. MARS leverages off-the-shelf language models, guided by reinforcement learning, to create augmented references that consider both the generation context and available human references, which are then used as additional references to score generated text. Compared with seven existing metrics in three common NLG tasks, MARS not only achieves higher correlation with human reference judgements, but also differentiates well-formed candidates from adversarial samples to a larger degree.

Review Score: 4/4/4 of 5

Mitigating Political Bias in Language Models through Reinforced Calibration

Ruibo Liu
, Chenyan Jia, Jason Wei, Guangxuan Xu, Lili Wang, and Soroush Vosoughi
AAAI 2021 Outstanding Paper Award (3/9034)

Current large-scale language models can be politically biased as a result of the data they are trained on, potentially causing serious problems when they are deployed in real-world settings. In this paper, we describe metrics for measuring political bias in GPT-2 generation and propose a reinforcement learning (RL) framework for mitigating political biases in generated text. By using rewards from word embeddings or a classifier, our RL framework guides debiased generation without having access to the training data or requiring the model to be retrained. In empirical experiments on three attributes sensitive to political bias (gender, location, and topic), our methods reduced bias according to both our metrics and human evaluation, while maintaining readability and semantic coherence.

Review Score: 9/9/9/7 of 10 (Top 0.1%), and 9 from Meta-Review

Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation

Ruibo Liu
, Guangxuan Xu, Chenyan Jia, Weicheng Ma, Lili Wang, and Soroush Vosoughi
EMNLP 2020 (Oral Presentation)

We present a powerful and easy to deploy text augmentation framework, Data Boost, which augments data through reinforcement learning guided conditional generation. We evaluate Data Boost on three diverse text classification tasks under five different classifier architectures. The result shows that Data Boost can boost the performance of classifiers especially in low-resource data scenarios. For instance, Data Boost improves F1 for the three tasks by 8.7% on average when given only 10% of the whole data for training. We also compare Data Boost with six prior text augmentation methods. Through human evaluations (N=178), we confirm that Data Boost augmentation has comparable quality as the original data with respect to readability and class consistency.

Review Score: 4/4/4 of 5 (Top 5%)

A Transformer-Based Framework for Neutralizing and Reversing the Political Polarity of News Articles

Ruibo Liu
, Chenyan Jia, and Soroush Vosoughi
CSCW 2021

People often prefer to consume news with similar political predispositions and access like-minded news articles, which aggravates polarized clusters known as “echo chamber”. To mitigate this phenomenon, we propose a computer-aided solution to help combat extreme political polarization. Specifically, we present a framework for reversing or neutralizing the political polarity of news headlines and articles. The framework leverages the attention mechanism of a Transformer-based language model to first identify polar sentences, and then either flip the polarity to the neutral or to the opposite through a GAN network. Tested on the same benchmark dataset, our framework achieves a 3% − 10% improvement on the flipping/neutralizing success rate of headlines compared with the current state-of-the-art model. Adding to prior literature, our framework not only flips the polarity of headlines but also extends the task of polarity flipping to full-length articles. Human evaluation results show that our model successfully neutralizes or reverse the polarity of news without reducing readability. We release a large annotated dataset that includes both news headlines and full-length articles with polarity labels and meta-data to be used for future research. Our framework has a potential to be used by, social scientists, content creators and content consumers in the real world.

Political Depolarization of News Articles Using Attribute-Aware Word Embeddings

Ruibo Liu
, Lili Wang, Chenyan Jia, Soroush Vosoughi
ICWSM 2021

Political polarization in the US is on the rise. This polarization negatively affects the public sphere by contributing to the creation of ideological echo chambers. In this paper, we focus on addressing one of the factors that contributes to this polarity, polarized media. We introduce a framework for depolarizing news articles. Given an article on a certain topic with a particular ideological slant (eg., liberal or conservative), the framework first detects polar language in the article and then generates a new article with the polar language replaced with neutral expressions. To detect polar words, we train a multi-attribute-aware word embedding model that is aware of ideology and topics on 360k full-length media articles. Then, for text generation, we propose a new algorithm called Text Annealing Depolarization Algorithm (TADA). TADA retrieves neutral expressions from the word embedding model that not only decrease ideological polarity but also preserve the original argument of the text, while maintaining grammatical correctness.

Using impression data to improve models of online social influence

Rui Liu, Kevin T. Greene,
Ruibo Liu
, Mihovil Mandic, Benjamin A. Valentino, Soroush Vosoughi, V.S. Subrahmanian
Nature - Scientific Reports

Influence, the ability to change the beliefs and behaviors of others, is the main currency on social media. Extant studies of influence on social media, however, are limited by publicly available data that record expressions (active engagement of users with content, such as likes and comments), but neglect impressions (exposure to content, such as views) and lack “ground truth” measures of influence. To overcome these limitations, we implemented a social media simulation using an original, web-based micro-blogging platform. We propose three influence models, leveraging expressions and impressions to create a more complete picture of social influence. We demonstrate that impressions are much more important drivers of influence than expressions, and our models accurately identify the most influential accounts in our simulation. Impressions data also allow us to better understand important social media dynamics, including the emergence of small numbers of influential accounts and the formation of opinion echo chambers.

Impact Factor: 4.379

Multi-resolution Annotations for Emoji Prediction

Weicheng Ma,
Ruibo Liu
, Lili Wang, and Soroush Vosoughi
EMNLP 2020

Emojis are able to express various linguistic components, including emotions, sentiments, events, etc. Labels in existing emoji prediction datasets are all passage-based and are usually under the multi-class classification setting. However, in many cases, one single emoji cannot fully cover the theme of a piece of text. It is thus useful to infer the part of text related to each emoji. The lack of multi-label and aspect-level emoji prediction datasets is one of the bottlenecks for this task. This paper annotates an emoji prediction dataset with passage-level multi-class/multi-label, and aspect-level multi-class annotations. The annotations are generated heuristically, taking advantage of the self-attention mechanism in Transformer networks. We validate the annotations both automatically and manually to ensure their quality.

Embedding Heterogeneous Network into Hyperbolic Space Without Meta-path

Lili Wang, Chongyang Gao, Chenghan Huang,
Ruibo Liu
, Weicheng Ma, and Soroush Vosoughi
AAAI 2021

Networks found in the real-world are numerous and varied. A common type of network is the heterogeneous network, where the nodes (and edges) can be of different types. Accordingly, there have been efforts at learning representations of these heterogeneous networks in low-dimensional space. However, most of the existing heterogeneous network embedding suffers from the following two drawbacks: (1) The target space is usually Euclidean. Conversely, many recent works have shown that complex networks may have hyperbolic latent anatomy, which is non-Euclidean. (2) These methods usually rely on meta-paths, which requires domain-specific prior knowledge for meta-path selection. Additionally, different down-streaming tasks on the same network might require different meta-paths in order to generate task-specific embeddings. In this paper, we propose an novel self-guided random walk method that does not require meta-path for embedding heterogeneous networks into hyperbolic space. We conduct thorough experiments for the tasks of network reconstruction and link prediction on several public datasets, showing that our model outperforms a variety of well-known baselines across all tasks.

Reconstructing Human Joint Motion with Computational Fabrics

Ruibo Liu
, Qijia Shao, Siqi Wang, Christina Ru, Devin Balkcom, and Xia Zhou
UbiComp/IMWUT 2019 (Oral Prensentation)

This work studies the use of everyday fabrics as a flexible and soft sensing medium to monitor joint angular motion accurately and reliably. Specifically we focus on the primary use of conductive stretchable fabrics to sense the skin deformation during joint motion and infer the joint rotational angle. We tackle challenges of fabric sensing originated by the inherent properties of elastic materials by leveraging two types of sensing fabric and characterizing their properties based on models in material science. We apply models from bio-mechanics to infer joint angles and propose the use of dual strain sensing to enhance sensing robustness against user diversity and fabric position offsets. We fabricate prototypes using off-the-shelf fabrics and micro-controller. Experiments with ten participants show 9.69° median angular error in tracking joint angle and its sensing robustness across various users and activities.

Short Papers / Workshops / Findings

Aligning Generative Language Models with Human Values

Ruibo Liu
, Ge Zhang, Xinyu Feng, and Soroush Vosoughi
NAACL 2022 - Findings

This paper proposes SENSEI, a new reinforcement learning based method that can embed human values judgements into each step of language generation. SENSEI deploys an Actor-Critic framework, where the Critic is a reward distributor that simulates the reward assignment procedure of humans, while the Actor guides the generation towards the maximum reward direction. Compared with five existing methods in three human values alignment datasets, SENSEI not only achieves higher alignment performance in terms of both automatic and human evaluations, but also shows improvements on robustness and transfer learning on unseen human values.

Modulating Language Models with Emotions

Ruibo Liu
, Jason Wei, Chenyan Jia, and Soroush Vosoughi
ACL 2021 - Findings

Generating context-aware language that embodies diverse emotions is an important step towards building empathetic NLP systems. In this paper, we propose a formulation of conditional layer normalization—a technique inspired by computer vision—that allows us to use large-scale language models for emotional response generation. In empirical and human evaluation, our models outperform prior baseline methods while maintaining diversity, fluency, and coherence, obtaining competitive performance even when using only 10% of the available training data.

Enhanced Offensive Language Detection Through Data Augmentation

Ruibo Liu
, Guangxuan Xu, and Soroush Vosoughi
ICWSM 2020 - Data Challenge

Detecting offensive language on social media is an important task. The ICWSM-2020 Data Challenge Task 2 is aimed at identifying offensive content using a crowd-sourced dataset containing 100k labelled tweets. The dataset, however, suffers from class imbalance, where certain labels are extremely rare compared with other classes (e.g, the hateful class is only 5% of the data). In this work, we present Dager (Data Augmenter), a generation-based data augmentation method, that improves the performance of classification on imbalanced and low-resource data such as the offensive language dataset. We test Dager on four different classifiers (BERT, CNN, BiLSTM with attention, and Transformer), observing universal improvement on the detection, indicating our method is effective and classifier-agnostic.

Emoji Prediction: Extensions and Benchmarking

Weicheng Ma,
Ruibo Liu
, Lili Wang, and Soroush Vosoughi
WISDOM Workshop - KDD 2020

In this paper, we extend the existing setting of the emoji prediction task to include a richer set of emojis and to allow multi-label classification on the task. We propose novel models for multi-class and multi-label emoji prediction based on Transformer networks. We also construct multiple emoji prediction datasets from Twitter using heuristics.

An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data

Lili Wang, Chongyang Gao, Jason Wei, Weicheng Ma,
Ruibo Liu
, Soroush Vosoughi
W-NUT Workshop - EMNLP 2020

The field of NLP has seen unprecedented achievements in recent years. Most notably, with the advent of large-scale pre-trained Transformer-based language models, such as BERT, there has been a noticeable improvement in text representation. It is, however, unclear whether these improvements translate to noisy user-generated text, such as tweets. In this paper, we present an experimental survey of a wide range of well-known text representation techniques for the task of text clustering on noisy Twitter data. Our results indicate that the more advanced models do not necessarily work best on tweets and that more exploration in this area is needed.

Salienteye: Maximizing Engagement While Maintaining Artistic Style on Instagram Using Deep Neural Networks

Lili Wang,
Ruibo Liu
, and Soroush Vosoughi
ICMR 2020 (Short)

Instagram has become a great venue for amateur and professional photographers alike to showcase their work. We used transfer learning to adapt Xception, which is a model for object recognition trained on the ImageNet dataset, to the task of engagement prediction and utilized Gram matrices generated from VGG19, another object recognition model trained on ImageNet, for the task of style similarity measurement on photos posted on Instagram.