For complete list of publications, please refer to my Google Scholar profile.
2022
Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation
Arora, Kushal,
Asri, Layla El,
Bahuleyan, Hareesh,
and Cheung, Jackie Chi Kit
In Findings of the Association for Computational Linguistics (ACL),
2022
Current language generation models suffer from issues such as repetition, incoherence, and hallucinations. An often-repeated hypothesis is that this brittleness of generation models is caused by the training and the generation procedure mismatch, also referred to as exposure bias. In this paper, we verify this hypothesis by analyzing exposure bias from an imitation learning perspective. We show that exposure bias leads to an accumulation of errors, analyze why perplexity fails to capture this accumulation, and empirically show that this accumulation results in poor generation quality.
2021
Knowing When You Don’t Know in Online Fashion: An Uncertainty-Aware Size Recommendation Framework
Bahuleyan, Hareesh,
Lasserre, Julia,
Lefakis, Leonidas,
and Shirvany, Reza
In Recommender Systems in Fashion and Retail, FashionXRecSys Workshop,
2021
In recent years of online fashion, the availability of large-scale datasets has fueled the success of data-driven algorithmic products for supporting customers in their journey on fashion e-commerce platforms. Very often, these datasets are collected in an implicit manner, are subjective, and do not have expert annotated labels. The use of inconsistent and noisy data to train machine learning models could potentially harm their performance and generalization capabilities. In this paper, we explore uncertainty quantification metrics within the context of online size and fit recommender systems and show how they could be used to deal with noisy instances and subjective labels. We further propose an uncertainty-aware loss function based on Monte-Carlo dropout uncertainty estimation technique. Through experiments on real data at scale within the challenging domain of size and fit recommendation, we benchmark multiple uncertainty metrics and demonstrate the effectiveness of the proposed approach for training in the presence of noise.
Polarized-VAE: Proximity Based Disentangled Representation Learning for Text Generation
Balasubramanian, Vikash,
Kobyzev, Ivan,
Bahuleyan, Hareesh,
Shapiro, Ilya,
and
Vechtomova, Olga
In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL),
2021
Learning disentangled representations of realworld data is a challenging open problem. Most previous methods have focused on either supervised approaches which use attribute labels or unsupervised approaches that manipulate the factorization in the latent space of models such as the variational autoencoder (VAE) by training with task-specific losses. In this work, we propose polarized-VAE, an approach that disentangles select attributes in the latent space based on proximity measures reflecting the similarity between data points with respect to these attributes. We apply our method to disentangle the semantics and syntax of sentences and carry out transfer experiments. Polarized-VAE outperforms the VAE baseline and is competitive with state-of-the-art approaches, while being more a general framework that is applicable to other attribute disentanglement tasks.
2020
Diverse Keyphrase Generation with Neural Unlikelihood Training
In Proceedings of the 28th International Conference on Computational Linguistics (COLING),
2020
In this paper, we study sequence-to-sequence (S2S) keyphrase generation models from the perspective of diversity. Recent advances in neural natural language generation have made possible remarkable progress on the task of keyphrase generation, demonstrated through improvements on quality metrics such as F1-score. However, the importance of diversity in keyphrase generation has been largely ignored. We first analyze the extent of information redundancy present in the outputs generated by a baseline model trained using maximum likelihood estimation (MLE). Our findings show that repetition of keyphrases is a major issue with MLE training. To alleviate this issue, we adopt neural unlikelihood (UL) objective for training the S2S model. Our version of UL training operates at (1) the target token level to discourage the generation of repeating tokens; (2) the copy token level to avoid copying repetitive tokens from the source text. Further, to encourage better model planning during the decoding process, we incorporate K-step ahead token prediction objective that computes both MLE and UL losses on future tokens as well. Through extensive experiments on datasets from three different domains we demonstrate that the proposed approach attains considerably large diversity gains, while maintaining competitive output quality.
2019
Disentangled Representation Learning for Non-Parallel Text Style Transfer
In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL),
2019
This paper tackles the problem of disentangling the latent representations of style and content in language models. We propose a simple yet effective approach, which incorporates auxiliary multi-task and adversarial objectives, for style prediction and bag-of-words prediction, respectively. We show, both qualitatively and quantitatively, that the style and content are indeed disentangled in the latent space. This disentangled latent representation learning can be applied to style transfer on non-parallel corpora. We achieve high performance in terms of transfer accuracy, content preservation, and language fluency, in comparison to various previous approaches.
Stochastic Wasserstein Autoencoder for Probabilistic Sentence Generation
In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT),
2019
The variational autoencoder (VAE) imposes a probabilistic distribution (typically Gaussian) on the latent space and penalizes the Kullback-Leibler (KL) divergence between the posterior and prior. In NLP, VAEs are extremely difficult to train due to the problem of KL collapsing to zero. One has to implement various heuristics such as KL weight annealing and word dropout in a carefully engineered manner to successfully train a VAE for text. In this paper, we propose to use the Wasserstein autoencoder (WAE) for probabilistic sentence generation, where the encoder could be either stochastic or deterministic. We show theoretically and empirically that, in the original WAE, the stochastically encoded Gaussian distribution tends to become a Dirac-delta function, and we propose a variant of WAE that encourages the stochasticity of the encoder. Experimental results show that the latent space learned by WAE exhibits properties of continuity and smoothness as in VAEs, while simultaneously achieving much higher BLEU scores for sentence reconstruction.
2018
Variational Attention for Sequence-to-Sequence Models
In Proceedings of the 27th International Conference on Computational Linguistics (COLING),
2018
The variational encoder-decoder (VED) encodes source information as a set of random variables using a neural network, which in turn is decoded into target data using another neural network. In natural language processing, sequence-to-sequence (Seq2Seq) models typically serve as encoder-decoder networks. When combined with a traditional (deterministic) attention mechanism, the variational latent space may be bypassed by the attention model, and thus becomes ineffective. In this paper, we propose a variational attention mechanism for VED, where the attention vector is also modeled as Gaussian distributed random variables. Results on two experiments show that, without loss of quality, our proposed method alleviates the bypassing phenomenon as it increases the diversity of generated sentences.
Natural Language Generation with Neural Variational Models
Bahuleyan, Hareesh
Master’s Thesis,
2018
In this thesis, we explore the use of deep neural networks for generation of natural language. Specifically, we implement two sequence-to-sequence neural variational models - variational autoencoders (VAE) and variational encoder-decoders (VED). VAEs for text generation are difficult to train due to issues associated with the Kullback-Leibler (KL) divergence term of the loss function vanishing to zero. We successfully train VAEs by implementing optimization heuristics such as KL weight annealing and word dropout. We also demonstrate the effectiveness of this continuous latent space through experiments such as random sampling, linear interpolation and sampling from the neighborhood of the input. We argue that if VAEs are not designed appropriately, it may lead to bypassing connections which results in the latent space being ignored during training. We show experimentally with the example of decoder hidden state initialization that such bypassing connections degrade the VAE into a deterministic model, thereby reducing the diversity of generated sentences. We discover that the traditional attention mechanism used in sequence-to-sequence VED models serves as a bypassing connection, thereby deteriorating the model’s latent space. In order to circumvent this issue, we propose the variational attention mechanism where the attention context vector is modeled as a random variable that can be sampled from a distribution. We show empirically using automatic evaluation metrics, namely entropy and distinct measures, that our variational attention model generates more diverse output sentences than the deterministic attention model. A qualitative analysis with human evaluation study proves that our model simultaneously produces sentences that are of high quality and equally fluent as the ones generated by the deterministic attention counterpart.
Generating Lyrics with Variational Autoencoder and Multi-modal Artist Embeddings
arXiv preprint,
2018
We present a system for generating song lyrics lines conditioned on the style of a specified artist. The system uses a variational autoencoder with artist embeddings. We propose the pre-training of artist embeddings with the representations learned by a CNN classifier, which is trained to predict artists based on MEL spectrograms of their song clips. This work is the first step towards combining audio and text modalities of songs for generating lyrics conditioned on the artist’s style. Our preliminary results suggest that there is a benefit in initializing artists’ embeddings with the representations learned by a spectrogram classifier.
Music Genre Classification using Machine Learning Techniques
Bahuleyan, Hareesh
arXiv preprint,
2018
Categorizing music files according to their genre is a challenging task in the area of music information retrieval (MIR). In this study, we compare the performance of two classes of models. The first is a deep learning approach wherein a CNN model is trained end-to-end, to predict the genre label of an audio signal, solely using its spectrogram. The second approach utilizes hand-crafted features, both from the time domain and the frequency domain. We train four traditional machine learning classifiers with these features and compare their performance. The features that contribute the most towards this multi-class classification task are identified. The experiments are conducted on the Audio set data set and we report an AUC value of 0.894 for an ensemble classifier which combines the two proposed approaches.
2017
UWaterloo at SemEval-2017 Task 8: Detecting Stance towards Rumours with Topic Independent Features
In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017),
2017
This paper describes our system for subtask-A: SDQC for RumourEval, task-8 of SemEval 2017. Identifying rumours, especially for breaking news events as they unfold, is a challenging task due to the absence of sufficient information about the exact rumour stories circulating on social media. Determining the stance of Twitter users towards rumourous messages could provide an indirect way of identifying potential rumours. The proposed approach makes use of topic independent features from two categories, namely cue features and message specific features to fit a gradient boosting classifier. With an accuracy of 0.78, our system achieved the second best performance on subtask-A of RumourEval.