Project Contributers

  • Karan Jeswani
  • Rahul John Roy
  • Eshwar SR

Objective

  • Human evaluations of model generated text are accurate, but expensive and slow for the purpose of model development. Evaluating the output of such systems automatically, saves time, accelerates further research on the text generation tasks and it will also be free of human bias.
  • Objective of our project is to review currently used metrics, like BLEU, Word Mover Similarity, Sentence Mover Similarity, and BERTScore and evaluate their correlation with human level judgements.

Datasets Used

  • We calculate scores provided by the metrics stated above, on the Automated Student Assessment Prize (ASAP) AES and SAS dataset, which is an Essay Evaluation task.
  • We have also tested the metrics on CNN/DailyMail dataset.
  • We also use the WMT18 dataset to compare results with BERTScore paper.

Methodology

  • BLEU and ROUGE metrics do not require embeddings, they are based on word matching only. BLEU uses precision score and ROUGE uses recall.
  • Transformer based networks are used in BERT, BERTScore, RoBERTa, which are trained on 2 main tasks, called Masked Language Modelling (MLM), and Next Sentence Prediction (NSP), on whole of the wikipedia, reddit, etc. We used these networks which were already pre-trained for generating word embeddings.
  • Sentence Mover Similarity is capable of using any embeddings to generate word vectors and corresponding sentence vectors, which we could then use in combination with various similarity metrics like Cosine similarity, earth mover distance (also called wasserstein distance), and euclidean distance.

Conclusions

  • Cosine similarity is the best similarity metric for word / sentence embeddings compared to euclidean distance and earth mover distance (also called wasserstein distance).
  • Sentence based embeddings provide better human-correlation for evaluating multi-sentence text data.
  • Using pre-trained embedding does not provide good word or sentence embeddings in itself. Using BERT, RoBERTa, or GloVe are almost equivalent.
  • Even today, we don’t use metrics based on word embeddings in language model training, because the computational cost of evaluation is too high. BLEU and ROUGE are most commonly used metrics.