How to use Google’s pre-trained Language Model


Background

Having a good pre-trained language model can be significant time and money saver (better results for less compute and time) for NLP projects.  Empirical studies have shown that unsupervised pre-training greatly improves generalization and training in many machine learning tasks. For example according to this paper from Montreal and Google:

The results suggest that unsupervised pretraining guides the learning towards basins of attraction of minima that support better generalization from the training data set;

Thankfully, almighty Google scientists made some of their models open and available for everyone to use. Here, we will utilize Google’s lm_1b pre-trained TensorFlow language model. Vocabulary size of the model is 793471, and it was trained of 32 GPUs for five days. If you want to learn the details, please refer to this paper.  The entire TensorFlow graph is defined in a protobuf file. That model definition also specifies which compute devices are to be used, and it set to use primary CPU device. That’s fine, most of us will not be able to fit the large model parameters into conventional desktop GPU anyway.

The original lm_1b repository describes steps somewhat awkwardly. You need to do some manual work, and run Bazel commands with arguments to use the model. If you want get embeddings, you need to again run Bazel commands with your text in the parameters and it will save results into a file. On top of that, their inference and evaluation code is written for Python 2. The code and instructions provided here allow you to  fetch embeddings in a run-time of a Python 3 code in more flexible manner.

Continue Reading