Using NLP to build a semantic search engine plugin for Sympathy | Combine

Using NLP to build a semantic search engine plugin for Sympathy

In this task, we set out to enable users to find Sympathy nodes based on a description of what they would like to do, i.e. in natural language. This is something that a full-text search method could not easily achieve and is often referred to as semantic search, which extracts meaning from the query and relates this to the most similar response.

Introducing Transformers

The idea of training language models on large datasets and then using these pre-trained
models to enhance performance on smaller, similar datasets has been a crucial breakthrough
for progress in many NLP challenges. However, pre-training for a specific task and embedding
long-term sequential dependencies have been huge constraints to training more generalised
language models. Transformer models are unsupervised models capable of training on
unlabelled, unstructured text to perform a large array of downstream NLP tasks, including
question-and-answer for dialogue systems, named entity recognition (NER) and sequencelevel
tasks such as text generation. The typical Transformer architecture is illustrated below in
Figure 1:

Figure 1: Transformer Blocks [4]

As shown in Figure 1, the Transformer architecture consists of a block of encoders (left) and a
block of decoders (right). Instead of using a hidden state between layers (as in recurrent neural network architectures), the encodings themselves are passed between each encoder, and the
final encoder output is then passed to the first decoder in the decoder block. Each
encoder/decoder in the Transformer contains a self-attention layer, which aims to determine
which part of the sequence is most important when processing a particular word, e.g. in the
sentence “James enjoys the beach because he likes to swim”, the self-attention layer should
learn to link the word “he” to “James” as the most important for its embedding. Additionally,
each decoder contains an “Encoder-Decoder Attention” layer, which refers to the relative
importance of each encoder when the decoder predicts the output.

BERT

The Bidirectional Encoder Representation from Transformers, or BERT for short, is one of the
most influential Transformer-based models. It has earned its reputation from beating multiple
benchmark performances in various NLP tasks with its bi-directional attention mechanisms.
This means that BERT considers not only the previous context but also looks ahead when
learning embeddings. The BERT model focuses on building a language model and thus on the
encoder block of the Transformer. Figure 2 below shows the composition of BERT embeddings
as consisting of the word token embeddings, the segment embedding (for longer sequences)
and a positional embedding that keeps track of the input order:

Figure 2: BERT embeddings illustrated [1]

Fine-tuning BERT

To build our search engine, we first acknowledge that 72 data points is insufficient to fine-tune
the BERT model for our specific task. Instead, we make use of a benchmark dataset for
sentence similarity, STS-B, consisting of 8,000 pairs of semantically similar sentences from
news articles, captions and forums [3]. Since BERT is not specifically designed for sentence
embeddings, we use a modified version of BERT for sentence encoding (proposed by Reimers
and Gurevych [2]), which adds a pooling layer to the standard architecture and is trained with a
regression objective based on a siamese network, i.e. each sentence represents its own
network and their outputs are combined and then evaluated (see Figure 3). The regression
objective function here is the cosine similarity measure between these sentence embeddings,
which is used as a loss function for the fine-tuning task. From the bottom to the top, we see
that each sentence is first encoded using the standard BERT architecture, and thereafter our
pooling layer is applied to output another vector which is then used to compute the cosine similarity measure. As described in [2], we compute this similarity measure for each query and
the 72 docstrings that we obtain from the Sympathy modules and return the top 5 nodes
according to this measure.

Figure 3: Siamese BERT network for sentence similarity illustrated [2]

The Result

We have been able to build a working prototype of the semantic search engine for the 72
nodes currently available on Sympathy for Data, which we hope to integrate as a fully-fledged
plugin in the future. Our search engine performs impressively given that it has only been
trained on around 8,000 pairs of semantically similar sentences (i.e. 16,000 sentences). Below
is an illustrative example of how this works in practice.

References:

[1] Devlin, J., Chang, M., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. [online] arXiv.org. Available at:
https://arxiv.org/abs/1810.04805 [Accessed 24 Sep. 2019].
[2] Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using
Siamese BERT-Networks. [online] arXiv.org. Available at: https://arxiv.org/abs/1908.10084
[Accessed 24 Sep. 2019].
[3] Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia (2017)
SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Cross-lingual Focused
Evaluation Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval
2017)
[4] Models, H. (2019). Understanding Transformers in NLP: State-of-the-Art Models. [online]
Analytics Vidhya. Available at: https://www.analyticsvidhya.com/blog/2019/06/understandingtransformers-
nlp-state-of-the-art-models/ [Accessed 24 Sep. 2019].

Contact us