Sentiment Embeddings with Sentence Transformers and VADER (Coding)

Oleh Dubetcky
4 min readMay 21, 2024

Implementing sentiment analysis with word embeddings involves several steps, including data preprocessing, creating word embeddings, and building and training a machine learning model.

The article Natural Language Processing (NLP) in Python with Code (Part1. Tonality Analysis) looked at pitch detection using the VADER tool.

Photo by Getty Images on Unsplash

VADER (Valence Aware Dictionary and sEntiment Reasoner) — a tool for assessing the tonality of text, but it is not a typical deep learning model. VADER is based on a dictionary and rules and is used to quickly determine the tonality of text. But now we will use VADER to pre-analyze the tonality of the text, and then transfer this knowledge to a deep model for more accurate sentiment analysis.

Transformers have revolutionized the field of natural language processing (NLP) and machine learning. They were introduced in the paper “Attention is All You Need” by Vaswani et al. in 2017. Since then, transformer-based models like BERT, GPT, and various other derivatives have set new benchmarks across a wide range of NLP tasks.

Transformers use self-attention mechanisms to process input data. This allows them to consider the entire context of a sentence when making predictions, making them highly effective for tasks like language translation, text summarization, and sentiment analysis.

And until then, RNNs, recurrent neural networks were dominating this field. Some offsprings of RNNs like LSTMs and so on. RNNs have some specific problems or can have specific problems. For example, in NLP the words are processed one by one.

Here’s a brief overview and a step-by-step guide on how to use a transformer model for sentiment analysis, including the use of the SentenceTransformer for embeddings.

Make sure to install the necessary libraries if you haven’t already:

pip install sentence-transformers tensorflow scikit-learn pandas numpy

If the text is not in English, you’ll need to follow additional steps to ensure proper handling of the language. The SentenceTransformer models are generally trained on multilingual datasets, but you need to ensure you're using a model that supports the language in question. Here's how you can adapt the workflow for non-English text, specifically using a multilingual model like distiluse-base-multilingual-cased-v1.

Ensure your dataset contains text samples with sentiment labels (positive, negative, neutral) in the target language.

To integrate VADER (Valence Aware Dictionary and sEntiment Reasoner) with Transformer models for sentiment analysis, you can use the VADER sentiment scores as additional features or as a way to provide interpretability to your Transformer-based sentiment analysis. Here’s a step-by-step guide to combine VADER with a Transformer model such as SentenceTransformer for sentiment analysis

First, load your data and preprocess it.

import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

def clean_text(text):
text = re.sub(r'\s+', ' ', text) # Remove special characters and numbers
text = text.lower() # Convert to lowercase
return text

# Load your dataset
data = pd.read_csv('/home/master/Downloads/sentiment.csv')
data['cleaned_text'] = data['text'].apply(clean_text)

# Encode labels
label_encoder = LabelEncoder()
data['label'] = label_encoder.fit_transform(data['sentiment']) # Assuming 'sentiment' column has 'positive', 'negative', 'neutral'

# Split data
X_train, X_test, y_train, y_test = train_test_split(data['cleaned_text'], data['label'], test_size=0.2, random_state=42)

Use SentenceTransformer to generate sentence embeddings and VADER to get sentiment scores.

# Download Ukrainian tone dict 
import requests
import nltk
import csv
nltk.download('vader_lexicon')

url = 'https://raw.githubusercontent.com/lang-uk/tone-dict-uk/master/tone-dict-uk.tsv'
r = requests.get(url)
with open(nltk.data.path[0]+'/tone-dict-uk.tsv', 'wb') as f:
f.write(r.content)

d = {}
with open(nltk.data.path[0]+'/tone-dict-uk.tsv', 'r') as csv_file:
for row in csv.reader(csv_file, delimiter='\t'):
d[row[0]] = float(row[1])


from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sentence_transformers import SentenceTransformer


# Load the pre-trained SentenceTransformer model
model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v1')

# Initialize VADER sentiment analyzer and update dict
vader_analyzer = SentimentIntensityAnalyzer()
vader_analyzer.lexicon.update(d)

# Function to get VADER sentiment scores
def get_vader_sentiment(text):
scores = vader_analyzer.polarity_scores(text)
return [scores['neg'], scores['neu'], scores['pos'], scores['compound']]

# Generate embeddings and VADER scores for training data
X_train_embeddings = model.encode(X_train.tolist(), show_progress_bar=True)
X_train_vader = np.array([get_vader_sentiment(text) for text in X_train])

# Generate embeddings and VADER scores for testing data
X_test_embeddings = model.encode(X_test.tolist(), show_progress_bar=True)
X_test_vader = np.array([get_vader_sentiment(text) for text in X_test])

# Combine embeddings with VADER scores
X_train_combined = np.hstack((X_train_embeddings, X_train_vader))
X_test_combined = np.hstack((X_test_embeddings, X_test_vader))

Use the combined features (embeddings + VADER scores) to train a neural network.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Define the model
input_dim = X_train_combined.shape[1]
nn_model = Sequential()
nn_model.add(Dense(256, input_shape=(input_dim,), activation='relu'))
nn_model.add(Dropout(0.3))
nn_model.add(Dense(128, activation='relu'))
nn_model.add(Dropout(0.3))
nn_model.add(Dense(3, activation='softmax')) # Three classes: positive, negative, neutral

# Compile the model
nn_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
nn_model.fit(X_train_combined, y_train, epochs=10, batch_size=32, validation_data=(X_test_combined, y_test))

Evaluate the model’s performance on the test set.

loss, accuracy = nn_model.evaluate(X_test_combined, y_test)
print(f'Test Accuracy: {accuracy * 100:.2f}%')

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 13ms/step - accuracy: 0.6897 - loss: 0.5995
Test Accuracy: 68.97%

Create a function to predict the sentiment of new text samples using both SentenceTransformer embeddings and VADER sentiment scores.

def predict_sentiment(text):
cleaned_text = clean_text(text)
embedding = model.encode([cleaned_text])[0] # Generate embedding for the input text
vader_scores = get_vader_sentiment(cleaned_text) # Get VADER scores
combined_features = np.hstack((embedding, vader_scores)) # Combine features
prediction = nn_model.predict(np.array([combined_features]))
sentiment = label_encoder.inverse_transform([np.argmax(prediction)])
return sentiment[0]

print(predict_sentiment("Цей фільм був нормальним, не поганим, але й не чудовим."))

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 27ms/step
neutral

By combining SentenceTransformer embeddings with VADER sentiment scores, you can enhance your sentiment analysis model to leverage both deep learning-based sentence representations and lexicon-based sentiment features. This approach can improve model performance and provide more robust sentiment predictions. Adjust the neural network architecture and hyperparameters as needed to optimize performance for your specific dataset.

If you liked the article, you can support the author by clapping below 👏🏻 Thanks for reading!

Oleh Dubetsky|Linkedin

--

--

Oleh Dubetcky

I am an management consultant with a unique focus on delivering comprehensive solutions in both human resources (HR) and information technology (IT).