Multi-Label Text Classification for Job Skills Prediction

Oleh Dubetcky
3 min readJun 23, 2024

--

Multi-label text classification is a method where each input (in this case, a job title) can be assigned multiple labels (skills in this scenario).

Photo by Nathan Cima on Unsplash

Here’s a step-by-step guide to building a multi-label text classification model for predicting skills based on job titles:

Step 1: Data Collection

Job Titles and Skills Data: Collect a dataset containing job titles and corresponding skills. Each job title should be associated with a set of skills.

Step 2: Data Preprocessing

  1. Text Cleaning: Clean the job titles by removing special characters, converting to lowercase, and performing other text preprocessing steps.
  2. Tokenization: Tokenize the job titles into words or subwords.
  3. Label Binarization: Convert the list of skills into a binary format where each skill is represented as a binary value (1 if the skill is present, 0 otherwise).
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score

# Label binarization
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df['skills'])

classes = mlb.classes_
print(classes)

Step 3: Feature Extraction

  1. Vectorization: Convert the text data into numerical format using techniques like TF-IDF, word embeddings (Word2Vec, GloVe), or transformer-based embeddings (BERT, RoBERTa).
# Text preprocessing
vectorizer = TfidfVectorizer(analyzer='word', max_features = 1000, ngram_range = (1,3), stop_words=None)
X = vectorizer.fit_transform(df['jobtitle'])

print(vectorizer.vocabulary_)

Step 4: Model Building

Choose a Model: Select a model that supports multi-label classification. Common choices include:

  • Logistic Regression
  • Random Forest
  • Support Vector Machine (SVM)
  • Deep Learning models (e.g., LSTM, CNN, Transformer-based models)

Multi-Label Strategy: Use techniques such as:

  • Binary Relevance: Treat each label as a separate binary classification problem.
  • Classifier Chains: Consider label dependencies by chaining binary classifiers.
  • Label Powerset: Treat each unique combination of labels as a single label in a multi-class classification problem.

Step 5: Training the Model

  1. Split Data: Split the data into training and testing sets.
  2. Model Training: Train the selected model on the training data.
  3. Hyperparameter Tuning: Tune the model’s hyperparameters using cross-validation techniques.
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Model building
model = OneVsRestClassifier(LogisticRegression(solver='lbfgs'))
model.fit(X_train, y_train)

import numpy as np

def j_score(y_true, y_pred):
jaccard = np.minimum(y_true, y_pred).sum(axis = 1)/np.maximum(y_true, y_pred).sum(axis = 1)
return jaccard.mean()*100

j_score(y_test, y_pred)

#47.25109916714178
from sklearn.svm import LinearSVC

svm = LinearSVC(C = 1.5, penalty='l1', dual = False)
clf = OneVsRestClassifier(svm)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

j_score(y_test, y_pred)

#70.20806495547572

Step 6: Model Evaluation

Metrics: Use appropriate evaluation metrics for multi-label classification, such as:

  • Hamming Loss
  • F1 Score (macro, micro, weighted)
  • Precision and Recall

Evaluation: Evaluate the model on the test data using the selected metrics.

Step 7: Deployment and Prediction

Model Deployment: Deploy the trained model to a production environment.

Prediction: Use the deployed model to predict skills for new job titles.

x = ['Web']
xt = vectorizer.transform(x)
mlb.inverse_transform(clf.predict(xt))

If you liked the article, you can support the author by clapping below 👏🏻 Thanks for reading!

Oleh Dubetsky|Linkedin

--

--

Oleh Dubetcky

I am an management consultant with a unique focus on delivering comprehensive solutions in both human resources (HR) and information technology (IT).