Multi-Label Text Classification for Job Skills Prediction

3 min readJun 23, 2024

Multi-label text classification is a method where each input (in this case, a job title) can be assigned multiple labels (skills in this scenario).

Here’s a step-by-step guide to building a multi-label text classification model for predicting skills based on job titles:

Step 1: Data Collection

Job Titles and Skills Data: Collect a dataset containing job titles and corresponding skills. Each job title should be associated with a set of skills.

Occupations

The ESCO occupations pillar is built on ISCO-08 which serves as the hierarchical structure for the occupations pillar…

esco.ec.europa.eu

Step 2: Data Preprocessing

Text Cleaning: Clean the job titles by removing special characters, converting to lowercase, and performing other text preprocessing steps.
Tokenization: Tokenize the job titles into words or subwords.
Label Binarization: Convert the list of skills into a binary format where each skill is represented as a binary value (1 if the skill is present, 0 otherwise).

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score

# Label binarization
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df['skills'])

classes = mlb.classes_
print(classes)

Step 3: Feature Extraction

Vectorization: Convert the text data into numerical format using techniques like TF-IDF, word embeddings (Word2Vec, GloVe), or transformer-based embeddings (BERT, RoBERTa).

# Text preprocessing
vectorizer = TfidfVectorizer(analyzer='word', max_features = 1000, ngram_range = (1,3), stop_words=None)
X = vectorizer.fit_transform(df['jobtitle'])

print(vectorizer.vocabulary_)

Step 4: Model Building

Choose a Model: Select a model that supports multi-label classification. Common choices include:

Logistic Regression
Random Forest
Support Vector Machine (SVM)
Deep Learning models (e.g., LSTM, CNN, Transformer-based models)

Multi-Label Strategy: Use techniques such as:

Binary Relevance: Treat each label as a separate binary classification problem.
Classifier Chains: Consider label dependencies by chaining binary classifiers.
Label Powerset: Treat each unique combination of labels as a single label in a multi-class classification problem.

Step 5: Training the Model

Split Data: Split the data into training and testing sets.
Model Training: Train the selected model on the training data.
Hyperparameter Tuning: Tune the model’s hyperparameters using cross-validation techniques.

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Model building
model = OneVsRestClassifier(LogisticRegression(solver='lbfgs'))
model.fit(X_train, y_train)

import numpy as np

def j_score(y_true, y_pred):
    jaccard = np.minimum(y_true, y_pred).sum(axis = 1)/np.maximum(y_true, y_pred).sum(axis = 1)
    return jaccard.mean()*100

j_score(y_test, y_pred)

#47.25109916714178

from sklearn.svm import LinearSVC

svm = LinearSVC(C = 1.5, penalty='l1', dual = False)
clf = OneVsRestClassifier(svm)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

j_score(y_test, y_pred)

#70.20806495547572

Step 6: Model Evaluation

Metrics: Use appropriate evaluation metrics for multi-label classification, such as:

Hamming Loss
F1 Score (macro, micro, weighted)
Precision and Recall

Evaluation: Evaluate the model on the test data using the selected metrics.

Step 7: Deployment and Prediction

Model Deployment: Deploy the trained model to a production environment.

Prediction: Use the deployed model to predict skills for new job titles.

x = ['Web']
xt = vectorizer.transform(x)
mlb.inverse_transform(clf.predict(xt))

If you liked the article, you can support the author by clapping below 👏🏻 Thanks for reading!

Oleh Dubetsky|Linkedin