Multi-Label Text Classification for Job Skills Prediction
Multi-label text classification is a method where each input (in this case, a job title) can be assigned multiple labels (skills in this scenario).
Here’s a step-by-step guide to building a multi-label text classification model for predicting skills based on job titles:
Step 1: Data Collection
Job Titles and Skills Data: Collect a dataset containing job titles and corresponding skills. Each job title should be associated with a set of skills.
Step 2: Data Preprocessing
- Text Cleaning: Clean the job titles by removing special characters, converting to lowercase, and performing other text preprocessing steps.
- Tokenization: Tokenize the job titles into words or subwords.
- Label Binarization: Convert the list of skills into a binary format where each skill is represented as a binary value (1 if the skill is present, 0 otherwise).
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score
# Label binarization
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df['skills'])
classes = mlb.classes_
print(classes)
Step 3: Feature Extraction
- Vectorization: Convert the text data into numerical format using techniques like TF-IDF, word embeddings (Word2Vec, GloVe), or transformer-based embeddings (BERT, RoBERTa).
# Text preprocessing
vectorizer = TfidfVectorizer(analyzer='word', max_features = 1000, ngram_range = (1,3), stop_words=None)
X = vectorizer.fit_transform(df['jobtitle'])
print(vectorizer.vocabulary_)
Step 4: Model Building
Choose a Model: Select a model that supports multi-label classification. Common choices include:
- Logistic Regression
- Random Forest
- Support Vector Machine (SVM)
- Deep Learning models (e.g., LSTM, CNN, Transformer-based models)
Multi-Label Strategy: Use techniques such as:
- Binary Relevance: Treat each label as a separate binary classification problem.
- Classifier Chains: Consider label dependencies by chaining binary classifiers.
- Label Powerset: Treat each unique combination of labels as a single label in a multi-class classification problem.
Step 5: Training the Model
- Split Data: Split the data into training and testing sets.
- Model Training: Train the selected model on the training data.
- Hyperparameter Tuning: Tune the model’s hyperparameters using cross-validation techniques.
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Model building
model = OneVsRestClassifier(LogisticRegression(solver='lbfgs'))
model.fit(X_train, y_train)
import numpy as np
def j_score(y_true, y_pred):
jaccard = np.minimum(y_true, y_pred).sum(axis = 1)/np.maximum(y_true, y_pred).sum(axis = 1)
return jaccard.mean()*100
j_score(y_test, y_pred)
#47.25109916714178
from sklearn.svm import LinearSVC
svm = LinearSVC(C = 1.5, penalty='l1', dual = False)
clf = OneVsRestClassifier(svm)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
j_score(y_test, y_pred)
#70.20806495547572
Step 6: Model Evaluation
Metrics: Use appropriate evaluation metrics for multi-label classification, such as:
- Hamming Loss
- F1 Score (macro, micro, weighted)
- Precision and Recall
Evaluation: Evaluate the model on the test data using the selected metrics.
Step 7: Deployment and Prediction
Model Deployment: Deploy the trained model to a production environment.
Prediction: Use the deployed model to predict skills for new job titles.
x = ['Web']
xt = vectorizer.transform(x)
mlb.inverse_transform(clf.predict(xt))
If you liked the article, you can support the author by clapping below 👏🏻 Thanks for reading!