Text Representation

Representing Classes / Labels into numerical format

This part of the code converts the categorical labels into a numerical format suitable for training machine learning models. Specifically, it performs one-hot encoding on the labels, which transforms each label into a binary vector where only one element is 1 (indicating the presence of that label) and all other elements are 0.

texts = df['Text'].values
labels = df['category'].values

from sklearn.preprocessing import OneHotEncoder

# one hot encode the labels
encoder = OneHotEncoder(sparse=False)
labels = encoder.fit_transform(labels.reshape(-1, 1))

One-Hot Encoding:

The OneHotEncoder class from scikit-learn is used to perform one-hot encoding on the categorical labels.
encoder = OneHotEncoder(sparse=False): This line initializes the one-hot encoder object with the parameter sparse=False, which ensures that the output will be a dense array instead of a sparse matrix.
labels = encoder.fit_transform(labels.reshape(-1, 1)): This line fits the one-hot encoder to the categorical labels (labels) after reshaping them into a 2D array with a single column. It then transforms the labels into one-hot encoded vectors.

Save the encoder for future use

# save the encoder
import joblib
joblib.dump(encoder, 'encoder/encoder.joblib')

Save Encoder:

The trained encoder is saved for future use using the joblib.dump() function.
joblib.dump(encoder, 'encoder/encoder.joblib'): This line saves the encoder object to a file named encoder.joblib in the encoder directory.

Create Train Test Splits from dataset

This part of the code splits the dataset into two separate sets: one for training the machine learning model and one for testing its performance. Let's break down the relevant code:

from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = \
    train_test_split(texts, labels, test_size=0.2, random_state=42)

train_test_split() Function:
- This function is from the sklearn.model_selection module and is commonly used to split datasets into random train and test subsets.
- texts is the input data (text data) that we want to split into training and testing sets.
- labels are the corresponding labels for the input data.
- test_size=0.2: This parameter specifies the proportion of the dataset that should be allocated to the test set. Here, it's set to 0.2, meaning 20% of the data will be used for testing, and the remaining 80% will be used for training.
- random_state=42: This parameter sets the random seed for reproducibility. It ensures that every time you run the code, you get the same random split. This is useful for debugging and ensuring consistent results.
Output Variables:
- train_texts: This variable contains the training portion of the input data (texts). These are the text samples that will be used to train the model.
- test_texts: This variable contains the testing portion of the input data. These text samples will be used to evaluate the trained model's performance.
- train_labels: This variable contains the corresponding labels for the training data (train_texts). These labels will be used to train the model to predict the correct category for each text sample.
- test_labels: This variable contains the corresponding labels for the testing data (test_texts). These labels will be used to evaluate the model's predictions and calculate performance metrics.

PreviousDataset Preparation NextTokenization

Last updated 1 year ago