Text Representation
Representing Classes / Labels into numerical format
This part of the code converts the categorical labels into a numerical format suitable for training machine learning models. Specifically, it performs one-hot encoding on the labels, which transforms each label into a binary vector where only one element is 1 (indicating the presence of that label) and all other elements are 0.
One-Hot Encoding:
The
OneHotEncoder
class from scikit-learn is used to perform one-hot encoding on the categorical labels.encoder = OneHotEncoder(sparse=False)
: This line initializes the one-hot encoder object with the parametersparse=False
, which ensures that the output will be a dense array instead of a sparse matrix.labels = encoder.fit_transform(labels.reshape(-1, 1))
: This line fits the one-hot encoder to the categorical labels (labels
) after reshaping them into a 2D array with a single column. It then transforms the labels into one-hot encoded vectors.
Save the encoder for future use
Save Encoder:
The trained encoder is saved for future use using the
joblib.dump()
function.joblib.dump(encoder, 'encoder/encoder.joblib')
: This line saves the encoder object to a file namedencoder.joblib
in theencoder
directory.
Create Train Test Splits from dataset
This part of the code splits the dataset into two separate sets: one for training the machine learning model and one for testing its performance. Let's break down the relevant code:
train_test_split()
Function:This function is from the
sklearn.model_selection
module and is commonly used to split datasets into random train and test subsets.texts
is the input data (text data) that we want to split into training and testing sets.labels
are the corresponding labels for the input data.test_size=0.2
: This parameter specifies the proportion of the dataset that should be allocated to the test set. Here, it's set to 0.2, meaning 20% of the data will be used for testing, and the remaining 80% will be used for training.random_state=42
: This parameter sets the random seed for reproducibility. It ensures that every time you run the code, you get the same random split. This is useful for debugging and ensuring consistent results.
Output Variables:
train_texts
: This variable contains the training portion of the input data (texts
). These are the text samples that will be used to train the model.test_texts
: This variable contains the testing portion of the input data. These text samples will be used to evaluate the trained model's performance.train_labels
: This variable contains the corresponding labels for the training data (train_texts
). These labels will be used to train the model to predict the correct category for each text sample.test_labels
: This variable contains the corresponding labels for the testing data (test_texts
). These labels will be used to evaluate the model's predictions and calculate performance metrics.
Last updated