udsm ai
  • Welcome to UDSM Ai
  • About the club
  • Events
    • Exploring natural language processing
    • Building Blocks with numpy
    • Introduction to machine learning
  • 📚Resources
    • Intro to Ai
      • 🖊️Introduction to AI (General summary)
      • 🖊️Views of AI
        • AI agents
        • Agents in different types of applications
      • 🖊️Knowledge representation in AI
        • Knowledge Representation
        • Formal Systems
        • Propositional Logic
        • Predicate Logic
          • Syntax
          • Semantics
      • 🖊️Approaches to Handle Uncertainty
    • Machine Learning
      • Introduction to Machine Learning
      • Fundamentals of Python Programming
      • Exploratory Data Analysis (EDA)
      • Supervised Learning Algorithms
        • Linear Regression
        • Logistic regression
  • 👷Hands On
    • Natural Language Processing
      • 📢Voice
      • 📒Text
        • Text classification from scratch
          • Dataset Preparation
          • Text Representation
          • Tokenization
          • Model Definition
          • Training
          • Evaluation
          • Inference
          • Fine-tuning and Optimization
Powered by GitBook
On this page
  • Representing Classes / Labels into numerical format
  • Create Train Test Splits from dataset
  1. Hands On
  2. Natural Language Processing
  3. Text
  4. Text classification from scratch

Text Representation

Representing Classes / Labels into numerical format

This part of the code converts the categorical labels into a numerical format suitable for training machine learning models. Specifically, it performs one-hot encoding on the labels, which transforms each label into a binary vector where only one element is 1 (indicating the presence of that label) and all other elements are 0.

texts = df['Text'].values
labels = df['category'].values
from sklearn.preprocessing import OneHotEncoder

# one hot encode the labels
encoder = OneHotEncoder(sparse=False)
labels = encoder.fit_transform(labels.reshape(-1, 1))

One-Hot Encoding:

  • The OneHotEncoder class from scikit-learn is used to perform one-hot encoding on the categorical labels.

  • encoder = OneHotEncoder(sparse=False): This line initializes the one-hot encoder object with the parameter sparse=False, which ensures that the output will be a dense array instead of a sparse matrix.

  • labels = encoder.fit_transform(labels.reshape(-1, 1)): This line fits the one-hot encoder to the categorical labels (labels) after reshaping them into a 2D array with a single column. It then transforms the labels into one-hot encoded vectors.

Save the encoder for future use

# save the encoder
import joblib
joblib.dump(encoder, 'encoder/encoder.joblib')

Save Encoder:

  • The trained encoder is saved for future use using the joblib.dump() function.

  • joblib.dump(encoder, 'encoder/encoder.joblib'): This line saves the encoder object to a file named encoder.joblib in the encoder directory.

Create Train Test Splits from dataset

This part of the code splits the dataset into two separate sets: one for training the machine learning model and one for testing its performance. Let's break down the relevant code:

from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = \
    train_test_split(texts, labels, test_size=0.2, random_state=42)
  1. train_test_split() Function:

    • This function is from the sklearn.model_selection module and is commonly used to split datasets into random train and test subsets.

    • texts is the input data (text data) that we want to split into training and testing sets.

    • labels are the corresponding labels for the input data.

    • test_size=0.2: This parameter specifies the proportion of the dataset that should be allocated to the test set. Here, it's set to 0.2, meaning 20% of the data will be used for testing, and the remaining 80% will be used for training.

    • random_state=42: This parameter sets the random seed for reproducibility. It ensures that every time you run the code, you get the same random split. This is useful for debugging and ensuring consistent results.

  2. Output Variables:

    • train_texts: This variable contains the training portion of the input data (texts). These are the text samples that will be used to train the model.

    • test_texts: This variable contains the testing portion of the input data. These text samples will be used to evaluate the trained model's performance.

    • train_labels: This variable contains the corresponding labels for the training data (train_texts). These labels will be used to train the model to predict the correct category for each text sample.

    • test_labels: This variable contains the corresponding labels for the testing data (test_texts). These labels will be used to evaluate the model's predictions and calculate performance metrics.

PreviousDataset PreparationNextTokenization

Last updated 1 year ago

👷
📒