Dataset Preparation

Dataset Preparation

Select Dataset:

Before diving into building a text classification model, you need data to work with. Look for a dataset that has text documents along with labels that tell you what category each document belongs to. For example, if you're interested in classifying news articles, you'd want a dataset where each article is labeled with its topic or category, like sports, politics, or entertainment.

So we will use Sahili news dataset from https://www.kaggle.com/datasets/waalbannyantudre/swahili-news-classification-dataset

# using pandas to read your dataset
import pandas as pd

data_path = 'data/SwahiliNewsClassificationDataset.csv'
df = pd.read_csv(data_path)

Preprocess Dataset:

Once you've got your dataset, it's important to clean it up and get it ready for the model. This means removing any unnecessary stuff like punctuation, numbers, and weird symbols from the text.

Here we wil use regular expression to normalize dataset

import re
def normalize_text(text):
    # Remove punctuation, numbers, and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    return text

This part of the code defines a function called normalize_text that is used to preprocess text data. Let's break down what each step does:

  1. import re: This imports the Python module re, which stands for "regular expressions". Regular expressions are a powerful tool for pattern matching and string manipulation.

  2. def normalize_text(text): This line defines a function named normalize_text that takes a single argument text, which represents the input text that needs to be normalized.

  3. text = re.sub(r'[^a-zA-Z\s]', '', text): This line uses the re.sub() function to perform a substitution operation on the input text. The regular expression pattern r'[^a-zA-Z\s]' matches any character that is not a letter (either lowercase or uppercase) or whitespace. The sub() function then replaces all such characters with an empty string, effectively removing them from the text.

  4. text = text.lower(): This line converts all the remaining characters in the text to lowercase using the lower() method. This step ensures that the text is case-insensitive, which can be helpful for text processing tasks.

  5. return text: Finally, the function returns the normalized text after applying the above transformations.

Apply the function to dataset

# Normalize the text column
df['Text'] = df['content'].apply(normalize_text)

Last updated