Dataset Preparation
Dataset Preparation
Select Dataset:
Before diving into building a text classification model, you need data to work with. Look for a dataset that has text documents along with labels that tell you what category each document belongs to. For example, if you're interested in classifying news articles, you'd want a dataset where each article is labeled with its topic or category, like sports, politics, or entertainment.
So we will use Sahili news dataset from https://www.kaggle.com/datasets/waalbannyantudre/swahili-news-classification-dataset
Preprocess Dataset:
Once you've got your dataset, it's important to clean it up and get it ready for the model. This means removing any unnecessary stuff like punctuation, numbers, and weird symbols from the text.
Here we wil use regular expression to normalize dataset
This part of the code defines a function called normalize_text
that is used to preprocess text data. Let's break down what each step does:
import re
: This imports the Python modulere
, which stands for "regular expressions". Regular expressions are a powerful tool for pattern matching and string manipulation.def normalize_text(text)
: This line defines a function namednormalize_text
that takes a single argumenttext
, which represents the input text that needs to be normalized.text = re.sub(r'[^a-zA-Z\s]', '', text)
: This line uses there.sub()
function to perform a substitution operation on the input text. The regular expression patternr'[^a-zA-Z\s]'
matches any character that is not a letter (either lowercase or uppercase) or whitespace. Thesub()
function then replaces all such characters with an empty string, effectively removing them from the text.text = text.lower()
: This line converts all the remaining characters in the text to lowercase using thelower()
method. This step ensures that the text is case-insensitive, which can be helpful for text processing tasks.return text
: Finally, the function returns the normalized text after applying the above transformations.
Apply the function to dataset
Last updated