Tokenization
Tokenization:
Tokenization is the process of breaking down text into smaller units, usually words or individual characters. In our case, we're breaking down sentences into words.
We set a maximum number of words to consider, called
max_words
. This helps manage the size of our data.We create a
Tokenizer
object, which is like a tool that helps us tokenize our text.We use
tokenizer.fit_on_texts(train_texts)
to teach the tokenizer about the words in our training data. This step builds a vocabulary based on the words it encounters intrain_texts
.
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
max_words = 1000 # Adjust based on your dataset size
max_len = 200 # Adjust based on the average length of your news articles
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train_texts)
train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)
train_data = pad_sequences(train_sequences, maxlen=max_len)
test_data = pad_sequences(test_sequences, maxlen=max_len)
Convert Text to Sequences:
Once the tokenizer is trained, we use it to convert our text data into sequences of numbers.
Each word in the text is replaced by a unique number, which represents its position in the tokenizer's vocabulary.
This creates sequences of numbers that represent our text data in a format that our machine learning model can understand.
Padding Sequences:
Neural networks expect inputs to have a consistent length, but our sequences might be different lengths.
So, we use padding to make all sequences the same length. We choose a maximum length,
max_len
, and add zeros to the beginning or end of sequences that are shorter, or we trim sequences that are longer.This ensures that all sequences have the same length, which is necessary for training our model.
Save Tokenizer:
Finally, we save our tokenizer to a file so that we can use it later to preprocess new text data in the same way.
This is helpful because we want to use the same vocabulary and encoding scheme for both training and prediction.
import json
tokenizer_json = tokenizer.to_json()
with open('tokenizer/tf_tokenizer.json', 'w', encoding='utf-8') as f:
json.dump(tokenizer_json, f)
Last updated