Tokenization
Tokenization:
Tokenization is the process of breaking down text into smaller units, usually words or individual characters. In our case, we're breaking down sentences into words.
We set a maximum number of words to consider, called
max_words
. This helps manage the size of our data.We create a
Tokenizer
object, which is like a tool that helps us tokenize our text.We use
tokenizer.fit_on_texts(train_texts)
to teach the tokenizer about the words in our training data. This step builds a vocabulary based on the words it encounters intrain_texts
.
Convert Text to Sequences:
Once the tokenizer is trained, we use it to convert our text data into sequences of numbers.
Each word in the text is replaced by a unique number, which represents its position in the tokenizer's vocabulary.
This creates sequences of numbers that represent our text data in a format that our machine learning model can understand.
Padding Sequences:
Neural networks expect inputs to have a consistent length, but our sequences might be different lengths.
So, we use padding to make all sequences the same length. We choose a maximum length,
max_len
, and add zeros to the beginning or end of sequences that are shorter, or we trim sequences that are longer.This ensures that all sequences have the same length, which is necessary for training our model.
Save Tokenizer:
Finally, we save our tokenizer to a file so that we can use it later to preprocess new text data in the same way.
This is helpful because we want to use the same vocabulary and encoding scheme for both training and prediction.
Last updated