Tokenization

Tokenization:

Tokenization is the process of breaking down text into smaller units, usually words or individual characters. In our case, we're breaking down sentences into words.
We set a maximum number of words to consider, called max_words. This helps manage the size of our data.
We create a Tokenizer object, which is like a tool that helps us tokenize our text.
We use tokenizer.fit_on_texts(train_texts) to teach the tokenizer about the words in our training data. This step builds a vocabulary based on the words it encounters in train_texts.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

max_words = 1000  # Adjust based on your dataset size
max_len = 200      # Adjust based on the average length of your news articles

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train_texts)

train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

train_data = pad_sequences(train_sequences, maxlen=max_len)
test_data = pad_sequences(test_sequences, maxlen=max_len)

Convert Text to Sequences:
- Once the tokenizer is trained, we use it to convert our text data into sequences of numbers.
- Each word in the text is replaced by a unique number, which represents its position in the tokenizer's vocabulary.
- This creates sequences of numbers that represent our text data in a format that our machine learning model can understand.
Padding Sequences:
- Neural networks expect inputs to have a consistent length, but our sequences might be different lengths.
- So, we use padding to make all sequences the same length. We choose a maximum length, max_len, and add zeros to the beginning or end of sequences that are shorter, or we trim sequences that are longer.
- This ensures that all sequences have the same length, which is necessary for training our model.
Save Tokenizer:
- Finally, we save our tokenizer to a file so that we can use it later to preprocess new text data in the same way.
- This is helpful because we want to use the same vocabulary and encoding scheme for both training and prediction.

import json
tokenizer_json = tokenizer.to_json()

with open('tokenizer/tf_tokenizer.json', 'w', encoding='utf-8') as f:
    json.dump(tokenizer_json, f)

PreviousText Representation NextModel Definition

Last updated 1 year ago