Tokenization

Tokenization:

  • Tokenization is the process of breaking down text into smaller units, usually words or individual characters. In our case, we're breaking down sentences into words.

  • We set a maximum number of words to consider, called max_words. This helps manage the size of our data.

  • We create a Tokenizer object, which is like a tool that helps us tokenize our text.

  • We use tokenizer.fit_on_texts(train_texts) to teach the tokenizer about the words in our training data. This step builds a vocabulary based on the words it encounters in train_texts.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
max_words = 1000  # Adjust based on your dataset size
max_len = 200      # Adjust based on the average length of your news articles

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train_texts)

train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

train_data = pad_sequences(train_sequences, maxlen=max_len)
test_data = pad_sequences(test_sequences, maxlen=max_len)
  1. Convert Text to Sequences:

    • Once the tokenizer is trained, we use it to convert our text data into sequences of numbers.

    • Each word in the text is replaced by a unique number, which represents its position in the tokenizer's vocabulary.

    • This creates sequences of numbers that represent our text data in a format that our machine learning model can understand.

  2. Padding Sequences:

    • Neural networks expect inputs to have a consistent length, but our sequences might be different lengths.

    • So, we use padding to make all sequences the same length. We choose a maximum length, max_len, and add zeros to the beginning or end of sequences that are shorter, or we trim sequences that are longer.

    • This ensures that all sequences have the same length, which is necessary for training our model.

  3. Save Tokenizer:

    • Finally, we save our tokenizer to a file so that we can use it later to preprocess new text data in the same way.

    • This is helpful because we want to use the same vocabulary and encoding scheme for both training and prediction.

import json
tokenizer_json = tokenizer.to_json()

with open('tokenizer/tf_tokenizer.json', 'w', encoding='utf-8') as f:
    json.dump(tokenizer_json, f)

Last updated