udsm ai
  • Welcome to UDSM Ai
  • About the club
  • Events
    • Exploring natural language processing
    • Building Blocks with numpy
    • Introduction to machine learning
  • 📚Resources
    • Intro to Ai
      • 🖊️Introduction to AI (General summary)
      • 🖊️Views of AI
        • AI agents
        • Agents in different types of applications
      • 🖊️Knowledge representation in AI
        • Knowledge Representation
        • Formal Systems
        • Propositional Logic
        • Predicate Logic
          • Syntax
          • Semantics
      • 🖊️Approaches to Handle Uncertainty
    • Machine Learning
      • Introduction to Machine Learning
      • Fundamentals of Python Programming
      • Exploratory Data Analysis (EDA)
      • Supervised Learning Algorithms
        • Linear Regression
        • Logistic regression
  • 👷Hands On
    • Natural Language Processing
      • 📢Voice
      • 📒Text
        • Text classification from scratch
          • Dataset Preparation
          • Text Representation
          • Tokenization
          • Model Definition
          • Training
          • Evaluation
          • Inference
          • Fine-tuning and Optimization
Powered by GitBook
On this page
  1. Hands On
  2. Natural Language Processing
  3. Text
  4. Text classification from scratch

Tokenization

Tokenization:

  • Tokenization is the process of breaking down text into smaller units, usually words or individual characters. In our case, we're breaking down sentences into words.

  • We set a maximum number of words to consider, called max_words. This helps manage the size of our data.

  • We create a Tokenizer object, which is like a tool that helps us tokenize our text.

  • We use tokenizer.fit_on_texts(train_texts) to teach the tokenizer about the words in our training data. This step builds a vocabulary based on the words it encounters in train_texts.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
max_words = 1000  # Adjust based on your dataset size
max_len = 200      # Adjust based on the average length of your news articles

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train_texts)

train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

train_data = pad_sequences(train_sequences, maxlen=max_len)
test_data = pad_sequences(test_sequences, maxlen=max_len)
  1. Convert Text to Sequences:

    • Once the tokenizer is trained, we use it to convert our text data into sequences of numbers.

    • Each word in the text is replaced by a unique number, which represents its position in the tokenizer's vocabulary.

    • This creates sequences of numbers that represent our text data in a format that our machine learning model can understand.

  2. Padding Sequences:

    • Neural networks expect inputs to have a consistent length, but our sequences might be different lengths.

    • So, we use padding to make all sequences the same length. We choose a maximum length, max_len, and add zeros to the beginning or end of sequences that are shorter, or we trim sequences that are longer.

    • This ensures that all sequences have the same length, which is necessary for training our model.

  3. Save Tokenizer:

    • Finally, we save our tokenizer to a file so that we can use it later to preprocess new text data in the same way.

    • This is helpful because we want to use the same vocabulary and encoding scheme for both training and prediction.

import json
tokenizer_json = tokenizer.to_json()

with open('tokenizer/tf_tokenizer.json', 'w', encoding='utf-8') as f:
    json.dump(tokenizer_json, f)
PreviousText RepresentationNextModel Definition

Last updated 1 year ago

👷
📒