# Tokenization

**Tokenization:**

* Tokenization is the process of breaking down text into smaller units, usually words or individual characters. In our case, we're breaking down sentences into words.
* We set a maximum number of words to consider, called `max_words`. This helps manage the size of our data.
* We create a `Tokenizer` object, which is like a tool that helps us tokenize our text.
* We use `tokenizer.fit_on_texts(train_texts)` to teach the tokenizer about the words in our training data. This step builds a vocabulary based on the words it encounters in `train_texts`.

```python
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
```

```python
max_words = 1000  # Adjust based on your dataset size
max_len = 200      # Adjust based on the average length of your news articles

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train_texts)

train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

train_data = pad_sequences(train_sequences, maxlen=max_len)
test_data = pad_sequences(test_sequences, maxlen=max_len)
```

1. **Convert Text to Sequences:**
   * Once the tokenizer is trained, we use it to convert our text data into sequences of numbers.
   * Each word in the text is replaced by a unique number, which represents its position in the tokenizer's vocabulary.
   * This creates sequences of numbers that represent our text data in a format that our machine learning model can understand.
2. **Padding Sequences:**
   * Neural networks expect inputs to have a consistent length, but our sequences might be different lengths.
   * So, we use padding to make all sequences the same length. We choose a maximum length, `max_len`, and add zeros to the beginning or end of sequences that are shorter, or we trim sequences that are longer.
   * This ensures that all sequences have the same length, which is necessary for training our model.
3. **Save Tokenizer:**
   * Finally, we save our tokenizer to a file so that we can use it later to preprocess new text data in the same way.
   * This is helpful because we want to use the same vocabulary and encoding scheme for both training and prediction.

```python
import json
tokenizer_json = tokenizer.to_json()

with open('tokenizer/tf_tokenizer.json', 'w', encoding='utf-8') as f:
    json.dump(tokenizer_json, f)
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://udsm-ai.gitbook.io/udsm-ai/hands-on/natural-language-processing/text/text-classification-from-scratch/tokenization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
