Evaluation

Model evaluation is the process of assessing the performance and effectiveness of a trained machine learning model using unseen data. It involves analyzing how well the model generalizes to new, previously unseen examples and provides insights into its predictive capabilities and potential limitations.

Here's an overview of the model evaluation process:

  1. Test Data: Unseen data, called the test dataset, is used to evaluate the trained model. This dataset should be distinct from the data used for training and validation to ensure an unbiased assessment of the model's performance.

  2. Prediction: The trained model is used to make predictions on the test dataset. Input examples from the test dataset are fed into the model, and corresponding output predictions are generated.

  3. Performance Metrics: Various performance metrics are calculated to quantify how well the model performs on the test dataset. Common evaluation metrics depend on the specific task and include accuracy, precision, recall, F1-score, mean squared error, and others.

    • Accuracy: The proportion of correctly classified examples out of the total number of examples.

    • Precision: The proportion of true positive predictions out of all positive predictions, measuring the model's ability to avoid false positives.

    • Recall (Sensitivity): The proportion of true positive predictions out of all actual positive examples, measuring the model's ability to find all relevant instances.

    • F1-score: The harmonic mean of precision and recall, providing a balanced measure of the model's performance.

    • Mean Squared Error (MSE): The average squared difference between predicted and actual values for regression tasks.

  4. Visualization: Visualizations such as confusion matrices, ROC curves, precision-recall curves, and calibration plots can provide further insights into the model's behavior and performance across different thresholds or classes.

  5. Comparison: The model's performance is compared to baseline models or other existing models to assess its relative effectiveness and identify areas for improvement.

  6. Iterative Improvement: Based on the evaluation results, adjustments may be made to the model's architecture, hyperparameters, or training process to enhance performance and address any deficiencies observed during evaluation.

  7. Generalization: The model's ability to generalize to unseen data from the same distribution is assessed to ensure that it can make accurate predictions in real-world scenarios beyond the training and test datasets.

Example from our trained model

# evaluate the model
loss, accuracy = model.evaluate(test_data, test_labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

In this code snippet, we are evaluating the trained model using the test dataset to determine its performance. Here's what each part does:

  1. Model Evaluation:

    • The evaluate() method is called on the trained model (model) to assess its performance on the test dataset.

    • The method takes three arguments:

      • test_data: The input features (text data) from the test dataset.

      • test_labels: The corresponding true labels or targets from the test dataset.

      • verbose=0: Specifies whether to display progress information during evaluation. Setting verbose to 0 means no progress information will be displayed.

  2. Output:

    • The evaluate() method returns a tuple containing the loss value and the accuracy metric calculated on the test dataset.

    • We unpack this tuple into two variables: loss and accuracy.

  3. Print Accuracy:

    • The accuracy is then printed to the console using print().

    • The accuracy value is multiplied by 100 to convert it to a percentage format and displayed with %f formatting.

  4. Interpretation:

    • The printed accuracy value represents the percentage of correctly classified examples in the test dataset.

    • A higher accuracy indicates better performance of the model in making predictions on unseen data.

    • It provides a quantitative measure of the model's effectiveness and can be used to compare different models or assess the impact of changes made during model development.

Saving your model

# save the model
model.save('models/starter_swahili_news_classification_model.h5')

Last updated