udsm ai
  • Welcome to UDSM Ai
  • About the club
  • Events
    • Exploring natural language processing
    • Building Blocks with numpy
    • Introduction to machine learning
  • 📚Resources
    • Intro to Ai
      • 🖊️Introduction to AI (General summary)
      • 🖊️Views of AI
        • AI agents
        • Agents in different types of applications
      • 🖊️Knowledge representation in AI
        • Knowledge Representation
        • Formal Systems
        • Propositional Logic
        • Predicate Logic
          • Syntax
          • Semantics
      • 🖊️Approaches to Handle Uncertainty
    • Machine Learning
      • Introduction to Machine Learning
      • Fundamentals of Python Programming
      • Exploratory Data Analysis (EDA)
      • Supervised Learning Algorithms
        • Linear Regression
        • Logistic regression
  • 👷Hands On
    • Natural Language Processing
      • 📢Voice
      • 📒Text
        • Text classification from scratch
          • Dataset Preparation
          • Text Representation
          • Tokenization
          • Model Definition
          • Training
          • Evaluation
          • Inference
          • Fine-tuning and Optimization
Powered by GitBook
On this page
  1. Resources
  2. Machine Learning

Exploratory Data Analysis (EDA)

Understanding Data and its Characteristics

Overview: Before applying machine learning algorithms, it's crucial to understand the dataset's structure, features, and distributions.

Key Concepts:

  • Dataset Exploration: Analyze the dataset's size, dimensions, and data types.

  • Feature Inspection: Examine individual features (columns) to identify their types (numerical, categorical), ranges, and distributions.

  • Target Variable Analysis: Understand the distribution and characteristics of the target variable (if applicable).

Data Visualization with Matplotlib and Seaborn

Matplotlib:

  • Matplotlib is a comprehensive library for creating static, interactive, and animated visualizations in Python.

  • It provides functionalities for creating line plots, scatter plots, histograms, bar charts, and more.

Seaborn:

  • Seaborn is built on top of Matplotlib and offers a higher-level interface for creating attractive statistical graphics.

  • It simplifies the process of creating complex visualizations such as heatmaps, violin plots, and pair plots.

Data Preprocessing Techniques: Handling Missing Values, Encoding Categorical Variables, and Feature Scaling

Handling Missing Values:

  • Identify missing values in the dataset and decide on an appropriate strategy for handling them (e.g., imputation, deletion).

  • Imputation methods include mean, median, mode imputation, or more advanced techniques like K-nearest neighbors (KNN) imputation.

Encoding Categorical Variables:

  • Categorical variables need to be converted into numerical format before feeding them into machine learning algorithms.

  • Common encoding techniques include one-hot encoding, label encoding, and ordinal encoding.

Feature Scaling:

  • Features often have different scales, which can affect the performance of some machine learning algorithms.

  • Feature scaling methods, such as Min-Max scaling and Standardization (Z-score scaling), are used to bring all features to a similar scale.

Data Splitting: Training, Validation, and Testing Sets

Train-Validation-Test Split:

  • Split the dataset into three subsets: training set, validation set, and test set.

  • The training set is used to train the machine learning model, the validation set is used to tune hyperparameters and evaluate model performance during training, and the test set is used to assess the final model's performance.

Stratified Sampling:

  • Ensure that the distribution of classes (if present) is maintained across the training, validation, and test sets, especially in classification tasks.

  • Stratified sampling techniques help preserve class proportions when splitting the data.

Cross-Validation:

  • Cross-validation is a technique used to assess model performance by splitting the data into multiple subsets (folds) and training the model on different combinations of these subsets.

  • It provides a more robust estimate of the model's performance compared to a single train-test split.

Data Leakage Prevention:

  • Ensure that preprocessing steps (e.g., feature scaling, encoding) are applied separately to training, validation, and test sets to prevent data leakage and ensure unbiased evaluation of the model.

PreviousFundamentals of Python ProgrammingNextSupervised Learning Algorithms

Last updated 1 year ago

📚