Data and Its Importance in AI — Artificial Intelligence for Managers

Improved AI outcomes come from the right data, not just better algorithms. Data is the fuel that powers AI — without it, even the smartest model is blind.

Talvinder Singh, from a Pragmatic Leaders AI course session

AI is not magic — it is mathematics grounded in data. The actual job is to understand that data is the foundation of every AI system. Without the right data, your AI product is doomed to underperform or fail entirely.

The trap is thinking that better algorithms alone will fix your problems. What I tell PMs is this: you cannot out-code bad data. Your model’s quality and your product’s impact depend first and foremost on the data you feed it.

This lesson teaches you how to think about data’s role in AI, what kinds of data matter, and how to use data-driven metrics to improve your AI product.

Why data matters more than you think

AI systems learn patterns from data. The better the data — in quantity, quality, and relevance — the better the AI’s decisions.

Imagine you are building a computer vision system to detect defects in manufactured goods. If your training images are blurry, mislabeled, or unrepresentative of real defects, your model will fail in production. No amount of algorithmic tuning will fix that.

In practice, Indian companies face unique data challenges:

Messy data: Many enterprises have inconsistent formats, missing fields, or unstructured text in multiple languages.
Limited labeled data: Labeling data is expensive and time-consuming, yet crucial for supervised learning.
Bias and noise: Data may reflect historical biases or errors, leading to unfair or inaccurate AI outcomes.

Understanding these realities is the first step to building AI that works in India.

// scene:

Product review meeting at a Series B SaaS startup in Bangalore

You (PM): “Our model accuracy dropped 5% last quarter. What changed?”

Data Scientist: “The new data pipeline introduced some corrupted records. We didn't catch it in time.”

You (PM): “Let's prioritize data validation and cleaning before pushing new features. Model improvements won't matter without clean data.”

Engineering Lead: “That means delaying the new algorithm update?”

You (PM): “Yes. The foundation has to be right before we build higher.”

// tension:

AI quality depends on data hygiene, not just model tweaks.

What kinds of data matter in AI?

Data is not one thing. Different AI problems require different data types:

Data Type	Description	Example in Indian context
Structured	Data organized in tables or fields	Customer transaction records at PhonePe
Unstructured	Text, images, audio, video	User reviews in multiple Indian languages
Labeled	Data with human-annotated tags or categories	Annotated medical images for disease detection
Time series	Data points indexed over time	Sensor readings from agricultural drones
Transactional	Records of user actions or events	Clickstream data from Flipkart app

Your AI model’s performance depends on how well your training data matches the real-world data it will see in production.

A classic mistake is training on clean, ideal data but deploying into noisy, unpredictable environments. Swiggy’s fraud detection model, for example, must learn from thousands of real fraud patterns — including new, evolving tactics — or it will miss suspicious orders.

The role of big data in AI

Big data is often touted as the secret sauce for AI success. It is important — but not sufficient.

What matters more than sheer volume is relevance and labeling. A billion data points of irrelevant or low-quality data will not help your model learn meaningful patterns.

Indian startups like Razorpay and Meesho invest heavily in collecting high-quality, labeled datasets because they understand that data quality is the bottleneck.

// thread: #data-team — Prioritizing labeled data for fraud detection

Priya (Data Engineer)We have 10 million transactions, but only 100k are labeled fraud/non-fraud.

Rahul (PM)Can we get more labeled fraud cases? The model precision is low on rare fraud types.

PriyaLabeling takes time, but we can prioritize recent high-risk segments.

RahulLet's do that. Better labeled data beats bigger unlabeled sets.

Metrics and data analysis drive AI product success

Data is not just the input for models. It is also the feedback that guides product decisions.

Defining and measuring the right metrics is critical. These include:

Model metrics: accuracy, precision, recall, F1 score
User impact metrics: task completion rate, time saved, error rate experienced by users
Business metrics: conversion rate, churn, revenue uplift attributable to AI features

The trap is optimizing only for model metrics without linking them to user outcomes.

For example, improving recall from 85% to 90% may look great in isolation. But if it causes many false positives that annoy users, your product suffers.

The PM’s job is to translate data into decisions:

Which data quality issues are hurting model performance?
Are the AI features improving key user metrics?
How do changes in data impact business KPIs?

// scene:

AI feature review at a fintech startup in Mumbai

You (PM): “Our credit scoring model's F1 score improved, but loan approval rates dropped. Why?”

Data Scientist: “More conservative thresholds reduced false positives but also rejected borderline good customers.”

You (PM): “Let's find a balance that maximizes approvals without increasing defaults.”

Product Analyst: “I'll run simulations on different thresholds with historical data.”

// tension:

Balancing model accuracy with real-world business impact.

Building a high-quality dataset: The PM’s role

Creating and maintaining datasets is not just a data team job. As a PM, you must understand:

How the data is collected, stored, and labeled
What biases or gaps exist in the data
How data quality affects model outputs and user experience

You should partner closely with data engineers, scientists, and domain experts to ensure:

Data pipelines are reliable and validated
Labeling guidelines are clear and consistent
Data updates reflect current realities (seasonality, market changes)

This is what week one looks like for most AI PMs: getting your hands dirty with data realities, not just model specs.

// exercise: · 15 min

Assess your product’s data readiness

Pick an AI feature you work on or know well. Answer these questions:

What data is used to train the model powering this feature?
How is the data collected and labeled? Who owns this process?
What are the known data quality issues or gaps?
How often is the data updated or refreshed?
What metrics do you track to measure data quality and model performance?
How does the data reflect your real users and use cases, especially in Indian contexts?

The AI data lifecycle

Think of data in AI products as a continuous cycle:

Data collection: Gathering raw data from users, sensors, logs
Data labeling: Annotating data for supervised learning
Data cleaning: Removing errors, duplicates, and inconsistencies
Data storage: Organizing data for easy access and governance
Model training: Using data to teach the AI system
Model evaluation: Measuring performance on test data
Model deployment: Shipping the AI to users
Monitoring: Tracking real-world performance and collecting new data
Feedback loop: Feeding user corrections and new data back into training

Understanding this lifecycle helps you identify where data problems originate and how to fix them.

// thread: #ai-product — Responding to model drift with continuous data updates

Meera (PM)Our model drifted last month. What happened?

Data ScientistUser behavior changed after we launched a new feature. Our training data is stale.

MeeraWe need a process to continuously collect fresh labeled data and retrain regularly.

Engineering LeadLet's automate data pipelines and schedule retraining jobs.

Where data fits in the AI product strategy

Data is not just an input — it is part of your product strategy.

Good AI PMs ask:

What data advantage do we have over competitors?
How do we collect and protect proprietary data?
What data privacy and compliance issues apply in India?
How do we scale data collection as the product grows?

This is what separates AI product leaders from AI feature managers.

// learn the judgment

You are a PM at a Bangalore-based healthtech startup using AI to detect diabetic retinopathy from retinal images. The data science team reports the model accuracy is 88%, but doctors say the false negative rate is too high for clinical use.

The call: What steps do you take to improve AI outcomes, and how do you balance data quality, labeling, and model performance?

Your reasoning:

// practice

Your task: What steps do you take to improve AI outcomes, and how do you balance data quality, labeling, and model performance?

your reasoning:

0 chars (min 80)

From the field: Why data is the real moat

Test yourself: The data dilemma

// interactive:

The Data Dilemma

You are the PM at a Series A SaaS startup in Pune building an AI-powered customer support bot. The engineering team proposes launching with a small labeled dataset collected internally. The customer success team wants more diverse data from real users before launch.

You need to decide whether to launch the MVP now or delay for more data collection.

Where to go next

Learn how to identify AI opportunities: AI Product Strategy
Master the AI product lifecycle: Building AI Products
Understand ethical AI and bias: Ethical AI Practices
Improve your data literacy: Data-Driven Decision Making

Improved AI outcomes come from the right data, not just better algorithms. Data is the fuel that powers AI — without it, even the smartest model is blind.

Talvinder Singh, from a Pragmatic Leaders AI course session

This lesson teaches you how to think about data’s role in AI, what kinds of data matter, and how to use data-driven metrics to improve your AI product.

Why data matters more than you think

AI systems learn patterns from data. The better the data — in quantity, quality, and relevance — the better the AI’s decisions.

In practice, Indian companies face unique data challenges:

Messy data: Many enterprises have inconsistent formats, missing fields, or unstructured text in multiple languages.
Limited labeled data: Labeling data is expensive and time-consuming, yet crucial for supervised learning.
Bias and noise: Data may reflect historical biases or errors, leading to unfair or inaccurate AI outcomes.

Understanding these realities is the first step to building AI that works in India.

// scene:

Product review meeting at a Series B SaaS startup in Bangalore

You (PM): “Our model accuracy dropped 5% last quarter. What changed?”

Data Scientist: “The new data pipeline introduced some corrupted records. We didn't catch it in time.”

You (PM): “Let's prioritize data validation and cleaning before pushing new features. Model improvements won't matter without clean data.”

Engineering Lead: “That means delaying the new algorithm update?”

You (PM): “Yes. The foundation has to be right before we build higher.”

// tension:

AI quality depends on data hygiene, not just model tweaks.

What kinds of data matter in AI?

Data is not one thing. Different AI problems require different data types:

Data Type	Description	Example in Indian context
Structured	Data organized in tables or fields	Customer transaction records at PhonePe
Unstructured	Text, images, audio, video	User reviews in multiple Indian languages
Labeled	Data with human-annotated tags or categories	Annotated medical images for disease detection
Time series	Data points indexed over time	Sensor readings from agricultural drones
Transactional	Records of user actions or events	Clickstream data from Flipkart app