Preparing Data: Dataset — Artificial Intelligence for Managers

A high-quality training dataset is the foundation of every successful machine learning project.

Talvinder Singh, from a Pragmatic Leaders AI for Managers session

Data is not just a raw input to machine learning. It is the carrier of real-world context, constraints, and signals that a model learns from. The actual job is to create a dataset that fits your specific use case — nothing generic, nothing assumed. Without a properly scoped, high-quality dataset, your model will not learn the right patterns and will fail in production.

In practice, dataset preparation is the step where most AI projects stall or go off the rails. Teams underestimate the effort required to collect, clean, and annotate data. They treat data as an afterthought instead of the foundation. This lesson teaches you how to think about dataset preparation as a rigorous, goal-driven process.

Why dataset quality matters more than model architecture

There is an old saying in AI: Garbage in, garbage out. You can have the most sophisticated neural network architecture, but if your training data is noisy, biased, or irrelevant, your model will fail to deliver value.

Here is the uncomfortable reality: most model failures trace back to poor data quality, not algorithmic choices. The data must be representative of the real-world scenarios your product will encounter. It must be labeled accurately if supervised learning is used. It must capture the diversity and edge cases that matter for your users.

Consider an Indian fintech startup building a fraud detection system. If their dataset lacks examples of fraud patterns common in India — such as SIM swap fraud or UPI phishing — the model will miss these cases, causing costly false negatives. The model architecture cannot compensate for missing or skewed data.

Defining dataset scope based on your AI use case

Your first step is to clearly define what problem the AI is solving and what data is needed to solve it. This is not a technical exercise — it is product thinking.

What is the user problem or business outcome? For example, detecting fraudulent transactions within 30 seconds of initiation.
What data sources reflect this problem? Transaction logs, user device metadata, customer complaints.
What features or labels are required? Fraudulent vs legitimate transactions, device location, transaction amount.
What is the required data volume and diversity? Enough examples of fraud cases across regions, user demographics, and transaction types.

Without this clarity, you risk collecting irrelevant or insufficient data. The dataset must be scoped to answer the key questions your model needs to learn.

// scene:

Sprint planning at a Series B fintech startup in Bangalore

Product Manager: “We want to build a fraud detection model. What data do we have?”

Data Engineer: “We have transaction logs for the past year, but very few confirmed fraud labels.”

Product Manager: “How do we get more labeled fraud examples? Can we use customer support tickets?”

Data Scientist: “Yes, but we'll need to clean and align those with transactions. It will take time.”

Product Manager: “Let's prioritize data collection and labeling this sprint. Without it, the model won't learn.”

// tension:

The PM pushes the team to focus on data quality before modeling

The process of dataset creation

Creating a dataset involves several iterative steps:

Data collection: Identify and gather raw data from internal systems, third-party sources, or manual inputs.
Data cleaning: Remove duplicates, handle missing values, correct errors, and normalize formats.
Data labeling/annotation: Assign ground truth labels or tags necessary for supervised learning.
Data splitting: Divide into training, validation, and test sets to avoid overfitting and evaluate model performance.
Data augmentation (optional): Create synthetic examples to balance classes or expand dataset diversity.

Each step requires domain knowledge and collaboration across product, engineering, and data science teams.

// thread: #data-prep — Collaborative data cleaning discussion

Neha (Data Scientist)The raw dataset has 10% missing values in the key feature. We need to impute or remove those.

Rahul (Product Manager)Can we collect more complete data upstream? Or should we exclude those records?

NehaLet's try both and measure impact. I'll update the data cleaning scripts accordingly.

Anjali (Engineering)I'll add monitoring alerts for missing data in the pipeline.

Evaluating dataset fit for your use case

Not all data is equally useful. You must evaluate dataset quality along several dimensions:

Dimension	What to check	India-specific considerations
Relevance	Does data capture the problem context?	Regional languages, vernacular content
Completeness	Are all required features and labels present?	Missing fields in legacy systems
Accuracy	Are labels and values correct and consistent?	Errors from manual data entry
Balance	Are classes or categories represented fairly?	Skewed distribution in fraud vs legit transactions
Timeliness	Is data up to date and reflecting current trends?	Rapidly evolving market conditions
Volume	Is the dataset large enough to train models?	Limited historical data in early-stage startups

You must be explicit about which data quality dimensions matter most for your problem. For example, in medical image classification, accuracy and completeness of labels is paramount. In recommendation systems, volume and timeliness are critical.

Post-deployment monitoring of data quality

Building the dataset is not a one-time job. Data distribution changes over time — new user behaviors, new fraud patterns, or product changes can shift the input data.

You must build a post-deployment monitoring plan that tracks data drift, label quality, and model performance in production. This enables timely retraining or data refreshes before model quality degrades.

Typical monitoring metrics include:

Feature distribution shifts (mean, variance)
Unexpected missing values or nulls
Label accuracy checks via manual review
Model confidence and error rates

Hands-on: Build a simple annotated dataset

// exercise: · 15 min

Create an annotated dataset for image classification

Select a small set of images relevant to your use case — for example, photos of handwritten digits or product packaging.
Define clear labeling criteria — what classes or tags will you assign?
Use a free annotation tool (e.g., LabelImg, MakeSense.ai) to label each image accurately.
Export your labeled dataset in CSV or JSON format.
Review the dataset for consistency and completeness.
Reflect: How well does this dataset capture the diversity of real-world inputs? What challenges did you face in labeling?

This exercise will help you appreciate the effort behind dataset creation and the importance of labeling guidelines.

Judgment exercise: Dataset scope at a Series A Indian healthtech startup

// learn the judgment

You are the PM at a Series A healthtech startup in Hyderabad building an AI model to detect diabetic retinopathy from retina images. Your data team has access to 5,000 labeled images from a US dataset and 500 unlabeled images collected from Indian clinics.

The call: Should you train the model solely on the US dataset, solely on the unlabeled Indian images, or a combination? How do you decide?

Your reasoning:

// practice

Your task: Should you train the model solely on the US dataset, solely on the unlabeled Indian images, or a combination? How do you decide?

your reasoning:

0 chars (min 80)

Monitoring data quality in production at Razorpay

// thread: #model-monitoring — Cross-functional monitoring discussion at Razorpay

Meera (Data Engineer)We've set up alerts for data drift on key transaction features.

Rahul (PM)Great. What thresholds trigger a retraining request?

MeeraIf the distribution mean shifts by more than 10% for two consecutive days.

RahulAlso, can we monitor label accuracy? We get new fraud labels weekly from the ops team.

MeeraYes, we'll automate sample checks and flag discrepancies.

The entire profession in one line: Your dataset is your product's foundation

If you cannot answer these questions about your dataset — where it comes from, how complete and accurate it is, and how you will maintain it — you are not ready to build or ship an AI product.

Everything else — model selection, hyperparameter tuning, deployment pipelines — is downstream of that one job.

Where to go next

Learn how to measure model performance and avoid bias: Measuring AI Impact and Bias
Understand AI project workflows end-to-end: AI Project Lifecycle
Develop skills to work effectively with AI teams: Working with AI Teams
Explore ethical considerations in AI data collection: Ethical AI Data Practices