Amazon Product Optimization Case Study: AI Analyst for a Seller Portfolio

Why did this product’s profit drop last month? On Amazon, that simple question has an expensive answer. Sales and traffic live in one report. Ad spend sits in another, across three campaign types. Fees, refunds and reimbursements arrive in settlement data weeks later. Inventory splits between Amazon’s warehouses and the company’s own. Price history, keyword ranks and competitor moves belong to third-party tools.

Vive Health sells medical equipment across the US – wheelchairs, braces, therapy supplies – with hundreds of products on Amazon. Every week, managers answered that question product by product: open tool after tool, export the data, line up the dates, reason toward a cause. Hours per product, quality depending on who did it, and no record of how the conclusion was reached.

They asked us to build a system that does the analysis itself – from raw data to root cause to recommended action.

The Problem

No single dashboard explains a profit change. The daily sales report says units fell. Only the ads data shows a bidding war doubled the cost per click. Only the price tracker shows a competitor undercut the price two weeks earlier. Only the storage report shows a surcharge on aging inventory eating the margin. A correct diagnosis needs all of it at once – and a manager assembling it by hand spends most of the time on collection, not thinking.

The numbers also refuse to sit still. Amazon attributes ad sales for up to two weeks after a click and accepts returns for a month after a sale, so a report pulled Tuesday contradicts the same report pulled Friday. Both are correct for their moment.

A skilled manager works through one product in an afternoon. With hundreds of products, most of the portfolio never gets that afternoon. Problems surface only in the totals – months after the cause.

The Data Foundation: One Lake, Every Metric, Per Product, Per Day

Before any AI, we built the foundation it reasons over. Every night, ingestion jobs pull from every system the business runs on – Amazon’s Selling Partner and Ads APIs, the ERP, profit analytics, price history, keyword intelligence, listing data, the team’s own product tracker – into a single PostgreSQL data lake.

At its core: the complete financial picture of every product at daily grain – sales, advertising, fees, refunds, costs, margin – reconciled once and shared downstream. Around it, daily snapshots of inventory on both sides, storage fees, keyword rank positions, and badge observations: the day a product gained or lost Amazon’s Choice, the day it was flagged as frequently returned.

Amazon

Selling Partner API

sales, fees, inventory, listings

Ads API

spend across three campaign types

Market intelligence

Price & rank history

day-by-day timelines

Keyword intelligence

search volume, ranks, competitors

Listing data

badges, reviews, content

Company systems

ERP

warehouse stock & costs

Profit analytics

daily P&L per product

Product tracker

team notes & screenshots

↓ every night · re-pulled as numbers settle

Product data lake

Financial performance

full daily P&L per product

Inventory & storage

company warehouses and FBA

Market position

ranks, prices, competitors

Listing health

content, badges, availability

↓ stored, reconciled rows – read in seconds

Product analysis

Pre-Launch workflow

Post-Launch CoPilot

The analysis layer reads these stored rows, not live APIs: analyses run in seconds instead of waiting on Amazon’s report queue, every run is reproducible because the data it saw is on disk, and any window – week, month, year – is queryable at daily grain.

Most of the engineering went into making the data trustworthy, not the model: an analyst reasoning over broken inputs produces confident wrong answers, and those kill trust. Four problems, solved source by source:

Amazon doesn’t hand its data over. Reports must be requested, then wait in Amazon’s queue from half a minute to half an hour before collection and parsing. The pipeline tracks every report through that journey, so a nightly run that fails halfway resumes where it stopped.
The data rewrites itself. A sale attributed to an ad today may be re-attributed tomorrow; a return lands weeks after the order. So ingestion jobs re-pull the same dates repeatedly over the following month, treating yesterday’s numbers as a draft, not a fact.
Some numbers Amazon never provides. Brand campaigns report what they spent but not which product the spend belongs to, so the pipeline estimates the split – and carries the “estimated” label into the analysis, so no conclusion leans on it harder than it deserves.
The systems disagree. The profit feed and Amazon’s settlement reports differ on the same transactions; the ERP and Amazon’s live stock count differ by hours. Every number has one source designated as truth; the rest are demoted to cross-checks.

The Analysis: Minutes Instead of an Afternoon

When a manager asks about a product now, the system does what they used to do by hand. One job fans out dozens of parallel queries – external APIs and the data lake at once – and merges everything a diagnosis could need into one structured context: the period and year-long comparison windows, unit economics per variation, keyword ranks and history, the closest competitors with prices and review counts, a listing-content audit, live inventory, the badge timeline.

Live from external APIs

Catalog & A+ content

Price & rank history

Keywords & search volume

Closest competitors

ERP stock & velocity

Badges & listing flags

From the data lake

Current-period financials

Comparison periods through the year

Unit economics per variation

Refund rates with denominators

Storage fees & inventory aging

Campaign performance & rank trends

From the team

Activity notes – what changed, when

Screenshots, read as images

↓ dozens of parallel queries merge into one structured context

Claude · extended thinking · the order an experienced operator would ask

1Listing health→2Unit volume→3Returns→4Ad efficiency→5Margin

↓ a report built to be acted on

Direct reason

what moved, in plain operational terms

Root cause

a dated chain of events, tied to the team's own changes

Recommended action

one, definitive – not a list of maybes

It also reads the team’s notes. Managers log what they change – a price, a main image, an ad campaign – in their product tracker, often with screenshots, and the system passes those screenshots to Claude as images. It doesn’t just see that units fell this week – it sees the team changed the hero image two days before the drop, and connects the two.

Claude runs with extended thinking and works through the questions in an experienced operator’s order: is the listing healthy, then volume, then returns, then advertising efficiency, then margin. The report comes back built to act on: the direct reason in plain terms, the root cause as a dated chain of events, and one definitive recommended action. The cross-signals earn their keep – the unit drop that traces to a competitor’s price cut and a lost badge in the same week, the margin slide that is really aging inventory crossing a surcharge tier, the “advertising problem” that is actually a suppressed variation.

Context engineering: decomposition first, formatting second

The naive approach is one giant prompt: hand the model every number and hope. It fails predictably – attention spread across everything lands on nothing, an early misreading contaminates everything built on it, and a wrong output can’t be localized to the step that broke.

So the work is decomposed before any prompt is written. Each workflow is a chain of focused LLM steps, each carrying its own prompt, its own narrow context, and its own output contract. In the pre-launch workflow, competitor analysis informs keyword strategy, keyword strategy feeds title generation, and titles, review analysis and Fact Pack feed the image plan and A+ copy – each step seeing only what its decision requires. Even within a single analysis the model follows a fixed diagnostic sequence into a fixed output structure, with the boundary between manager-facing summary and full analysis defined by one constant both the prompt and the parser reference, so the two can’t drift apart. And decomposition keeps the system maintainable: every prompt is a versioned, labeled artifact in Langfuse, tuned and rolled back independently, so a fix to the title generator can’t regress the CoPilot.

The other half is what the data looks like when it reaches the model. Rates arrive with their sample size attached – “2 returns / 50 units (last 7 days)” next to the trailing 12-month figure – because a bare 4% from two returns triggers false alarms, and early versions proved it. Sales velocity comes in two labeled windows, one for stock risk and one for inventory runway. ERP and Amazon stock figures stay separately labeled, so the model reasons about the gap instead of averaging it away. Cost drift is flagged only across comparable periods with enough volume to mean something.

Every run is traced end to end – inputs, reasoning, output. When a manager disputes a conclusion, we replay exactly what the model saw.

Workflows for Every Stage of a Product’s Life

“Why did profit drop” is a mature product’s question. A product that hasn’t launched needs different questions answered; a product three weeks post-launch different ones again. So the platform runs a dedicated workflow per stage: launch planning before go-live, a weekly CoPilot through the launch phase, and the deep performance analysis above once a product is established.

Product lifecycle

→

Before go-live

Pre-Launch Analysis

one run

Page-1 competitor analysis & realistic monthly units

Launch price, mature price & launch budget

Keyword strategy: longtail now, high-volume later

Listing content grounded in a verified Fact Pack

Fulfillment call & first inbound quantity

Launch phase

Post-Launch CoPilot

weekly

Stage-aware diagnosis: reviews → rank → profitability

Controlled ad ramp on longtail keywords

Exactly one time-boxed test at a time

Tasks, alerts & an audit log of every decision

Stock protection: true daily rate locked during throttling

Established

Performance Analysis

on demand + batches

Full-context root-cause diagnosis

Cross-signals: ads, price, badges, fees, stock

Dated causal chain & one definitive action

Whole portfolio covered on schedule

Pre-Launch. Turns a single seed keyword into a launch plan. It studies page one of Amazon search, filters out unrealistic comparisons – wrong product type, or entrenched players with a thousand-plus reviews – and estimates what a newcomer can sell per month. From margin targets it derives a launch price and a mature price, adjusted for psychological thresholds, plus a launch budget to gain rank. It reverse-engineers the closest competitors’ keyword lists, keeps the keywords several of them rank for, and splits them into longtail targets for launch and high-volume keywords to defer until the product has traction. Then it drafts the listing – two title variants, a nine-image plan, A+ copy, a video script mapped against the top complaints in competitors’ reviews – every claim grounded in a “Fact Pack” of verified specs from the product’s R&D docs and manual; where a spec is missing, the output says so. It closes with a fulfillment recommendation and inbound quantity, and a listing manager reviews before go-live.

Weekly CoPilot. From go-live to maturity, it reviews every launch-phase product weekly, aware of each one’s stage: a listing with no reviews is pushed toward review acquisition, a ramping product toward rank building, a maturing one toward profitability. Each week it assembles the full picture – traffic, reviews, returns with buyer comments, inventory, fees, per-keyword ad performance, rank movement, budget state, the team’s notes, and its own previous output, so it remembers what it said last week – and applies the team’s playbook in strict priority order. A suppressed or unbuyable listing overrides everything. Negative review themes and return reasons are early warnings. Ad spend ramps in small deliberate steps on a handful of longtail keywords; a keyword that burns budget for two weeks without rank progress triggers a pivot. When progress stalls, it proposes exactly one time-boxed test – a price change, a coupon, a creative experiment – never five at once. Output comes in two layers: a readable weekly diagnosis with at most five actions, and structured tasks, alerts and an audit log – validated against typed schemas – recording which rules fired, which inputs were missing, and how confident the run was.

One rule shows how much operational knowledge lives in the playbook. When stock runs low, Amazon throttles a product’s visibility and observed sales collapse – so a reorder based on the throttled rate under-buys and deepens the shortage. The CoPilot locks in the true daily rate from the last normal days before throttling and reports both numbers, with a standing note to purchasing: reorder on the true rate, not the observed one.

How We Worked Together

That throttling rule didn’t come from us. It came from Vive’s managers, and so did the rest: the thousand-review cutoff for realistic competitors, mandatory Subscribe & Save for consumables, variation naming conventions that cut sizing-related returns, the discipline of one test at a time. We mapped how the team’s best people actually decide before writing a single prompt – the workflows encode their judgment, not a generic playbook. The collaboration runs continuously: the team’s activity notes feed every analysis, and every published analysis lands back in the product tracker the team already works in.

That same loop improves the system. When a conclusion looks off, managers log it in a shared feedback sheet. Each entry follows the same path: reproduce the claim against live data, trace it to the layer that caused it – a database query, a context-formatting decision, a prompt rule – fix that layer, and pin the fix with a permanent regression test so it can’t quietly return.

Most “AI mistakes” were context mistakes. Refund-rate alarms came from missing denominators. A velocity figure that “looked very off” came from a 7-day window where the team expected 30 days – the fix was presenting both, labeled. Each fix made the context more truthful, and the analyses better for every product, not just the disputed one. Prompt adjustments ship through Langfuse versions, checked against the traces that prompted them, then promoted – which is why trust grew instead of eroding after the first wrong answer.

The Result

The platform runs in production. Nightly jobs keep the data lake current. Analyses run on demand when a manager asks about a product and in scheduled batches across the portfolio, each tracing from raw daily metrics to a dated root cause and a definitive recommendation. New products move through the same platform from pre-launch planning through weekly CoPilot reviews until they graduate to the standard cadence.

	Before	After
Analysis time per product	An afternoon	Minutes
Portfolio coverage	Only products in visible trouble	Every product, on schedule
Evidence assembly	Manual export and date-alignment	Auto-assembled; the manager judges the conclusion
Reasoning record	None	Every run traced and replayable

The review team’s job changed shape. The hours once spent hopping between services and assembling spreadsheets now go into decisions, and coverage stopped being selective – every product gets looked at, on schedule, with the same rigor.

The lesson we keep relearning: the model is the smallest part. The data lake underneath it, the decomposition that keeps every model task focused, the context engineering that makes data hard to misread, and the feedback loop that turns every disagreement into a permanent improvement – that is what turns an impressive demo into an analyst the team trusts.

Vive Health: Optimizing Hundreds of Amazon Products With One AI Analyst