Engineering

How we built WhatToBuy

March 2026 · 5 min read

WhatToBuy is a shopping assistant. You describe a situation (a camping trip, a new baby, a home office setup) and it comes back with a curated, balanced cart. In Deep mode it asks a few clarifying questions first to get the details right. This is a post about building it.

We almost built it the wrong way. The obvious version uses an LLM to tell you what to buy. You ask it, it answers from memory, you display the list. Turns out that’s useless. The model has a training cutoff, no pricing, no availability, no idea what’s actually in stock or any good right now. We built a version of this and it felt hollow immediately. Products the model was confidently recommending were discontinued, overpriced, or had 2-star reviews.

The shift that made everything click: use the LLM to write search queries, not to answer the question. Instead of asking “what tent should I buy,” we ask it to figure out the most precise Google Shopping query that would surface the right product for each category. Models are very good at this. They can translate “camping with two young kids” into “family camping tent 4-person easy setup waterproof” far better than they can tell you which specific tent to buy. Then we run those queries against a live data source and work with real results.

The first draft of the prompt produced vague queries. “Tent.” “Sleeping bag.” We spent more time on prompt engineering than we expected. Two things actually moved the needle. First, showing the model a bad query (“tent”) and a good query (“family camping tent 4-person easy setup waterproof”) labeled WRONG and CORRECT. Not just describing what good looks like but showing both. Without the negative example, the model defaults to vague even when you explicitly tell it not to. Second, we inject user profile data before the scenario: age, location, family members. The model generates much more targeted queries when it knows the buyer is a 35-year-old with a 7-year-old versus a solo traveler.

Once you have real products, you have to pick one per category. This is harder than it sounds. Sort by price alone and the cheapest result is usually a fake listing, a mislabeled accessory, or something with two reviews. We had to add a price floor: anything below 30% of the median for that category gets dropped. Sort by rating alone and a product with three reviews and a perfect 5.0 beats everything. We require at least 50 reviews before something is eligible for the top tier.

For the middle tier we settled on a value score: rating times log(review count) divided by unit price. The log is important. Without it, popular bestsellers dominate regardless of quality. A $200 item with 50,000 reviews wins everything. Logarithm means the jump from 10 to 100 reviews matters a lot more than the jump from 1,000 to 10,000. We also normalize for multi-pack items. A 12-pack of sponges at $9.99 should score at $0.83 per unit, otherwise bulk items sweep every budget tier.

Data quality was the thing we underestimated most. The scoring logic can only do so much when the underlying data is garbage. Shopping search APIs surface a long tail of bad results: links that redirect to YouTube, grocery delivery apps appearing as product stores, Canadian retailers showing up in US results, bulk wholesale suppliers with minimum orders of 24 units. We audited our cart history and found a real percentage of product links either dead, wrong-priced, or pointing to completely unrelated pages. We built a domain blocklist stored in the database, fetched and cached at runtime, applied before any product reaches scoring. Adding a bad domain now takes effect in minutes without a deployment. We should have built this on day one.

We also added family member context to profiles. If you’ve told us you have a 7-year-old daughter, she should show up in a beach day cart (with specific gear for a 7-year-old girl) but not in a “laptop for work” cart. Getting models to apply context conditionally is a real prompt engineering problem. They either always include it or erratically apply it. The fix was more few-shot examples showing both when to use family context and when to ignore it. Positive examples alone produce over-application.

One architectural decision we’re still happy with: two modes. Fast runs a single LLM call and returns results immediately, no sign-in required. Deep runs a short conversation first. The model asks a few clarifying questions, you answer, then it generates queries informed by your responses. The quality difference is significant for anything underspecified. “Ski trip” in Fast mode gets generic ski gear. In Deep mode, the model finds out it’s a family of four, the kids are beginners, the budget is around $1,500, and they’re renting skis. The cart it builds after that is completely different. Deep is our default. Fast is always one click away for people who want something quick.

The patterns that held up (using LLMs for retrieval translation rather than recall, negative examples alongside positive ones, logarithmic scaling for noisy signals, multi-turn only when the input is genuinely underspecified) came out of building something specific and hitting the edges. We didn’t plan most of them upfront.

If you’re building something in this space and want to compare notes, we’re at support@whattobuy.app. The app is at whattobuy.app and Fast mode works without an account.