3 Comments
User's avatar
Shanvit Shetty's avatar

Great breakdown, Really liked your points on token-aware rate limiting, semantic caching, and the full AI-native architecture.

For a 1-2 dev team building an MVP though, I’m wondering if starting simpler (direct streaming calls + lightweight setup) might be better for velocity, then evolving into this more robust system once usage grows. Curious how you think about that early-stage tradeoff

Vinit Shahdeo's avatar

Thanks, glad it resonated!

I don't think a 1-2 dev MVP should build all of this upfront. A fetch call in a route handler is the right starting point. Velocity wins early.

The way I think about it: separate the cheap patterns to add later from the ones that are expensive to retrofit.

Semantic caching, a prompt registry, and multi-model fallback; skip those until usage and cost actually hurt. You'll have real traffic to tune against anyway, and tuning a similarity threshold against zero users is just guessing.

But two things I'd argue belong in even the simplest setup, because retrofitting them is painful: streaming from day one (switching from a JSON response contract to SSE later means touching your client, your proxy timeouts, and your error handling all at once), and logging tokens + cost per call (one line of code now, versus reconstructing "which feature caused the $5k bill" from nothing later).

So my honest answer: start simple, but make the two or three choices that are hard to reverse. Everything else can wait until the usage tells you it's time.

Shanvit Shetty's avatar

Makes sense, start simple and fast, but build in the components that are painful to retrofit later. This way, systems can remain as plug-and-play as possible, allowing individual pieces to evolve independently without triggering cascading refactors across the stack.

Appreciate the article and the thoughtful reply!