Every team I've worked with on RAG starts in the same place: tweaking the prompt. Variants of system messages, restated instructions, careful tone-shaping. It rarely moves the needle.
The reason is structural. A generation grounded on the wrong context cannot be saved by any prompt. The model is doing exactly what it was asked to do — it's just been asked the wrong question. Fix that, and most of the apparent quality problems disappear without touching the prompt at all.
Where to actually look
Start with retrieval evals before you touch generation evals. Sample two hundred recent queries, retrieve the top-k for each, and ask whether the right segments are in there at all. If they're not, no prompt will save you. If they are but the order is wrong, you have a reranking problem, not a generation problem.
Most of my time on Tessact, in any given week, lives upstream of the model. Indexing strategies, hybrid retrieval, reranking, structural metadata. The prompt itself is the cheap part.