I had my claw test with several models and modified the prompt and caching and somehow a grep-based script without LLM at all was quite a strong competitor for most models. Sonnet 4.6 found more verified features, knew the keywords and nuance where a description would talk about non-product stuff etc.
I bet it can be improved but I also have to finish a feature.
