I've found for my use case that both claude-instant-* and claude-* are roughly on par with each other and gpt-3.5. claude-* seems to be the least inaccurate, but we also haven't put it into production like gpt-3.5, so it's hard to say for sure.
In either case, the claude models are very good. I think they'd do fine in a real product. But there's definitely issues that they all have (or that my prompt engineering has).