I think SOME amount of rewrite by the optimizer is required. We especially found that to be true with exactly this situation in the post. It seems to me that "x IN(list of scalars)" and "x = ANY(list of scalers)" and x IN VALUES” is all the same thing until you use query parameters. I agree it doesn't need to get carried away (as it sounds like Oracle did). Everyone with the same issue benefits from optimizer improvements instead of having to change their applications as long as the planning time is capped to some time limit.
When I was on a team working with SQL Server queries in an application, I often advised people that when using any of the relational databases, try to keep these ideas in mind:
1. Write the query as WHAT you want to retrieve, not HOW to execute it. If you need to explain how to execute it whether by writing the query complicated or adding SQL server query hints then something is too complicated.
2. Second, try to keep the query as simple as possible (which does require some understanding of how the execution would happen). Like don't join in the same table again if it is joining on the same key! Maybe prefer to write as a join instead of a subquery etc.
3. EXPECT complicated queries to change execution plan as your data changes.
It makes sense to me. Although I’m not a fan of losing cycles, keep in mind that multiple servers around the world will spend energy for nothing. So, my initial idea was that the core could provide metadata on the query tree to give extensions a way to find out how many transformations can be successfully applied to a specific query.
Until PG14 using IN VALUES with a long list (and the resulting semi join) was significantly more performant than an = ANY expression with a large list because that expression was always executed with a linear scan through the list. That patch (to execute such an expression as a hash lookup) came out of my experience with such an = ANY containing 17k values being horrifically slow (but rewriting as an explicit join to VALUES being much faster).
Thanks for this piece of the experience! That's exactly why I write these posts - to get an alternative view. It is helpful that you pointed out that this feature shouldn't be used later than when hash array search was invented.
I'd like to see this terrific query (with a 17k-element array) just to understand how it could appear in real life.
The query wasn’t particularly interesting: we had 17k ids that we knew shared a characteristic and we needed to modify any references to those records, so we had to find all rows in a table that referenced those ids.
The = ANY optimization only works with const arrays, and so you can construct similarly degenerate queries now that still benefit from the hash semi join conversion if the array is either a param (e.g. across a nest loop) or a subquery.
You mentioned an interesting idea: We allow transformation for quite complex VALUES sequences. Testing how effectively it works with long chains of parameters and non-immutable functions inside makes sense.
I think SOME amount of rewrite by the optimizer is required. We especially found that to be true with exactly this situation in the post. It seems to me that "x IN(list of scalars)" and "x = ANY(list of scalers)" and x IN VALUES” is all the same thing until you use query parameters. I agree it doesn't need to get carried away (as it sounds like Oracle did). Everyone with the same issue benefits from optimizer improvements instead of having to change their applications as long as the planning time is capped to some time limit.
When I was on a team working with SQL Server queries in an application, I often advised people that when using any of the relational databases, try to keep these ideas in mind:
1. Write the query as WHAT you want to retrieve, not HOW to execute it. If you need to explain how to execute it whether by writing the query complicated or adding SQL server query hints then something is too complicated.
2. Second, try to keep the query as simple as possible (which does require some understanding of how the execution would happen). Like don't join in the same table again if it is joining on the same key! Maybe prefer to write as a join instead of a subquery etc.
3. EXPECT complicated queries to change execution plan as your data changes.
It makes sense to me. Although I’m not a fan of losing cycles, keep in mind that multiple servers around the world will spend energy for nothing. So, my initial idea was that the core could provide metadata on the query tree to give extensions a way to find out how many transformations can be successfully applied to a specific query.
Until PG14 using IN VALUES with a long list (and the resulting semi join) was significantly more performant than an = ANY expression with a large list because that expression was always executed with a linear scan through the list. That patch (to execute such an expression as a hash lookup) came out of my experience with such an = ANY containing 17k values being horrifically slow (but rewriting as an explicit join to VALUES being much faster).
Thanks for this piece of the experience! That's exactly why I write these posts - to get an alternative view. It is helpful that you pointed out that this feature shouldn't be used later than when hash array search was invented.
I'd like to see this terrific query (with a 17k-element array) just to understand how it could appear in real life.
The query wasn’t particularly interesting: we had 17k ids that we knew shared a characteristic and we needed to modify any references to those records, so we had to find all rows in a table that referenced those ids.
The = ANY optimization only works with const arrays, and so you can construct similarly degenerate queries now that still benefit from the hash semi join conversion if the array is either a param (e.g. across a nest loop) or a subquery.
You mentioned an interesting idea: We allow transformation for quite complex VALUES sequences. Testing how effectively it works with long chains of parameters and non-immutable functions inside makes sense.