Half of your questions can be answered by looking at the compiled shader. For dx11 platforms, Unity unfortunately does not provide it in readable format. What I use, though, is the above mentioned AMD’s Shader Analyzer that lets you see the actual instructions behind the shader and even roughly how they perform on some (rather old) AMD cards.
The rest is just experience and general knowledge. Both functions and macros perform identically because in the compiled shader, there are no functions or macros, just straight up code. Two multiplications are also the same no matter what syntax you use.
In your last example, if you’re not doing anything else with the values, I’d say passing it directly as a vector of 4 is a tiny bit faster because you avoid a mov instruction in the end. But if you really don’t do anything else in the pixel shader, then it does not matter at all, because 99% of the time will be spent elsewhere in the pipeline… rasterizer, interpolators, ROPs… or even just waiting for the memory. Bandwidth is often the most limiting factor on desktop GPUs after all.
It is a lot more difficult to figure out the rest. If you look at the compiled code, you’ll see pow(x, y) is compiled into a log2, mul and exp2 instructions. pow(c, y), where c is a constant only need a mul + exp, pow(x, c) depends on the value of c and often gets turned into multiplications instead where possible.
Now, if you’re asking how much is an exp2 instruction more expensive than a multiplication, then there’s no satisfactory answer. It’s different between AMD, Nvidia, mobile GPUs, desktop GPUs, new cards, old cards… Even the compiler does not know, in the case of hlsl. It might have a vague idea though… like, it turns pow(x, 512) into 9 multiplications, but pow(x, 1024) into a log, mul and exp… Either way, you can see that it’s often safer to rely on the compiler to do low level optimizations for you.
Texture reads is a whole another beast. When a shader core “sends a read request”, it often does not just idly wait for the data to come back, it switches to work on other code that is ready to process instead, possibly even in a completely separate pixel shader invocation. Generally you should have a lot more arithmetic instructions than texture fetches for that reason. The delay until the data comes back is further dependent on the texture format and filtering used, and most importantly if it has been fetched before and is currently stored in cache. Spatially coherent reads are cache friendly reads. That means, if neighboring pixels need to fetch texels that are also close to each other, it’s much faster than if they are sampled all over the texture randomly. Since cache memory is limited, that also means reading from very small textures very often is relatively cheap - but most likely not cheap enough to make lookup textures for a single pow function worth it.
My point with all this is that measuring every separate piece is not nearly enough. Hell, even testing the entire shader on some random data isn’t accurate. You need to profile the shader in the actual, real scenario to get a good picture of the overall performance.
As a disclaimer, I specialize in desktop graphics, so whatever I just said might or might not be drastically different on mobile.