The penalty is at most ~1 cycle of latency -- in practice I find it gets completely absorbed by the OOE engine. I've never measured any significant penalty in any code for mixing float and int SSE operations on any x86 microarchitecture.
Floating point bitwise operations exist too: xorps, andps, and so on.
gsg|13 years ago