"A common theme in linear compression and feature extraction is to map a high dimensional vector $x$ to a lower dimensional vector $y=Wx$ such that the information in the vector $x$ is maximally preserved in $y$. Opten PCA is applied for this purpose. However, the optimal setting for $W$ is in generall not given by the widely used PCA. Actually, PCA is sub-optimal special case of mutual information maximisation."

Can anyone elaborate why PCA is a sub-optimal special case of mutual information maximisation ?

Can anyone elaborate why PCA is a sub-optimal special case of mutual information maximisation ?