Intuition of information theory

I am reading the book "Elements of Information Theory" by Cover and Thomas and I am having trouble understanding conceptually the various ideas.

For example, I know that H(X) can be interpreted as the average encoding length. But what does $H(Y|X)$ intuitively mean?

And what is mutual information? I read things like "It is the reduction in the uncertainty of one random variable due to the knowledge of the other". This doesn't mean anything to me as it doesn't help me explain in words why $I(X;Y)=H(Y)-H(Y|X)$. Or explain the chain rule for mutual information.

I also encountered the Data processing inequality explained as something that can be used to show that no clever manipulation of the data can improve the inferences that can be made from the data. $I(X;Y)\ge I(X;Z)$. If I had to explain this result to someone in words and explain why it should be intuitively true I would have absolutely no idea what to say. Even explaining how "data processing" is related to a markov chains and mutual information would baffle me.

I can imagine explaining a result in algebraic topology to someone since there is usually an intuitive geometric picture that can be drawn. But with information theory if I had to explain a result to someone at comparable level to a picture I would not be able to.

When I do problems its just abstract symbolic manipulations and trial and error. I am looking for an explanation (not these blah gives information about blah explanations) of the various terms that will make the solutions to problems appear in a meaningful way.

Right now I feel like someone trying to do algebraic topology purely symbolically without thinking about geometric pictures.

Is there a book that will help my curse?

I am reading the book "Elements of Information Theory" by Cover and Thomas and I am having trouble understanding conceptually the various ideas.

For example, I know that H(X) can be interpreted as the average encoding length. But what does $H(Y|X)$ intuitively mean?

And what is mutual information? I read things like "It is the reduction in the uncertainty of one random variable due to the knowledge of the other". This doesn't mean anything to me as it doesn't help me explain in words why $I(X;Y)=H(Y)-H(Y|X)$. Or explain the chain rule for mutual information.

I also encountered the Data processing inequality explained as something that can be used to show that no clever manipulation of the data can improve the inferences that can be made from the data. $I(X;Y)\ge I(X;Z)$. If I had to explain this result to someone in words and explain why it should be intuitively true I would have absolutely no idea what to say. Even explaining how "data processing" is related to a markov chains and mutual information would baffle me.

I can imagine explaining a result in algebraic topology to someone since there is usually an intuitive geometric picture that can be drawn. But with information theory if I had to explain a result to someone at comparable level to a picture I would not be able to.

When I do problems its just abstract symbolic manipulations and trial and error. I am looking for an explanation (not these blah gives information about blah explanations) of the various terms that will make the solutions to problems appear in a meaningful way.

Right now I feel like someone trying to do algebraic topology purely symbolically without thinking about geometric pictures.

Is there a book that will help my curse?