The following quoted section, the way it's currently written, is confusing entropy with Kolmogorov complexity. https://en.wikipedia.org/wiki/Kolmogorov_complexity
"For example, the sequence of letters aaaaa has low entropy, conceptually, as it is the same letter appearing multiple times; it could be abbreviated to just 5 a’s. In contrast, the sequence of letters bkawe has high entropy, with no apparent pattern, and no apparent way to abbreviate it without losing its content. Shannon’s view of information was thus as an amount of information, measured by the compressibility of some data.
Another way to think about Shannon’s entropic idea of information is through probability: if we were to observe each character in the sequences above, and make a prediction about the likelihood of the next character, the first sequence would result in increasingly high confidence of seeing another a. In contrast, in the second sequence, the probability of seeing any particular letter is quite low. The implication of these ideas is that the more rare “events” or “observations” in some phenomenon, the more information that is required to represent it."
Specifically you say "could be abbreviated to just 5 a's" that's most definitely using Komogorov complexity which perhaps would merit inclusion in the book too. The reason why this example must use Komogorov complexity is that you don't make any reference to a random variable that describes the generation of those sequences. Without the prior distribution you must then appeal to Kolmogorov complexity. With a prior distribution, the entropy is then well defined and thus provides a bound on compressability.
Also, I don't believe the 2nd paragraph I quoted above continues in a correct way either. To follow the reasoning as it's currently written, you need at least some modeled hyperparameter governing the generation of the next letter in the sequence with a distribution over the values of the hyperparameter itself. If you don't want to complicate the example one might think we could adjust the 2nd paragraph to just say that if a letter sequence was generated by drawing a sequence of letters uniformly at random, the sequence 'aaaaa' would have low entropy and 'bkawe' would have a high entropy since it seems random -- but not true! They have the same entropy!
Perhaps better would be to use an example where you toss a coin x times, and count how many heads and tails. (like this example: https://courses.lumenlearning.com/physics/chapter/15-7-statistical-interpretation-of-entropy-and-the-second-law-of-thermodynamics-the-underlying-explanation/) Then a set of tosses all heads would be very low entropy and set of tosses with about equal heads and tails would be high entropy. The key distinction is if you view it as a sequence or a set. I think H/T are easier to think of as a set.
"Elements of Information Theory" by Cover and Thomas is a book I like on Shannon information
Anyhow, thanks for putting this book up. I've been quite enjoying it so far, just though I could help on this little detail.