Hidden Markov Models Made Easy

Hidden Markov Models Made Easy
By Anthony Fejes

Markov Models are conceptually not difficult to understand, but because they are heavily based on a statistical approach, it's hard to separate them from the underlying math. This page is an attempt to simplify Markov Models and Hidden Markov Models, without using any mathematical formulas.

Brief overview of a Model

A Markov Model, in the context of Molecular Genetics is nothing more than a series of probabilities which tell you how likely a particular sequence is to have descended from a particular "Ancestral" sequence, or vice versa, what the most probable "Ancestral" sequence is. Tada. Now you know what a Markov Model is. However, that takes most of the elegance of the process and puts it out of it's misery. The beauty of the model is that, among many other things, it can create it's own "Ancestral" sequence and set of rules.

Markov Models:

A Markov Model (MM) can be thought of as a board game, albeit not a particularly fun board game and certainly not one I'd pull out on a lazy sunday evening, but a board game of sorts. Somewhat like a cross between snakes and ladders (since the squares are often connected to non adjacent squares) and a really weird version or trivial pursuit, where each square you land on gives you an answer instead of asking a question. However, very much different from either of those games, in this game, you're allowed to stay on the same square for long periods of time. The analogy to snakes and ladders would be for the ladder to take you to the same square that you started on, which really doesn't make much sense in that game.

The rules of the Markov Model Game:

    1. Each square will give you a letter. (In the case of DNA, you only have a 4 letter alphabet to work with, ACTG. For proteins, with a 20 letter alphabet, you have a slightly more complex model to deal with.) Each square will give out the letters in different proportions. (Some squares will give As and C's most of the time, or some will just give out G's all the time.. but, most importantly, no two squares are the same.)
    1a. Each letter has a number (or score) between 1 and 0 attached to it.
    1b. At the end of the game, you multiply all of those numbers together to obtain your final score. (It's hard to keep ALL the math out of it). Imagine this as keeping score, much like in scrabble, but in some games, the person with the least points wins!

2. The longer you stay on one square, the better your model is. Hence;
2a. Each time you move from one square to another, you are penalized. (Not to say you might not get a penalty for staying in the same place too... can't let the rules get too easy!)

3. There are two squares which you can go to at any time, the delete square and the insert square, with little or no penalty attached, however they do not give you a letter and thus, no number is generated either.

The goal of the Markov Model Game.

The object of the Markov Model Game is to take any given sequence and find how likely it is to have come from your "Ancestral" sequence, that is, to obtain the highest possible score, while raking up the fewest possible penalties. Alternately, you can use a single sequence, and determine how well it fits your model. And that is exactly what we'll try to do with an example.

A really simple version of the game would have two squares, and would start with the sequence:

AAATATTATACTATCGGCGAGCGCG

The two squares in this game would be:

Square 1

In Square 1, you get A's and T's nearly all of the time, but you can get the Rare C or G.

Square 2

In Square 2, you get C's and G's nearly all of the time, but you can get the Rare A or T.

given this example, you can tell that the best strategy for playing this round would be to stay on the first square until you've reached the 15th letter, then move over to the 2nd square. As far as the first C in the sequence (11th letter) is concerned, moving to square 2 and then back would have incurred an additional 2 moving penalties (call that strategy A), whereas staying on Square 1 and accepting a low score for that letter (strategy B) is much more likely to help you win the round.

Hidden Markov Models

Here's where the game becomes REALLY weird. Hidden Markov Models (HMMs) can be hidden to different degrees (1) depending on what you aren't allowed to see. A true Hidden Markov Model arises when you aren't allowed to know what squares a player took to win the game... however, you are allowed to know the sequence of letters that were emitted as they won the game. Yet, the path that the player took to get those letters can't be seen. This is really starting to stretch the model, but that would essentially be like playing the board game in the dark, while the board moves around... ok.. that's really not getting anywhere. Perhaps it's more like having someone else play the game for you, but won't tell you what square they're putting the game piece on... but you get the point. That may sound rather odd, after all why would you want someone else to play a game for you, especially when they won't tell you what's going on?

Well, if we know what the board looks like, but not the path, we can use the theory of a Hidden Markov Model to solve: (2)

Problem #1: Given an observed set of letters, what's the best possible score you can obtain?
Problem #2: Given an observed set of letters, what's the best possible path you can travel?
Problem #3: How can we create a better board to play a specific game?

Unfortunately, like any board game.. how can you improve on a best seller? In fact, there is no known method to solve problem #3 mathematically, although small modifications can be found, there is no way to ever find the best possible game for any particular round. As far as I can tell, this isn't impossible, it's just that no one knows how to do it. There is nearly always room to improve. (Note to mathematicians, someone might want to let me know if this is an NP-complete and I just don't know it.)

Regardless of what we don't know or what we want to find out, there are always some common elements to a Hidden Markov Model. We might not know all or any of them, but they the pieces that are included with the HMM game.

1. In order to do anything productive, you must know the number of squares
2. You also need to know the size of the alphabet your game uses. (DNA uses 4, Proteins use 20 or so..)
3. There is also a probability of moving from one square to the next on any given move. Consider these the dice of the game.
4. As well, there is a probability of any given square emitting a particular letter... a second set of dice.
5. And, last but not least, you need to know how everything starts. This is rather like monopoly, where the rule sheets tell you how much money each player gets. Although, for this game, you just need to know which square says GO. Sometimes this can be decided arbitrarity, but you do need to start somewhere. If everyone agrees that you're going to start monopoly on free parking, that works too.)

Various types of Hidden Markov Models

An interesting consequence of Markov Models comes from the fact that you can design different boards to play on. From any one square, you don't necessarily have to be able to get to any other square and you can make paths between any two squares one directional. Imagine a four square model, where the squares are placed so that one corner of each box meets in the centre and you can go from any square to any other square.

1	2
3	4

First imagine a game where you can move from any square to any other square.. this is the most "complex" case, despite being easiest to explain.

Now, to create a second game, use the same configuration, but this time you can't move diagonally....

A third game would have you start on square 1 and only move to numbers that are greater than the number that you're on... (for example.. in the order 1, 3, 4.) Obviously you'd have to stop on square 4.

A fourth game would have you only move clockwise on the squares.. 1, 2, 3, 4, 1, 2, 3.... etc.

By now, you get the idea. There are an incredible number of games you can create with only 4 squares. And, naturally, you can increase the number of squares, thus increasing the number of games even further.

At any rate, you now have an endless number of games you can play on a lazy sunday afternoon... but as I said before, Markov Models wouldn't be my choice of games. I'd prefer a good game of hearts or Silly Bridge or even Trivial Pursuit, but this should be good food for thought... Trivial pursuit has 73 squares, imagine how many Markov Models you could make out of that. Happy Gaming!s

References

1. http://www.hd.uib.no/corpora/1995-2/0044.html

2. Rabiner, Lawrence R. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, vol. 77, No. 2, February 1989. P 257-286

Other references

Krough, et al. Hidden Markov Models in Computational Biology, Applications to Protein Modeling. J. Mol. Biol. (1994) 235, 1501-1531

Eddy, Sean R. Current Opinion in Structural Biology, 1996; 6: 361-365