HMM in R – example applications in finance and bioinformatics

BY IN Notes, R, Tutorials NO COMMENTS YET , ,

A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with hidden states. In this tutorial overview, we will explore example applications of HMMs in finance and bioinformatics using the depmixS4 library in R.

library(depmixS4)

Finance – HMM for bear vs. bull market state inference

In finance, the hidden state of interest may be the market state (bull vs. bear). What is actually observed are stock prices over time. Consider the following observations on Apple stock prices.

# Get data
library(Quandl)
data <- Quandl("GOOG/NASDAQ_AAPL")
data <- data[rev(order(data$Date)), ]

Intuitively, if the market (hidden) is in a bear market, then we should see a falling price trend that is reflected in a negative daily return (observed). Likewise, in a bull market (state 2), then we should see a rising price trend that is reflected in a positive daily return (observed). Thus, we can think of the HMM as a mixture model where the market state, which controls the mixture component to be select for each observation on daily returns, are related through a Markov process.

# Look at old data
start <- 31*3 # last 3 months from 3 months ago
mydata <- data[start:(2*start),]

# Convert to daily returns
# The book formula is 
# Close_price_today - Close_price_yesterday / Close_price_yesterday
ret <- diff(mydata$Close)/mydata$Close[-nrow(mydata)]

# Use HMM
# Create model, 2 states: either going up or going down
mod <- depmix(ret ~ 1, data=data.frame(ret=ret), nstates=2, family=gaussian())
# Use Baum–Welch algorithm to fit model
f <- fit(mod)
summary(f)

# Get the estimated state for each timestep 
esttrans <- posterior(f)

# Plot
par(mfrow=c(3,1))
plot(mydata$Date, mydata$`Close`, type='l', main='Trajectory')
plot(mydata$Date[-nrow(mydata)], esttrans[,1], type='l', main='Estimated state')
plot(mydata$Date[-nrow(mydata)], ret, type='l', main='Returns')
apple1

We can fit our HMM using one part of the data, and apply to another part.

# Take fit and apply to new data
newdata <- data[1:start,] # last 3 months
ret <- diff(newdata$Close)/newdata$Close[-nrow(newdata)]
mod <- depmix(ret ~ 1, data=data.frame(ret=ret), nstates=2, family=gaussian())

# From old model
fitted <- c(unlist(getpars(f)))
mod <- setpars(mod, fitted)
# fix response parameters
fixed <- c(0, 0, 0, 0, 0, 0, 1, 1, 1, 1)
fnew <- fit(mod, fixed=fixed)
summary(fnew)
summary(f) # check same

# Get the estimated state for each timestep 
esttrans <- posterior(fnew)

# Plot
par(mfrow=c(3,1))
plot(newdata$Date, newdata$`Close`, type='l', main='Trajectory')
plot(newdata$Date[-nrow(newdata)], esttrans[,1], type='l', main='Estimated state')
plot(newdata$Date[-nrow(newdata)], ret, type='l', main='Returns')
apple2

Bioinformatics - HMM for copy number inference

A normal diploid cell or human has paired chromosomes. For germline heterozygous SNPs, consider the 2 alleles named arbitrarily A and B. B allele frequency (BAF), can be calculated as the proportion of the total allele signal (A + B) explained by a single allele (A). A BAF of 0 represents the genotype (A/A or A/–), whereas 0.5 represents (A/B) and 1 represents (B/B or B/–).

Thus, in a normal region, we expect BAFs to be approximately 0.5 (with some variability given limitations of measurement and technical noise). However, in a clonal deletion region, BAFs will deviate towards 0 and 1. Thus, we can think of the HMM as a mixture model where the deletion state controls the mixture component to be select for each observation on BAFs.

# Simulate BAF
n = 100
baf <- c(rbeta(n, 10, 10, ncp = 0), rbeta(n, 0.1, 0.1, ncp = 0), rbeta(n, 10, 10, ncp = 0))

# Use HMM to infer deletion state
# Create model, 2 states: either deletion or not
mod <- depmix(baf ~ 1, data=data.frame(baf=baf), nstates=2, family=gaussian())
# Use Baum–Welch algorithm to fit model
f <- fit(mod)
summary(f)
# Get the estimated state for each SNP position
esttrans <- posterior(f)
# Plot together
par(mfrow=c(2,1))
plot(1:(3*n), baf, main='BAF', xlab="SNP", ylab="fraction")
plot(1:(3*n), esttrans[,1], type='l', main='Estimated state', xlab="SNP", ylab="state")
baf

So, what do you think ?