## Distance Matrix with Custom Function in R

This R function will allow you to use custom distance functions (other than Euclidean, etc) to create a distance matrix. Given a list and a custom distance function, a matrix containing pairwise distances, as specified by the function, of all elements of the list will be returned.

code:

custom.dist <- function(my.list, my.function) {
n <- length(my.list)
mat <- matrix(0, ncol = n, nrow = n)
colnames(mat) <- rownames(mat) <- names(my.list)
for(i in 1:nrow(mat)) {
for(j in 1:ncol(mat)) {
mat[i,j] <- my.function(my.list[i],my.list[j])
}}
return(as.dist(mat))
}


As an example, we can specify a custom distance function based on Blosum62 to quantify the evolutionary distance between two protein sequences.

example implementation:

b62 <- as.matrix(read.table("ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62", check.names=FALSE))
## Blosum function to quantify the difference between two sequences
blosum <- function(x,y) {
a <- strsplit(x,"")[[1]]
b <- strsplit(y,"")[[1]]
score <- sum(abs(b62[a[!(a %in% b)], b[!(b %in% a)]]))
return(score)
}
aa.seqs <- c(a="ANQGH",b="ANCGH",c="ANQEH",d="ANQES",e="RDCGH",f="RNCGH")
dis <- custom.dist(aa.seqs, blosum)


results:

> dis
a  b  c  d  e
b  3
c  2 11
d  3 13  1
e 11  4 21 27
f  5  1 13 18  1


You can then use this custom distance matrix downstream for clustering, tree construction, etc.

1. Colin Gorrie |

This saved me a lot of time! I used your custom.dist() to create a cost-weighted geographical distance matrix between Chinese dialects (I wanted to make rivers act like highways), to see how much linguistic distance I could attribute to geographical distance. Thanks!

2. Dave |

Hi Jean,

I came here while checking to see if such a custom dist function already existed (I’m a big believer in writing as little code of my own as possible, even when it’s pretty easy code to write)… And since I ended up having to write my own, I figured I may share it here:

custom.dist <- function(x, my.dist) {
mat <- sapply(x, function(x.1) sapply(x, function(x.2) my.dist(x.1, x.2)))
as.dist(mat)
}

In general, it’s a good idea to use vectorial computation rather than ‘for’ loops in R: much faster on very large arrays…