Distance Matrix with Custom Function in R

BY IN Code, R, Tutorials 5 COMMENTS , ,

This R function will allow you to use custom distance functions (other than Euclidean, etc) to create a distance matrix. Given a list and a custom distance function, a matrix containing pairwise distances, as specified by the function, of all elements of the list will be returned.

code:

custom.dist <- function(my.list, my.function) {
    n <- length(my.list)
    mat <- matrix(0, ncol = n, nrow = n)
    colnames(mat) <- rownames(mat) <- names(my.list)
    for(i in 1:nrow(mat)) {
        for(j in 1:ncol(mat)) {
            mat[i,j] <- my.function(my.list[i],my.list[j])
    }}
    return(as.dist(mat))
}

As an example, we can specify a custom distance function based on Blosum62 to quantify the evolutionary distance between two protein sequences.

example implementation:

b62 <- as.matrix(read.table("ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62", check.names=FALSE))
## Blosum function to quantify the difference between two sequences
blosum <- function(x,y) { 
    a <- strsplit(x,"")[[1]]
    b <- strsplit(y,"")[[1]]
    score <- sum(abs(b62[a[!(a %in% b)], b[!(b %in% a)]]))
    return(score)
}
aa.seqs <- c(a="ANQGH",b="ANCGH",c="ANQEH",d="ANQES",e="RDCGH",f="RNCGH")
dis <- custom.dist(aa.seqs, blosum)

results:

> dis
   a  b  c  d  e
b  3            
c  2 11         
d  3 13  1      
e 11  4 21 27   
f  5  1 13 18  1

You can then use this custom distance matrix downstream for clustering, tree construction, etc.

5 Comments

  1. Colin Gorrie |

    This saved me a lot of time! I used your custom.dist() to create a cost-weighted geographical distance matrix between Chinese dialects (I wanted to make rivers act like highways), to see how much linguistic distance I could attribute to geographical distance. Thanks!

    Reply
  2. Dave |

    Hi Jean,

    I came here while checking to see if such a custom dist function already existed (I’m a big believer in writing as little code of my own as possible, even when it’s pretty easy code to write)… And since I ended up having to write my own, I figured I may share it here:

    custom.dist <- function(x, my.dist) {
    mat <- sapply(x, function(x.1) sapply(x, function(x.2) my.dist(x.1, x.2)))
    as.dist(mat)
    }

    In general, it’s a good idea to use vectorial computation rather than ‘for’ loops in R: much faster on very large arrays…

    Reply

So, what do you think ?