The diffusr tutorial

Introduction

diffusr implements algorithms for network diffusion such as Markov random walks with restarts and weighted neighbor classification. Network diffusion has been studied extensively in bioinformatics, e.g. in the field of cancer gene prioritization. Network diffusion algorithms generally spread information in the form of node weights along the edges of a graph to other nodes. These weights can for example be interpreted as temperature, an initial amount of water, the activation of neurons in the brain, or the location of a random surfer in the internet. The information (node weights) is iteratively propagated to other nodes until a equilibrium state or stop criterion occurs.

Tutorial

First load the package:

library(diffusr)

Markov random walks

A Markov random walk with restarts calculates the stationary distribution: $\begin{equation} \mathbf{p}_{t+1} = (1 - r) \mathbf{W} \mathbf{p}_t + r \mathbf{p}_0, \end{equation}$ where $r \in (0,1)$ is a restart probability of the Markov chain, $\mathbf{W}$ is a column-normalized stochastic matrix (we do the normalization for you) and $\mathbf{p}_0$ is the starting distribution of the Markov chain. We calculate the iterative updates, it is also possible to do the math using the nullspace of the matrix (comes later).

If you want to use Markov random walks just try something like this:

  # count of nodes
  n <- 5
  # starting distribution (has to sum to one)
  p0    <- as.vector(rmultinom(1, 1, prob=rep(.2, n)))
  # adjacency matrix (either normalized or not)
  graph <- matrix(abs(rnorm(n*n)), n, n)
  # computation of stationary distribution
  pt    <- random.walk(p0, graph)

The stationary distribution should have changed quite a bit from the starting distribution:

  print(t(p0))

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    1    0    0    0

  print(t(pt))

##      p.inf     transition.matrix
## [1,] numeric,5 numeric,25

You can also use a matrix p0:

   p0 <- matrix(c(p0, runif(20)), nrow=n)
   pt <- random.walk(p0, graph)
   pt

## $p.inf
## [1] 0.20155762 0.58401309 0.04504716 0.06790495 0.10147718
## 
## $transition.matrix
##             [,1]       [,2]      [,3]       [,4]       [,5]
## [1,] 0.000000000 0.62768874 0.1240731 0.32918651 0.08452962
## [2,] 0.578537804 0.00000000 0.4073473 0.03317131 0.30382198
## [3,] 0.001322467 0.08413649 0.0000000 0.34241913 0.17182459
## [4,] 0.147539127 0.09740329 0.1010815 0.00000000 0.43982381
## [5,] 0.272600602 0.19077149 0.3674981 0.29522304 0.00000000

In the later case, a random walk is done over all columns of p0 separately.

Analytical vs iterative solution

In the last section we computed the iterative solution of the stationary distribution. You can also choose to do this analytically. In that case we need to take the inverse of the transition matrix which might lead to numerical instability, though. However, it usually runs faster than the iterative version.

   pt <- random.walk(p0, graph, do.analytical=TRUE)
   pt

## $p.inf
## [1] 0.2015512 0.5840193 0.0450461 0.0679042 0.1014792
## 
## $transition.matrix
##             [,1]       [,2]      [,3]       [,4]       [,5]
## [1,] 0.000000000 0.62768874 0.1240731 0.32918651 0.08452962
## [2,] 0.578537804 0.00000000 0.4073473 0.03317131 0.30382198
## [3,] 0.001322467 0.08413649 0.0000000 0.34241913 0.17182459
## [4,] 0.147539127 0.09740329 0.1010815 0.00000000 0.43982381
## [5,] 0.272600602 0.19077149 0.3674981 0.29522304 0.00000000

Nearest neighbors

Diffusion using nearest neighbors is done by traversing through a (weighted) graph and take all the neighbors of a node until a certain depths in the graph is reached. We find shortest paths using priority queues:

  # count of nodes
  n <- 10
  # indexes(integer) of nodes for which neighbors should be searched
  node.idxs <- c(1L, 5L)
  # the adjaceny matrix (does not need to be symmetric)
  graph <- rbind(cbind(0, diag(n-1)), 0)
  # compute the neighbors until depth 3
  neighs <- nearest.neighbors(node.idxs, graph, 3)

Let’s see what which nodes we got:

  print(neighs)

## $`1`
## [1] 2 3 4
## 
## $`5`
## [1] 6 7 8

Heat diffusion

A Laplacian heat diffusion process calculates the heat distribution over a graph after at a specific time $t$ : $\begin{equation} h_i(t) = h_i(0)\exp(-\lambda_i t) \end{equation}$ where $\mathbf{h}_0$ is the initial heat distribution, $\mathbf{h}_t$ is the heat distribution at time $t$ and $\boldsymbol \lambda$ are the eigenvalues of the of your graph.

You can use the Laplacian heat diffusion process like this:

  # count of nodes
  n <- 5
  # starting distribution (has to sum to one)
  h0    <- as.vector(rmultinom(1, 1, prob=rep(.2, n)))
  # adjacency matrix (either normalized or not)
  graph <- matrix(abs(rnorm(n*n)), n, n)
  # computation of stationary distribution
  ht    <- heat.diffusion(h0, graph)

Here are the results:

  print(t(h0))

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    1

  print(t(ht))

##            [,1]       [,2]       [,3]      [,4]      [,5]
## [1,] 0.07311824 0.07082704 0.03669767 0.3714729 0.7113146

As before, p0 can also be a matrix:

   h0 <- matrix(c(h0, runif(20)), nrow=n)
   ht <- heat.diffusion(h0, graph)
   ht

##            [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.07311824 0.5026732 0.8447452 0.7071784 1.0029935
## [2,] 0.07082704 0.6723834 1.0313002 0.3950428 0.7453103
## [3,] 0.03669767 0.2953058 0.8854613 0.1676122 0.3530274
## [4,] 0.37147286 0.3374178 0.8467648 0.8508218 0.9281102
## [5,] 0.71131463 0.2777378 0.7588318 0.4574978 0.4870314