Background

Fourier Analysis

Reference

Graph Laplacian

Reference

Graph Fourier Transform

References [1], [2]

Batch Norm

Graph Networks

Graph Convolutional Network

https://arxiv.org/abs/1609.02907

\[Z=\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}X^{k-1}\Theta\]

where

An activation function should be applied to \(Z\) with a bias:

\[X^k=\sigma(Z+b)\]

Highway GCN

https://arxiv.org/abs/1804.08049

\[\tilde{X}^k=\sigma(Z+b)\] \[T=\text{sigmoid}(WX^{k-1}+c)\] \[X^k=\tilde{X}^k\odot T + X^{k-1}\odot(1-T)\]

where $Z$ is the same as the $Z$ in GCN.

Simplifying Graph Convolutional Networks

https://arxiv.org/abs/1902.07153

\[S=\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}\] \[Z=\sigma(S^KX\Theta)\]

where the non-linearities between layers are removed.

FastGCN

https://arxiv.org/abs/1801.10247

GCN can be treated as

\[S=\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}\] \[Z=\sigma(SX\Theta)\]

where \(X\in\mathbb{R}^{N\times C}\) is the input features and its i-th row is the features of the i-th node

\[Z_{i,:}=\sum\limits_{j=1}^NS_{ij}X_{j,:}\Theta\]

If for each layer, we sample \(T\) nodes from \(N\) nodes, denote the indices as \(I_T\)

\[Z_{i,:}=\frac{N}{T}\sum\limits_{j=1}^N1_{j\in I_T}S_{ij}X_{j,:}\Theta\]

Further, we can use importance sampling (sample the nodes following a distribution \(q\))

\[Z_{i,:}=\frac{1}{T}\sum\limits_{j=1}^N1_{j\in I_T}q_j^{-1}S_{ij}X_{j,:}\Theta\]

with \(q_j\sim\|S_{:,j}\|^2\).

The sampling in each GCN layer is independent.

Relational Graph Convolutional Network

https://arxiv.org/abs/1703.06103

Graph Attention Networks

https://arxiv.org/abs/1710.10903

\[h_i=\sigma\left(\sum\limits_{j\in\mathcal{N}_i}\alpha_{ij}Wh_j\right)\]

where \(\alpha_{ij}\sim\exp(\text{LeakyReLU}(a^T[Wh_i;Wh_j]))\) (normalize over $j$) with a weight vector \(a\).

The attention can be extended to multi-head like transformer, where $h_i$ from multiple heads are concatenated. If using multi-head attention, the results should be averaged before activation at the final layer:

\[\sigma\left(\frac{1}{K}\sum\limits_k\sum\limits_{j\in\mathcal{N}_i}\alpha_{ij}^kW^kh_j\right)\]

Skip connections could also be considered with attention mechanism.

Line Graph Neural Network

https://arxiv.org/abs/1705.08415

Large-Scale Learnable Graph Convolutional Network

https://arxiv.org/abs/1808.03965

LGCL automatically selects a fixed number of neighboring nodes for each feature based on value ranking in order to transform graph data into grid-like structures in 1-D format, thereby enabling the use of regular convolutional operations on generic graphs.

GraphSage (sample and aggregate)

https://arxiv.org/abs/1706.02216

Diffusion-Convolutional Neural Networks

https://arxiv.org/abs/1511.02136

Gated Graph Sequence Neural Networks

https://arxiv.org/abs/1511.05493

\[h_{n(v)}=\sum\limits_{u\in n(v)}h_u^{k-1}+b\] \[h_v^{k}=\text{GRU}(x:h_{n(v)},h:h_v^{k-1})\]

Jumping Knowledge Networks

https://arxiv.org/abs/1806.03536

The idea is that, in the same graph, for different nodes, the same number of steps can lead to very different effect. Because the range of neighboring nodes that a node’s representation draws from strongly depends on the graph structure. For a node, denote its representation at different layers as \(h_v^1,\cdots,h_v^k\)

The final representation is then an aggregation of these.