Background
Fourier Analysis

Fourier Transform

Inverse Fourier Transform

Properties

Convolution theorem

Impossibility of Perfect BandLimiting
A signal has perfectly bandlimited spectrum iff the signal persists for all time.

A filter maps a signal into another signal

Linear

Time invariant

Theorem
If a filter is linear and time invariant, there exists a function such that


Graph Laplacian

Laplacian
The pointwise formulation for the Laplacian acting on a function is
 Graph Laplacian
 $L=DA$
 is the adjacency matrix, if node i, j are connected
 is diagonal with
 $L=DW$
 is the weight of the edge between node i, j
 is diagonal with
 Assuming nonnegative edge weights, is positive semidefinite.

Normalized Graph Laplacian
eigen values are between $[0,2]$
 Properties
 , zero is a eigen value.
 The multiplicity of the eigenvalue zero is equal to the number of connected components of the graph.
 Embedding

Supposing the graph is connected (there’s a path between every pair of nodes), the eigenvector associated to the smallest nonzero eigenvalue (Fiedler Vector) provides the best onedimensional embedding.
imposing the constraint to prevent .

Embedding in $\mathbb{R}^d$
Consider $f:V\rightarrow\mathbb{R}^d$, then $f_i\in\mathbb{R}^d$ is the embedding of the ith node, let $F=[f_1,f_2,\cdots]^T$ be the embedding matrix of size $N\times M$, then
imposing the constraint .
The solutions are the eigen vectors of using the d smallest nonzero eigenvalues.

 $L=DA$

Graph Laplacian Equation
Let $G=(V,E)$ be a graph, consider a scalar function $f:V\rightarrow\mathbb{R}$, $f$ satisfies the Graph Laplacian equation if $Lf=0$, here $f$ is the vector of values of all nodes.
Graph Fourier Transform

Graph Fourier Transform
(the function is only defined on the eigen values)
(the function is only defined on each node for i=1 to N)
where and is the eigen value and vector of .
Think are vectors, is vector, let $U=[\cdots;u_l;\cdots]$ be a matrix of size , then
Columns of are eigen vectors of with

Parseval’s Identity
In particular,

Graph convolution
so ( means elementwise)
If is polynomial, , using then
Let and , then
with
Therefore, when the filter is a polynomial in the spectral domain, the filtered signal (f*g) in each node i is a linear combination of the original signal in the neighborhood of i.
If , then , , . If take the normalized Laplacian, , then it’s equivalent to
This equation is very similar to the Eq (6) in GCN.

Translation
Translation to a node j can be seen as a convolution with the delta function ().
The energy of the signal is not preserved, .

Modulation
then is the identity operator
Batch Norm
Graph Networks
Graph Convolutional Network
where
 and is the adjacency matrix
 is a diagonal matrix
 is the input features at kth layer
 is the output features
 is the parameter matrix to develop features
An activation function should be applied to with a bias:
Highway GCN
where $Z$ is the same as the $Z$ in GCN.
Simplifying Graph Convolutional Networks
where the nonlinearities between layers are removed.
FastGCN
GCN can be treated as
where is the input features and its ith row is the features of the ith node
If for each layer, we sample nodes from nodes, denote the indices as
Further, we can use importance sampling (sample the nodes following a distribution )
with .
The sampling in each GCN layer is independent.
Relational Graph Convolutional Network

Message passing framework
is the set of incoming messages for node
is the message from node to node

RGCN
where
 is the set of neighbors for node under relation
 $c_{i,r}$ is a normalization constant, e.g. number of elements in $\mathcal{N}_i^r$ or the sum of number of elements over relations, but it might be problematic for nodes having high degree.

Regularization

Basis decomposition
is a basis transformations and the transformation of each relation is a linear combination of the basis

Block decomposition
is a direct sum of low dimensional matrices, therefore it is sparse. Direct sum:

Graph Attention Networks
where (normalize over $j$) with a weight vector .
The attention can be extended to multihead like transformer, where $h_i$ from multiple heads are concatenated. If using multihead attention, the results should be averaged before activation at the final layer:
Skip connections could also be considered with attention mechanism.
Line Graph Neural Network

GNN
where is the degree operator and is the adjacency matrix and .

Line graph
Given , the line graph is , where the node are the ordered edges in , and two nodes and in are connected if .

LGNN
where
 ,
 ,
 ,
 calculates the sum of connected edges for the node,
 calculates the sum of out edges for the node  the sum of in edges for the node.
LargeScale Learnable Graph Convolutional Network
LGCL automatically selects a fixed number of neighboring nodes for each feature based on value ranking in order to transform graph data into gridlike structures in 1D format, thereby enabling the use of regular convolutional operations on generic graphs.

kleargest node selection in one LGCN layer
Consider node i, assume it has n > k neighbors, each node has F features. For each feature f, take the klargest values, so for the node i, together with its own features, the result is a matrix of fixedsize (k+1)*F. This matrix is later reduced to a vector (node embedding) via 1D convolution.

Subgraph selection algorithm
To deal with large graph, for each training, a subgraph could be extracted as follows.
GraphSage (sample and aggregate)

Algorithm
Given Graph , each node has feature
 Init embedding by feature
 For
 Sample a fixed size neighborhood for every node
 for each node
 aggregate the embedding of neighbors into a embedding
 concatenate and then pass to a fully connected layer
 normalize to make sure the norm is one
 is the final representation of node

Aggregator
 LSTM/GRU/Attention
 Pooling (max/mean)

Mean
DiffusionConvolutional Neural Networks

Node classification
where
 , , ,
 is the number of nodes, is the number of features
 contains the power series of , means the probability to jump from node i to node j with the adjacency matrix and is later flatten into and followed by MLP to produce label probabilities

Graph classification
where , , , is later flatten into a single vector to produce label probabilities

Edge classification with edge features
An edge can be converted into , so the edge itself becomes a node.
Gated Graph Sequence Neural Networks
Jumping Knowledge Networks
The idea is that, in the same graph, for different nodes, the same number of steps can lead to very different effect. Because the range of neighboring nodes that a node’s representation draws from strongly depends on the graph structure. For a node, denote its representation at different layers as
The final representation is then an aggregation of these.
 Concatenation + projection
 LSTM+Attention
 Max pooling