Graph Neural network là gì
Machine learning on graphs is a difficult task due to the highly complex, but also informative graph structure. This post is the first in a series on how to do deep learning on graphs with Graph Convolutional Networks (GCNs), a powerful type of neural network designed to work directly on graphs and leverage their structural information. The posts in the series
are: In this post, I will give an introduction to GCNs and illustrate how information is propagated through the hidden layers of a GCN using coding examples.
We’ll see how the GCN aggregates information from the previous layers and how this mechanism produces useful feature representations of nodes in graphs. GCNs are a very powerful neural network architecture for machine learning on graphs. In fact, they are so powerful that even a randomly initiated 2-layer GCN can produce useful feature representations of nodes in networks. The figure below illustrates a
2-dimensional representation of each node in a network produced by such a GCN. Notice that the relative nearness of nodes in the network is preserved in the 2-dimensional representation even without any training. More formally, a graph convolutional network (GCN) is a neural network that operates on graphs. Given a graph G = (V, E), a GCN takes as input A hidden layer in the GCN can thus be written as Hⁱ = f(Hⁱ⁻¹, A)) where
H⁰ = X and f isa propagation [1]. Each layer Hⁱ corresponds to an N × Fⁱ feature matrix where each row is a feature representation of a node. At each layer, these features are aggregated to form the next layer’s features using the propagation rule f. In this way, features become increasingly more abstract at each consecutive layer. In this
framework, variants of GCN differ only in the choice of propagation rule f [1]. One of the simplest possible propagation rule is [1]: f(Hⁱ, A) = σ(AHⁱWⁱ) where Wⁱ is the weight matrix for layer i and σ is a non-linear activation function
such as the ReLU function. The weight matrix has dimensions Fⁱ × Fⁱ⁺¹; in other words the size of the second dimension of the weight matrix determines the number of features at the next layer. If you are familiar with
convolutional neural networks, this operation is similar to a filtering operation since these weights are shared across nodes in the graph. Let’s examine the propagation rule at its most simple level. Let In other words, f(X, A) = AX. This propagation rule is perhaps a bit too
simple, but we will add in the missing parts later. As a side note, AX is now equivalent to the input layer of a multi-layer perceptron. As a simple example, we’ll use the the following graph: And below is its A = np.matrix([ Next, we need features! We generate 2 integer features for every node based on its index. This makes it easy to confirm the matrix calculations manually later. In [3]: X = np.matrix([ Applying the Propagation RuleAlright! We now have a graph, its adjacency matrix In [6]: A * X What happened? The representation of each node (each row) is now a sum of its neighbors features! In other words, the graph convolutional layer represents each node as an aggregate of its neighborhood. I encourage you to check the calculation for yourself. Note that in this case a node n is a neighbor of node v if there exists an edge from v to n. Uh oh! Problems on the Horizon!You may have already spotted the problems:
In the following, I discuss each of these problems separately. Adding Self-LoopsTo address the first problem, one can simply add a self-loop to each node [1, 2]. In practice this is done by adding the identity matrix In [4]: I = np.matrix(np.eye(A.shape[0])) Since the node is now a neighbor of itself, the node’s own features is included when summing up the features of its neighbors! Normalizing the Feature RepresentationsThe feature representations can be normalized by node degree by transforming the adjacency matrix f(X, A) = D⁻¹AX Let’s see what happens. We first compute the degree matrix. In [9]: D = np.array(np.sum(A, axis=0))[0] Before applying the rule, let’s see what happens to the adjacency matrix after we transform it. Before A = np.matrix([ After In [10]: D**-1 * A Observe that the weights (the values) in each row of the adjacency matrix have been divided by the degree of the node corresponding to the row. We apply the propagation rule with the transformed adjacency matrix In [11]: D**-1 * A * X and get node representations corresponding to the mean of the features of neighboring nodes. This is because the weights in the (transformed) adjacency matrix correspond to weights in a weighted sum of the neighboring nodes’ features. Again, I encourage you to verify this observation for yourself. Putting it All TogetherWe now combine the self-loop and normalization tips. In addition, we’ll reintroduce the weights and activation function that we previously discarded to simplify the discussion. Adding back the Weights First order of business is applying the weights. Note that here In [45]: W = np.matrix([ And if we want to reduce the dimensionality of the output feature representations we can reduce the size of the weight matrix In [46]: W = np.matrix([ Adding an Activation FunctionWe choose to preserve the dimensionality of the feature representations and apply the ReLU activation function. In [51]: W = np.matrix([ Voila! A complete hidden layer with adjacency matrix, input features, weights and activation function! Back to RealityNow, finally, we can apply a graph convolutional network on a real graph. I will show you how to produce the feature representations we saw early in the post. Zachary’s Karate ClubZachary’s karate club is a commonly used social network where nodes represent members of a karate club and the edges their mutual relations. While Zachary was studying the karate club, a conflict arose between the administrator and the instructor which resulted in the club splitting in two. The figure below shows the graph representation of the network and nodes are labeled according to which part of the club. The administrator and instructor are marked with ‘A’ and ‘I’, respectively. Zachary’s Karate ClubBuilding the GCNNow let us build the graph convolutional network. We won’t actually train the network, but simply initialize it at random to produce the feature representations we saw at the start
of this post. We will use from networkx import karate_club_graph, to_numpy_matrixzkc = karate_club_graph() Next, we’ll initialize weights randomly. W_1 = np.random.normal( Stack the GCN layers. We here use just the identity matrix as feature representation, that is, each node is represented as a one-hot encoded categorical variable. def gcn_layer(A_hat, D_hat, X, W): We extract the feature representations. feature_representations = { And voila! Feature representations that separate the communities in Zachary’s karate club quite well. And we haven’t even begun training yet! Feature Representations of the Nodes in Zachary’s Karate ClubI should note that for this example the randomly initialized weights were very likely to give 0 values on either the x- or the y-axis as result of the ReLU function, so it took a few random initializations to produce the figure above. ConclusionIn this post I have given a high-level introduction to graph convolutional networks and illustrated how the feature representation of a node at each layer in the GCN is based on an aggregate of the its neighborhood. We saw how we can build these networks using numpy and how powerful they are: even randomly initialized GCNs can separate communities in Zachary’s Karate Club. In the next post, I will go a bit more into technical detail and show how to implement and train a recently published GCN using semi-supervised learning. You find the next post in the series here. Liked what you read? Consider following me on Twitter where I share papers, videos, and articles related to the practice, theory, and ethics of data science and machine learning that I find interesting in addition to my own posts. For professional inquiries, please contact me on LinkedIn or by direct message on Twitter. References[1] Blog post on graph convolutional networks by Thomas Kipf. [2] Paper called Semi-Supervised Classification with Graph Convolutional Networks by Thomas Kipf and Max Welling. |