Tutorial

BeGin is a framework containing the following core components:

ScenarioLoader: This module provides built-in continual learning scenarios to evaluate the performances of graph continual learning methods.
Evaluator: This module provides the evaluator, which computes basic metrics based on the ground-truth and predicted answers.
Trainer: This module manages the overall training procedure of user-defined continual learning algorithms, including preparing the dataloader, training, and validation, so that users only have to implement novel parts of their methods.

In this material, we briefly describe how to perform graph continual learning with those components using some examples.

ScenarioLoader and Evaluation Metric

In order to evaluate graph CL methods, we need to prepare (1) graph datasets with multi-class, domain, or timestamps, (2) incremental settings, and (3) proper evaluation metric for the settings. To reduce such efforts, BeGin provides various benchmark scenarios based on graph-related problems and incremental settings for continual learning, and built-in evaluation metrics. For example, using BeGin, users can load the class-incremental node classification scenario on ogbn-arxiv dataset in just one line of code.

>>> from begin.scenarios.nodes import NCScenarioLoader
>>> NCScenarioLoader(dataset_name='ogbn-arxiv', num_tasks=8, metric='accuracy', save_path='/data', incr_type='class')

Currently, BeGin supports 19 Node Classification (NC), Link Classification (LC), Link Prediction (LP), Graph Classification (GC) scenarios with the following incremental settings for continual learning with graph data.

Task-incremental (Task-IL): In this incremental setting, the set of classes consisting each task varies with tasks, and they are often disjoint. In addition, for each query at evaluation, the corresponding task information is provided, and thus its answer is predicted among the classes considered in the task. This setting is applied to NC and GC tasks, where the sets of classes can vary with tasks, and for NC and LC tasks, the input graph is fixed.
Class-incremental (Class-IL): In this incremental setting, the set of classes grows over tasks. In addition, for each query at evaluation, the corresponding task is NOT provided, and thus its answer is predicted among all classes seen so far. This setting is applied to NC and GC tasks, where the sets of classes can vary with tasks, and for NC and LC tasks, the input graph is fixed.
Domain-incremental (Domain-IL): In this incremental setting, we divided entities (i.e., nodes, edges, and graphs) over tasks according to their domains, which are additionally given. For NC, the nodes of the input graph are divided into NC tasks according to their domains. For LC and LP, the links of the input graph and the input queries are divided according to their domains, respectively. For GC, the links of the input graph are divided into LC tasks according to their domains. For NC, LC, and LP tasks, the input graph is fixed.
Time-incremental (Time-IL): In this incremental setting except for GC, we consider a dynamic graph evolving over time, and the set of classes may or may not vary across tasks. For NC, LC, and LP, the input graph of i-th task is the i-th snapshot of the dynamic graph. For GC, the snapshots of the dynamic graph are grouped and assigned to tasks in chronological order.

Trainer

For usability, BeGin provides the trainer, which users can extend when implementing and benchmarking new methods. It manages the overall training procedure, including preparing the dataloader, training, and validation, so that users only have to implement novel parts of their methods. As in Avalanche, the trainer divides the training procedure of continual learning as a series of events. For example, the subprocesses in the training procedure where the trainer (a) receives the input for the current task, (b) trains the model for one iteration for the current task, and (c) handles the necessary tasks before and after the training. as events. Each event is modularized as a function, which users can fill out, and the trainer proceeds the training procedure with the event functions.

Currently, BeGin supports the following event functions. Note that implementing each event function is optional. If a user-defined event function is not provided, the trainer performs training and evaluation with the corresponding basic pre-implemented operations. Thus, users do not need to implement the whole training and evaluation procedure, but only the necessary parts. See here for the detailed arguments and roles of the event functions. Currently, users can override the following built-in event functions:

initTrainingStates(): This function is called only once, when the training procedure begins.
prepareLoader(): This function is called once for each task when generating dataloaders for training, validation, and test. Given the dataset for each task, it should return dataloaders for training, validation, and test.
processBeforeTraining(): This function is called once for each task, right after the prepareLoader() event function terminates.
processTrainIteration(): This function is called for every training iteration. When the current batched inputs, model, and optimizer are given, it should perform a single training iteration and return the information or outcome during the iteration.
processEvalIteration(): This function is called for every evaluation iteration. When the current batched inputs and trained model are given, it should perform a single evaluation iteration and return the information or outcome during the iteration.
inference(): This function is called for every inference step in the training procedure.
beforeInference(): This function is called right before the inference() begins.
afterInference(): This function is called right after the inference() terminates.
_reduceTrainingStats(): This function is called at the end of every training step. Given the returned values of the processTrainIteration() event function, it should return overall and reduced statistics of the current training step.
_reduceEvalStats(): This function is called at the end of every evaluation step. Given the returned values of the processEvalIteration() event function, it should return overall and reduced statistics of the current evaluation step.
processTrainingLogs(): This function is called right after the reduceTrainingStats() event function terminates. It should generate training logs for the current training iteration.
procssAfterEachIteration(): This function is called at the end of the training iteration. When the outcome from reduceTrainingStats() and reduceEvalStats() are given, it should determine whether the trainer should stop training for the current task or not.
processAfterTraining(): This function is called once for each task when the trainer completes training for the current task.

Suppose we implement Elastic Weight Consolidation (EWC) algorithm for class-IL node classification using BeGin. EWC algorithm is a regularization-based CL algorithm for generic data. Specifically, it uses weighted L2 penalty term which is determined by the learned weights from the previous tasks as in the following equation:

\[\mathcal{L}(\theta) = \mathcal{L}_i(\theta) + \sum_{j=1}^{i-1} \frac{\lambda}{2} F_j (\theta - \theta^*_j)^2,\]

where \(\theta\) is current weights of the model, \(\theta^*_j\) is learned weights until the \(j\)-th task, \(\lambda > 0\) is a hyperparameter, and \(F_j\) is the diagonal part of the Fisher information matrix until the \(j\)-th task computed as square of the first derivatives.

Step 1. Extending the base

BeGin provides base trainer class that makes the training behavior exactly the same as the Bare model. Based on the base class, users can implement their CL algorithm by extending the class and substituting some default event functions with user-defined ones.

from begin.trainers.nodes import NCTrainer
class NCClassILEWCTrainer(NCTrainer):
    pass

Step 2. Setting initial states for the algorithm (`initTrainingStates()`)

As stated in the aformentioned equation, EWC requires storing the learned weights and Fisher information matrices from the previous tasks to compute the regularization term. However, they cannot be obtained on the current task. In order to resolve this issue, the trainer provides a dictionary called training_states. The dictionary can be used to store intermediate results and can be shared by events in the form of an argument (i.e., input parameter) of the event functions. To set the initial states, the user can extend the base trainer with their modified initTrainingStates() event function, which initializes the states for running EWC.

from begin.trainers.nodes import NCTrainer
class NCClassILEWCTrainer(NCTrainer):
    def initTrainingStates(self, model, optimizer):
        return {'fishers': [], 'params': []}

Step 3. Storing previous weights and Fisher matrix (`processAfterTraining()`)

In order to compute the penalty term at task \(i\), we need the learned weights \(\theta^*_j\) and Fisher information matrix \(F_j\) for every task \(j < i\). Hence, we need to store them at the end of each task, and this can naturally be implemented in the event function processAfterTraining(), which is called at the end of each task. In the example below, curr_training_states[‘params’][j-1] and curr_training_states[‘fishers’][j-1] store the learned weights and the Fisher information matrix of task \(j\), respectively.

from begin.trainers.nodes import NCTrainer
class NCClassILEWCTrainer(NCTrainer):
    def initTrainingStates(self, model, optimizer):
        return {'fishers': [], 'params': []}

    def processAfterTraining(self, task_id, curr_dataset, curr_model, curr_optimizer, curr_training_states):
        super().processAfterTraining(task_id, curr_dataset, curr_model, curr_optimizer, curr_training_states)
        params = {name: torch.zeros_like(p) for name, p in curr_model.named_parameters()}
        fishers = {name: torch.zeros_like(p) for name, p in curr_model.named_parameters()}
        train_loader = self.prepareLoader(curr_dataset, curr_training_states)[0]

        total_num_items = 0
        for i, _curr_batch in enumerate(iter(train_loader)):
            curr_model.zero_grad()
            curr_results = self.inference(curr_model, _curr_batch, curr_training_states)
            curr_results['loss'].backward()
            curr_num_items =_curr_batch[1].shape[0]
            total_num_items += curr_num_items
            for name, p in curr_model.named_parameters():
                params[name] = p.data.clone().detach()
                fishers[name] += (p.grad.data.clone().detach() ** 2) * curr_num_items

        for name, p in curr_model.named_parameters():
            fishers[name] /= total_num_items

        curr_training_states['fishers'].append(fishers)
        curr_training_states['params'].append(params)

Step 4. Computing penalty term and Performing regularization (`processTrainIteration()` and `afterInference()`)

The penalty term in the above equation is used for regularization during a backpropagation process. The computation of the term should be performed at the end of training for every task, and thus it is implemented in afterInference(). In the event function, the argument (i.e., input parameter) curr_training_states contains the Fisher information matrices and the previously learned weights based on which the penalty term loss_reg is computed. The event function also has the argument results, which contains the prediction result and loss of the current model computed in the inference() function. Thus, the overall loss including the penalty term can be obtained by summing up results[‘loss’] and loss_reg.

from begin.trainers.nodes import NCTrainer
class NCClassILEWCTrainer(NCTrainer):
    def initTrainingStates(self, model, optimizer):
        return {'fishers': [], 'params': []}

    def processAfterTraining(self, task_id, curr_dataset, curr_model, curr_optimizer, curr_training_states):
        super().processAfterTraining(task_id, curr_dataset, curr_model, curr_optimizer, curr_training_states)
        params = {name: torch.zeros_like(p) for name, p in curr_model.named_parameters()}
        fishers = {name: torch.zeros_like(p) for name, p in curr_model.named_parameters()}
        train_loader = self.prepareLoader(curr_dataset, curr_training_states)[0]

        total_num_items = 0
        for i, _curr_batch in enumerate(iter(train_loader)):
            curr_model.zero_grad()
            curr_results = self.inference(curr_model, _curr_batch, curr_training_states)
            curr_results['loss'].backward()
            curr_num_items =_curr_batch[1].shape[0]
            total_num_items += curr_num_items
            for name, p in curr_model.named_parameters():
                params[name] = p.data.clone().detach()
                fishers[name] += (p.grad.data.clone().detach() ** 2) * curr_num_items

        for name, p in curr_model.named_parameters():
            fishers[name] /= total_num_items

        curr_training_states['fishers'].append(fishers)
        curr_training_states['params'].append(params)

    def afterInference(self, results, model, optimizer, _curr_batch, training_states):
        loss_reg = 0
        for _param, _fisher in zip(training_states['params'], training_states['fishers']):
            for name, p in model.named_parameters():
                l = self.lamb * _fisher[name]
                l = l * ((p - _param[name]) ** 2)
                loss_reg = loss_reg + l.sum()
        total_loss = results['loss'] + loss_reg
        total_loss.backward()
        optimizer.step()
        return {'loss': total_loss.item(),
                'acc': self.eval_fn(results['preds'].argmax(-1), _curr_batch[0].ndata['label'][_curr_batch[1]].to(self.device))}

The above code presents the complete implementation of EWC for node classification under Class-IL. Similarly, various CL algorithms can be developed by modifying specific event functions, without the need to manage the overall training and evaluation procedures. Refer to here for the detailed explanation of the event functions and their parameters.

Combining ScenarioLoader, Evaluator, Trainer

So far we have learned how to load each component of BeGin. The last step is to combine the components to perform the experiments under the prepared scenario and trainer, and this process also takes just a few lines of code.

from begin.scenarios.nodes import NCScenarioLoader

scenario = NCScenarioLoader(dataset_name='ogbn-arxiv', num_tasks=8, metric='accuracy', save_path='./data', incr_type='class')
benchmark = NCClassILEWCTrainer(model = GCN(scenario.num_feats, scenario.num_classes, 256, dropout=0.25),
                                scenario = scenario,
                                optimizer_fn = lambda x: torch.optim.Adam(x, lr=1e-3),
                                loss_fn = torch.nn.CrossEntropyLoss(ignore_index=-1),
                                device = torch.device('cuda:0'),
                                scheduler_fn = lambda x: torch.optim.lr_scheduler.ReduceLROnPlateau(x, mode='min', patience=20, min_lr=args.lr * 0.001 * 2., verbose=True))
results = benchmark.run(epoch_per_task = 1000)

To run the experiment, trainer object in BeGin requires a learnable model, a CL scenraio, a proper loss function to train the model, a function to generate optimizer and scheduler, and the other auxilary arguments to customize the trainer. After creating the object, users can start the experiment by calling the member function results of the trainer object.

In BeGin, at the end of each task, the trainer measures the performance of all tasks. When the procedure is completed, the trainer returns the evaluation results, which is in the form of a matrix. In the matrix, the (i,j)-th entry contains the performance evaluated using the test data of task j when the training of task i has just ended. In addition, BeGin supports the following final evaluation metrics designed for continual learning:

Average Performance (AP): Average performance on all tasks after learning all tasks.
Average Forgetting (AF): Average forgetting on all tasks. We measure the forgetting on task i by the difference between the performance on task i after learning all tasks and the performance on task i right after learning task i
Forward Transfer (FWT) : Average forward transfer on tasks. We measure the forward transfer on task i by the difference between the performance on task i after learning task (i-1) and the performance of initialized model on task i.
Intransigence (INT): Average intransigence on all tasks. We measure the intransigence on task i by the difference between the performances of the Joint model and the the target mode on task i after learning task i. BeGin provides this metric if and only if full_mode = True, which simultaneously runs the bare model and the joint model, is enabled.

Pretraining

From v0.4.0, BeGin supports various pretraining methods, allowing users to integrate them with existing CL methods by adding pretraining arguments.

from begin.scenarios.nodes import NCScenarioLoader
from begin.utils.pretraining import *

scenario = NCScenarioLoader(dataset_name='ogbn-arxiv', num_tasks=8, metric='accuracy', save_path='./data', incr_type='class')
benchmark = NCClassILEWCTrainer(
    model=GCN(scenario.num_feats, scenario.num_classes, 256, dropout=0.25),
    scenario=scenario,
    optimizer_fn=lambda x: torch.optim.Adam(x, lr=1e-3),
    loss_fn=torch.nn.CrossEntropyLoss(ignore_index=-1),
    device=torch.device('cuda:0'),
    scheduler_fn=lambda x: torch.optim.lr_scheduler.ReduceLROnPlateau(x, mode='min', patience=20, min_lr=args.lr * 0.001 * 2., verbose=True),
    pretraining=DGI
)
results = benchmark.run(epoch_per_task=1000)

Implementing Custom Pretraining Method

Similar to the trainer, BeGin provides a basic implementation of pretraining methods. To implement a new pretraining method, you need to extend the PretrainingMethod class to streamline the process. Currently, BeGin supports the following event functions. Note that implementing each event function is optional. If user-defined functions are not provided, the default pre-implemented base functions will be utilized.

PretrainIterator: This class is required for training on node-level and link-level tasks. The default implementation assumes full-batch training.
iterator(): This function is invoked for every epoch when the trainer requires an iterator for pretraining. The default implementation returns a PretrainIterator object.
inference(): This function is called during each inference step in the pretraining process. Implementing this function is mandatory to operate the pretraining procedure.
update(): This function is called when the best checkpoint needs to be updated. The default implementation stores the current state_dict of the model in self.best_checkpoint.
processAfterTraining(): This function is called once when the trainer concludes pretraining. The default implementation initializes the model using the saved best checkpoint (spec., self.best_checkpoint) before the main training begins.

Suppose we implement Deep Graph Infomax (DGI) method for node-level problems. DGI is a method for learning useful representations from graph data. It generates node embeddings that capture structural information from the graph. Unlike traditional autoencoders, DGI aligns local node embeddings with a global representation to extract richer information from the graph. DGI uses an encoder, typically a Graph Convolutional Network (GCN), to create node embeddings from graph features. A global readout function (e.g., mean or max pooling) summarizes these embeddings. The discriminator then maximizes mutual information between the node embeddings and the global summary, learning better graph representations.

Step 1. Extending the base and Implementing discriminator

First, we need to create a new class by extending PretrainingMethod and pass the encoder to perform pretraining (GCN) to the superclass constructor in the __init__ method. Next, we need to implement the discriminator in this subclass to distinguish between real and corrupted node embeddings, maximizing the mutual information between the node embeddings and the global summary.

class DGI(PretrainingMethod):
  class Discriminator(nn.Module):
      def __init__(self, n_hidden):
          super().__init__()
          self.weight = nn.Parameter(torch.Tensor(n_hidden, n_hidden))
          self.reset_parameters()

      def uniform(self, size, tensor):
          bound = 1.0 / math.sqrt(size)
          if tensor is not None:
              tensor.data.uniform_(-bound, bound)

      def reset_parameters(self):
          size = self.weight.size(0)
          self.uniform(size, self.weight)

      def forward(self, features, summary):
          features = torch.matmul(features, torch.matmul(self.weight, summary))
          return features

  def __init__(self, encoder):
      super().__init__(encoder)
      self.discriminator = self.Discriminator(encoder.n_hidden)
      self.loss_fn = nn.BCEWithLogitsLoss()

  def inference(self, inputs):
      pass

Step 2. Implementating inference process

We need to implement code that maximizes mutual information using the encoder and discriminator. First, the real embeddings, corrupted embeddings, and the global embedding obtained through the readout function can be computed with the following code.

def inference(self, inputs):
  graph, features = inputs, inputs.ndata['feat']
  positive = self.encoder.forward_without_classifier(graph, features)
  perm = torch.randperm(graph.number_of_nodes()).to(features.device)
  negative = self.encoder.forward_without_classifier(graph, features[perm])
  summary = torch.sigmoid(positive.mean(dim=0))

Lastly, we need to input pairs consisting of the real embeddings and global summary, as well as the corrupted embeddings and global summary, into the discriminator. To maximize the difference between the two, we use the nn.BCEWithLogitsLoss function to calculate the loss. The final code is as follows. This returned loss is used in the default processPretraining() to perform backpropagation, facilitating the overall pretraining process.

class DGI(PretrainingMethod):
  class Discriminator(nn.Module):
      def __init__(self, n_hidden):
          super().__init__()
          self.weight = nn.Parameter(torch.Tensor(n_hidden, n_hidden))
          self.reset_parameters()

      def uniform(self, size, tensor):
          bound = 1.0 / math.sqrt(size)
          if tensor is not None:
              tensor.data.uniform_(-bound, bound)

      def reset_parameters(self):
          size = self.weight.size(0)
          self.uniform(size, self.weight)

      def forward(self, features, summary):
          features = torch.matmul(features, torch.matmul(self.weight, summary))
          return features

  def __init__(self, encoder):
      super().__init__(encoder)
      self.discriminator = self.Discriminator(encoder.n_hidden)
      self.loss_fn = nn.BCEWithLogitsLoss()

  def inference(self, inputs):
      graph, features = inputs, inputs.ndata['feat']
      positive = self.encoder.forward_without_classifier(graph, features)
      perm = torch.randperm(graph.number_of_nodes()).to(features.device)
      negative = self.encoder.forward_without_classifier(graph, features[perm])
      summary = torch.sigmoid(positive.mean(dim=0))

      positive = self.discriminator(positive, summary)
      negative = self.discriminator(negative, summary)
      l1 = self.loss_fn(positive, torch.ones_like(positive))
      l2 = self.loss_fn(negative, torch.zeros_like(negative))
      return l1 + l2