RCNX: Residual Capsule NeXt

Contrary to the popular Convolutional Neural Network (CNN), which depends on the shift-invariance in the image, Capsule Networks depends on hierarchical model relations. This aspect of Capsule Networks keeps them in the machine learning domain despite their enormous size with only comparable accuracy to the CNNs. Capsules utilize an intricate algorithm to create a route by agreement, leading to their size and uniqueness. Recent developments in Capsule Networks have contributed to mitigating the problem of size and improving accuracy. We focus on modifying one of the Capsule Network, Residual Capsule Network (RCN), to a comparable size to modern CNNs, thus restating Capsule Network's importance. In this paper, Residual Capsule NeXt (RCNX) is proposed as an effective and more advanced version of RCN with a size of 1.5M parameters and an unprecedented improvement in the network's accuracy on the CIFAR-10 dataset to 89.3%. This accuracy and size exceed the famous embedded CNN model MobileNetV3.

take inspiration from the millions of years worth of the human visual system's evolution [ 2 ].
Machine learning is an integral part of modern society. From security to games to complex machine drawings, it is now unstoppable. Although the use of CNN exceeds human limitations, it is not ideally similar to the human brain [ 3 ]. To merge this gap and produce more similarity to a neural network Capsule Network(CapsNet) is introduced [ 4 ]. Capsule Networks focus on neural connections developed with routing by agreement algorithm [ 4 ]. This thesis focuses on improving one such network, Residual Capsule Network, where capsule network is merged with ResNet architecture to produce an image classification model.

Motivation
Copious amounts of neural network architectures helped us understand various levels of the model's complexity [ 5 ]. In part, replicating the brain cells, neural networks came into existence. Although many unknown factors remain, we build superficially similar neurons matched with activation functions similar to what can be expected in the brain cell. With growing, neural network architectures came to show limitations of CNN [ 6 ]. CNN, also knows as shift-invariant artificial neural networks are, as the name suggests, a convolution function that learns concerning the invariance to a special-shift of the window in an input [ 4 ]. These CNNs are helpful in the context of Image recognition, object recognition, image classification, and image segmentation.
This thesis focuses on the Image classification task with exemplary new models RCN2: Residual Capsule Network v2 and RCNX: Residual Capsule Next, outperforming CNN es-tablishing their dominance. Although good at image classification tasks, CNN's is weak, which is easily visible since they are insensitive towards pose and image transformations [ 7 ].
This insensitivity also extends to a stage where the spatial correlation between sub-features is discarded. Capsule Networks are developed to target these flaws on CNN and a closer look at the human visual system. In human vision, we observe a robust spatial correlation and strong pose effect where we characterize an object by the pose of its sub-sections.
Capsule Networks work on the principle of routing by agreement. Capsules Neural Networks is a system of artificial neural networks capable of handling better model hierarchical relationships [ 4 ]. They are much closer to the biological neural organization compared to CNNs. Having various advantages over CNNs, Capsule Networks outperforms CNNs in understanding the object and its characteristics in an image. Since the routing by agreement creates connections with neurons that agree, parameter number is lesser in caparison to CNNs.

The model Residual Capsule Network is a combination of ResNet CNN architecture and
Capsule Network architecture [ 7 ]. The residual convolutional blocks are used for the initial layers of the Capsule Network. Using such blocks helped the model to avoid vanishing gradient and improve the application of the Capsule Network.

Challenges
This thesis aims to modify the Residual Capsule Network to a model comparable to the CNN model with high accuracy and fewer parameters. Challenges in achieving this include the following.
• Creating a new neural network structure from RCN (baseline architecture) [ 7 ].
• Reducing model size.
• Testing the model.
• Increasing accuracy of the model.
• Deploying model into an Embedded System.

Methodology
In this thesis, the following procedures are followed: • Analyzing the architecture of RCN.
• Finding weaknesses of the structure.
• Finding improvement to the weaknesses of RCN.
• Estimating increase or decrease in the number of parameters.
• Implementing established ideas of improving accuracy • Confirming architecture's learning capability.
• Check for over-fitting or under-fitting.
• Optimizing parameters with hyper-parameter tuning using NNI.
• Reducing the size of the model with affecting accuracy.
• Deploying on embedded hardware.

Contributions
Contribution towards the completion of the thesis is listed as follows.
• Design space exploration of RCN Architecture.
• Image classification capability of the proposed models are verified.
• Deploying the efficient compressed RCNX into i.MX RT1060.
• Two papers Accepted for IEEE conferences.  [ 9 ]. It will discern differences between CNNs and CapsNets. Deep Learning architectures that lead to computer vision. These sections also speak about the missing elements of RCN and 3-level RCN, and these disadvantages are removed with design space exploration further in the thesis.

Neural Networks
Neural networks (NNs) consist of rules and algorithms that try to imitate the human brain in learning relationships between massive data [ 10 ]. NNs are a collection of neurons, simulate a network comparable to a human brain [ 11 ]. The NNs can adjust to changing input, so the model creates the leading conceivable result without overhauling the yield criteria [ 12 ].
A "neuron" in an NN is mapping a mathematical function that collects and classifies data concurring to relation. The NNs bear strategies such as regression analysis and curve fitting.
Applications of NN are extensive and include Natural language processing, Computer Vision, Stock Market predictions, Astronomical data analysis, etc. We focus on the Computer vision section of Neural Networks [ 5 ].

Convolutional Neural Networks
Convolutional Neural Networks (CNN), also called shift-invariant artificial neural networks, are, as the name suggests, a convolution function that learns concerning the invariance to a special-shift of the window in an input [ 14 ], [ 9 ]. These help CNN understand image recognition, object recognition, image classification, and image segmentation. The application of CNN extends to videos and lidar point clouds [ 5 ]. A typical CNN as depicted in Figure   2.2 contains Convolution Layers, Pool Layer, Activation Layer, and a Fully connected layer.
The initial Convolutional layers are intended for extracting features, and further pooling layers reduce the number of connections to improve computation with a lesser number of parameters as visualised in Figure 2.3 . Pooling is mainly done in two formats, maximum value pooling, and average value pooling [ 12 ]. The pooling layers discount the information to only the average of a kernel square or maximum of a kernel square. These pooling layers cause loss of information, as mentioned by Capsule Network authors. Capsule Network creators see the first and foremost disadvantage in CNN as the pooling layer's loss of information [ 16 ].
Pooling layers are embedded extensively in CNN to reduce the size of models. Thereby achieving accuracy with less model size. As the main aim of Neural Networks is to create models comparable to a human brain, these pooling layers lead CNN to be less related to human vision [ 18 ]. These scalar features are the second limitation of CNN. As we proceed to the following section about capsule networks, we can see the methods deployed to overcome these flaws.

Capsule Neural Network
In this section, extensive insight into CapsNet is described. Human vision disregards unimportant subtle elements by employing a carefully decided arrangement of fixation points to guarantee that as it were, a modest division of the optic cluster is ever handled at the most noteworthy determination [ 4 ]. Contemplation may be a destitute direct to understanding how much of our information of a scene comes from the arrangement of obsessions and how much we gather from a single obsession. In CapsNet, the idea extends to assume a single focus in the image produces more than a simple recognition and a list of its properties [ 7 ]. CapsNet is developed on the idea that the human visual system produces a tree with various analyses of each fixation. Thus, the multilayer model for CapsNet includes the tree-like structures with each layer containing groups of neurons, and these groups are called capsules. A typical CapsNet is depicted in the Figure 2.4 .

Figure 2.4. CapsNet[ 4 ]
CapsNet, being different from CNN, is still a field yet of unfolding its true potential.
A typical CapsNet includes a convolution layer, primary capsules, digit capsules, and a dynamic routing algorithm. The salient feature of CapsNet being that features are extracted and represented with vectors of N-dimension based on capsules [ 4 ]. Capsules are groups of neurons that produce an N-dimensional vector consist the actualization details of an entity like an object or sub-part of an object. This vector's length denotes the object or sub-part of objects existence with a probability, and the direction of the same represents the entity's pose.
Dynamic capsules make expectations utilizing the change in the lattices, which changes upper hierarchical capsules instantiation parameters [ 4 ]. When numerous expectations concur, the next level capsules get activated. With expectations produced by low-level capsules, a route to the high-level capsule is so that low-level capsules' expectation provides an agreement to the high-level capsule. This agreement-based routing is called routing by agreement algorithm, and it is the dynamic routing algorithm used by CapsNet. To obtain the routingby-agreement scalar product of probability vectors of the low-level capsule and high-level capsules is calculated, and the product with a more significant value is preferred to be the route [ 19 ].

Capsule vector input and output computation
There are numerous ways to actualize the common idea of capsules. CapsNet capsules aim to produce a likelihood vector that an entity is present in the input. To accomplish this, CapsNet use the squashing function as described below [ 4 ].
While training the model, it will use the non-linearity in the squashing, and the above function also provides normalization to the vector outputs. From layer two, all layers consider input s j with a weighted sum operation with a weighted matrix W ij over all the inputs with the following equation [ 4 ].
In the above equation c ij is the coupling coefficient learned from training via dynamic routing algorithm. The loss function of CapsNet is give in Figure 2.5 .

Routing-by-agreement algorithm
In general, routing algorithms are of two types adaptive routing algorithms and nonadaptive routing algorithms. Capsule networks take an adaptive routing algorithm, which is established with routing by agreement mechanism. Routing by agreement is a dynamic routing algorithm [ 4 ], [ 10 ]. To obtain the routing-by-agreement scalar product of probability vectors of the low-level capsule and high-level capsules is calculated, and the product with a more significant value is preferred to be the route. The dynamic routing algorithm used by the CapsNet is given in Algorithm 1 [ 4 ].

Algorithm 1 Dynamic Routing Algorithm
procedure: Routing(û j|i , r, l) Initialisation : for all capsule i in layer l and capsule j in layer (l + 1) for all capsule i in layer l and capsule j in layer (l + 1)

CapsNet Architecture
The architecture of CapsNet, given in Figure 2.4 , is not deep since only two convolution layers, and a fully connected layer are present. In detail, the first convolutional layer with 256 channels, 9 × 9 kernel, the stride of one, and ReLU activation effectively transforms pixel information into functions producing local feature detectors. These detector features are then fed forward to primary capsules. The lowest possible degree of the capsule layers starts at the primary capsule [ 7 ]. As discussed above, CapsNet focuses on two processes, the rendering, and inverse rendering, out of which the primary capsules, when triggered, resembles inverse rendering. This distinctive sort of computation, then patching instantiated parts collectively to create commonplace wholes, is what capsules are planned to be great at. Each of them has 8D vector output, and primary capsules can be seen as a Convolutional layer. DigitCaps is the final layer of the CapsNet, and it contains 16-dimensional vectors per class. Since it is used for handwritten digit classification, it is digit classes and hence the name DigitCaps [ 4 ]. Routing is only necessary between 2 consecutive capsules.
Reconstruction network, given in Figure 2.6 , is a crucial part of the CapsNet. A prediction of estimated vectors in DigitCaps is assumed to be a collection of the image features and then reconstructed. This reconstruction loss amounts to the support of DigitCaps to get better instantiation of the input image. The reconstruction network used by CapsNet is three fully connected layers that finally give the reconstruction to an output same as the input image.
Reconstruction loss is scaled to a minimal value to avoid the dominance of this loss alone.

Performance of CapsNet
CapsNet is trained to a dataset MNIST, which contains images of handwritten digits with a size of 28 × 28. The CapsNet performed very well and produced an error of only 0.25% for the test. This accuracy is achieved with only a three-level depth model [ 4 ].
CapsNet is not intended for complex image datasets like the CIFAR-10. Thus a version that fits the CIFAR-10 like dataset is the seven ensemble model of the CapsNet. This seven ensemble model's accuracy is 89.40% on tests with a size of 101.5 Million parameters [ 7 ].

Residual Network
Deeper neural networks are better for accuracy. Increasing the number of parameters and depth of the network is considered the easy, straightforward way to improve the model's accuracy. From the quotes, it can be noted such an improvement in the model size has a limit [ 20 ]. In machine learning, this limit is produced due to vanishing gradient and problems regarding dimensionality [ 6 ]. Deep networks often face these problems. Vanishing gradients have been existent from the initial days of Deep Neural networks. To overcome this hamper in convergence first method to existing was a normalized initialization with embedded normalizing layers. These techniques allowed proper back-propagation with reducing the vanishing gradient problem [ 20 ]. Improving from these basic techniques, which take time and effort, newer solutions started to surface. Deep residual learning framework addressed  [ 20 ] these issues in a different architecture. This solution is elegant and faster convergence is achieved. Rather than trusting every few stacked layers fit a desired elemental mapping, deep residual learning unequivocally lets these layers fit a leftover mapping. Let an underlying architecture be F (x), and residual learning provides an additional overlap of the input to the model's subsection by making the mapping function F (x) + x [ 20 ]. These feed-forward networks are called skip connections or shortcut connections. These networks were named Residual Networks (Figure 2.7 ).
Residual Networks are conceivably the foremost imaginative work within the Computer Vision community within the past decade. The additional is an identity function of the input to the model's subsection, and this can also be called identity shortcut connections. ResNets were capable of training numerous layers without affecting the accuracy or saturating of the model's accuracy [ 20 ]. With this technique alone, many image classification and object detection techniques have improved tremendously. By experiments, it was established that models without shortcuts tend to learn slower and reach saturation sooner than the models with the same architecture but with the above-mentioned 'identity shortcut connections.' Simplicity in concept, Residual Networks truly makes implementations easy. In a regular CNN, each layer's output is considered as the only input to the following layer. In ResNets, a layer's input in a high-level feature extractor is the prior layer's output and a feed-forward skip connection from 2 to 3 layers before it. ResNets contains skip-connections to layers where a vanishing gradient problem might occur. Even with hundreds of layers, Residual connections achieved complete removal of vanishing gradient. Pre-activation residual connections are found to be better to help in the passage of information. Residual Networks have accomplished a striking execution on Image classification errands by presenting skip associations utilized as bypassing ways [ 21 ]. It can be found that skipping has viably rearranged the network and expanded the learning speed by diminishing the effect of vanishing gradients as there are fewer layers to proliferate through.

Baseline: Residual Capsule Network (RCN)
A deeper network is effective than shallow systems but, consequently, more cumbersome in the training process. ResNets ease the preparation and have appeared to prove that they can give great precision with significant profundity. By utilizing Residual Network with the Capsule Network, the RCN model came into existence [ 7 ]. RCN Architecture is as shown in Figure 2 [ 7 ] vectors by mapping these into a 16-dimensional space with weighted coefficients [ 7 ]. As with every classification tasked Neural networks, the RCN converges to a final layer of the number of classes, i.e., ten classes for the CIFAR-10 dataset. Thus the DigitCaps output of 16-dimensional vectors is then mapped to ten neurons to handle the training output as a one-hot encoding method.
We can note RCN has a vast amount of parameters generated solely from the thick redundant layers of ResNet convolution. Like any CapsNet based architecture, the RCN had a reconstruction network. The reconstruction networks are generally connected with the network's output as input and the network's input as output. On a detailed study, we find the RCN lacking the required complexity in the reconstruction network. We assume that this is due to the network's size, and adding a whole complex reconstruction network will add many more parameters to the existing RCN. With this in mind, we fixed our aim to not focus on the reconstruction network as it is easily removed after the training. It is crucial to  [ 7 ]. This model was a good achievement, but we believed focusing on the flaws mentioned above can bring the best out of RCN.

YOLO-v3
You Only Look Once(YOLO) is an extensive model for object recognition in an image [ 22 ].
YOLO-v3 became one of the popular models used by developers to deploy and execute at a fast pace quickly. Currently, YOLO-v4 has come to light and is the fastest algorithm for object recognition. The YOLO model recognizes the object and draws a bounding box within which the object is more likely to be present. YOLO employments a preparing set that consists of pictures and their comparing bounding boxes of the target pictures [ 22 ]. Thus, an exemplary model, YOLOv3, was chosen to inspire the RCN authors to bring another network 3-level Residual Capsule Network model.
The idea that was inspiring in YOLOv3 is a three-staged network where the network views different input picture scales through each stage. These stages are mainly due to the necessity of YOLO as they were focusing on different scaled images. The structure of YOLO-v3 is shown in Figure 2.9 [ 23 ]. The YOLO implements a feature-dependent trainable model consisting of 75 convolution layers. YOLO altogether avoided fully connected layers by deploying the convolutions of specific sizes, which produced a fully connected layer-like structure and at the same time avoided pooling layers as this also can be replaced with convolutions which can be seen as a weight-based pooling layer [ 22 ]. They also avoided using SoftMax activation. This activation in itself limits the network, although it is necessary to get convergence.

3-level Residual Capsule Network
The authors of RCN brought the salient features of YOLO v3 and combined them with RCN [ 9 ]. These features brought tremendous change into the field of RCN. The authors implemented a three-staged network of 8 layered ResNets, finally connections that interface from every eight layers of ResNets to Primary capsules. A straightforward layer of Residual Network did not unravel the issue for complex datasets like CIFAR-10 [ 9 ]. This lack of performance is possible since the straightforward essential capsules may not be sufficient to compute all the picture highlights.
In the 3-Level RCN network, the primary capsules were modified to retrieve images at three stages of the ResNet layers. The seven-ensemble CapsNet produced a test result of 89.4%. With this high accuracy, it could have been just enough if the size of the model of seven-ensemble CapsNet was not 100 Million parameters [ 9 ]. This number of parameters is very high compared to any machine learning model. Although a 3-level residual capsule network aimed at reducing these parameters, which they were able to achieve, we believe there is room for more improvements.
The structure of the 3-level residual capsule network, as shown in Figure 2.10 , was inspiring that we brought these with the new architectures. A hierarchical structure like a pyramid was the core of YOLOv3 and is as well the structure of 3-level RCN. Each level in this pyramid-like layer contained a layer of repetitive eight cumbersome RCN layers. In 3-level RCN, the primary capsules were fed in with squashed 12 capsule inputs, and then these capsules output were provided to the digit caps. The DigitCaps used a dynamic routing algorithm like the CapsNet. They then were fed into the reconstruction layers of Fully Connected neurons and with ReLU activation for two layers and the final layer with 10 unit output with a sigmoid activation function [ 9 ].  The 3-Level RCN performed better than RCN when tested on the CIFAR-10 dataset.
Although initial RCN produced an accuracy of only 84.16% with 11.86 M parameters, the later 3-level RCN achieved 86.42% accuracy with 10.8 M parameters [ 9 ]. However, an impressive reduction in size accuracy came with a drop of 4% accuracy from the baseline CapsNet.
We believe the full potential of RCN is yet to be realized.

ResNeXt and Cardinality
ResNext NN is the model that brings improvements to the structure and performance of ResNet [ 24 ]. These changes are mainly revolving around the convolutions. The convolutional layers of ResNet are changed with an additional dimension. ResNeXt is a neural network that brought advancements to ResNet. The ability to squeeze the conventional convolutions in the ResNet with the addition of Cardinality leads to higher performance with a reduction in ResNet size [ 24 ]. The Cardinality is an additional dimension after the number of filters helped improve the network to a large extend. Since the Cardinality brings a certain complexity to the model's convolutions, it can be easily assumed any replacement of ResNet convolutions with the ResNeXt convolutions will bring improvements, as demonstrated by the authors of ResNeXt. This network is shown in Figure 2.11 .
Cardinality portrays the degree of changes. The usage of Cardinality in ResNet design leads to ResNeXt. ResNeXt was a leading classification machine learning model for the COCO dataset. Although the dataset variation is to be accounted for, it can be noted that even YOLOv3 was trained with the COCO dataset. Since the YOLO architecture inspires the existing 3-level residual capsule network model, it can be assumed that ResNeXt and the 3-layered structure will have no reason to be not compatible with each other [ 22 ]. The idea of Cardinality arises from the insights provided by various models of convolutions, i.e., Multibranch convolution network, Grouped convolutions, Compressing convolution networks, and Ensembling [ 25 ]. ResNeXt proved its dominance over many modern machine learning algorithms, like ResNet, Inception, Inception-ResNet, etc. ResNeXts followed the idea of Network in Neuron that a Network in Network model. This idea expanded to the creation of a new dimension.
This new method was stated as aggregated transformation by the ResNeXt authors [ 24 ]. With experiments, the paper of ResNeXt shows in detail how the extra dimension is essential to the model's capabilities and that it is more convenient and successful than the depth and width dimensions.
It is crucial to note that the ResNeXt capability comes with no extra parameters and is just an architectural strengthening tool. We discuss more on how we utilized this capability to improve RCN in the later sections.

DeepCaps
We discussed various changes in the convolutions leading to improvement in the CapsNet as it is clear that CapsNet size highly depends on the dynamic routing algorithm as well.
Therefore, various dynamic routing algorithms were considered, and then on analysis of each, we found that the DeepCaps brought better routing algorithm. DeepCaps aimed at the Capsule Network's depth. DeepCaps focused on intuition as we go deeper in a layer like a CNN, the performance improves. This 'going deeper' was achieved by improving the rouging algorithms [ 26 ].
The DeepCaps is designed and developed with estimations of applying to the similar classification tasks on CIFAR 10. The highlight of DeepCaps is the new routing algorithm, a 3D-convolution based dynamic-routing algorithm [ 26 ]. Along with the improvements in the dynamic routing algorithm, the DeepCaps also focuses on other factors such as the decoder network. The decoder network was improved to incorporate the class independent decoder.
The 3D-convolution based dynamic routing algorithm is given in Algorithm 2 [ 26 ].
In detail, the new dynamic routing algorithm is giving vast importance in avoiding unnecessary routes. This change is achieved by considering the neighboring neurons activate similarly and provides similar instantiation parameters in higher-level capsules. DeepCaps removes this redundancy by involving a convolution in the dynamic algorithm. Typically a 3 × 3 filter with one channel is used inside the older dynamic routing, thus the name 3D convolution-based dynamic routing algorithm. DeepCaps understands that the depth of CapsNet matters for complex networks, and thus by removing these redundant routings enables the CapsNet to handle better depths.
Consider an N channel input to the new dynamic routing, and it can be seen that Deep-Caps achieved to create a 3D voting-like system, where the winning mode is detected with a weighted summation. Using this technique number of parameters involved in CapsNet were reduced by a factor of c * (w L w L+1 ) 2 parameters in each capsule, where c represents channels, and w L represents the width of layer L [ 26 ].

PROPOSED ARCHITECTURES
This section details on the two architectures proposed and implemented toward the completion of this thesis.

RCN2: Residual Capsule Network V2
As we discussed, CNNs can only work with the shift-invariance in an image, due to the parameter sharing of convolutional layers and a partial effect from pooling layers. This is the limitation that we try to overcome with CapsNet. CapsNet, being different from CNN, is still a field yet of unfolding its true potential. A typical CapsNet includes a convolution layer, primary capsules, digit capsules, and a dynamic routing algorithm [ 26 ]. The salient feature of CapsNet is that features are extracted and represented with vectors of N-dimension based on capsules [ 4 ]. CapsNet focuses on two processes, the rendering and inverse rendering, out of which the primary capsules, when triggered, resembles inverse rendering. CapsNet are great at this distinctive sort of computation to achieve inverse rendering.
The inspiration for the proposed RCN2 architecture is from RCN (Figure 2.8 ). From RCN, Residual Networks are being used along with the CapsNet [ 7 ]. A deeper network is effective than shallow ones but, consequently, more cumbersome in the training process.
ResNets ease the preparation and have proven that they can give great precision. By utilizing Residual Network with the Capsule Network, the RCN model came into existence. The combination of two has proven extensively helpful [ 7 ].
RCN2 is a compressed and improved version of RCN. We implement many functionalities which helped improve the RCN to become a better Network, and we named it RCN2. RCN2 architecture is depicted in Figure 3

Improving Residual Convolution layers with Bottleneck
As we understand from the literature review, the initial layer of the CapsNet is a convolutional layer. It is also important to note that this convolution layer is solely responsible for extracting the images' features. Multiple layers of the convolution will provide the model with the suitable complexity to learn features better. RCN authors improved this with the inclusion of 8 repetitive ResNet layers. These layers are lacking the ability to learn better as there is no bottleneck structure embedded in them. We understand that supplementing a bottleneck design to the given layer will diminish the preparation time. Allowing training to run smoother than the baseline RCN. In each ResNet layer where the RCN authors utilized 2-layers, we changed to a 4-layer structure containing 1 × 1, 3 × 3, 1 × 1, and 3 × 3 convolutions. Inserted with bottleneck strategy, which, together with Identity routes, gives less time complexity and less neural network model size. We configured this network to a 3-level structure to improve the performance. Thus at each stage, instead of 8 redundant layers, we reduced them to 2 ResNet with bottleneck layer.

3D convolution-based dynamic routing
As examined earlier in the literature review, the initial layers of the CapsNet are convolutional. In CapsNet the output of these convolution-based layers is fed into primary capsules.
The primary capsules transform the input vectors from the local feature detectors to eightdimensional capsules as output. These are then fed into Digit capsules. Going deeper with a model is always what is done to increase performance in many CNN models. CapsNet's dynamic routing is such that they flatten the primary capsules' output vectors, and then they are routed with the dynamic routing algorithm as given in Algorithm 1 [ 4 ], [ 26 ]. This structure gives rise to a structure similar to a multi-layer perceptron model. This structure is a highly time-consuming one [ 26 ]. Stacking these to reach better performance will drastically pull the network's speed and efficiency, may even limit improvements after certain depth.
To accumulate multiple layers of CapsNet, it is needed to establish a dynamic routing with some effect of a convolution-like process. This effect in process is achieved by the 3Dconvolution based dynamic routing algorithm proposed by the DeepCaps. This algorithm is given in Algorithm 2 [ 26 ]. Within the Digit capsule layer, vector feature maps are squashed and steered through an dynamic routing calculation. The new dynamic routing algorithm from DeepCaps massively dropped the redundancy by directing squares of capsule s, from layer L to layer L+1, rather than directing all capsules in layer L separately [ 26 ]. This change was brought with the idea that the neighboring capsules generate similar predictions.
As we know, the primary capsules create an abstract estimation of parts of an object.
Further, they create a prediction of the presence of these parts and even with the transformation parameters of it. This inverse rendering effect is crucial in the CapsNets. The next most important step is to create a routing by agreement from the primary capsules to the next level capsules, which is usually termed as DigitCaps. The DigitCaps as one can imagine Algorithm 2 3D convolution based Dynamic Routing Algorithm [ 26 ] procedure: Routing(û j|i , r, l) Require: Φ l ∈ R (w l ,w l ,c l ,n l ) , r and c l+1 , n l+1 Let p ∈ w l+1 , q ∈ w l+1 , r ∈ c l+1 ans s ∈ c l for i iterations do for all p, q, r, k pqrs ← sof tmax_3D(b pqrs ) for all s, S pqr ← s k pqrs ·Ṽ pqrs for all s,Ŝ pqr ← squash 3 D(S pqr ) for all s, b pqrs ← b pqrs +Ŝ pqr ·Ṽ pqrs end for return Φ l+1 =Ŝ is similar to the primary capsule with the fact that these take more complex object detection in the image [ 4 ]. DigitCaps also generate the prediction of presence of the capsules objects and also the instantiation parameters of the particular object in each location.

Mish activation
The activation function gives the NN flexibility to incorporate the required nonlinearity to learn mapped functions. This flexibility provides a fundamental part of the implementation and gaining accuracy. The baseline model, RCN, uses the ReLU (Rectified Linear Unit) activation function after every ResNet layer. The function of ReLU is as follows: [ 27 ] f In RCN2, we utilized the very flexible Mish Activation. Via experiments, we tested and trained various activation functions, and the Mish activation function provided a boost to the network's accuracy. Mish activation function is as follows (Figure 3.3 ) [ 28 ].
f (x) = x tanh(sof tplus(x)) (3.2) Mish function has certain properties, which led to our conclusion that Mish activation was the right choice for RCN2. Mish activation prevents over-fitting and provides the necessary intricacy for self-regularization. Unlike ReLU, Mish activation does not get overwhelmed by a near-zero gradient. These properties allow the Mish activation to achieve better generalization [ 28 ].

3D reconstruction by decoder network
The Reconstruction networks are a crucial part of the CapsNet. In RCN, the authors tried to reduce the computation cost by deducting complexity from the reconstruction network. We believe that a proper reconstruction network is always suitable for a CapsNet based machine learning model. It is understood from the literature review that the re-construction network is only used for training and is discarded or unnecessary during the process of testing or inference. Thereby allowing a fair comparison to the existing CNNs to the CapsNets. Here we employ the 3-dimensional reconstructions, and the reconstruction is based on the class independent decoder network for the RCN2. The DeepCaps inspired us to implement the decoder network with the instantiation parameters extracted from the model with deconvolutional layers to reproduce the input [ 26 ], [ 29 ].

Summary of the proposed architecture RCN2
In summary, as the image moves across the initial ResNet units, the primary capsules obtain the output from them at three different stages permitting the network to evaluate the input at different layers of feature mining. These primary capsules produce 8D vectors, for each being a highlight of the object. These 8D vectors go across a DigitCaps layer with three reiterations of routing to deliver 16-dimensional vectors, following which it is combined to deliver a classification. Moreover, the decoder network decodes each output of the network and tries to match it with the input received.

ResNeXt Convolution Layers
To achieve excellent accuracy, convolutional layers of RCNX should become efficient in composite feature abstraction. Although Capsule Network functions are established on routed capsules, the primary levels are convolution reliant [ 26 ]. Repetitive convolutional layers in the opening stages of Capsule Network provide us superior feature mining, and this offers the neural network an excellent opening advantage.
The RCN ponders on eight repetitive ResNet layers with no presence of bottleneck layout or any cardinality to enhance the intricacy and lessen parameters. We bring substantial compression and development to the RCN architecture. As cited heretofore, this thesis intends revisions in such elements, which are deficient in the RCN system. RCNX is unique with the insertion of the latest dimension cardinality to RCN. As clarified in the literature With the absent structural improvements, RCN requires the intricacy to find a richer grasp of the inputs, which can be the justification for not attaining above average accuracy even with eight monotonous ResNets layers. From the time when we involve the requisite convolution applying ResNeXt, we excluded the unnecessary layers. ResNeXt structure inserted in the RCNX is with varying Cardinality prior to applying to the capsules. We also propose the three-staged architecture and find the system expands learning of the input for various capsules to understand from different image viewpoints.

3-Level ResNeXt structures in RCNX
The proposed RCNX can understand sophisticated elements and know to do it rapidly with the integration of different viewpoints. Letting the model go across each stage repeatedly because of the elaborate composition makes the best of RCNX with minimal effort.
In the 3-staged structure, the ResNeXt network with four and two cardinalities is incorporated, with identical size filters, i.e., 32. This differing Cardinality gives variable total filter sizes for individual capsules. Initial capsules getting three separate viewpoints to the input were configured to deliver likelihood vectors of variable dimensions, with plasticity in creating different features with various dimensions. We configured this network to a 3-level structure to improve the performance. We use the dimensions of 8, 24, 32, and 8 across four primary capsules.

Effective Capsules with 3D convolution-based dynamic routing.
As examined earlier, the initial layers of the CapsNet are convolutional. The output of these convolution-based layers is fed into primary capsules. Allowing the training network to go through each level repetitively due to the intricate structure brings the best of RCNX with minimal effort. Utilizing 3D convolution-based dynamic routing, DeepCaps creators altered and made strides in routing by agreement calculation by changing including convolution for each capsule in the capsule networks [ 26 ].
The primary capsules transform the input vectors from the local feature detectors to eight-dimensional capsules as output. These are then fed into Digit capsules. Going deeper with a model is always what is done to increase performance in many CNN models. CapsNet's dynamic routing is such that they flatten the primary capsules' output vectors, and then they are routed with the dynamic routing algorithm. This structure gives rise to a structure similar to MLP models, which is a highly time-consuming structure. Stacking these to reach better performance will drastically pull the network's speed and efficiency [ 26 ].
This 3D convolution makes a difference in trimming the network in size. The inclusion of convolution is considering that neighboring neurons create a comparable instantiation output, and this can be clustered [ 26 ]. To accumulate multiple layers of CapsNet, it is necessary to establish a dynamic routing with some effect of a convolution-like process. This convolution-like process is achieved by the 3D-convolution based dynamic routing algorithm proposed by the DeepCaps. Within the Digit capsule layer, vector feature maps are squashed and steered through an energetic routing calculation. The new dynamic routing algorithm from DeepCaps massively dropped the redundancy by directing squares of capsule s from layer L to layer L + 1, rather than directing individual capsules in layer L separately [ 29 ].

3D reconstruction by decoder network
At the end of RCN, the decoder network does not include full inverse rendering but is reduced to a 2-dimensional restoration. The Reconstruction networks are a critical part of the CapsNet. In RCN, the authors tried to cut the computation cost by deducting complexity from the reconstruction network. We believe that a proper reconstruction network is always suitable for a CapsNet based machine learning model. It is understood from the literature review that the reconstruction network is only used for training and is discarded or unnecessary during the process of testing/inference. Thereby allowing a fair comparison to the existing CNNs to the CapsNets. Here we employ the 3-dimensional reconstructions, and the reconstruction is based on the class independent decoder network for the RCNX. The DeepCaps inspired us to implement the decoder network with the instantiation parameters extracted from the model with deconvolutional layers to reproduce the input [ 29 ]. Non-linearity is a significant part of the neural network, and without it, the NN will fail to achieve anything significant. The activation functions mainly contribute to this nonlin-earity for the NN involved in the neurons. Over the past several years, activation functions alone have become a scope for further research. After each convolutional layer, speaking in a layer-like manner, we add the activation layer on it, which is usually followed by a batch normalization. In the baseline RCN, they use the ReLU activation function. In the RCN2, we used the Mish activation function, but for the proposed RCNX, we utilize the ELU (Exponential Linear Unit) activation function (Figure 3.6 ) [ 30 ]. ELU activation function avoids the dead ReLU problem. It also brings the better optimization of biases and weights and provides a negative output to prevent saturation near-zero gradient. Function representation of ELU is as follows [ 30 ].

ELU activation function
where α is also a trainable parameter.

Summary of the proposed architecture RCNX
In conclusion, when a 32 × 32 × 3 input image is fed through the proposed RCNX:

Hardware Setup
For the training and inference of the algorithms I utilized Lenovo think system compute node provided by the IU compute cluster [ 11 ]. These systems have the following hardware configurations.

Dataset: CIFAR-10
CIFAR-10(Canadian Institute for Advanced Research) dataset is used to gain inference of the model [ 8 ]. The CIFAR-10 is a benchmark dataset comprised of 60,000 RGB images of 10 classes with 6000 pictures per class. Ten thousand images in these are for testing and the remainder for training the neural network models [ 8 ].

Hyper-parameter Tuning: Neural Network Intelligence
When challenged with the design parameters for a neural network, it is optimal to utilize a hyper parameter tuner. Earlier on the common tool used for hyper parameter optimization was TensorFlow hyper-parameter library, along with visualization tool TensorBoard [ 31 ]. We utilized TensorBoard and TensorFlow hyper-parameter tuners for the RCN2 development.
Recent developments in the field of machine learning brought forth the amazing tool Neural Network Intelligence (NNI) [ 21 ]. Many parameters like optimizer, number of filters in the layers, activation functions, kernel sizes and routing numbers are experimentally optimized with the help of NNI.

RESULTS
Utilizing above mentioned training setup, RCN2 and RCNX were trained and tested. Using NNI, hyperparameter tuning was conducted, and a sample graph output of the hyperparameter tuning for RCNX is given in Figure 5.1 .
We evaluated the proposed models RCN2 and RCNX with the CIFAR-10 dataset and compared the observed performance to the baseline RCN. We also brought in the accuracy comparisons of DCNet++, DCNet, 3-level RCN, seven ensemble CapsNet since these are leading Capsule Networks [ 9 ]. Furthermore, we compare our proposed RCN2 and RCNX to the embedded model MobileNetV3 to prove that the aim of creating an embedded machine learning model by compressing RCN without losing performance is achieved. The comparison is listed in Table 5.1 .

RCNX Results
With further improvements to the RCN2, we achieve the RCNX architecture, which is efficient and compressed. This RCNX outperformed the baseline model by 5.15% while reducing parameters by 86.67%. The train and test accuracy, while the model is trained for 30 epochs, is shown in Figure 5.3 . The test loss curve is also depicted in Figure 5

Performance Comparison
From the above results and comparisons, a performance comparison chart is generated for visualizing the impact of our proposed RCN2 and RCNX in Figure 5.5 .

Figure 5.5. Performance Comparison Chart
It is clear from the above Figure 5.5 that RCN2 and RCNX have achieved performance better than the existing MobileNetV3, which is a comparison to CNN and outperforms all other CapsNets given above. Hence, we clearly reinstate the importance of the Capsule Networks in the field of Machine Learning.

IMPLEMENTATION ON I.MX RT1060
This section we discuss the implementation of RCNX architecture on the NXP i.MX RT1060 for image classification

Hardware Setup
The  Additional hardware required for this process include a Camera Module MT9M114 (Figure 6.2 ) and an LCD screen [ 33 ]. Finally, the camera's input is processed frame by frame, and the detection result is sent via UART communication.

Softare Setup
MCUExpresso is used as the IDE for the deployment. NXP provides an eIQ software library intended for the software development for neural network-based applications and contains various optimized libraries for the compilation, back propagation, and inference generation. This library utilizes the Tensorflow-lite model of the intended architecture [ 34 ].
The eIQ software is given as middleware in the MCUXpresso SDK for the NXPi.MX RT1060 board [ 33 ]. It contains the latest eIQ SDK and demos. The tflite has to be converted into a C array structure to be downloaded to the board.

Model Preparation
After successfully training the RCNX with the CIFAR-10 dataset, we saved this model into PB format, the protocol buffer(protobuf) format. This PB format is commonly used in TensorFlow models. Since the development of RCNX is based on Keras, TensorFlow, and custom functions, it was easily convertible to the PB model [ 35 ], [ 36 ]. This PB model is further converted into a tflite(TensorFlow lite) model, a common standard used from converting models cross-platform [ 34 ]. The steps followed are as per the following Figure 6.3 .

Implementation Results
The RCNX was successfully deployed into the i.MX RT1060, proving that the CapsNet for image classification can be downloaded into embedded systems. A demonstration of the model running on i.MX RT1060 is provided in Figure 6.4 . Although the model ran successfully and the accuracy of prediction was good, the time taken for inference is 3.8 seconds, which is a long time. This delay in processing could be due to the complex routing algorithms and current libraries' inefficiency to help optimize CapsNet based networks. We believe this prototype presented in the thesis is only an initial step for the CapsNet to be considered for embedded application.

CONCLUSION
Convolutional Neural Networks is one of the causes why Deep Learning is so prevalent. In the near future, we believe these models will inspire more effective neural networks and also deliver accuracy superior to CNN models.