T.Kalvin’s Homepage

Space-based AI Data Centres

2026-03-27T00:00:00+00:00

After taking a deep dive into the computational infrastucture demands of modern Artificial Intelligence, I realised that while our models are scaling exponentially, the physical infrastructure required to run them is hitting a massive bottleneck. We tend to trust and assume that these so called “Hyperscale” data centers to process everything from every day user LLM query prompts to Autonomous Vehicle Networks assuming that our current power grids infrastructure can just endlessly accomodates them out of the box.

However, from my research into the hardware side of AI, there seems to be much more of a different truth here where running these enormous Compute Clusters here on Earth is becoming physically and economically unsustainable in the long-term especially as we continue the scale. In this post, I’ll share and summarise my findings on the viability of Space-based AI Data Centres currently.

Background

Why Space-based AI Data Centres?

To a Human, the current modern cloud infrastructure seems like an invisible, infinite storage compute space regardless of where we access it. However, Data Centres operate by consuming massives amount of energy/electricity and generating immense amount of thermal energy (heat) as result of the high-levels of computation. By calculating the physical footprints and energy costs of these models, we can determine mathematically that the Earth’s resouces are a finite boundary. Moving these Data Centres to space creates what is classed as an Oribital Data Centre. The goal of these Space-based Orbital Data Centres is to locate an environment that maximises the hardwares’ overall system uptime where the reliance on the Earth’s power grid infrastructure and water cooling is completely eliminated. This ensures that AIs can run on an infinite*, stable energy source (the Sun).

Core Ideas

The Energy Equation (Solar in Space):

The core idea of an Oribital Space-based Data Centre is to power the compute units by keeping these satellites in an orbit that is sun-synchronous. THis allows for a constant and stable supply of energy from the Sun. Here on Earth, Solar Panels are cheap (lower than $1 per watt without taking into account the cost of installation and setup).

Based on recent developments, the base price of space-rated solar panels is around *$11.21 per watt**.

Power Consumption (NVIDIA HGX H100) Example:

A single NVIDIA HGX H100 unit contains 8x H100 GPUs in a single topology. While NVIDIA’s specification sheet lists the power consumptions at 5,600 watts (5.6kW) and then we approximate the power draw-factoring in the CPU, RAM, SSDs and other necessary components and electronics which results in a total approximation of around 10,000 watts.

The hardware build price for solar power enery generation is computed as follows: $C_{power} = P_{total} \times \text{Cost}_{solar}$

where:

$P_{total}$: the realistic power consumption (10,000 watts)
$\text{Cost}_{solar}$: the price per watt for space-rated panels ($11.21)
$C_{power}$: The resulting $112,100 required just to build the solar array for one unit

As you can see the hardware build price does seem to be high, but this is nothing in comparison to the launch cost.

Launch Cost (SpaceX Falcon 9 Rocket):

Launch costs are an iterative mathematical problem, where every component adds additional weight and every kilogram adds thousands of dollars extra. Most space solar panel manufacturers converge at an efficiency of approximately 30 watts per kilogram. Then we need to take into account the weight of the NVIDIA HGX H100 unit which weighs around 24 kg plus the necessary chassis, cold shields, and other space-specific hardware which results in a total mass of approximately 100 kg.

The total payload weight required to sustain the compute module is computed as:

\[W_{total} = \left(\frac{P_{total}}{E_{solar}}\right) + W_{hardware}\]

where:

$E_{solar}$: the energy-to-weight ratio of the panels (30 W/kg)
$W_{hardware}$: the weight of the compute unit and protective chassis (124 kg)
$W_{total}$: the total projected payload weight.

Calculating this out, the solar panels alone weigh 333 kg and when add it to the 124 kg compute hardware it results in a massive payload

Cooling in Space vs on Earth:

Here on Earth, Data Centres rely on biliions of litres of water and ariflow in order to transfer the heat away from the Servers. However, in Space, a vacuum has no air or water meaning Space-based AI Data Centres must learn to defend against overheating by relying on Thermal Radiation. For example, dissipating the heat generated by a 10,000 watt HGX unit requires a radiator roughly the suquare rootage of a medium to large home, which results in even more restrictive mass to the payload.

Findings & Results

If we were launching a HGX H100 unit into orbit using SpaceX’s Falcon 9 (which costs $2,750 per kilogram), a clear progression in total costs was observed:

Solar Panel Launch Cost = 333 kg * $2,750 results in a cost of exactly $915,750
Compute Module Launch Cost = 124 kg * $2,750 results in a cost of exactly $341,000
Total Orbital Deployment = Total launch cost of approximately $1.3 million per module.

My Takeaways:

After researching the viability of Space-based AI Data Centres, it highlighted a fundamental concern in the furutre of AI Data Centres:

Earthly Constraints $\neq$ Infinite Scaling - We cannot keep building larger models without severely impacting our local power grids and water supplies.
The Power of Launch Economics - The current hardware and solar technology is viable. Driving down the $2,750/kg launch cost by a factor of 10 is essential to making orbital data centers a reality.
Infrastructure is an Arms Race - As AI models get smarter and even larger, the facilities housing them must evolve beyond simple Eath-based warehouses. We cannot trust our current power grids blindly without considering space-based solutions as an alternative.

Machine Learning Safety

2026-01-26T00:00:00+00:00

After taking an Advanced Artifical Intelligence course, I realised that while modern AI models achieve very highly in terms of benchmark performance, they have a massive, invisible blind spot. We tend to trust these so called “Black Box” systems with increasingly critical taks from Self-Driving Vehicles to Autonomous Robotics and Medical Diagnosis - assuming that if they have a high accuracy, they are also “intelligent”.

However, from my experience in participating in an adversarial attack competition revealed a much different truth where these models do not “see the world” that us Humans perceive the world to be (assuming that whoever is reading is that is in-face classified as a Human Being). In this post, I’ll share and summarise my finds on Machine Learning Safety, in particular Adversarial Machine Learning and how I went from simple attacks to breaking state-of-the-art robust defences using techniques such as FGSM, PGD, Momentum, Input Diversity…

Background

### To a human, an image of a Car looks like a car, regardless of filter, lighting or positioning. However, Neural Networks operate by finding complex mathematical boundaries in a high-dimensional space. By calculating gradients of a model, we can determine mathematically how to modify each pixel in an image to push that image across the boundary.

This modified image of pixels is classes as an Adversarial Example. The goal of an Adversarial Attack is to find a pertubation $\delta$ that maximises the model’s loss where the difference between the original image and modified image is smaller that the Epsilon limit, ensuring that the change is invisible to the Human eyes.

Core Ideas:

FGSM (Fast Gradient Sign Method):

The core idea of FGSM is to modify the original input image by adding a tiny amount of peturbation in the direction that maximally increases the model’s loss. This model takes a “one-shot” approach where it takes on single, large step in the direction of the gradient.

It is computed as: $x_{adv} = x + \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y))$

where:

$\theta$: the model parameters
$\text{x, y}$: the input and the label
$J(\theta, x, y)$: the loss function
A one-step modification to all pixel values to increase the loss function with a ( l_\infty ) perturbation

While FGSM is easily blocked if the model’s loss landscape is non-linear meaning that one wrong step might result in missing the peak of the curve entirely. FGSM asumers the loss function is entirely linear, but in reality, deep neural networks are highly non-linear. This means that a single, large step is often insufficient to fool a robust model.

PGD (Projected Gradient Descent):

PGD is an interative version of FGSM, where instead of taking one single, large step, we take many small steps ($\alpha $). After each step, if the individual pixel values go beyond the epislon limit or the valid pixel range $[0, 1]$, we then clip/project the back. This results in the PGD Adversarial Attack’s ability to navigate around non-linear lines in the gradient loss landscape in order to find a more accurate adversarial example.

It is computed as: $x_{t+1} = \Pi_{x+S} (x_t + \alpha \cdot \text{sign}(\nabla_x J(\theta, x_t, y)))$

where:

$\alpha$: the step size (learning rate)
$\Pi$: the projection function (clipping)
$t$: the current iteration step

Standard PGD Adversarial attacks are considered the universal benchmark for evaluation robustness. However, from my experience in participating in an adversarial attack competition, standard PGD struggled to lower the accuracy to aproximately $57 \%$. The adversarial training sample models had learned to defend against standard PGD algorithms due to their use of a “rugged” defence which is also known as Gradient Masking.

Gradient Masking:

Robust models often defend themsef through the creation of “rugged” loss landscaped. They hide their true gradients by creating small, falze local maximas near the data points causing traps for the PGD algorithm to fall into. This resutls in the PGD attack reporting a failure but the model isn’t actuall robust, instead it is just hiding/diverting the path to its failure which is called Gradient Masking.

MI-DI-FGSM (Momentum plus Input Diversity Fast Gradient Sign Method):

Multi-Targetd PGD plus CW (Carlini-Wagner) Loss:

Standard Adversarial Attack algorithmsa are often untargeted as they just try to maximise the overal gradient loss where they move away from the correct answer. With robust models they are unsure about everything resuling in making it hard for standard Adversarial Attacks to just maximise ther error.

Multi-Target PGD plus CW attempts to push the image towards every other possible workng class. Then it picks the target class that yields the strongesr attack. Then instead of using Standard Cross-Entropy loss as it often relies on probabilities, we instead use a Carlini-Wagner (CW) loss where it ignored the probabilites and looks directly at the logits which are the raw scores. It then calculutes the margin between the target class and the correct class.

It is computed as: $L(x, t) = \max(Z(x)_{\text{real}} - Z(x)_{\text{target}}, -\kappa)$

where:

$Z(x)_{\text{real}}$: raw logit score of the correct class
$Z(x)_{\text{target}}$: raw logit score of the specific target class we want to fake
$\kappa$: confidence parameter (margin)

Through the combination of Multi-Targeting with CW Loss, the Adversarial attack finds the specific vunerability in the models’s Adversarial Training than the standard FGSM/PGD.

Findings & Results

During the evaluation against 9 reference models (ranging from standard to highly robust), it was observed a clear progression in attack sucess:

Standard PGD = The robust models maintained a high accuracy of around $57 \%$ to $67 \%$
PGD plus Restarts = Improved performance slightly, dropping the accuracy down to around $56 \%$
MI-DI-FGSM and Multi-Targetd PGD plus CW Loss = Broke defences, resulting in the robust accuracy dopiing significantly to around $30 \%$

My Takeaways:

After taking an Advanced Artificial Intelligence course and participating in an adversarial attack competition, it highlighted a fudamental concern in Machine Learning Safety

Robustness $\neq$ Correctness: A model can be robust against simple noise but completely fail against structured/diverse attacks.
The Power of Randomness: Deterministic attacks are easy to defend against. Adding stochastic elements like Input Diversity is essential to finding the true weaknesses in a model
Safety is an Arms Race: As attacks get smarter, defenses must evolve beyond simple adversarial training. We cannot trust a “high accuracy” score blindly without rigorous testing

Is Attention Really All You Need?

2025-09-13T00:00:00+00:00

In 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uzkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin published the Research Paper “Attention Is All You Need”, which introduced the Transformer model architecture, but is Attention really all you need? In this post, I’ll summarise the key ideas in this Research Paper and share my takeaways of whether or not “Is Attention Really All You Need”?

Background

Why Transformers?

Before 2017 of the introduction of the Transfomer, the majority of sequence-to-sequence models relied on Recurrent Neural Networks (RNN) or Long Short-Term Memories (LSTMs). These models were quite slow and they processed words sequentially, one by one and they often struggled with longer sentences between tokens. This lead to the authors of the paper “Attention Is All You Need, proposing a different approach with an Attention Mechanism.

Attention is a mechanism that Transformers use that allows the model to look at all the words at once and decides how much each word should influence the other. Then it combines the most relevant words to understand meaning, a concept called Self-Attention. This results in a model that captures long-range relationships in text while being much faster/more parallelisable than previous methods such as Recurrent Neural Networks (RNN) or Long Short-Term Memories (LSTMs).

Core Ideas

Self-Attention:

Instead of processing words sequentially, one by one, the model creates three vectors for each token which are:

Query (Q)
Key (K)
Value (V)

Then it multiplies the input embedding by the weight matrices that have been previously trained/learned. Next it computes the Attention scores between one token’s Query and the other tokens’ Values to decide how much each word should influence the others. Then a Softmax function is used where the Attention scores are passed through the function to normalise them into probabilities. Then the model uses these probabilities to take the weighted sum of Value Vectors which then produces a new representation for each token that uses the information from the other tokens. This Self Attention mechanism allows the Transformer to work with long-range relationships in the text. This Self-Attention is computed as: $\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^{T}}{\sqrt{d_k}}\right)V$ where $K^{T}$ is the transpose of the Key Matrix and $d_k$ is the dimensionality of the Key Vectors.

Positional Encoding:

As Transformers process words in parallel instead of sequential, they lose the natural sequential order of the original text which is why the authors introduced Positional Encoding where they introduced adding information about the positions of each of the words in the sequences. The Positional Encoding is computed as: $PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$ where:

$pos$ is the token position in the sequence
$i$ is the index dimension
$d_{model}$ is the embedding size

Multi-Head Attention:

Multi-Head Attention which is similar to Self-Attention, allows the model to look at long-range relationships from multiple perspectives simulataneously. Here instead of computing the vectors Query (Q), Key (K), Value (V) as a single set, the Transfomer computes heads of the Query, Key, Value Matrices. Each head has its own learned projection Matrices so that the same input is interpreted in different ways, allowing the model to capture the diverse relationships between the words in the text.

For each head, the attention is computed as: $\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$ Then all heads are concatenated and projected: $\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O$

Feed-Forward Network (FFN):

After the Attention layers, each Transformer layer has a Feed-Forward Network (FNN) that helps the model transform its representation in a non-linear way. The Feed-Forward Network (FFN) is computed as:

$\text{Feed-Forward Network}(x) = \text{max}(0, x W_1 + b_1) W_2 + b_2$ where the weight matrix $W_1$ transforms the input vector.

Encoder:

The Encoder stack composes of a stack of $N$ indentical layers where each layer contains a multi-head self-attention sublayer and a position-wise fully connected feed-forward network with residual connections as well as layer normalisation. The output of each sub-layer is computed as: $\text{LayerNorm}(x + Sublayer(x))$

Decoder:

The Decoder stack also composes of a stack of N identical layers where it has the same two sub-layers as in each encoder layer except the Decoder has an additional third sub-layer where it peforms multihead attention over the encoder’s output. In addition, the Decoder stack contains a masked multi-head self attention sub-layer that prevents positions from peaking at future tokens instead of the normal multi-head self-attention sub-layer found in the Encoder stack.

Transformer Architecture:

Figure 1. Transformer Architecture. Adapted from Vaswani et al. (2017).

Results:

The Research Paper “Attention Is All You Need” showed that the Transformer model architecture achieved state-of-the-art results where their model “achieves 28.4 BLEU on the WMT 2014 English-to-German translation task” and on the English-to-French translation task their model achieved a “new state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs”. This demonstrated that the Transformer architecture can output recurrent and convolutional models while training much faster.

My Takeaways:

After reading “Attention Is All You Need”, we can see why the Transfomer plays such a pivotal role in Natural Language Processing (NLP) and Large Language Models (LLM) with having real-world impact in modern applications such as ChatGPT (Decoder-only Transformers), BERT (Encoder-only Transformers), Google Translate (Encoder And Decoder)…

For me, the key takeways were:

Multi-head Attention - allows the model to see complex relationships from multiple perspectives
Positonal Encoding - adds information about the order of the words in the sentence
Parallel Processing - allows faster training as no sequential bottleneck
Feed-Forward Layers - gives each token its own small-linear transformation after attention
Scalability - powers poweful models across text, images, audio…

My First Neural Network

2025-09-04T00:00:00+00:00

In this post, I built a Simple Neural Network for XOR made in Python from Scratch without any external Machine Learning Libraries/Frameworks such as PyTorch or Tensorflow or JAX.

Why XOR?

XOR (exclusive OR) is a binary logic gate operation that takes two binary inputs (0 or 1) and produces a single output (0 or 1). Here in XOR, the ouput of the operation is equal to 1 if and only if both the inputs are different (not the same inputs). If both inputs are the same, then the ouput is 0.

I chose XOR specifically as it’s the simplest problem to solve for a Neural Network as a single-layer perceptron cannot solve for XOR because the data is not linearly separable. Neural Networks can solve non-linear separable problems by adding hidden layers between the input and output which allows them to learn complex, non-linear decision boundaries such as XOR.

Neural Network:

The Neural network consits of 2 inputs, 3 neurons in the hidden layer which apply the non-linear transformation in this case the sigmoid activation function and in the output layer there is 1 neuron which gives one number between 0 and 1 which represent the predicted probability.

Training Neural Network:

The Training process consited of 4*2 matrix of all possible combinations of two binary inputs which is XOR function. These were:

(0,0) -> Expected Output = 0
(0,1) -> Expected Ouput = 1
(1,0) -> Expected Output = 1
(1,1) -> Expected Ouput = 0

The goal of the Training the Neural Network was to adjust the weights/biases so that the the output for each of the inputs correspond to the expected output for XOR.

The main steps for Training the Neural Network are as follows:

Forward Propagation - inputs and biases are passed through the Neural Network to generate an output. This invlvec calulating the weighted sums as each of the layers from the input layer, through the hidden layer to the output layer, applying the sigmoid activation function and obtaining the Output Predicition.
Backpropagation - calulate the error loss between the prediction output and the expected output using the Mean Squared Error Loss (MSE) Function. Then calulate the gradient of the loss with respect to the weights/biases which results in how much each of those weights/biases cotributed to the error. Adjust weights/biases according to error contribtion.
Sigmoid Activation Function - introduce non-linearity to the Neural Network: $\sigma(x) = \frac{1}{1 + e^{-x}}$
MSE (Mean Squared Error) Loss Function - $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_{\text{pred}} - y_{\text{true}})^2$

Results:

After training the Neural Network for 10,000 epochs with a learning rate of 0.1, the training process took 0.5556 seconds. During training, the model adjusted the weights and biases to minimize the loss, and the predictions gradually became more accurate. This resulted in Raw Prections (Before Rounding) and Predicted Classes (After Rounding).

Raw Predictions (Before Rounding):

Input (x1, x2)	Raw Predicted Output
(0, 0)	0.05917951
(0, 1)	0.95001299
(1, 0)	0.9500065
(1, 1)	0.05039647

Predicted Classes (After Rounding):

Input (x1, x2)	Predicted Output (After Rounding)	Expected Output
(0, 0)	0	0
(0, 1)	1	1
(1, 0)	1	1
(1, 1)	0	0

Conclusion:

In conclusion, this Simple Neural Network for XOR made in Python from Scratch * successfully learnt the XOR function without any external Machine Learning Libraries/Frameworks such as PyTorch or Tensorflow or JAX, only using basic principles such as Forward Propagation, Backpropagation, Sigmoid Activation Function and MSE (Mean Squared Error Loss) by adjusting the weights/biases through Backpropagation and reducing the loss over time. This resulted in correctly predicing XOR outputs for the given inputs.

The Simple Neural Network for XOR made in Python from Scratch source code can be accessed via GitHub

T.Kalvin’s Homepage

Space-based AI Data Centres

Background

Why Space-based AI Data Centres?

Core Ideas

The Energy Equation (Solar in Space):

Power Consumption (NVIDIA HGX H100) Example:

Launch Cost (SpaceX Falcon 9 Rocket):

Cooling in Space vs on Earth:

Findings & Results

My Takeaways:

Further Reading & References

Machine Learning Safety

Background

Core Ideas:

FGSM (Fast Gradient Sign Method):

PGD (Projected Gradient Descent):

Gradient Masking:

MI-DI-FGSM (Momentum plus Input Diversity Fast Gradient Sign Method):

Multi-Targetd PGD plus CW (Carlini-Wagner) Loss:

Findings & Results

My Takeaways:

Further Reading:

Is Attention Really All You Need?

Background

Why Transformers?

Core Ideas

Self-Attention:

Positional Encoding:

Multi-Head Attention:

Feed-Forward Network (FFN):

Encoder:

Decoder:

Transformer Architecture:

Results:

My Takeaways:

Further Reading:

My First Neural Network

Why XOR?

Neural Network:

Training Neural Network:

Results:

Conclusion: