Huh, the AlphaStar paper is finally up (linked in this DeepMind post)
I was mostly interested in the model architecture rather the training setup, although the paper focuses mostly on the latter. In part this is understandable, since their training procedure – with the “league” – is more novel.
What’s strange to me, though, is that they downplay the model-design side of the work to the point of actually not telling you how they chose the components or tuned the hyperparameters, or indeed whether they tuned the hyperparameters at all.
Indeed, the entire training procedure they describe appears to have been done, not just with a single fixed model structure, but with a single fixed set of hyperparameters. And we aren’t told how they arrived at them, or what else they tried. The closest thing to a discussion of model selection is this boilerplate-ish paragraph:
Architecture components were chosen and tuned with respect to their performance in supervised learning, and include many recent advances in deep learning architectures. A high-level overview of the agent architecture is given in Extended Data Fig. 3, with more detailed descriptions in Supplementary Data, Detailed Architecture. AlphaStar has 139 million weights, but only 55 million weights are required during inference. Ablation Fig. 3f compares the impact of scatter connections, transformer, and pointer network.
And the term “hyper-parameter” (Nature apparently prefers the hyphen) appears only once in the paper, in this sentence:
All the neural architecture details and hyper-parameters can be found in the file ‘detailed-architecture.txt’ in the Supplementary Data.
What is “detailed-architecture.txt”? It turns out to be an extensive 3747-word human-readable description of a single, very complicated ML model, with all the hyperparameters explicitly written out, and again no discussion of how they were chosen. A few representative excerpts:
The transformer output is passed through a ReLU, 1D convolution with 256 channels and kernel size 1, and another ReLU to yield `entity_embeddings`. The mean of the transformer output across across the units (masked by the missing entries) is fed through a linear layer of size 256 and a ReLU to yield `embedded_entity`.
unit_counts_bow: A bag-of-words unit count from `entity_list`. The unit count vector is embedded by square rooting, passing through a linear layer, and passing through a ReLU
mmr: During supervised learning, this is the MMR of the player we are trying to imitate. Elsewhere, this is fixed at 6200. MMR is mapped to a one-hot of min(mmr / 1000, 6) with maximum 6, then passed through a linear of size 64 and a ReLU
cumulative_statistics: The cumulative statistics (including units, buildings, effects, and upgrades) are preprocessed into a boolean vector of whether or not statistic is present in a human game. That vector is split into 3 sub-vectors of units/buildings, effects, and upgrades, and each subvector is passed through a linear of size 32 and a ReLU, and concatenated together. The embedding is also added to `scalar_context`
beginning_build_order: The first 20 constructed entities are converted to a 2D tensor of size [20, num_entity_types], concatenated with indices and the binary encodings (as in the Entity Encoder) of where entities were constructed (if applicable). The concatenation is passed through a transformer similar to the one in the entity encoder, but with keys, queries, and values of 8 and with a MLP hidden size of 32. The embedding is also added to `scalar_context`.
At a less granular level, the whole thing apparently looks like this:

And you can go to “detailed-architecture.txt” to learn things like
- the block labeled “Core" is “an LSTM with 3 hidden layers each of size 384” (why 3? why 384? why an LSTM?)
- the one labeled “Entity encoder” is “a transformer with 3 layers of 2-headed self-attention [with 128-dim heads and 1024-dim feedforward]” – this is what DeepMind means when they talk about AlphaStar using transformers, although it’s only 3 blocks, which is way smaller than all the NLP transformers
All of this is pretty mysterious to me. It’s conventional wisdom by now that getting your architecture/hyperparamters right helps no matter how much data you have (indeed, it’s one of the things you use the data to do), and that simple but somehow “domain-correct” architectures like the transformer can beat convoluted ones planned out by humans using domain knowledge.
Like, compare this to other recent mind-blowers:
- AlphaZero was a stack of 19 identical ResNet conv blocks (with some small specialized connectors at the start and end)
- GPT/BERT/GPT-2 were between 12 and 48 identical transformer blocks
Yet AlphaStar is this gigantic complicated circuit diagram with all sorts of specialized blocks, with their various specialized knobs set to different, seemingly arbitrary powers of 2.
Are there information bottlenecks in there? What might be holding it back? Could the training have been less fancy if the least good aspect of the architecture (whatever it is) were improved? What if it were simpler but just bigger (the whole thing is about the size of GPT-2-small)? More fundamentally, why this, out of all the possible things you could do?

