It occurred to me yesterday that a simple way to incorporate the “concept formation” ideas I was talking about here into a neural net framework might be to start with a very simple network, and then let it produce new hidden nodes dynamically.
I have in mind something like this. Fix integers M, N > 0. Every M cycles of training, look for sets of neurons whose outputs were highly correlated over the last N rounds of training. (I want these “sets” to have more than 2 elements, but covariance doesn’t extend naturally beyond 2 variables, so I’m not sure how to do this?)
If the correlation for a set of neurons is above some threshold, add a new neuron which takes input from (only) that set of neurons, and outputs to all of the neurons they output to. Then delete all outgoing connections from neurons in the set, so that they can only communicate with later layers via the new neuron. (We’d probably want to only make one neuron per M cycles, even if there are multiple sets above the threshold, to avoid confounding the effects of adding multiple neurons at once.)
This captures the intuition that hidden layers are meant to compress the input data, but rather than fixing the architecture in advance, the system can build new compressive nodes when it notices new opportunities for compression. (It seems related to Hebbian learning, but is not identical to it.)
If we only look for correlations within a single layer at a time, the system will just add more and more hidden layers to a feedforward network. If we have the resources to look for correlations across all neurons at once, then the system can build more general graphs. (To see why this might be useful, consider a concept like “blonde hair.” Since hair can come in many colors, we wouldn’t expect low-level visual concepts like “light” and “yellow” to have much influence on a neuron encoding the presence of hair. But to make a neuron encoding the presence of blonde hair, we’d want to take input from the “hair” neuron and combine it with input from those lower-level concepts.)
We’d want to put a limit on the number of new neurons “per layer” so that the system is not allowed to just reproduce the entire input in a hidden layer. In the feedforward case, this is straightforward. With more arbitrary graphs, it’s less obvious how to do it, but I imagine there is some graph theory concept that would work?
Not sure how to set the incoming and outgoing weights of new neurons.
For both efficiency and ease of interpretation, we’ll want to delete newly created neurons if they turn out to “not work.” I wouldn’t want to rely too much on the objective function here because I expect adding new neurons to initially make the objective function worse, because they add more compression (see below). Maybe we could track the correlation originally used to create the neuron and see if it persists, although this creates issues involving the order in which training examples are presented.
The setup above already creates a dependence on the order of training examples, because the system cannot notice a pattern unless it appears in a set of examples presented together inside a window of size N. So it would be best to use time series data with autocorrelation, e.g. successive frames of a video. Still not sure what the deletion rule should be. (We want to retain a concept if there are repeated stretches in which it is useful, e.g. “hair” is useful whenever hair appears although there may be long stretches with no hair. Could perhaps track the variance or some related statistic for each neuron’s output over long times – a neuron that either [almost] always fires or [almost] never fires is not encoding a helpful concept.)
It might be helpful for the deletion rule, but also interesting in general, to allow the system to “engage in exploratory behavior” by asking for samples that will strongly activate (or inactivate) a recently created neuron, and checking whether the neuron “works” in these instances. (Not sure how to measure this, since creating a neuron may decrease the objective function at first even if the neuron is “useful.” This seems like a bias/variance tradeoff, since adding the neuron generally reduces the total number of connections, so maybe it will still do better on the new examples, since it is “getting the underlying theme” rather than overfitting noise.)
The “reducing the number of connections” thing is worrisome because we expect the network to eventually have more connections than it started with, once we have many layers. This suggests that “let the set only communicate via the new neuron” is too restrictive?
I imagine someone has done something very similar to this already. (It may even be a standard method that I am just not aware of.) Might be too computationally expensive for “real applications” while still being illuminating as basic science?
