I recently stumbled upon an interesting idea after a long coding session. What if physical filtration principles could inform the design of attention heads in AI models? This concept might seem unusual, but bear with me as we explore it.
In physical filtration, materials are layered by particle size to filter out specific elements. For example, in water filtration, you might use fine sand, coarse sand, gravel, and crushed stone, with each layer handling a specific size of particles. This process is subtractive, meaning each layer removes certain elements, allowing only the desired particles to pass through.
Now, let’s consider attention heads in transformers. These models learn to focus on specific parts of the input data, but this process is often emergent and not explicitly constrained. What if we were to explicitly constrain attention heads to specific receptive field sizes, similar to physical filter substrates?
For instance, we could have:
* Heads 1-4: only attend within 16 tokens (fine)
* Heads 5-8: attend within 64 tokens (medium)
* Heads 9-12: global attention (coarse)
This approach might not be entirely new, as some models like Longformer and BigBird already use binary local/global splits. Additionally, WaveNet uses dilated convolutions with exponential receptive fields. However, the idea of explicitly constraining attention heads to specific sizes could potentially reduce compute requirements and add interpretability to the model.
But, there are also potential drawbacks to this approach. The flexibility of unconstrained heads might be a key aspect of their effectiveness, and explicitly constraining them could limit their ability to learn complex patterns. Furthermore, this idea might have already been tried and proven not to work.
Another interesting aspect to consider is the concept of subtractive attention, where fine-grained heads ‘handle’ local patterns and remove them from the residual stream, allowing coarse heads to focus on more ambiguous patterns. While this idea is still highly speculative, it could potentially lead to more efficient and effective attention mechanisms.
So, is this idea worth exploring further? Should we be looking into physical filtration principles as a way to improve attention head design in AI models? I’d love to hear your thoughts on this topic.
