-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query about the motivation #10
Comments
Hi, you can carefully read the analysis of the attention range in the two articles [11,4]. Although the transformer is designed to capture long-range dependencies, its characteristics in the shallow layers mainly focus on capturing short-range dependencies. |
If the theory about short-long-range dependencies is real , we use convolution for shallow layer and use transformer block for deep layer. as we all know, convolution can get local information without position independent. |
Hi, there! I thanks for your reply! I want to know what is motivation of Modulation? Why you can use modulation for improvement of performance? Can you further explain it further? I am looking forward to your reply. |
Hi there! This is a nice work. But I have a little query about the motivation of the architecture design. In the paper "Based on the research conducted in [11, 4], which performed a quantitative analysis of different depths of self-attention blocks and discovered that shallow blocks tend to capture short-range dependencies while deeper ones capture long-range dependencies".
From my knowledge, the transformer can always model globally and can capture high effective receptive field from initial stages. Why did you refer that the shallow blocks capture short-range dependencies and deeper ones capture long-range dependencies? Why did they both capture long-range? Why did shallow capture short range but deep capture long-range?
The text was updated successfully, but these errors were encountered: