Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query about the motivation #10

Open
VoyageWang opened this issue Jul 25, 2023 · 3 comments
Open

Query about the motivation #10

VoyageWang opened this issue Jul 25, 2023 · 3 comments

Comments

@VoyageWang
Copy link

Hi there! This is a nice work. But I have a little query about the motivation of the architecture design. In the paper "Based on the research conducted in [11, 4], which performed a quantitative analysis of different depths of self-attention blocks and discovered that shallow blocks tend to capture short-range dependencies while deeper ones capture long-range dependencies".

From my knowledge, the transformer can always model globally and can capture high effective receptive field from initial stages. Why did you refer that the shallow blocks capture short-range dependencies and deeper ones capture long-range dependencies? Why did they both capture long-range? Why did shallow capture short range but deep capture long-range?

@AFeng-x
Copy link
Owner

AFeng-x commented Jul 26, 2023

Hi, you can carefully read the analysis of the attention range in the two articles [11,4]. Although the transformer is designed to capture long-range dependencies, its characteristics in the shallow layers mainly focus on capturing short-range dependencies.

@DavideHe
Copy link

Hi, you can carefully read the analysis of the attention range in the two articles [11,4]. Although the transformer is designed to capture long-range dependencies, its characteristics in the shallow layers mainly focus on capturing short-range dependencies.

If the theory about short-long-range dependencies is real , we use convolution for shallow layer and use transformer block for deep layer. as we all know, convolution can get local information without position independent.

@VoyageWang
Copy link
Author

Hi, there! I thanks for your reply! I want to know what is motivation of Modulation? Why you can use modulation for improvement of performance? Can you further explain it further? I am looking forward to your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants