In the fine - grained image classification task, it is crucial to extract distinctive local features to identify small differencesbetween images. The algorithm model based on ViT ( vision transformer) framework has achieved excellent performance in variousresearch fields of computer vision. Aiming at the problem that the fine-grained image classification model based on?ViT framework payslittle attention to the local area of the picture and to further strengthen the context connection of patch features,a fine-grained image classification method based on enhancing the correlation of patch is proposed. Firstly,a method of assigning correlation weights to patches isproposed,and nested application is used in different layer encoders to enrich different layer feature information,which solves the problemthat ViT does not pay enough attention to local features of images. Secondly,combining the position information of the patch,the localfeature context is strengthened,and the interference caused by the noise information is reduced. Finally,the similarity loss function isproposed to learn the difference of minute features?
in fine-grained images and optimize the classification effect of the model. Experimentson two public data sets,CUB-200 -2011 and Standford Dogs,have achieved an accuracy of 91. 33%?
and 92. 15% ,respectively. Theproposed method improves the benchmark model ViT network by 0. 63 and 0. 45 percentage points respectively,effectively improving thefine-grained image classification effect,and verifying the effectiveness of the method.