Aiming at the problem that the corresponding relationship between mechanical equipment and sound source cannot bedetermined in the single-modal mixed-signal separation method,a sound source separation method for mechanical equipment based onmulti-modal feature fusion is proposed. Firstly, by using multiple sets of feature extraction layers of different scales, a Res2Net18network with a multi-scale feature extraction structure is constructed to extract fine-grained visual features of mechanical equipment. Thespatial position information expression?of different audio features in the encoder is enhanced. Secondly,the visual features of mechanicalequipment are integrated into the mixed audio features to generate a corresponding sound source mask,and then the independent soundsource spectrum is obtained by combining the mask and the mixed audio spectrum, so as to realize the visual feature separationcorresponds?
to the sound source of the mechanical equipment. The proposed method effectively solves the problem of the inability to determine the corresponding relationship between the mechanical equipment and the sound source in the single - mode mixed - signalseparation method. Finally,the SDR,SIR and SAR respectively reach 6. 14 dB, 8. 59 dB and 18. 33 dB on the mechanical equipmentdata set. Compared with the existing three multimodal sound source separation models,the proposed multimodal sound source separationmethod achieves the best results in both SDR and SAR,which verifies its effectiveness.