11-21 CAT-Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios