This is a demo page of current ongoing audio-visual speech enhancement (AVSE) project. The visual information of lip movements is utilized to extract the target speech. Our proposed model ranks first in the ICASSP 2024 Signal Processing Grand Challenge: Multimodal Information based Speech Processing (MISP) 2023 Challenge, where the paper can be found through https://mispchallenge.github.io/mispchallenge2023/task1_leaderboard.html.
This is a latest demo:
mytest_1.mp4
mydemo2.mp4
demo.of.the.MISP.2023.Challenge.mp4
Region of interest of lip movements