prashanthatp / wav2mov Goto Github PK

Speech to Facial Animation using GANs

Python 98.28% Shell 1.11% Batchfile 0.61%

deeplearning pytorch gans face-animation facial-animation lip-animation generative-adversarial-networks generative-adversarial-network deep-learning gan

wav2mov's People

Contributors

Stargazers

Watchers

Forkers

road2018 chenxingshensecond

wav2mov's Issues

Remove unnecessary condition inside create_file_list

Why check for length of FROM_DIR ,when we can assign the value in the first place

wav2mov/wav2mov/datasets/create_file_list.py

Lines 9 to 17 in 3061ed6

 FROM_DIR = '' 

 TO_DIR = '' 

 VIDEOS_PER_ACTOR = 30 

 if len(FROM_DIR)==0: 

 FROM_DIR = os.path.dirname(os.path.abspath(__file__)) 

 FROM_DIR = os.path.join(FROM_DIR,DATASET) 

 TO_DIR = FROM_DIR

weight update is not being done for generator

after performing adversarial backward passes for generator and identity discriminator,(or with sync discriminator),
function call for generator weight update is not being called.

wav2mov/models/wav2mov.py

Lines 193 to 216 in a8651bf

 def __optimize(self,adversarial): 

 self.fake_video_frames = None 

 self.fake_video_frames_c = None 

 FRACTION =5 

 losses = {'gen':(0.0,0), 

 'id':(0.0,0), 

 'l1':(0.0,0), 

 'sync':(0.0,0), 

 'seq':(0.0,0)} 

 for sub_batch in self.__get_sub_batch(fraction=FRACTION): 

 self.model.set_input(sub_batch) 

 loss_id = self.model.optimize_id(adversarial=False) 

 # self.logger.debug(f'wav2mov 456 {loss_id}') 

 for name,(loss,n) in loss_id.items(): 

 prev_loss,prev_n = losses.get(name,(0.0,0)) 

 loss_id[name] = (prev_loss+loss,prev_n+n) 

 self.model.step_id() 

 self.model.set_input(self.get_sub_seq(do_re_gen=adversarial,fraction=FRACTION)) 

 loss_sync = self.model.optimize_sync(adversarial) 

 self.model.step_sync() 

 self.model.update_scale() 

 losses = {**losses,**loss_id,**loss_sync} 

 return losses

Normalizing MFCCs

Should the mean and variance be calculated for entire sample or only across frames?
for example
if the mfccs of audio is of shape (1,7,13)
mean and variance shape
if calculated on entire sample : (1,1)
if calculated across time frames : (1,7)
Currently we are taking across entire sample

wav2mov/wav2mov/core/data/utils.py

Line 135 in 3a4b2bd

 mean,std = torch.mean(full_mfccs,axis=(1,2),keepdims=True),torch.std(full_mfccs,axis=(1,2),keepdims=True) 

Sync Loss : real and fake labels should be 1 and 0

Sync loss was initially using swapped labels (0 for real and 1 for fake labels)
But after swapping back to normal , fake labels are still using the value 1

wav2mov/losses/sync_loss.py

Lines 6 to 16 in 0fdef6b

 class SyncLoss(nn.Module): 

 """Abstracts away the funciton of applying loss of synchronity between audio and video frames. 

  [Reference]: 

  https://github.com/Rudrabha/Wav2Lip/blob/master/hq_wav2lip_train.py#L181-L186  

  """ 

 def __init__(self,device,real_label=None,fake_label=1.0): 

 super().__init__() 

 if real_label is None: 

 real_label = round(random.uniform(0.8,1),2) 

 self.register_buffer('real_label',torch.tensor(real_label)) 

 self.register_buffer('fake_label',torch.tensor(fake_label))

Computing MFCCs on GPU

Currently while computing mfccs , the audio has to be moved to cpu as librosa needs it in numpy form and which in turn needs it be in cpu,

wav2mov/core/data/utils.py

Lines 100 to 106 in 30ccbd4

 self.audio_sf = audio_sf 

 self.n_mfcc = 14 

 # self.mfcc_transform = MFCC(sample_rate=self.audio_sf,n_mfcc=self.n_mfcc) 

 self.mfcc_transform = partial(librosa.feature.mfcc,sr=self.audio_sf,n_mfcc=self.n_mfcc) 

 def extract_mfccs(self,audio): 

 mfccs = self.mfcc_transform(audio.squeeze().cpu().numpy())[1:].T

So why not make use of torchaudio and its capability of computing on gpu. I think it also supports batch wise computation

n_fft = 2048
win_length = None
hop_length = 512
n_mels = 128
n_mfcc = 14
sample_rate=16000

model =nn.Sequential(MFCC(sample_rate=sample_rate,
    n_mfcc=n_mfcc,
    melkwargs={
      'n_fft': n_fft,
      'n_mels': n_mels,
      'hop_length': hop_length,
      'mel_scale': 'htk',
    }))

model.to(device)

audio = ...
mfccs = model(audio)

Sync Loss : BCELoss is more suitable than BCELossWithLogits

Sync loss is being calculated based on cosine similarity which is already between 0 and 1.
Best to use BCELoss

But have to avoid scaling gradients using mixed precision as the doc suggest to use BCELossWithLogits if scaling required

wav2mov/losses/sync_loss.py

Lines 6 to 17 in 0fdef6b

 class SyncLoss(nn.Module): 

 """Abstracts away the funciton of applying loss of synchronity between audio and video frames. 

  [Reference]: 

  https://github.com/Rudrabha/Wav2Lip/blob/master/hq_wav2lip_train.py#L181-L186  

  """ 

 def __init__(self,device,real_label=None,fake_label=1.0): 

 super().__init__() 

 if real_label is None: 

 real_label = round(random.uniform(0.8,1),2) 

 self.register_buffer('real_label',torch.tensor(real_label)) 

 self.register_buffer('fake_label',torch.tensor(fake_label)) 

 self.loss = nn.BCEWithLogitsLoss()

Why not combine cmd line args(options),hparams,and config

Since in many places options,hparams,config parameters are being passed why not put them inside a common container?

wav2mov/main/main.py

Line 37 in 2a765a1

create_from_grid_dataset(config,preprocess_logger)

wav2mov/main/main.py

Line 40 in 2a765a1

train_model(args_options,params,config,train_logger)

wav2mov/main/main.py

Line 43 in 2a765a1

test_model(args_options,params,config,test_logger)

No lip variation in the generated frames.

Sync and sequence discriminator don't seem to learn

Memory overflow with Batching

adding batching of videos results in memory error in both local and as well as in colab environments.
Probably because the reference images/still images are copied to match frame dimension before feeding into generator and identity discriminator.

wav2mov/models/wav2mov_v7.py

Lines 112 to 126 in aea7f56

 def set_input(self, still_images, audio,audio_frames, video): 

 """ 

  still_images : (B,F,C,H,W) : (B*F,C,H,W) 

  video : (B,F,C,H,W) : (B*F,C,H,W) 

  audio_frames : (B,F,feat) 

  audio : (B,feat) 

  """ 

 batch_size,num_frames,channels,height,width = video.shape 

 self.curr_batch_size = batch_size 

 still_images = still_images.to(self.device) 

 still_images = still_images.repeat((1,num_frames,1,1,1)) 

 self.still_images = still_images 

 self.curr_real_video_frames = video.to(self.device) 

 self.curr_real_audio_frames = audio_frames.to(self.device) 

 self.curr_audio = audio

[BUG] output video contains multiple faces in the same frame

The produced video has multiple faces in same frame
The duration of the video is not correct

wav2mov/utils/plots.py

Lines 57 to 101 in 0afbf1a

 def save_video(hparams,filepath,audio,video_frames): 

 def get_video_frames(idx): 

 idx = int(idx) 

 # logger.debug(f'{video_frames.shape} ,{video_frames[idx].shape}') 

 frame = video_frames[idx].permute(1,2,0).squeeze() 

 return frame.cpu().numpy().astype(np.uint8) 

 logger.debug('saving video please wait...') 

 num_frames = video_frames.shape[0] 

 video_fps = hparams['data']['video_fps'] 

 audio_sf = hparams['data']['audio_sf'] 

 duration = audio.squeeze().shape[0]/audio_sf 

 # duration = 10 

 # duration = math.ceil(num_frames/video_fps) 

 logger.debug(f'duation {duration} seconds') 

 dir_name = os.path.dirname(filepath) 

 temp_audio_path = os.path.join(dir_name,'temp','temp_audio.wav') 

 os.makedirs(os.path.dirname(temp_audio_path),exist_ok=True) 

 # print(audio.cpu().numpy().reshape(-1).shape) 

 write_audio(temp_audio_path,audio_sf,audio.cpu().numpy().reshape(-1)) 

 video_clip = mpy.VideoClip(make_frame=get_video_frames,duration=duration) 

 audio_clip = mpy.AudioFileClip(temp_audio_path,fps=audio_sf) 

 video_clip = video_clip.set_audio(audio_clip) 

 # print(filepath,video_clip.write_videofile.__doc__) 

 video_clip.write_videofile( filepath, 

 fps=video_fps, 

 codec="png", 

 bitrate=None, 

 audio=True, 

 audio_fps=audio_sf, 

 preset="medium", 

 # audio_nbytes=4, 

 audio_codec=None, 

 audio_bitrate=None, 

 # audio_bufsize=2000, 

 temp_audiofile=None, 

 # temp_audiofile_path="", 

 remove_temp=True, 

 write_logfile=False, 

 threads=None, 

 ffmpeg_params=['-s','256x256','-aspect','1:1'], 

 logger="bar", 

 # pixel_format='gray 

 )

[ENHANCEMENT] writing frames to video file

Add a util function to save generated fake frames into a video file using one of the following

Tool	References
torchvision	docs, github
ffmpeg	artilcle
opencv	docs,stackoverflow
moviepy	docs

add the function here
https://github.com/PrashanthaTP/wav2mov/blob/44f507e586da9aff1a59c8d45525c2b8aba69bdb/utils/plots.py

	FROM_DIR = ''
	TO_DIR = ''

	VIDEOS_PER_ACTOR = 30

	if len(FROM_DIR)==0:
	FROM_DIR = os.path.dirname(os.path.abspath(__file__))
	FROM_DIR = os.path.join(FROM_DIR,DATASET)
	TO_DIR = FROM_DIR

	def __optimize(self,adversarial):
	self.fake_video_frames = None
	self.fake_video_frames_c = None
	FRACTION =5
	losses = {'gen':(0.0,0),
	'id':(0.0,0),
	'l1':(0.0,0),
	'sync':(0.0,0),
	'seq':(0.0,0)}
	for sub_batch in self.__get_sub_batch(fraction=FRACTION):
	self.model.set_input(sub_batch)
	loss_id = self.model.optimize_id(adversarial=False)
	# self.logger.debug(f'wav2mov 456 {loss_id}')
	for name,(loss,n) in loss_id.items():
	prev_loss,prev_n = losses.get(name,(0.0,0))
	loss_id[name] = (prev_loss+loss,prev_n+n)

	self.model.step_id()
	self.model.set_input(self.get_sub_seq(do_re_gen=adversarial,fraction=FRACTION))
	loss_sync = self.model.optimize_sync(adversarial)
	self.model.step_sync()
	self.model.update_scale()
	losses = {losses,loss_id,**loss_sync}
	return losses

	class SyncLoss(nn.Module):
	"""Abstracts away the funciton of applying loss of synchronity between audio and video frames.
	[Reference]:
	https://github.com/Rudrabha/Wav2Lip/blob/master/hq_wav2lip_train.py#L181-L186
	"""
	def __init__(self,device,real_label=None,fake_label=1.0):
	super().__init__()
	if real_label is None:
	real_label = round(random.uniform(0.8,1),2)
	self.register_buffer('real_label',torch.tensor(real_label))
	self.register_buffer('fake_label',torch.tensor(fake_label))

	self.audio_sf = audio_sf
	self.n_mfcc = 14
	# self.mfcc_transform = MFCC(sample_rate=self.audio_sf,n_mfcc=self.n_mfcc)
	self.mfcc_transform = partial(librosa.feature.mfcc,sr=self.audio_sf,n_mfcc=self.n_mfcc)

	def extract_mfccs(self,audio):
	mfccs = self.mfcc_transform(audio.squeeze().cpu().numpy())[1:].T

	def set_input(self, still_images, audio,audio_frames, video):
	"""
	still_images : (B,F,C,H,W) : (B*F,C,H,W)
	video : (B,F,C,H,W) : (B*F,C,H,W)
	audio_frames : (B,F,feat)
	audio : (B,feat)
	"""
	batch_size,num_frames,channels,height,width = video.shape
	self.curr_batch_size = batch_size
	still_images = still_images.to(self.device)
	still_images = still_images.repeat((1,num_frames,1,1,1))
	self.still_images = still_images
	self.curr_real_video_frames = video.to(self.device)
	self.curr_real_audio_frames = audio_frames.to(self.device)
	self.curr_audio = audio

	def save_video(hparams,filepath,audio,video_frames):
	def get_video_frames(idx):
	idx = int(idx)
	# logger.debug(f'{video_frames.shape} ,{video_frames[idx].shape}')
	frame = video_frames[idx].permute(1,2,0).squeeze()
	return frame.cpu().numpy().astype(np.uint8)

	logger.debug('saving video please wait...')
	num_frames = video_frames.shape[0]
	video_fps = hparams['data']['video_fps']
	audio_sf = hparams['data']['audio_sf']
	duration = audio.squeeze().shape[0]/audio_sf
	# duration = 10
	# duration = math.ceil(num_frames/video_fps)
	logger.debug(f'duation {duration} seconds')
	dir_name = os.path.dirname(filepath)
	temp_audio_path = os.path.join(dir_name,'temp','temp_audio.wav')
	os.makedirs(os.path.dirname(temp_audio_path),exist_ok=True)
	# print(audio.cpu().numpy().reshape(-1).shape)
	write_audio(temp_audio_path,audio_sf,audio.cpu().numpy().reshape(-1))

	video_clip = mpy.VideoClip(make_frame=get_video_frames,duration=duration)
	audio_clip = mpy.AudioFileClip(temp_audio_path,fps=audio_sf)
	video_clip = video_clip.set_audio(audio_clip)
	# print(filepath,video_clip.write_videofile.__doc__)
	video_clip.write_videofile( filepath,
	fps=video_fps,
	codec="png",
	bitrate=None,
	audio=True,
	audio_fps=audio_sf,
	preset="medium",
	# audio_nbytes=4,
	audio_codec=None,
	audio_bitrate=None,
	# audio_bufsize=2000,
	temp_audiofile=None,
	# temp_audiofile_path="",
	remove_temp=True,
	write_logfile=False,
	threads=None,
	ffmpeg_params=['-s','256x256','-aspect','1:1'],
	logger="bar",
	# pixel_format='gray
	)

prashanthatp / wav2mov Goto Github PK

wav2mov's People

Contributors

Stargazers

Watchers

Forkers

wav2mov's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs