yerfor / syntaspeech Goto Github PK
View Code? Open in Web Editor NEWSyntaSpeech: Syntax-aware Generative Adversarial Text-to-Speech; IJCAI 2022; Official code
License: MIT License
SyntaSpeech: Syntax-aware Generative Adversarial Text-to-Speech; IJCAI 2022; Official code
License: MIT License
Hi, I would like to use the pretrained model on LibriTTS to adapt the model on two target speakers for which I have about 40 minutes of training data each.
Could you please share how would be the approach for fine tuning it?
Any modules to freeze, decreasing learning rate, if it is actually possible in your opinion with that amount of data etc..
Any info would be useful.
Thanks for your work and have a good day.
Hello, I have experimented on your excellent job with this repo. But I found the ddp is not effective. I wonder if the way I used is wrong?
CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node 3 tasks/run.py --config //fs.yaml --exp_name fs_test_demo --reset
In the code, group MFA inputs for better parallelism. For multi speaker, it maybe go wrong.
For input g_uang3 zh_ou1 n_v3 d_a4 x_ve2 sh_eng1 d_eng1 sh_an1 sh_i1 l_ian2 s_i4 t_ian1 j_ing3 f_ang1 zh_ao3 d_ao4 i2 s_i4 n_v3 sh_i1
.
The TexGrid is
item [1]:
class = "IntervalTier"
name = "words"
xmin = 0.0
xmax = 9.4444
intervals: size = 56
intervals [1]:
xmin = 0
xmax = 0.5700000000000001
text = ""
intervals [2]:
xmin = 0.5700000000000001
xmax = 0.61
text = "eng"
intervals [3]:
xmin = 0.61
xmax = 0.79
text = "s_an1"
intervals [4]:
xmin = 0.79
xmax = 0.89
text = "eng"
intervals [5]:
xmin = 0.89
xmax = 1.06
text = "i1"
intervals [6]:
xmin = 1.06
xmax = 1.24
text = "eng"
intervals [7]:
xmin = 1.24
xmax = 1.3
text = ""
intervals [8]:
xmin = 1.3
xmax = 1.36
text = "s_an1"
intervals [9]:
xmin = 1.36
xmax = 1.42
text = ""
intervals [10]:
xmin = 1.42
xmax = 1.49
text = "eng"
intervals [11]:
xmin = 1.49
xmax = 1.67
text = "s_i4"
intervals [12]:
xmin = 1.67
xmax = 1.78
text = "eng"
intervals [13]:
xmin = 1.78
xmax = 1.91
text = ""
intervals [14]:
xmin = 1.91
xmax = 1.96
text = "er4"
intervals [15]:
xmin = 1.96
xmax = 2.06
text = "eng"
intervals [16]:
xmin = 2.06
xmax = 2.19
text = ""
intervals [17]:
xmin = 2.19
xmax = 2.35
text = "i1"
intervals [18]:
xmin = 2.35
xmax = 2.53
text = "eng"
intervals [19]:
xmin = 2.53
xmax = 3.03
text = "i1"
intervals [20]:
xmin = 3.03
xmax = 3.42
text = "eng"
intervals [21]:
xmin = 3.42
xmax = 3.48
text = "i1"
intervals [22]:
xmin = 3.48
xmax = 3.6
text = ""
intervals [23]:
xmin = 3.6
xmax = 3.64
text = "eng"
intervals [24]:
xmin = 3.64
xmax = 3.86
text = "i1"
intervals [25]:
xmin = 3.86
xmax = 3.99
text = "eng"
intervals [26]:
xmin = 3.99
xmax = 4.59
text = ""
intervals [27]:
xmin = 4.59
xmax = 4.869999999999999
text = "er4"
intervals [28]:
xmin = 4.869999999999999
xmax = 4.9799999999999995
text = "eng"
intervals [29]:
xmin = 4.9799999999999995
xmax = 5.1899999999999995
text = "s_i4"
intervals [30]:
xmin = 5.1899999999999995
xmax = 5.34
text = ""
intervals [31]:
xmin = 5.34
xmax = 5.43
text = "eng"
intervals [32]:
xmin = 5.43
xmax = 5.6
text = ""
intervals [33]:
xmin = 5.6
xmax = 5.76
text = "i1"
intervals [34]:
xmin = 5.76
xmax = 6.279999999999999
text = "eng"
intervals [35]:
xmin = 6.279999999999999
xmax = 6.359999999999999
text = "s_an1"
intervals [36]:
xmin = 6.359999999999999
xmax = 6.47
text = ""
intervals [37]:
xmin = 6.47
xmax = 6.6
text = "eng"
intervals [38]:
xmin = 6.6
xmax = 6.9399999999999995
text = "i1"
intervals [39]:
xmin = 6.9399999999999995
xmax = 7.039999999999999
text = "eng"
intervals [40]:
xmin = 7.039999999999999
xmax = 7.289999999999999
text = "s_an1"
intervals [41]:
xmin = 7.289999999999999
xmax = 7.369999999999999
text = "eng"
intervals [42]:
xmin = 7.369999999999999
xmax = 7.6
text = "s_i4"
intervals [43]:
xmin = 7.6
xmax = 7.699999999999999
text = "eng"
intervals [44]:
xmin = 7.699999999999999
xmax = 7.869999999999999
text = ""
intervals [45]:
xmin = 7.869999999999999
xmax = 8.049999999999999
text = "er4"
intervals [46]:
xmin = 8.049999999999999
xmax = 8.26
text = ""
intervals [47]:
xmin = 8.26
xmax = 8.299999999999999
text = "eng"
intervals [48]:
xmin = 8.299999999999999
xmax = 8.36
text = "s_i4"
intervals [49]:
xmin = 8.36
xmax = 8.389999999999999
text = ""
intervals [50]:
xmin = 8.389999999999999
xmax = 8.42
text = "eng"
intervals [51]:
xmin = 8.42
xmax = 8.45
text = ""
intervals [52]:
xmin = 8.45
xmax = 8.59
text = "s_an1"
intervals [53]:
xmin = 8.59
xmax = 8.83
text = ""
intervals [54]:
xmin = 8.83
xmax = 9.1
text = "eng"
intervals [55]:
xmin = 9.1
xmax = 9.44
text = "i1"
intervals [56]:
xmin = 9.44
xmax = 9.4444
text = ""
could you share your pretrain model? Thanks
I saw you have trained on libriTTS dataset. Have you test on AISHELL3 or other multi speaker chinese data?
Not so good or other situation?
005804 你当#1我傻啊#3?脑子#1那么大#2怎么#1塞进去#4?
ni3 dang1 wo2 sha3 a5 nao3 zi5 na4 me5 da4 zen3 me5 sai1 jin4 qu4
txt_struct=[['', ['']], ['你', ['n', 'i3']], ['当', ['d', 'ang1']], ['我', ['uo3']], ['傻', ['sh', 'a3']], ['啊', ['a', '?', 'n', 'ao3']], ['?', ['z', 'i']], ['脑', ['n', 'a4']], ['子', ['m', 'e']], ['那', ['d', 'a4']], ['么', ['z', 'en3']], ['大', ['m', 'e']], ['怎', ['s', 'ai1']], ['么', ['j', 'in4']], ['塞', ['q', 'v4', '?']], ['进', []], ['去', []], ['?', []], ['', ['']]]
ph_gb_word=['', 'n_i3', 'd_ang1', 'uo3', 'sh_a3', 'a_?n_ao3', 'z_i', 'n_a4', 'm_e', 'd_a4', 'z_en3', 'm_e', 's_ai1', 'j_in4', 'q_v4?', '', '', '', '']
what is 'a_?_n_ao3'
in the mfa_dict it appears ch_a1_d_ou1 ,a_?_n_ao3 and so on
Hello, if we want to realize Aishell3 Chinese multi-person dataset, which part should be changed?
In modules/tts/portaspeech/fvae.py, SyntaFVAE compute loss_kl (line 121) , Can someone help explain why
loss_kl = ((logqx - logpx) * nonpadding_sqz).sum() / nonpadding_sqz.sum() / logqx.shape[1]
,I think loss_kl should be compute by loss_kl = logqx.exp()*(logqx - logpx)
I would be very grateful if you could reply to me!
Discriminator's output['y_c'] never used, and never calculated in discriminator forward func. What does this variable mean?
SyntaSpeech/tasks/tts/synta.py
Line 81 in 5b07439
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.