How to train Syllables instead of phones using HTK?

2018-06-12 00:38:37

I am trying to develop a Speech to Text conversion system using the Hidden Markov Model Toolkit (HTK). There are several possible methods to model HMMs for isolated word recognition:

Using Phones (monophones and Triphones)

Using Syllables

Using Words

The HTK Manual and VoxForge Tutorial on HTK give us step by step instructions on how to model HMMs for every monophone or triphone. I have implemented this for an Indian Langauge (Kannada) with success. However, neither source tells us how the same can be implemented using Syllable-based HMM modelling.

According to a paper on ASR using context-dependent syllables :

Neither HTK nor LASER had the direct support for working with syllables we had to implement a transformation algorithm. This only transforms configuring files for monophone ASR into the form for syllable ASR.

They say we need to modify the configuration files to suit syllables instead of monophones. Following the 10 steps in the VoxForge Tutorial , The configuration files are as follows :

config, wav_config : Specifies the configuration parameters for generating feature vectors from training wav files

sample.jconf : Configuration for using julius for Speech Recognition.

However, changing these files in any way doesn't help in recognising syllables.

Instead, I performed the steps similar to the phone based modelling ie create a monophone pronunciation dictionary and HMM training with 9 rounds. This is equivalent to the 1st 8 steps of the voxforge tutorial. In case of triphone modelling however, instead of creating triphones for every single word, I split every word into syllables and created trihphones (and biphones) from every syllable. You may not be able to read this language Kannada, but I'll show a few samples so you can see what I'm dealing with.

During triphone modelling while I was just modelling for normal triphones, a part of my triphone dictionary dict-tri was the following (which worked perfectly):

ಅತಿ            ಅ+ತ್ ಅ-ತ್+ಇ ತ್-ಇ sp
ಅದರ        ಅ+ದ್ ಅ-ದ್+ಅ ದ್-ಅ+ರ್ ಅ-ರ್+ಅ ರ್-ಅ sp
ಅದು            ಅ+ದ್ ಅ-ದ್+ಉ ದ್-ಉ sp

In the new method of breking down the words into syllables and converting each syllable to mono, bi & triphones, the same dictionary is like:

ಅತಿ          ಅ ತ್+ಇ ತ್-ಇ sp
ಅದರ         ಅ ದ್+ಅ ದ್-ಅ ರ್+ಅ ರ್-ಅ sp
ಅದು          ಅ ದ್+ಉ ದ್-ಉ sp

Even though you may not be able to read the language, it should be clear that I aim to reduce the number of HMM models significantly. Yet while running julius, I got the following error :

    STAT: include config: sample.jconf
    STAT: jconf successfully finalized
    STAT: *** loading AM00 _default
    Stat: init_phmm: Reading in HMM definition
    Stat: rdhmmdef: ascii format HMM definition
    Stat: rdhmmdef: limit check passed
    Stat: check_hmm_restriction: an HMM with several arcs from initial state found: "sp"
    Stat: rdhmmdef: this HMM requires multipath handling at decoding
    Stat: rdhmmdef: no <SID> embedded
    Stat: rdhmmdef: assign SID by the order of appearance
    Stat: init_phmm: defined HMMs:   200
    Stat: init_phmm: loading ascii hmmlist
    Stat: init_phmm: logical names:   562 in HMMList
    Stat: init_phmm: base phones:    48 used in logical
    Stat: init_phmm: finished reading HMM definitions
    STAT: making pseudo bi/mono-phone for IW-triphone
    Stat: hmm_lookup: 5 pseudo phones are added to logical HMM list
    STAT: *** AM00 _default loaded
    STAT: *** loading LM00 _default
    STAT: reading [sample.dfa] and [sample.dict]...
    Error: voca_load_htkdict: line 3: triphone "*-ಅ+ತ್" or biphone "ಅ+ತ್" not found
    Error: voca_load_htkdict: line 3: triphone "ಅ-ತ್+ಇ" not found
    Error: voca_load_htkdict: the line content was: 2   [ಅತಿ]   ಅ ತ್ ಇ 
    Error: voca_load_htkdict: line 4: triphone "*-ಅ+ತ್" or biphone "ಅ+ತ್" not found
    Error: voca_load_htkdict: line 4: triphone "ಅ-ತ್+ಯ್" not found
    Error: voca_load_htkdict: line 4: triphone "ತ್-ಯ್+ಅ" not found
    Error: voca_load_htkdict: line 4: triphone "ಯ್-ಅ+ದ್" not found
    Error: voca_load_htkdict: line 4: triphone "ಅ-ದ್+ಭ್" not found
    Error: voca_load_htkdict: line 4: triphone "ದ್-ಭ್+ಉ" not found
    Error: voca_load_htkdict: line 4: triphone "ಭ್-ಉ+ತ್" not found
    Error: voca_load_htkdict: line 4: triphone "ಉ-ತ್+ಅ" not found
    Error: voca_load_htkdict: the line content was: 2   [ಅತ್ಯದ್ಭುತ] ಅ ತ್ ಯ್ ಅ ದ್ ಭ್ ಉ ತ್ ಅ 
    Error: voca_load_htkdict: line 5: triphone "*-ಅ+ಥ್" or biphone "ಅ+ಥ್" not found
    Error: voca_load_htkdict: line 5: triphone "ಅ-ಥ್+ಅ" not found
    Error: voca_load_htkdict: line 5: triphone "ಥ್-ಅ+ವ್" not found
    Error: voca_load_htkdict: line 5: triphone "ಅ-ವ್+ಆ" not found
    .
    .
    .
    (over 200 more lines of the same)
    .
    .
    Error: voca_load_htkdict: ಹ್-ಒ+ಯ್
    Error: voca_load_htkdict: ಹ್-ಒ+ರ್
    Error: voca_load_htkdict: ಹ್-ಒ+ಳ್
    Error: voca_load_htkdict: ಹ್-ಓ+ಗ್
    Error: voca_load_htkdict: ಹ್-ಓ+ರ್
    Error: voca_load_htkdict: ಹ್-ಯ್+ಅ
    Error: voca_load_htkdict: ಹ್-ಯ್+ಆ
    Error: voca_load_htkdict: end missing phones
    Error: init_voca: error in reading sample.dict: 748 words failed out of 7 words
    ERROR: failed to read dictionary "sample.dict"
    ERROR: m_fusion: some error occured in reading grammars
    ERROR: Error in loading model

What exactly have I done wrong? From the looks of it, the error states I have fed it triphones that have not been modeled. But all the Error s displayed are triphones that I haven not even used! For reference, here is the list of ALL mono, bi and triphones used to create syllables (triphones1 replacement):

ಲ್-ಅ+ಮ್
ತ್-ಇ+ಮ್
ಘ್-ಅ+ಮ್
ನ್-ಊ+ಮ್
ಛ್-ಈ
ಛ್-ಅ
ಛ್-ಆ
ಛ್-ಇ
ಭ್-ಆ
ಭ್-ಇ
ಭ್-ಅ
ಭ್-ಊ
ಭ್-ಈ
ಭ್-ಉ
ಕ್-ಒ+ಮ್
ಭ್-ಏ
ಬ್+ಓ
ಬ್+ಒ
ಞ್+ಆ
ಬ್-ಅ+ಮ್
ಬ್+ಔ
ಬ್+ಇ
ಬ್+ಆ
ಬ್+ಅ
ಬ್+ಉ
ಬ್+ಈ
ಬ್+ಏ
ಬ್+ಎ
ರ್-ಒ+ಮ್
ತ್+ಊ
ತ್+ಉ
ಲ್-ಆ+ಮ್
ತ್+ಏ
ತ್+ಎ
ತ್+ಇ
ತ್+ಆ
ತ್+ಅ
ತ್+ಐ
ಕ್-ಇ+ಮ್
ಹ್-ಒ
ಹ್-ಓ
ಹ್-ಎ
ಹ್-ಈ
ಹ್-ಉ
ಹ್-ಆ
ಹ್-ಇ
ಖ್-ಅ+ಮ್
ಹ್-ಅ
ವ್-ಒ
ವ್-ಐ
ಖ್+ಊ
ಖ್
ವ್-ಉ
ವ್-ಎ
ವ್-ಏ
ಢ್+ಇ
ಢ್+ಆ
ವ್-ಆ
ವ್-ಇ
ವ್-ಅ
ಕ್-ಆ+ಮ್
ಹ್-ಅ+ಮ್
ಠ್-ಈ
ಠ್-ಏ
ಠ್-ಅ
ಳ್-ಎ+ಮ್
ಸ್-ಎ+ಮ್
ಸ್-ಅ
ಸ್-ಇ
ಸ್-ಆ
ಸ್-ಉ
ಸ್-ಈ
ದ್-ಅ+ಮ್
ಸ್-ಊ
ಶ್-ಇ
ಶ್-ಆ
ಶ್-ಅ
ಸ್-ಎ
ಸ್-ಐ
ಸ್-ಓ
ಸ್-ಒ
ಸ್-ಔ
ಗ್-ಅ+ಮ್
ಡ್-ಉ
ಶ್+ಅ
ಶ್+ಇ
ಶ್+ಆ
ಡ್-ಇ
ದ್
ಡ್-ಅ
ಡ್-ಎ
ಪ್-ಔ
ಪ್-ಓ
ಪ್-ಐ
ಧ್+ಊ
ಪ್-ಎ
ಧ್+ಈ
ಧ್+ಉ
ಪ್-ಊ
ಪ್-ಉ
ಪ್-ಇ
ಪ್-ಆ
ಪ್-ಅ
ಯ್-ಅ
ನ್
ಧ್+ಇ
ಧ್+ಅ
ಜ್
ಧ್+ಆ
ಷ್-ಈ
ತ್-ಅ+ಮ್
ಷ್-ಏ
ಷ್-ಅ
ಷ್-ಆ
ಷ್-ಇ
ಛ್+ಇ
sp
ಹ್
ಞ್-ಆ
ಥ್-ಈ
ಥ್-ಎ
ಥ್-ಏ
ಥ್-ಆ
ಥ್-ಇ
ಥ್-ಅ
ಚ್
ಯ್
ಮ್
ಟ್
ಳ್+ಎ
ಜ್+ಇ
ಜ್+ಆ
ಜ್+ಅ
ಳ್+ಉ
ಳ್+ಆ
ಳ್+ಇ
ಜ್+ಉ
ಳ್+ಅ
ಜ್+ಎ
ಖ್+ಐ
ಜ್+ಓ
ಜ್+ಐ
ಊ+ಮ್
ಳ್
ಶ್
ರ್-ಎ+ಮ್
ರ್-ಓ
ಆ-ಮ್
ರ್-ಉ
ಗ್+ಒ
ರ್-ಊ
ರ್-ಏ
ರ್-ಎ
ರ್-ಅ
ಟ್-ಊ
ರ್-ಇ
ರ್-ಆ
ಚ್+ಉ
ಚ್+ಈ
ಟ್-ಋ
ಬ್-ಎ+ಮ್
ಚ್+ಎ
ಚ್+ಅ
ಚ್+ಇ
ಚ್+ಆ
ದ್-ಒ+ಮ್
ನ್+ಓ
ನ್+ಒ
ನ್+ಐ
ನ್+ಇ
ನ್+ಆ
ನ್+ಅ
ಗ್-ಒ+ಮ್
ನ್+ಏ
ನ್+ಎ
ನ್+ಊ
ನ್+ಉ
ನ್+ಈ
ತ್-ಉ+ಮ್
ಧ್-ಈ
ಧ್-ಉ
ಧ್-ಊ
ಟ್+ಆ
ಟ್+ಇ
ಟ್+ಅ
ಟ್+ಊ
ಟ್+ಋ
ಟ್+ಈ
ಟ್+ಉ
ಧ್-ಅ
ಧ್-ಆ
ಧ್-ಇ
ಗ್+ಊ
ಚ್-ಎ
ಗ್+ಈ
ಗ್+ಉ
ಗ್+ಎ
ಗ್+ಏ
ಚ್-ಉ
ಚ್-ಈ
ಚ್-ಇ
ಚ್-ಆ
ಚ್-ಅ
ಗ್+ಆ
ಗ್+ಇ
ಖ್+ಓ
ಗ್+ಅ
ಖ್+ಏ
ತ್+ಈ
ಖ್+ಉ
ಟ್-ಅ
ಟ್-ಆ
ಟ್-ಇ
ಖ್+ಅ
ಟ್-ಉ
ಖ್+ಇ
ಖ್+ಆ
ನ್-ಅ+ಮ್
ಬ್-ಔ
ವ್+ಐ
ವ್+ಒ
ಒ-ಮ್
ಫ್-ಎ
ಸ್-ಅ+ಮ್
ವ್+ಉ
ವ್+ಎ
ವ್+ಏ
ವ್+ಅ
ವ್+ಆ
ಯ್-ಅ+ಮ್
ಹ್+ಉ
ಕ್-ಅ+ಮ್
ದ್-ಇ+ಮ್
ವ್+ಇ
ಖ್-ಇ
ಡ್-ಅ+ಮ್
ಥ್+ಅ
ಖ್-ಅ
ಸ್+ಇ
ಸ್+ಆ
ಸ್+ಅ
ಫ್+ಅ
ರ್+ಎ
ಫ್+ಎ
ಸ್+ಎ
ಎ+ಮ್
ಸ್+ಉ
ಸ್+ಈ
ಖ್-ಆ
ಸ್+ಔ
ಸ್+ಓ
ಸ್+ಒ
ಸ್+ಐ
ಗ್-ಆ+ಮ್
ಖ್-ಓ
ದ್+ಋ
ಠ್+ಏ
ಠ್+ಈ
ಠ್+ಅ
ಣ್+ಎ
ಣ್+ಉ
ಣ್+ಆ
ಣ್+ಇ
ಣ್+ಅ
ಏ-ಮ್
ಲ್+ಅ
ಲ್+ಇ
ಲ್+ಆ
ಲ್+ಉ
ರ್-ಇ+ಮ್
ಲ್+ಊ
ಲ್+ಏ
ಲ್+ಎ
ನ್-ಇ+ಮ್
ಬ್-ಅ
ಬ್-ಇ
ಬ್-ಆ
ಬ್-ಏ
ಬ್-ಎ
ಬ್-ಉ
ಬ್-ಈ
ಛ್+ಆ
ರ್-ಆ+ಮ್
ಛ್+ಅ
ಧ್
ಬ್-ಓ
ಬ್-ಒ
ರ್-ಅ+ಮ್
ಕ್-ಏ+ಮ್
ಛ್+ಈ
ಷ್+ಏ
ದ್-ಒ
ದ್-ಋ
ದ್-ಊ
ದ್-ಉ
ದ್-ಈ
ದ್-ಏ
ದ್-ಎ
ದ್-ಇ
ದ್-ಆ
ದ್-ಅ
ಗ್
ತ್-ಐ
ತ್-ಏ
ತ್-ಎ
ತ್-ಉ
ತ್-ಈ
ತ್-ಊ
ತ್-ಅ
ತ್-ಇ
ತ್-ಆ
ಝ್+ಓ
ಯ್-ಆ+ಮ್
ಘ್-ಅ
ಘ್-ಆ
ತ್-ಎ+ಮ್
ಘ್-ಓ
ಢ್+ಅ
ಛ್-ಅ+ಮ್
ಮ್+ಔ
ಸ್-ಇ+ಮ್
ಮ್+ಐ
ಮ್+ಒ
ಮ್+ಅ
ಮ್+ಇ
ಮ್+ಆ
ಮ್+ಏ
ಮ್+ಎ
ಮ್+ಉ
ಮ್+ಈ
ಮ್+ಊ
ಪ್+ಐ
ಪ್+ಓ
ಇ-ಮ್
ನ್-ಒ+ಮ್
ಪ್+ಉ
ಪ್+ಊ
ಪ್+ಎ
ಪ್+ಅ
ಪ್+ಇ
ಪ್+ಆ
ಘ್+ಆ
ಘ್+ಅ
ಮ್-ಅ+ಮ್
ಇ+ಮ್
ಘ್+ಓ
ಒ+ಮ್
ಔ-ಮ್
ಪ್-ಉ+ಮ್
ಬ್
ಳ್+ಊ
ಭ್+ಏ
ಡ್
ಬ್-ಇ+ಮ್
ಟ್-ಈ
ಢ್-ಇ
ಢ್-ಆ
ಢ್-ಅ
ಯ್-ಇ+ಮ್
ಜ್+ಈ
ಷ್-ಎ
ಫ್
ತ್-ಊ+ಮ್
ಥ್+ಈ
ಖ್-ಊ
ಖ್-ಉ
ಖ್-ಏ
ಥ್+ಎ
ಥ್+ಏ
ರ್+ಓ
ರ್+ಒ
ಥ್+ಆ
ಥ್+ಇ
ರ್+ಏ
ನ್-ಎ+ಮ್
ಕ್
ರ್+ಊ
ರ್+ಉ
ರ್+ಈ
ರ್+ಇ
ರ್+ಆ
ರ್+ಅ
ಖ್-ಐ
ಳ್-ಅ
ಳ್-ಆ
ಳ್-ಇ
ಲ್-ಇ+ಮ್
ಳ್-ಎ
ಹ್-ಇ+ಮ್
ಳ್-ಉ
ಳ್-ಊ
ಸ್+ಊ
ಭ್+ಈ
ಲ್
ಔ
ಐ
ಓ
ಒ
ಎ-ಮ್
ಅ
ಇ
ಆ
ಉ-ಮ್
ಸ್+ಏ
ಏ
ಎ
ಉ
ಈ
ಋ
ಊ
ಷ್
ನ್-ಓ
ನ್-ಒ
ಭ್+ಅ
ಭ್+ಆ
ಭ್+ಇ
ತ್
ಭ್+ಉ
ಭ್+ಊ
ದ್+ಈ
ನ್-ಐ
ದ್+ಏ
ದ್+ಎ
ದ್+ಉ
ಜ್-ಐ
ಜ್-ಓ
ದ್+ಊ
ದ್+ಅ
ದ್+ಇ
ದ್+ಆ
ಜ್-ಅ
ಜ್-ಇ
ಜ್-ಆ
ನ್-ಅ
ನ್-ಇ
ನ್-ಆ
ನ್-ಉ
ನ್-ಈ
ನ್-ಊ
ಜ್-ಉ
ಜ್-ಈ
ನ್-ಏ
ನ್-ಎ
ಲ್-ಇ
ಲ್-ಆ
ಲ್-ಅ
ಯ್+ಆ
ಯ್+ಇ
ಯ್+ಅ
ಲ್-ಏ
ಲ್-ಎ
ಯ್+ಉ
ಸ್
ಲ್-ಊ
ಲ್-ಉ
ಯ್+ಓ
ವ್
ಆ+ಮ್
ಕ್-ಎ+ಮ್
ಝ್-ಓ
ಊ-ಮ್
ವ್-ಒ+ಮ್
ಗ್-ಈ
ಗ್-ಉ
ಗ್-ಊ
ಜ್-ಎ
ರ್
ಗ್-ಎ
ಗ್-ಏ
ಕ್-ಒ
ಕ್-ಓ
ಗ್-ಅ
ಗ್-ಆ
ಗ್-ಇ
ಕ್-ಊ
ಕ್-ಋ
ಕ್-ಈ
ಕ್-ಉ
ಕ್-ಎ
ಗ್-ಒ
ದ್+ಒ
ಕ್-ಆ
ಕ್-ಇ
ಕ್-ಅ
ಣ್
ಳ್-ಇ+ಮ್
ಪ್+ಔ
ವ್-ಅ+ಮ್
ಲ್-ಏ+ಮ್
ಹ್+ಒ
ಹ್+ಓ
ಪ್-ಆ+ಮ್
ಹ್+ಅ
ಹ್+ಆ
ಹ್+ಇ
ಫ್-ಅ
ಹ್+ಎ
ಹ್+ಈ
ಕ್-ಉ+ಮ್
ರ್-ಒ
ಯ್+ಎ
ಕ್+ಒ
ಕ್+ಓ
ಯ್+ಏ
ಕ್+ಈ
ಕ್+ಉ
ಕ್+ಊ
ಕ್+ಋ
ಕ್+ಎ
ಕ್+ಏ
ಕ್+ಅ
ಕ್+ಆ
ಕ್+ಇ
ಷ್+ಈ
ಷ್+ಎ
ಮ್-ಅ
ರ್-ಈ
ಹ್-ಒ+ಮ್
ಷ್+ಆ
ಷ್+ಇ
ಷ್+ಅ
ಣ್-ಅ
ಣ್-ಆ
ಣ್-ಇ
ಮ್-ಉ+ಮ್
ಮ್-ಊ
ಣ್-ಎ
ಣ್-ಉ
ಡ್+ಎ
ಡ್+ಉ
ಡ್+ಅ
ಡ್+ಇ
ಪ್
ಅ-ಮ್
ಟ್-ಉ+ಮ್
ಪ್-ಅ+ಮ್
ಸ್-ಔ+ಮ್
ಮ್-ಐ
ಸ್-ಏ
ಮ್-ಇ
ಮ್-ಆ
ಯ್-ಆ
ಯ್-ಇ
ಯ್-ಉ
ಮ್-ಉ
ಮ್-ಈ
ಮ್-ಏ
ಮ್-ಎ
ಯ್-ಎ
ಯ್-ಏ
ಮ್-ಒ
ಯ್-ಓ
ಮ್-ಔ

Regardless, The error I recieved should be common to every langauge. What is wrong exactly? What should I do to implement Syllable modeling in HTK?

链接地址: http://www.djcxy.com/p/34434.html

上一篇: HTML5语音输入和Google翻译文本

下一篇: 如何训练音节而不是使用HTK的手机？