A common way to construct a large vocabulary continuous speech recogniser LVCSR is to use 3 state HMMs to model phonemic units. In this dissertation the focus is to improve this standard phone model. To this end three alternative phone recognition systems will be proposed. Central in the first two systems is a set of Acoustic SubWord Units (ASWUs), which are used in order to train phone models with an extended state topology. This extended topology contains several parallel paths and allows the model to vary the amount of states that are employed for each realisation of the phones.
In the first system this topology is fixed with four parallel paths which contains one, two, three or four states. A novel training algorithm is developed in order to train each of the states properly. In the second system the number of paths and the number of states in each of the states are derived in a data driven manner using an algorithm for pronunciation variation modelling (PVM). This algorithm is applied to the set of ASWUs in order to find variations for each phones, variations which are used to decide the topologies.
The final system is a hybrid system that employs non-negative matrix factorisation (NMF), an algorithm capable of extracting latent units in a data driven manner to model the acoustic observations. This hybrid was proposed before in the literature for modelling audio mixtures. In this dissertation modifications to this original hybrid, the non-negative HMM (N-HMM), are suggested for it to be used on the speech recognition task. The main contribution is to introduce dependency on state duration for the output probability distribution functions. This modified structure is referred to as the non-negative durational HMM (NdHMM).