Alexander Kain — Curriculum Vitae

image: 6_Users_kain_CSLU_doc_CV_self.jpg
ORCID 0000-0001-5807-9311

Present Positions

Oregon Health & Science University

Associate Professor
Computer Science & Electrical Engineering (
Center for Spoken Language Understanding (CSLU)
Department of Pediatrics
School of Medicine (SOM)
Oregon Health & Science University (OHSU)
3181 SW Sam Jackson Park Road
Portland, Oregon 97239-3098
Phone / Fax : (503) 349-3750 / (503) 346-3754

BioSpeech, Inc.

Chief Scientist
9946 SW 61st Ave
Portland, Oregon 97239-3098


Undergraduate and Graduate


Professional Experience




4.1 Areas of Research/Scholarly Interest

4.2 Grants


  1. 2016/01/01–2018/04/01: National Institute of Health 1R44DC015145-01, “Prosody Assessment Toolbox”, PI: Connors (BioSpeech). Current instruments for assessing prosodic deficits are decades behind those that are used for clinical assessment of other aspects of language. We propose to build a system that addresses these shortcomings. The system performs automated scoring and acoustic analysis of expressive prosody, allows stimuli to be acoustically modified for detailed perceptual assessment of receptive prosody, and can be extended by researchers to include novel tasks. It is evaluated with individuals who have ASD (adults and children), DS (adults and children), or MCI, and a typically developing control group. My role: I provide signal processing and machine-learning expertise, and assist with all scientific aspects of the project. Amount: 1.6$M.
  2. 2015/09/01–2020/08/31: National Institute of Health 2R01DC004689-11A1, "Therapeutic Approaches to Dysarthria: Acoustic and Perceptual Correlates", PI: Tjaden (State University of New York at Buffalo). 90% of the one million Americans living with idiopathic Parkinson’s disease (PD) and 50% of the 500,000 Americans living with Multiple Sclerosis (MS) will experience dysarthria at some point during the disease. The perceptual sequelae of dysarthria have devastating consequences for quality of life and participation in society by virtue of their effect on social and psychological variables such as employment, leisure activities and relationships. Knowledge of therapy techniques for maximizing perceived speech adequacy, as indexed by intelligibility, therefore is of paramount importance. As a result of our incomplete knowledge of the comparative merits of dysarthria therapy techniques and their variants, however, the choice of a particular technique is not based on a rigorous research base, but is based on either trial and error or the clinician’s educational and experience biases. The proposed project will address these barriers by comparing the acoustic and perceptual consequences of rate reduction, increased vocal intensity and clear speech variants in MS and PD. Our approach is to employ established acoustic measures and perceptual paradigms as well as a state-of-the-art speech re-synthesis technique that will permit conclusions concerning the underlying speech production characteristics, as inferred from the acoustic signal, causing improved intelligibility. Amount: $3.3M.
  3. 2014/09/01–2017/08/31: National Institute of Health 1R43MH101978-01A1, “System for automatic classification of rodent vocalizations”, PI: Lahvis (BioSpeech). Development of treatments for neuropsychiatric disorders presents a formidable challenge. To advance drug discovery, assessments of laboratory rodents are widely employed by academia and industry to model neuropsychiatric disorders. Substantial recent advances in digital recordings of rodent ultrasonic vocalizations (USVs) have engendered interest in assessment of USVs to measure behavior change. A practical obstacle to USV assessment is that they are classified manually. We propose a software system that allows a user to rapidly interrogate recordings of rodent USVs for prosodic content. My role: I provide signal processing and machine-learning expertise, and assist with all scientific aspects of the project. Amount: $324K.


  1. 2010/09/27–2016/09/30: National Science Foundation BCS-1027834, "Computational Models for the Automatic Recognition of Non-Human Primate Social Behaviors", PI: Kain (OHSU). To develop methods that will permit researchers to remotely and automatically monitor behavior of primates and other highly social animals.
  2. 2013/12/01–2015/08/31: National Institute of Health 1R43DA037588-01A1, "Screening for Sleep Disordered Breathing with Minimally Obtrusive Sensors", PI: Snider (BioSpeech). Sleep disordered breathing (SDB) is believed to be a widespread, under-diagnosed condition associated with detrimental health problems, at a high cost to society. The current gold standard for diagnosis of SDB is a time-consuming, expensive, and obtrusive (requiring many attached wires) sleep study, or polysomnography (PSG). The immediate objective of our research is to develop and evaluate a hardware design and a set of algorithms for automatically detecting obstructive, central, or mixed apneas and hypopneas from acoustic, peripheral oxygen saturation (SpO2), and pulse rate data, using an ambient microphone and a wireless pulse oximeter. The long-term goal is to create a low-cost, easy-to-operate, minimally obtrusive, at-home device that can be used for early and frequent screen for SDB in patients' homes, significantly increasing patient comfort while capturing more representative sleep data compared to a clinical sleep study. In collaboration with Chad Hagen, M. D. at the Sleep Disorders Program at OHSU, we aim to (1) develop a screening system by selecting minimally obtrusive sensor hardware and extending state-of-the art algorithms for automatically detecting SDB from acoustic, SpO2, and pulse rate data; (2) collect patient data in the sleep lab and at home from representative populations using the proposed system; (3) determine the screening accuracy by comparing the performance of the proposed system on the collected data against standard PSG-derived clinical results; and (4) measure the usability of an at-home screening device by the target population, by asking subjects who participated in the at-home data collection to complete a survey on various aspects of the setup and operation of the proposed system. My role: I provide signal-processing and machine-learning expertise, and assist with all scientific aspects of the project. Amount: $205K.
  3. 2012/04/01–2015/03/31: National Institute of Health 5R44DC009515-03, "SBIR Phase 2: Computer-based auditory skill building program for aural (re)habilitation", PI: Connors (BioSpeech). To extend an adaptive computer-guided software program that focuses on learning phoneme discrimination and identification. See Phase I description. Amount: $400K.
  4. 2011/12/01–2015/08/31: National Institute of Health R21DC012139, "Computer-Based Pronunciation Analysis for Children with Speech Sound Disorders", PI: Kain (OHSU). In this work we are developing speech-production assessment and pronunciation training tools for children with speech sound disorders. To-date, computer-assisted pronunciation training has not yet been successfully extended to help children with speech sound disorders, primarily because of a lack of accuracy in phoneme-level analysis of the speech signal. My role: I am creating a set of algorithms that will reliably identify and score the intelligibility of a phoneme within an isolated target word, providing immediate, relevant, and understandable feedback about pronunciation errors. The use of human perceptual data during training is an important and new component of the proposed approach. As PI, I am also responsible for overall project supervision and management. Amount: $416K.
  5. 2010/06/09–2015/05/31: National Science Foundation IIS-0964102, "Semi-Supervised Discriminative Training of Language Models", PI: Kain (OHSU). To conduct fundamental research in statistical language modeling to improve human language technologies, including automatic speech recognition (ASR) and machine translation (MT).
  6. 2010/05/15–2015/04/30: National Science Foundation IIS-0964468, "HCC: Medium: Synthesis and Perception of Speaker Identity", PI: Kain (OHSU). Millions of Americans with impaired or absent speech communication ability rely on Augmentative and Alternative Communication devices with voice output (Speech Generating Devices, or SGDs) to communicate. A psychologically important and desirable feature is the ability to speak with one's own voice, i. e. the ability for the SGD to produce speech that mimics the individual's pre-morbid speech or speech that the individual may be able to intermittently produce. However, current text-to-speech (TTS) systems can only create speech with one or very few supplied speaker characteristics, and cannot be trained to take on the user's voice. My role: Together with Ph. D. students and co-investigators, I am creating a TTS synthesis system that generates speech that sounds like that of a specific individual (Speaker Identity Synthesis, or SIS). In the process we are building and evaluating analysis and synthesis models of the relevant acoustic features, including pitch, duration, and spectrum. Since the system includes a trainability component, this project also involves use of advanced mapping technology in the form of a joint-density Gaussian mixture model. I first proposed this approach in a 1998 publication which has since been cited over 370 times. As PI, I am also responsible for overall project supervision, management, and mentorship of graduate student Mohammadi. Amount: $905K.
  7. 2011/04/01–2012/03/31: National Institute of Health 5R42DC008712, "User Adaptation of AAC Device Voices - Phase 2", PI: Klabbers (BioSpeech). Developing and evaluating voice transformation and prosody modification technologies to customize synthetic voices in AAC devices, mimicking the individual user's pre-morbid speech. See Phase 1 description.
  8. 2011/03/01–2013/03/31: National Institute of Health 1R43DC011706-01, "SBIR Phase 1: Computerized System for Phonemic Awareness Intervention", PI: Connors (BioSpeech). Phonemic awareness, defined as “the ability to notice, think about, and work with the individual sounds in spoken words”, is considered a necessary skill for literacy. The financial and quality-of-life costs of these impairments are significant, not only because of the link with reading difficulties and hence with future employability, but also because there may exist further links between reading difficulties and a range of psychiatric disorders. This argues for phonemic awareness intervention beyond what can be taught in a regular pre-school or elementary school curriculum. Such intervention is typically provided in the form of one-on-one sessions with a specialized professional (e. g. a Speech Language Pathologist). However, responding to cost concerns and poor access to these services, and also recognizing the importance of frequent intervention sessions, usage of computerized intervention systems is becoming more common. These computerized intervention systems have been steadily improving. However, one significant drawback continues to be their restricted response modalities, typically consisting of the child using a touch screen or a pointing device to select from a set of pictures. By confining the phonemic awareness skills that the system addresses to those that can be tapped into via picture-point-and-click , these systems have a restricted scope of what they can teach. A second drawback of many current systems is that their user interface (e. g. visual layout, tempo) is typically not tunable to the individual characteristics of the child. Given the prevalence of phonemic awareness issues in a broad range of neurodevelopmental disorders, including Autism Spectrum Disorder and Developmental Language Disorder, individual tuning may be critical to address individual neurocognitive weaknesses, such as problems in memory, attention, visual scanning, perceptualmotor coordination, and processing speed. We have addressed these drawbacks by (1) taking advantage of drag-and-drop and other touch response modalities that current low-cost touch screen computers are capable of processing and that children are increasingly more familiar with, and (2) by incorporating multiple dimensions of individual tunability into the system. My role: Since 2005, I have been the primary developer of the BioSpeech text-to-speech system, a medium-size software project comprised of approximately 10,000 lines of code. For this project, I assisted with integration with the graphical user interface, as well as provided solutions to the problem of synthesizing illegal (i. e. not found in normal use of English) phoneme sequences.
  9. 2009/09/01–2013/08/31: National Science Foundation IIS-0915754, "RI: Small: Modeling Coarticulation for Automatic Speech Recognition", PI: Kain (OHSU). We have developed a data-driven, triphone formant trajectory model and methodology for estimating its parameters. In this model, formant targets are speaker dependent, but independent of speaking style. We have validated this model using perceptual listening tests. An analysis of conversationally and clearly spoken speech confirmed that (1) formant trajectories in clear vowels reach their targets more frequently, (2) formants show considerable asynchronicity, and (3) phoneme formant targets approximate their expected values. We also found preliminary evidence that targets derived from clear speech alone perform better at modeling both styles than targets from conversational speech. Having created and validated this model, we are now in the process of applying the approach to disordered speech, paving the way for an objective diagnosis of the degree of coarticulation of dysarthria. Another application is an objective evaluation of the effectiveness of specific speech interventions for certain kinds of dysarthria, e. g. the Lee Silverman Voice Treatment. Finally, this research may also provide an avenue for automatically transforming conversationally-spoken speech to sound as if it had been spoken clearly, thus increasing its intelligibility. A real-time, transparent version of this algorithm would be a desirable feature in many general telecommunications devices. My role: As PI, I am responsible for all aspects of the project, including overall project supervision and management, as well as mentoring of graduate student Bush.
  10. 2009/07/15–2012/06/30: National Science Foundation IIS-0905095, "HCC: Automatic detection of atypical patterns in cross-modal affect", PI: van Santen (OHSU).The expression of affect in face-to-face situations requires the ability to generate a complex, coordinated, cross-modal affective signal, having gesture, facial expression, vocal prosody, and language content modalities. This ability is compromised in neurological disorders such as Parkinson's disease and autism spectrum disorder (ASD). The long term goal is to build computer-based interactive systems for remediation of poor affect communication and diagnosis of the underlying neurological disorders based on analysis of affective signals. A requirement for such systems is technology to detect atypical patterns in affective signals. We developed a play situation for eliciting affect and collected audio-visual data from approximately 60 children between the ages of 4–7 years old, half of them with ASD and the other half constituting a control group of typically developing children. We labeled the data on relevant affective dimensions, developed algorithms for the analysis of affective incongruity, and then tested the algorithms against the labeled data in order to determine their ability to differentiate between ASD and typical development. My role: I created special delexicalized speech stimuli, using a novel delexicalization algorithm that rendered the lexical content of an utterance unintelligible while preserving important acoustic prosodic cues. Preference tests showed that the proposed method preserved drastically more speaker identity, and sounded more natural than conventional methods. These delexicalized speech stimuli were used in perceptual tests to exclude the effect of lexical content on affect.
  11. 2009/07/17–2012/06/30: National Institute of Health 5R21DC010035, "Quantitative Modeling of Segmental Timing in Dysarthria", PI: van Santen (OHSU). The project seeks to apply a quantitative modeling framework to segment durations in sentences produced by speakers with a variety of neurological diagnoses and dysarthrias. My role: I was responsible for software development for custom recording of speech data and for the extension of my previously published hybridization algorithm for the purposes of creating special perceptual speech stimuli.
  12. 2008–2009: Nancy Lurie Marks Family Foundation award, "In Your Own Voice: Personal AAC Voices for Minimally Verbal Children with Autism Spectrum Disorder", PI: van Santen (OHSU). My role: I performed research and development to adapt a text-to-speech voice to sound like a particular child's voice; a task made particularly challenging by the difficulty of extracting reliable acoustic features from children's speech.
  13. 2007/09/01–2011/08/31: National Science Foundation IIS-0713617, "HCC: High-quality Compression, Enhancement, and Personalization of Text-to-Speech Voices", PI: Kain (OHSU). My role: Together with Ph. D. students and co-investigators, I developed text-to-speech (TTS) technologies that focus on elimination of concatenation errors and improved accuracy in the areas of coarticulation, degree of articulation, prosodic effects, and speaker characteristics, using an asynchronous interpolation model that Jan van Santen and I proposed in 2002. These algorithmic advances added to the general acceptability of Speech Generating Devices (SGDs), used by individuals with impaired or absent speech communication.
  14. 2007/01/01–2008/06/30: National Institute of Health 1R41DC008712, "User Adaptation of AAC Device Voices - Phase 1", PI: van Santen (BioSpeech). Speech communication ability is impaired or absent in millions of Americans due to neurological disorders and diseases and to trauma, including autism, Parkinson's disease, and stroke. Augmentative and Alternative Communication (AAC) devices that are operated via switches, keyboards, and a broad range of other input devices, and that have synthetic speech as output, are often the only manner in which these individuals can communicate. A psychologically important feature that no currently available systems have is the ability to speak with the user's voice, i.e., the ability to produce speech that mimics the individual's pre-morbid speech or speech that the individual may be able to intermittently produce. This project used voice transformation (VT) technology to accomplish this goal. My role: I developed and evaluated voice transformation and prosody modification technologies to customize synthetic voices using concatenative speech synthesis technologies, with the aim of mimicking the individual user's pre-morbid speech.
  15. 2006/09/01–2008/03/31: National Institute of Health 1R41DC007240, "Voice Transformation for Dysarthria - Phase 1", PI: van Santen (BioSpeech). Dysarthria is a motor speech disorder due to weakness or poor co- ordination of the speech muscles. Affected muscles include the lungs, larynx, oro- and nasopharynx, soft palate, and articulators (lips, tongue, teeth, and jaw). The degree to which these muscle groups are compromised determines the particular pattern of speech impairment. For example, poor lung function affects the overall volume or loudness, while problems with specific articulators may cause mispronunciations of certain phonemes. There is a great variety of diseases that can cause dysarthria, including Parkinson’s, Multiple Sclerosis, and strokes. My role: I continued development of software that transforms speech compromised by dysarthria into easier-to-understand and more natural-sounding speech. In addition, I designed a hardware configuration that allowed the software to reside on a wearable computer, with a headset microphone as input and powered speaker as output, giving the user full mobility while wearing the speaking-aid.
  16. 2005/01/10–2010/12/31: National Institute of Health 5R01DC007129, "Expressive crossmodal affect integration in Autism", PI: van Santen (OHSU). Autistic Spectrum Disorders (ASD) form a group of neuropsychiatric conditions whose core behavioral features include impairments in reciprocal social interaction, in communication, and repetitive, stereotyped, or restricted interests and behaviors. The importance of prosodic deficits in the adaptive communicative competence of speakers with ASD, as well as for a fuller understanding of the social disabilities central to these disorders is generally recognized; yet current studies are few in number and have significant methodological limitations. The objective of the proposed project is to detail prosodic deficits in young speakers with ASD through a series of experiments that address these disabilities and related areas of function. My role: I developed a delexicalization algorithm that rendered the lexical content of an utterance unintelligible, while preserving important acoustic prosodic cues.
  17. 2005/01/01–2006/06/30: National Science Foundation IIP-0441125, "STTR Phase 1: Small Footprint Speech Synthesis", PI: Kain (BioSpeech). Text-to-speech (TTS) systems have recognized societal benefits for universal access, education, and information access by voice. For example, TTS-based augmentative devices are available for individuals who have lost their voice; and reading machines for the blind have been available for several decades. My role: I developed and implemented a novel algorithm that led to dramatic decreases in disk and memory requirements at a given speech quality level and minimization of the amount of voice recordings needed to create a new synthetic voice. The latter point enabled building personalized TTS systems for individuals with speech disorders who can only intermittently produce normal speech sounds or for individuals who are about to undergo surgery that will irreversibly alter their speech.
  18. 2001/10/01–2005/09/30: National Science Foundation IIS-0117911, "Making Dysarthric Speech Intelligible", PI: van Santen (OHSU). My role: I developed software that transforms speech compromised by dysarthria into easier-to-understand and more natural-sounding speech. The strategy for improving intelligibility is the manipulation of a small set of highly relevant speech features; specifically the energy, pitch, and formant frequencies of an input speech waveform. Pitch and energy are appropriately smoothed, and formant frequencies are mapped with a joint-density Gaussian mixture model, a technique I first introduced in 1998 that since has become the most often used mapping technique in the field. Results from perceptual tests indicated that the transformation improved intelligibility, and that the accompanying removal of the vocal fry improved perceived naturalness.

4.3 Publications/Creative Work

In the following lists, Ph. D. students under my mentorship are underlined.

Peer-reviewed Journal Articles and 4–5 page Conference Papers

  1. S. Mohammadi, A. Kain, “An Overview of Voice Conversion Systems”, Speech Communication, 2017.
  1. S. Mohammadi, A. Kain, “A Voice Conversion Mapping Function based on a Stacked Joint-Autoencoder”, Interspeech, 2016.
  2. B. Snider and A. Kain, “Classification of Respiratory Effort and Disordered Breathing during Sleep from Audio and Pulse Oximetry Signals”, ICASSP, 2016.
  1. M. Langarani, J. van Santen, S. Mohammadi, A. Kain, “Data-driven Foot-based Intonation Generator for Text-to-Speech Synthesis”, Interspeech, 2015.
  2. S. Mohammadi, A. Kain, “Semi-supervised Training of a Voice Conversion Mapping Function using a Joint-Autoencoder”, Interspeech, 2015.
  3. S. Dudy, M. Asgari, and A. Kain, “Pronunciation Analysis for Children with Speech Sound Disorders”, IEEE Engineering in Medicine and Biology society (EMBC), Milan, 2015. (PMC4710861).
  1. A. Amano-Kusumoto, J.-P. Hosom, A. Kain, J. Aronoff, “Determining the relevance of different aspects of formant contours to intelligibility”, Speech Communication, vol. 59, April 2014.
  2. K. Tjaden, A. Kain, J. Lam, “Hybridizing Conversational and Clear Speech to Investigate the Source of Increased Intelligibility in Parkinson’s Disease”, Journal of Speech, Language, and Hearing Research, Volume 57, August 2014.
  3. S. Mohammadi, A. Kain, “Voice conversion using Deep Neural Networks with speaker-independent pre-training”, IEEE Spoken Language Technology Workshop (SLT), 2014.
  4. B. Bush, A. Kain, “Modeling Coarticulation in Continuous Speech”, Interspeech 2014.
  1. S. Mohammadi, A. Kain, “Transmutative Voice Conversion”, ICASSP, 2013.
  2. B. Bush, A. Kain, “Estimating Phoneme Formant Targets and Coarticulation Parameters of Conversational and Clear Speech”, ICASSP, 2013.
  3. B. Snider and A. Kain, “Automatic Classification of Breathing Sounds during Sleep”, ICASSP, 2013.
  1. S. Mohammadi, A. Kain, J. van Santen, “Making Conversational Vowels More Clear”, Proceedings of Interspeech, 2012.
  2. E. Morley, E. Klabbers, J. van Santen, A. Kain, S. Mohammadi, “Synthetic F0 can Effectively Convey Speaker ID in Delexicalized Speech”, Interspeech, 2012.
  1. E. Morley, J. van Santen, E. Klabbers, A. Kain, “F0 Range and Peak Alignment across Speakers and Emotions”, ICASSP, 2011.
  2. B. Bush, J.-P. Hosom, A. Kain, and A. Amano-Kusumoto, “Using a genetic algorithm to estimate parameters of a coarticulation model”, Interspeech, 2011.
  1. A. Kain and T. Leen, “Compression of Line Spectral Frequency Parameters using the Asynchronous Interpolation Model”, Proceedings of 7th ISCA Workshop on Speech Synthesis, September 2010.
  2. A. Kain and J. van Santen, “Frequency-domain delexicalization using surrogate vowels”, Interspeech, 2010.
  3. A. Amano-Kusumoto, J.-P. Hosom, and A. Kain, “Speaking style dependency of formant targets”, Interspeech, 2010.
  4. E. Klabbers, A. Kain, and J. van Santen, “Evaluation of speaker mimic technology for personalizing SGD voices”, Interspeech, 2010.
  1. A. Kain, J. van Santen, “Using Speech Transformation to Increase Speech Intelligibility for the Hearing- and Speaking-impaired”, Proceedings of ICASSP, April 2009.
  2. Q. Miao, A. Kain, J. van Santen, “Perceptual Cost Function for Cross-fading Based Concatenation”, Proceedings of Interspeech, 2009.
  3. R. Moldover, A. Kain, “Compression of Line Spectral Frequency Parameters with Asynchronous Interpolation”, Proceedings of ICASSP, April 2009.
  1. A. Kain, A. Amano-Kusumoto, and J.-P. Hosom, “Hybridizing Conversational and Clear Speech to Determine the Degree of Contribution of Acoustic Features to Intelligibility”, Journal of the Acoustical Society of America, vol. 124, issue 4, October 2008, pp. 2308–2319.
  1. A. Kain, J. Hosom, X. Niu, J. van Santen, M. Fried-Oken, J. Staehely, “Improving the Intelligibility of Dysarthric Speech”, Speech Communication, vol. 49, issue 9, September 2007, pp. 743–759.
  2. E. Klabbers, J. van Santen, A. Kain, “The Contribution of Various Sources of Spectral Mismatch to Audible Discontinuities in a Diphone Database”, IEEE Transactions on Audio, Speech, and Language Processing Journal, Volume 15, Issue 3, pp. 949–956, March 2007.
  3. A. Kusumoto, A. Kain, P. Hosom, and J. van Santen, “Hybridizing Conversational and Clear Speech”, Proceedings of Interspeech, August 2007.
  4. A. Kain, Q. Miao, J. van Santen, “Spectral Control in Concatenative Speech Synthesis”, Proceedings of 6th ISCA Workshop on Speech Synthesis, August 2007.
  5. A. Kain and J. van Santen, “Unit-Selection Text-to-Speech Synthesis Using an Asynchronous Interpolation Model”, Proceedings of 6th ISCA Workshop on Speech Synthesis, August 2007.
  1. X. Niu, A. Kain, J. van Santen, “A Noninvasive, Low-cost Device to Study the Velopharyngeal Port During Speech and Some Preliminary Results”, Proceedings of Interspeech, September 2006.
  1. J. van Santen, A. Kain, E. Klabbers, and T. Mishra, “Synthesis of Prosody using Multi- level Unit Sequences”, Speech Communication Journal, vol. 46, issues 3–4, pp. 365–375, July 2005.
  2. X. Niu, A. Kain, J. van Santen, “Estimation of the Acoustic Properties of the Nasal Tract during the Production of Nasalized Vowels”, Proceedings of EUROSPEECH, September 2005.
  1. A. Kain, X. Niu, J. Hosom, Q. Miao, J. van Santen, “Formant Re-synthesis of Dysarthric Speech”, Proceedings of 5th ISCA Workshop on Speech Synthesis, June 2004.
  2. J. van Santen, A. Kain, and E. Klabbers, “Synthesis by Recombination of Segmental and Prosodic Information”, Speech Prosody 2004, March 2004.
  3. H. Duxans, A. Bonafonte, A. Kain, and J. van Santen, “Including Dynamic and Phonetic Information in Voice Conversion Systems”, Proceedings of ICSLP, October 2004.
  1. J. Hosom, A. Kain, T. Mishra, J. van Santen, M. Fried-Oken, J. Staehely, “Intelligibility of modifications to dysarthric speech”, Proceedings of ICASSP, May 2003.
  2. A. Kain and J. van Santen, “A speech model of acoustic inventories based on asynchronous interpolation”, Proceedings of EUROSPEECH, pp. 329-332, August 2003.
  3. J. van Santen, L. Black, G. Cohen, A. Kain, E. Klabbers, T. Mishra, J. de Villiers, X. Niu, “Applications of computer generated expressive speech for communication disorders”, Proceedings of EUROSPEECH, pp. 1657-1660, August 2003.
  1. A. Kain and J. van Santen, “Compression of Acoustic Inventories using Asynchronous Interpolation”, Proceedings of IEEE Workshop on Speech Synthesis, pp. 83-86, September 2002.
  2. J. van Santen, J. Wouters, and A. Kain, “Modification of Speech: A Tribute to Mike Macon”, Proceedings of IEEE Workshop on Speech Synthesis, September 2002.
  1. A. Kain and M. Macon, “Design and Evaluation of a Voice Conversion Algorithm based on Spectral Envelope Mapping and Residual Prediction”, Proceedings of ICASSP, May 2001.
2000 and earlier
  1. A. Kain and Y. Stylianou, “Stochastic Modeling of Spectral Adjustment for High Quality Pitch Modification”, Proceedings of ICASSP, June 2000, vol. 2, pp. 949–952.
  2. J. House, A. Kain, and J. Hines, “ESP - Metaphor for learning: an evolutionary algorithm”, Proceedings of GECCO 2000, Las Vegas, NV.
  3. A. Kain and M. Macon, “Personalizing a speech synthesizer by voice adaptation”, Third ESCA / COCOSDA International Speech Synthesis Workshop, November 1998, pp. 225–230.
  4. A. Kain and M. Macon, “Text-to-speech voice adaptation from sparse training data”, Proceedings of ICSLP, November 1998, vol. 7, pp. 2847–50.
  5. A. Kain and M. Macon, “Spectral Voice Conversion for Text-to-Speech Synthesis”, Proceedings of ICASSP, May 1998, vol. 1, pp. 285–288.
  6. S. Sutton, R. Cole, J. de Villiers, J. Schalkwyk, P. Vermeulen, M. Macon, Y. Yan, E. Kaiser, B. Rundle, K. Shobaki, P. Hosom, A. Kain, J. Wouters, D. Massaro, M. Cohen, “Universal Speech Tools: The CSLU Toolkit”, Proceedings of ICSLP, November 1998, vol. 7, pp. 3221–24.
  7. N. Malayath, H. Hermansky, A. Kain and R. Carlson, “Speaker-independent Feature Extraction by Oriented Principal Component Analysis”, Proceedings of EUROSPEECH 1997.


  1. J.-P. Hosom, A. Kain, and B. Bush, “Towards the recovery of targets from coarticulated speech for automatic speech recognition”, The Journal of the Acoustical Society of America, 130(4), page 2407, 2011.
  2. A. Kain, "Speech transformation: Increasing intelligibility and changing speakers", Journal of the Acoustical Society of America, 126(4), page 2205, 2009.

Ph. D. Thesis

High Resolution Voice Transformation”, OGI School of Science & Engineering, 2001.

Technical Reports

  1. B. R. Snider and A. Kain, “Adaptive Reduction of Additive Noise from Sleep Breathing Sounds”, CSLU-2012-001.
  2. A. Kain, J.-P. Hosom, S. H. Ferguson, B. Bush, “Creating a speech corpus with semi-spontaneous, parallel conversational and clear speech”, CSLU-11-003.


  1. J. van Santen and A. Kain, OHSU. System and Method for Compressing Concatenative Acoustic Inventories for Speech Synthesis.
  2. A. Kain and Y. Stylianou, AT&T Research Laboratories. Stochastic Modeling Of Spectral Adjustment For High Quality Pitch Modification.


OHSU Disclosures

  1. #2275 PyTTS Text-to-Speech software with 16 voices, 05/12/2016, Exclusively Licensed.
  2. #1365 Mexican Spanish female diphone voice, 12/08/2008, Seeking Commercial Partners.
  3. #1364 Mexican Spanish male diphone voice, 12/08/2008, Seeking Commercial Partners.
  4. #1362 American English female diphone voice (AS), 12/08/2008, Seeking Commercial Partners.
  5. #1361 American English male speaker diphone voice, 12/08/2008, Seeking Commercial Partners.
  6. #1360 German male speaker diphone voice, 12/08/2008, Seeking Commercial Partners.
  7. #1359 German female speaker diphone voice, 12/08/2008, Seeking Commercial Partners.
  8. #1358 New Flinger singing synthesis, 12/08/2008, Inactive.
  9. #1195 Clear-Speech Corpus, Speaker JPH, 05/07/2007, Seeking Commercial Partners.
  10. #1065 Controlling Formant Frequencies in Concatenative Speech Synthesis Systems, 05/16/2006, Inactive.
  11. #1061 Noninvasive Nasal Flow Measurement Device and Algorithm, 05/11/2006, Inactive.
  12. #0868 CSLU System and Method for Synthesis Based Speech Enhancement, 09/24/2004, Exclusively Licensed.
  13. #0844 CSLU Voice transformation for Dysarthria with Formant Re-synthesis, 06/03/2004, Exclusively Licensed.
  14. #0665 Voice Transformation (High Resolution), 11/13/2002, Seeking Commercial Partners.
  15. #0566 Method to compress concatenative acoustic inventories for speech synthesis, 07/01/2001, Exclusively Licensed.

4.4 Invited Lectures, Conference Presentations or Professorships

International and National

  1. Conference presentation: “Hybridizing Conversational and Clear Speech to Investigate the Source of Intelligibility Variation in Parkinson’s Disease”, Conference on Motor Speech, Sarasota, Florida, 2014.
  2. Conference presentation: “Transmutative Voice Conversion”, ICASSP, Vancouver, Canada, 2013.
  3. Conference presentation: ”Frequency-domain delexicalization using surrogate vowels”, Interspeech, Makuhari, Japan, 2010.
  4. Conference presentation: ”Compression of Line Spectral Frequency Parameters using the Asynchronous Interpolation Model”, 7th ISCA Workshop on Speech Synthesis, Kyoto, Japan, 2010.
  5. Invited conference presentation: ”Hybridizing Conversational and Clear Speech to Determine the Degree of Contribution of Acoustic Features to Intelligibility”, Meeting of the Acoustical Society of America, 2009, San Diego, CA.
  6. Invited conference presentation for a Special Session on Voice Transformation: ”Using Speech Transformation to Increase Speech Intelligibility for the Hearing- and Speaking-impaired”, ICASSP, Taipei, Taiwan, 2009.
  7. Conference presentation: ”Compression of Line Spectral Frequency Parameters with Asynchronous Interpolation”, ICASSP, Taipei, Taiwan, 2009.
  8. Conference presentation: ”Hybridizing Conversational and Clear Speech”, Interspeech, Antwerp, Belgium, 2007.
  9. Conference presentation: ”Spectral Control in Concatenative Speech Synthesis”, 6th ISCA Workshop on Speech Synthesis, Bonn, Germany, 2007.
  10. Conference presentation: ”Unit-Selection Text-to-Speech Synthesis Using an Asynchronous Interpolation Model”, 6th ISCA Workshop on Speech Synthesis, Bonn, Germany, 2007.
  11. Conference presentation: ”Formant Re-synthesis of Dysarthric Speech”, 5th ISCA Workshop on Speech Synthesis, Pittsburgh, PA, USA, 2004.
  12. Conference presentation: ”A speech model of acoustic inventories based on asynchronous interpolation”, EUROSPEECH, Geneva, Switzerland, 2003.
  13. Conference presentation: ”Compression of Acoustic Inventories using Asynchronous Interpolation”, IEEE Workshop on Speech Synthesis, Santa Monica, CA, 2002.
  14. Conference presentation: ”Design and Evaluation of a Voice Conversion Algorithm based on Spectral Envelope Mapping and Residual Prediction”, ICASSP, Salt Lake City, UT, 2001.
  15. Conference presentation: ”Spectral Voice Conversion for Text-to-Speech Synthesis”, ICASSP, Seattle, WA, 1998.

Regional and Local

4.5 Awards


5.1 Membership in Professional Societies

5.2 Granting Agency Review Work

5.3 Editorial and Ad Hoc Review Activities

5.4 Committees



5.5 Activities


6.1 Students

6.2 Courses

CS 627 Data Science Programming

This course represents a best-of compilation of concepts, practices, and R- and python-based software libraries (all free, open-source, and unrestricted) that allow for a relatively rapid, straight-forward, and easy-to-maintain implementation of new ideas and scientific questions. Students will gain awareness and initial working knowledge of some of the most fundamental computational tools for performing a wide variety of academic research. As such, it will focus on providing breadth instead of depth, which means that for each concept we will talk about motivation, key concepts, and concrete usage scenarios, but without mathematical background or proofs, which can be acquired in more specialized classes. In this class we will: use R for data exploration and visualization, write programs in python, perform numeric tasks using numpy and scipy, analyze data using pandas, discuss audio and image processing using scipy.signal and scikit-image, apply machine learning algorithms using scikit-learn, visualize data using matplotlib and pyqtgraph, use QT to build graphical user interfaces, learn how to version control files with git, address performance issues via compilation/profiling/parallelization tools, and much more.
I have created the curriculum for, and regularly teach this 3-credit course (12 × 1.5-hour lectures). Creating the curriculum required approximately 100 hours. Students' evaluation scores averaged 5.0/5.0 in Fall 2015, 5.15/6 in Fall 2016.

EE 658 Speech Signal Processing

Speech systems are becoming commonplace in today's computer systems and Augmentative and Alternative Communication (AAC) devices. Examples are speech recognition systems and Text-to-Speech synthesis systems. This course will introduce the fundamentals of the underlying speech signal processing that enables such systems. Topics include speech production and perception by humans, frequency transforms, filters, linear predictive features, pitch estimation, speech coding, speech enhancement, and prosodic speech modification.
I have created the curriculum for, and regularly teach this 3-credit course (20 × 1.5-hour lectures). Creating the curriculum required approximately 160 hours. I also created a dozen homework assignments and a project. Grading students' answers and evaluating their project outcomes requires a total of approximately 3 hours per student over the course of the class (unless a TA is available). Students' evaluation scores averaged 4.7/5.0 in Winter 2016.

CS 653 Text-to-Speech Synthesis

This course will introduce students to the problem of synthesizing speech from text input. Speech synthesis is a challenging area that draws on expertise from a diverse set of scientific fields, including signal processing, linguistics, psychology, statistics, and artificial intelligence. Fundamental advances in each of these areas will be needed to achieve truly human-like synthesis quality and advances in other realms of speech technology (like speech recognition, speech coding, speech enhancement). In this course, we will consider current approaches to sub-problems such as text analysis, pronunciation, linguistic analysis of prosody, and generation of the speech waveform. Lectures, demonstrations, and readings of relevant literature in the area will be supplemented by student lab exercises using hands-on tools.
I have created the curriculum for and teach the second half of this 3-credit course (10 × 1.5-hour lectures). Creating the curriculum required approximately 80 hours of my time. In this class, students are expected to create medium-size projects, which I evaluate and grade. Students' evaluation scores averaged 4.4/5.0 in 2015.

CS 606 Computational Approaches to Speech and Language Disorders

This course covers a range of speech and language analysis algorithms that have been developed for measurement of speech or language based markers of neurological disorders, for the creation of assistive devices, and for remedial applications. Topics will include introduction to speech and language disorders, robust speech signal processing, statistical approaches to pitch and timing modeling, voice transformation algorithms, speech segmentation, and modeling of disfluency. The class will use a wide array of clinical data, and will be closely tied to several ongoing research projects.
I have created and taught 2 × 1.5-hour lectures for this course.

6.3 Awards