Current CSLU Research Projects

CSLU conducts a wide range of research projects,  including projects focused on core speech processing and natural language processing algorithms (Technology Research Projects) and projects focused on biomedical applications (Biomedical Research Projects), specifically on creation of diagnostic, remedial, and assistive methods for neurodevelopmental and neurodegenerative disorders and diseases.

TECHNOLOGY RESEARCH PROJECTS

BIOMEDICAL RESEARCH PROJECTS

Autism Projects

Other Neurodevelopmental Disorders
Neurodegenerative Disorders

I. TECHNOLOGY RESEARCH PROJECTS


  • From Text to Pictures
[Richard Sproat and Julia Hirschberg [Columbia U.], PI's; Owen Rambow [Columbia U.], co-PI; NSF]. The researchers are developing new theoretical models and technology to automatically convert descriptive text into 3D scenes representing the text's meaning. They do this via the Scenario-Based Lexical Knowledge Resource (SBLR), a resource they are creating from existing sources (PropBank, WordNet, FrameNet) and from automated mining of Wikipedia and other un-annotated text. In addition to predicate-argument structure and semantic roles, the SBLR includes necessary roles, typical role fillers, contextual elements, and activity poses which enables analysis of input sentences at a deep level and assembly of appropriate elements from libraries of 3D objects to depict the fuller scene implied by a sentence. For example, "Terry ate breakfast" does not tell us where (kitchen, dining room, restaurant) or what he ate (cereal, doughnut, or rice, umeboshi, and natto). These elements must be supplied from knowledge about typical role fillers appropriate for the information that is specified in the input. Note that the SBLR has a component that varies by cultural context.
    Textually-generated 3D scenes will have a profound, paradigm-shifting effect in human computer interaction, giving people unskilled in graphical design the ability to directly express intentions and constraints in natural language -- bypassing standard low-level direct-manipulation techniques. This research will open up the world of 3D scene creation to a much larger group of people and a much wider set of applications. In particular, the research will target middle-school age students who need to improve their communicative skills, including those whose first language is not English or who have learning difficulties: a field study in a New York after-school program will test whether use of the system can improve literacy skills. The technology also has the potential for interesting a more diverse population in computer science at an early age, as interactions with K-12 teachers have indicated.


  • Efficient hidden structure annotation via structural multiple-sequence alignments
[Brian Roark, PI; NSF].  The focus of this project is to develop finite-state syntactic processing models for natural language that use features encoding global structural constraints derived through multiple sequence alignment (MSA) techniques, to significantly improve accuracy without expensive context-free inference. MSAs are widely used in computational biology for building finite-state models that capture long-distance dependencies in sequences (e.g., in RNA secondary structure). Given a large set of functionally aligned sequences in MSA format, finite-state models can be constructed that allow for the efficient alignment of new sequences with the given MSA. In natural language processing (NLP), only very rarely have MSA techniques been used, and then to characterize phonetic or semantic similarity. This project is exploring the definition of a purely syntactic functional alignment between semantically unrelated strings from the same language, to define a structural MSA for constructing finite-state syntactic models. The project has two specific aims. The first aim is to develop natural language sequence processing algorithms and models that can: a) define sequence alignments with respect to syntactic function; b) build structural MSAs based on defined functional alignments; c) derive finite-state models to efficiently align new sequences with the built MSA; and d) extract features from an alignment with the MSA for improved sequence modeling. The second aim is to empirically validate this approach within a number of large-scale text processing applications in multiple domains and languages. The resulting algorithms are expected to provide improved finite-state natural language models that will contribute to the state-of-the-art in critical text processing applications.

  • Discriminative Syntactic Language Modeling: Automatic Feature Selection and Efficient Annotation
[Brian Roark, PI; NSF].  The focus of this NSF_funded project is on the effective use of parser-derived and tagger-derived features within discriminative approaches to language modeling for automatic speech recognition. Discriminative language modeling approaches provide a tremendous amount of flexibility in defining features, but the size of the potential parser-derived feature space requires efficient feature annotation and selection algorithms. The project has four specific aims. The first aim is to develop a set of efficient, general, and scalable syntactic feature selection algorithms for use with various kinds of annotation and several parameter estimation techniques. The second aim is to develop general tree and grammar transformation algorithms designed to preserve selected feature annotations yet lead to faster parsing or even tagging approximations to parsing. The third aim is to evaluate a broad range of feature selection and grammar transformation approaches on a large vocabulary continuous speech recognition (LVCSR) task, namely Switchboard. The final aim is to design and package the algorithms to straightforwardly support future research into other applications, such as machine translation (MT); and into other languages, such as Chinese and Arabic. The algorithms developed as a part of this project are expected to contribute to improvements in LVCSR accuracy and applications that rely upon this technology. The algorithms are being packaged into a publicly available software library, enabling researchers working in many application areas -- including LVCSR and MT -- and various languages to investigate best practices in syntactic language modeling for their specific task, without having to hand-select and evaluate feature sets.


  • Learning Mixed-Initiative Dialogue Strategies
[Peter Heeman, PI; NSF]. This research project enables next generation dialogue systems to be able to collaborate with a user without the limitations of system-initiative interaction, in order to solve complex tasks in an optimal manner. The research develops reinforcement learning (RL) strategies to learn dialogue policies that are mixed-initiative. The specific aims of this are to (a) extend RL to mixed-initiative dialogue interaction; (b) allow the system policy to adapt to different user types, such as people with poor memory, or poor problem-solving skills; and (c) simultaneously learn the policy for the simulated user.
    This approach will allow more advanced dialogue systems to be deployed, such as assisting the elderly so they can live independently longer, and helping provide health care information to rural areas. The proposed research project will result in a toolkit that will allow a wide range of users to easily develop dialogue policies. The toolkit will (a) allow students to be effectively trained in this area, (b) lower the barrier for other researchers to contribute to the field, and (c) help transfer this new technology to industry.


  • High-Quality Compression, Enhancement, and Personalization of Text-to-Speech Voices
[Alexander Kain, PI; Todd Leen, co-PI; NSF]. The vast variability of the human speech signal remains a central challenge for Text-to-Speech (TTS) systems. The objective of this research is to develop TTS technologies that focus on elimination of concatenation errors, and accurate speech modifications in the areas of coarticulation, degree of articulation, prosodic effects, and speaker characteristics. The investigators are exploring an asynchronous interpolation model (AIM), which promises to provide for high-quality and flexible TTS. The core idea of AIM is to represent a short region of speech as a composition of several types of features called streams. Each stream is computed by asynchronous interpolation of basis vectors.  Each basis vector is associated with a particular phoneme, allophone, or more specialized unit. Thus, the speech region is described by the varying degrees of influence of several types of preceding and following acoustic features. Using AIM, the investigators are also developing methods to optimally compress the acoustic inventories of TTS systems, given a size or a quality constraint, and to adapt the system to a new voice, given a few training samples. The system being researched forms a hybrid between traditional concatenative and formant-based synthesis, having advantages of both, resulting in a high-quality, optimized TTS system with voice adaptation capabilities. TTS has generally recognized societal benefits for universal access, education, and information access by voice. Our research will make it possible, for example, to build personalized TTS systems for individuals with speech disorders who can only intermittently produce normal speech sounds.

  • Multi-Threaded Dialogues For Real-Time Applications
[Peter Heeman, PI; NSF]. The goal of this NSF-funded project is to create a speech interface that supports a user in interacting with multiple real-time devices at the same time, where the interaction with each device is a separate dialogue thread. The first aim is to show, using a human-computer study, that the simple way to implement a speech interface for managing multiple threads is not effective. The second aim is to run a human-human study to show that people can inherently manage multiple dialogue threads, and to determine what conventions they use. The third aim is to build a speech interface that implements the conventions that were found.
    The main impact of this work is the development of a model that accounts for how people deal with multi-threaded dialogues. This model will be demonstrated in an implemented speech interface. This work will create a technology that will be useful in interacting with the pervasive electronic devices that we can expect to see in the future.

  • Small Footprint Speech Synthesis
This NSF Small Business Technology Transfer Phase I project is led by Alexander Kain at BioSpeech Inc., a CSLU startup, and Jan van Santen.   The project aims to develop and implement a new algorithm in the area of text-to-speech synthesis (TTS) that will lead to (i) dramatic decreases in disk and memory requirements at a given speech quality level and (ii) minimization of the amount of voice recordings needed to create a new synthetic voice. Most current TTS systems operate by concatenating segments of recorded speech ([acoustic] units). A challenge for TTS is coarticulation: The dependency of the acoustic manifestations of a phoneme on its neighbors. Current TTS systems use multi-phone acoustic units such as diphones, which preserve coarticulatory patterns naturally present in speech. However, this approach requires a large amount of recordings and generates systems with large footprints. BioSpeech proposes a uniphone approach that addresses coarticulation processes with an explicit model. The method uses complex spectral vectors (basis vectors) representing brief segments of speech inside single phonemes, and decomposes these into two components: A formant vector and a spectral balance vector. To generate speech, the formant and spectral balance vectors derived from the basis vectors corresponding to successive phonemes are subjected to separate--and hence generally asynchronous--interpolation operations using time varying weights; the formant and spectral balance vector trajectories thus created are re-combined to create a trajectory in complex spectral space; finally, this trajectory is converted into output speech with the inverse Fourier transform. Asynchronicity is necessitated by the quasi-independence of articulators underlying different spectral features (e.g., frication, formant frequencies). The proposed work has implications for other speech technologies, including Automatic Speech Recognition (ASR). Current ASR technologies address coarticulation by using multi-phone units, typical triphones. The number of triphones in English is over 70,000, and thus requires a large amount of training recordings. The proposed model could dramatically impact on the amount of recordings required for system training. Second, TTS has generally recognized societal benefits for universal access, education, and information access by voice. For example, TTS-based augmentative devices are available for individuals who have lost their voice; and reading machines for the blind have been available for several decades. Third, the approach will make higher-quality TTS more available for smaller devices. For example, voice based caller ID on low-end mobile telephones is currently not possible due to memory limitations. Fourth, it enables voice adaptation with a minimum of recordings. This will enable building personalized TTS systems for individuals with speech disorders who can only intermittently produce normal speech sounds or for individuals who are about to undergo surgery that will irreversibly alter their speech. The method proffered by BioSpeech only requires recordings of valid samples of each of (less than 50) phonemes instead of each of (2000 or more) diphones.

  • Objective Methods for Predicting and Optimizing Synthetic Speech Quality
[Jan van Santen, PI]. This NSF-funded project focuses on how humans perceive acoustic discontinuities in speech. Current text-to-speech synthesis ("TTS") technology operates by retrieving intervals of stored digitized speech("units")  from a database and splicing ("concatenating") them to form the output utterance. Unavoidably, there are acoustic discontinuities at the time points where the successive speech intervals meet. An unsolved problem is how to predict  from the quantitative, acoustic properties of two to-be-concatenated units whether humans will hear a discontinuity. This is of immediate relevance for TTS systems that select units at run time from a large speech corpus. During selection, the systems search through the space of all possible sequences of units that can be used for the utterance and selects the sequence that has the lowest overall objective cost measure, such as the Euclidean distance between the final frame and initial frame of two units. However, research has already shown that this method and related methods do not predict well whether humans will hear a discontinuity. The current research, by being explicitly focused on perceptually optimized objective cost measures, will directly contribute to the perceptual accuracy of cost measures and hence to synthesis quality.

  • Prosody Generation for Child Oriented Speech Synthesis
[Jan van Santen, PI]. This NSF-funded project [joint with Alan Black at Carnegie Mellon University and Richard Sproat at the University of Illinois at Urbana-Champaign / now at CSLU] focuses on innovative algorithms for generating highly expressive synthetic speech. Generating expressive speech involves three hard  research problems. (i) Computation of abstract tags that specify, e.g., which words need emphasis, and phrasing (e.g., where to pause). (ii) Based on these tags, the system has to compute a fundamental frequency contour. (iii)  Severe modification of the stored speech fragments ("acoustic units") to obtain these contours. The central goal of the project is to address these research problems, and create a TTS system that will make the next generation of  TTS based language remediation systems viable.

  • Creating the Next Generation of Intelligent Animated Conversational Agents
The goal of this NSF-funded project [Jan van Santen, co-PI; Ron Cole (PI) at the University of Colorado and Javier Movellan, co-PI, at the University of California at San Diego]  is to improve reading achievement of children with reading problems by designing computer-based interactive reading tutors that incorporate new speech and language technologies. The reading tutors will help English- and Spanish-speaking children learn to read by providing classroom teachers and reading specialists with tools to instruct and exercise the set of auditory, visual and linguistic skills needed to read, speech discrimination, speech production, phonological awareness, sound-to-letter mappings, vocabulary, fluency and comprehension. The tutors will be designed, tested and refined in collaboration with reading specialists and instructional designers, and tested with children in special education programs in elementary schools in Boulder Colorado.




BIOMEDICAL RESEARCH PROJECTS


AUTISM

  • Expressive and Receptive Prosody in Autism
This NIH-supported project, led by Jan van Santen and Lois Black, and in collaboration with Rhea Paul and Fred Volkmar at Yale's Child Study Center and Larry Shriberg at the University of Wisconsin's Waisman Center, focuses on automated technologies for assessment of prosodic ability in autism. Autistic Spectrum Disorders (ASD) form a group of neuropsychiatric conditions whose core behavioral features include impairments in reciprocal social interaction, in communication, and repetitive, stereotyped, or restricted interests and behaviors. The importance of prosodic deficits in the adaptive communicative competence of speakers with ASD, as well as for a fuller understanding of the social disabilities central to these disorders is generally recognized; yet current studies are few in number and have significant methodological limitations. The objective of the proposed project is to detail prosodic deficits in young speakers with ASD through a series of experiments that address these disabilities and related areas of function. Key features of the project include: 1) the application of innovative technology. The study will apply computer-based speech and language technologies for quantifying expressive prosody, for computing dialogue structure, and for generating acoustically controlled speech stimuli for measuring receptive prosody; moreover, all experiments will be delivered via computer to insure consistency of stimuli and accuracy of recording responses; 2) broad coverage of the dimensions of prosody. All three functions of prosody, grammatical, pragmatic, and affective, will be addressed; expressive and receptive tasks are included; and both contextualized tasks (dialogue, story comprehension and memory) and decontextualized tasks (e.g., vocal affect recognition) will be used; 3) inclusion of neuropsychological assessment and classification methodologies to address within-group heterogeneity and obtain a detailed characterization of the groups; 4) inclusion of two comparison groups: children with typical development and those with Developmental Language Disorder; 5) inclusion of an experimental treatment program to enhance the prosodic abilities of speakers with ASD.  A student fellowship for this project is supported by Autism Speaks.  The software architecture is designed and implemented by Senior Programmer Jacques de Villiers.


  • Expressive crossmodal affect integration in autism
[Lois Black, PI; Jan van Santen, Alexander Kain, Esther Klabbers, Zak Shafran, Investigators; NIH]. Children with autism spectrum disorder (ASD) have often been observed to express affect either weakly, only in one modality at a time (e.g., choice of words) or in multiple modalities but not in a coordinated fashion. These difficulties in crossmodal integration of affect expression may have roots in certain global characteristics of brain structure in autism, specifically atypical interconnectivity between brain areas. Poor crossmodal integration of affect expression may also play a critical role in communication difficulties that are well documented in ASD. Not understanding how e.g., facial expression can be used to modify the interpretation of words undermines social reciprocity. Impairment in crossmodal integration of affect is thus a potentially powerful explanatory concept in ASD.
    The study will provide much needed data on expressive crossmodal integration impairment in ASD and its association with receptive croosmodal integration impairment, using innovative technologies to create stimuli for a judgmental procedure that makes possible independent assessment of the individual modalities. These technologies are critical because human observers are not able to selectively filter out modalities. In addition, the vocal measures and the audiovisual database lay the essential groundwork for the next step: Creation of audiovisual analysis methods for automated assessment of expressive crossmodal integration.
    These methods will be applied to audio-visual recordings of a structured play situation; the child will participate in this play situation twice, once with an examiner and once with a caregiver. This procedure for measuring expressive crossmodal integration will be complemented by a procedure for measuring crossmodal integration of affect processing using dynamic talking-face stimuli in which the audio and video stream are recombined (preserving perfect synchrony of the facial and vocal channels) to create stimuli with congruent vs. incongruent affect expression. Both procedures will be applied to three groups: Children with ASD, children with Developmental Language Disorder (DLD), and typically developing children; ages will be six to ten.
    Our study would be the first to perform a comprehensive analysis of crossmodal integration of affect expression in ASD. If the study confirms the existence of these impairments in ASD, and provides a detailed picture of these impairments, this could (i) guide brain studies to specifically target areas responsible for affect expression; (ii) provide a deeper understanding of impairments in social reciprocity; and (iii) help design remedial programs for intensive training of under-used or incoordinated expressive modalities. The study thus contributes to etiology diagnosis, and treatment.


  • Automatic detection of atypical patterns in crossmodal affect
[Jan van Santen, PI; Lois Black, Alexander Kain, and Zak Shafran, Co-PI's; NSF]. The expression of affect in face-to-face situations requires the ability to generate a complex, coordinated, crossmodal affective signal, in gesture, facial expression, vocal prosody, and language content modalities. This ability is compromised in certain neurological disorders (e.g., Parkinson's disease, autism spectrum disorder (ASD)). Our long term goal is to build interactive, agent based systems for remediation of poor affect communication and diagnosis of the underlying neurological disorders based on analysis of affective signals. A requirement for such systems is technology to detect atypical patterns in affective signals. Our immediate-term research objective is to develop this technology. Specific aims are (i) to collect and annotate audio-visual data in a play situation designed for eliciting affect from children with typical development (TD) and children with ASD; and (ii) to develop algorithms for the analysis of affective incongruity and evaluate their TD vs. ASD differentiation ability.

  • A Computerized Interactive Game for Remediation of Prosody in Children with Autism

[Lois Black; Autism Speaks]. The proposed project focuses on computer-assisted remediation of expressive and receptive prosody in children with autism spectrum disorders (ASD). Prosody refers to loudness, pitch, timing, melody, and other aspects of speech that illuminate the different meanings of what is verbally communicated. Prosody plays a critical role in an individual's communicative competence and social emotional reciprocity. The ability to appropriately understand and express prosody may, in fact, be an integral part of the theory of mind deficits considered central to ASD, and play a role in the reported lack of ability to make inferences about others' intentions, desires, and feelings. Yet, prosody is an under-explored feature of ASD, both in diagnosis and intervention. The goal of the proposed study is to create a new computer-assisted prosody remediation program and evaluate its efficacy in improving expressive and receptive prosody in children with ASD who have known prosody impairments. The program consists of an interactive, computerized "drama book" that contains a collection of videotaped social scenarios, each consisting of a series of interrelated scenes dramatically enacted by child and adult actors. A scenario will open with one scene, and the next scenes will occur based on what and how the child with ASD, speaking on behalf of an on-screen targeted child, communicates to the other characters. Therapist-assisted, the child with ASD will be able to practice different things to say -- and so experience the power of what and how he speaks as able to change the course of events. 

 

  • Detection of Autism in Infants
[Jan van Santen, Lois Black, Co-PI's;  the OHSU Foundation].  The project goal is to develop, demonstrate, and validate an automated, objective system for detecting early warning signs of autism in infants.  The approach is non-invasive and uses an in-home system comprising low-cost, off-the shelf equipment in the form of microphones, video cameras, and accelerometers. Data generated by the system are transmitted via internet protocol to a central processing facility where innovative algorithms -- which will be the core contribution of the proposed study -- extract diagnostic profiles. Unlike current diagnosis and detection of autism, which relies on behavioral assessment and subjective clinical judgment along with parent questionnaires, these diagnostic profiles are objective and based on sophisticated computer analysis of voice and movement patterns and hence are expected to be more reliable, accurate, and information-rich.
    The prototype system exemplifies an exciting new telemedicine model that may be applicable to a broad range of both neurodevelopmental disorders in addition to autism (e.g., ADHD, child bipolar disorder, ...) as well as neurodegenerative disorders (e.g., Parkinson's, Alzheimer's, ALS, ...).  By replacing expensive direct clinical observation with automated data collection, and by providing the experts with highly informative and accurate diagnostic profiles, significant cost savings and simultaneous increases in diagnostic accuracy and accessibility can be expected. 
    Equally exciting about this project is the wealth of data and the powerful algorithms it will create, which will provide leverage for several future research studies on autism that in turn will lead to new generations of methods for diagnosis and intervention.
  • In Your Own Voice: Personal Augmentative and Alternative Communication Voices for Minimally Verbal Children with Autism Spectrum Disorders
[Jan van Santen, Lois Black, PI's; Alexander Kain, Esther Klabbers, Investigators; Nancy Lurie Marks Family Foundation].   Many children with autism who have limited verbal abilities use Augmentative and Alternative Communication (AAC) devices to help them communicate with others. Often, these devices produce speech output. Necessarily, the voice of such a system does not resemble in any way the voice of the child who uses the system. This project is for children who have at least some speech capability, such as saying a few isolated words. The investigators will develop technology that performs a voice transplant of the child's natural voice onto the AAC device, so that the device's voice will sound like the child. The investigators hypothesize that an AAC device with a personalized voice that mimics the child's voice will psychologically reinforce powerful motivational factors and a sense of owness for communication so that the frequency and richness of AAC use, and its acceptance by family members and friends, will be enhanced. In addition, as a tool for improving a child's speech capabilities, a system that speaks with a voice similar to the child's own voice is likely to be more effective than a system that speaks with a default synthetic voice because the computer provides a model that is closer to the child's speech and hence is easier to emulate by the child. To create the system, the investigators will build on the most recent voice transformation, speech synthesis, and other speech technologies that have been developed by our group.

  • Automated Measurement of Dialogue Structure in Autism
[Brian Roark, Lois Black, Jan van Santen, AutismSpeaks].  This project seeks to bring the power of machine-based sensing and computation to improve the study of speech patterns in individuals with autism. By combining technologies stemming from natural language processing methods and prosodic analysis methods, the study expects to find aspects of speech that could be used as clinical markers. Current manual methods for measuring narrative coherence are not only difficult to obtain and extremely time consuming but it is unclear whether the human coder can even detect the statistical degree of semantic similarity and organization as the machine can. This research will analyze recordings being collected from two narrative recall tests that have the potential to uncover a wider range of speech differences between ASD and others. The hope is that this will clinically define children with ASD relative to typically developing children and differentiate ASD from other groups who also have communication impairments, i.e., children with developmental language delay (DLD), as well as differentiate speech characteristics or markers that might better discriminate subtypes within the ASD umbrella (e.g., HFA vs. Asperger's). We expect that speech and language technologies will not only make critical diagnostic speech features easier to document but also may actually uncover distinguishing speech features in autism and autistic subtypes that have previously gone undetected.

  • ERP Based Communication Device for Nonverbal Children on the Autism Spectrum
[Deniz Erdogmus, Lois Black, PI's; Brian Roark, Jan van Santen, Investigators; Nancy Lurie Marks Family Foundation].  Children with Autism Spectrum Disorders (ASD) exhibit varying levels of communication abilities. In this project, the investigators will address the communication needs of the subset that: 1) lack expressive speech and language; 2) lack ability to operate a keyboard, pointing device, or other typical assistive interface; and 3) are assumed to have adequate cognition, literacy, and receptive language understanding. This research aims to develop a communication system for such children. Resulting technology could also benefit other children and adults with adequate cognition but limited communication options. The investigators will develop an assistive communication facilitation device referred to as the RSVP Keyboard. It unites three technologies: 1) Rapid serial visual presentation (RSVP, with individually adjustable presentation rates) of letters/words/phrases; 2) a yes/no intent detection mechanism based on detecting evoked-response potentials (ERP) in the brain to determine which target letter or letters the child wants to convey; 3) a statistical language model based dynamic sequencing optimization procedure that computes which letter needs to be presented next to take advantage of regularities in language. The system will operate by showing the sequence of candidate letters on the screen as well as previously typed text, such that words and phrases are formed naturally by adding selected letters. The first goal is to test the viability of the basic concept of facilitated communication through the RSVP Keyboard System. Upon demonstration of feasibility through neuroimaging and statistical analysis of brain responses to RSVP stimuli sequences, the investigators will evaluate performances of typically developing children and nonverbal children with ASD in three interactive cognitive tasks.

  • Differentiating between Autism Spectrum Disorder and Developmental Language Disorders via Story Recall Analysis
[Brian Roark, PI, Medical Research Foundation of Oregon]. The analysis of elicited spoken language samples plays a key role in the diagnosis of a wide  range of linguistic and cognitive impairments, from developmental impairments, such as Developmental Language Disorders (DLD) or Autism Spectrum Disorder (ASD), to degenerative cognitive impairments, such as dementia.  Perhaps the most popular means of  eliciting such a sample is through a narrative recall task, where the subject is told a story of sufficient length to preclude verbatim recall, and then asked, either immediately or after some delay, to retell the story they have been told.  Most clinical uses of such tests involve a very simple scoring mechanism, in which the recall of specific items in the story is noted by the administering clinician (as the story is being re-told), and summary scores are calculated based on the number of these recalled items.  The resulting summary score fails to capture much of the potentially relevant information available in the spoken language sample, e.g., grammatical complexity, pause frequency, or the ordering of recalled items. The long-term objective of the proposed work is to identify multiple complex markers,  derived from open and cued responses to narrative recall tasks, for differentiating between: (1) children broadly diagnosed with ASD; (2) children broadly diagnosed with DLD; and (3)  normally developing children.  In the proposed study, narrative retellings produced by a relatively limited number of children will be analyzed for the feasibility of automatically extracting markers from the spoken language samples to effectively discriminate between the three groups.


OTHER NEURODEVELOPMENTAL DISORDERS

  • Investigating the Diagnostic Utility of Spontaneous Measures of Language
[Amy Costanza-Smith (Child Development and Rehabilitation Center, OHSU), mentored by Lois Black and Jan van Santen; Medical Research Foundation of Oregon].  A language disorder is an impairment in communication characterized by poor grammar, poor vocabulary and/or poor social use of language.  Language disorders affect nearly 4 million school-age children and put them at risk for further learning disabilities. These disorders are typically diagnosed via standardized assessments (psychometric tests) that bear little resemblance to real-life communication.  Collections of real-life communication, called spontaneous language samples, are also used to describe language abilities.  Language samples are rarely used in diagnosis, however, due to the time it takes to transcribe and analyze them.  However, language samples provide a rich context to assess a child’s language and often give more accurate information than standardized assessment.
    The purpose of this project is to determine the diagnostic ability of spontaneous language measures (e.g. vocabulary, grammar) to differentiate between children with language disorders and typically developing children. It is hypothesized that the real-life richness of spontaneous language samples will provide adequate information to diagnose language disorders. The results of this project will move toward the ultimate goal of developing new markers of language disorder, capitalizing on recent advances in technology to develop automated scoring procedures. These results have broader implications for the study of human communication disorders including adult onset disorders such as aphasia, dementia, and Parkinson’s disease.
This project uses data currently being collected in a larger NIH-funded project on autism and developmental language disorders in children. 


  • Computer assisted disfluency counts for stuttered speech - Phase I
[Peter Heeman, NIH; joint with BioSpeech Inc.]. Stuttering is a communication disorder characterized by disfluencies that are frequent and disruptive to communication. Clinicians extensively use disfluency counts to decide whether a client should be treated, to assess treatment progress, and to document treatment outcomes. Clinicians often do disfluency counts in real-time as a speaker is talking. However, these are not very specific, and cannot be re-examined. Clinicians can also use a verbatim transcription approach, in which they first transcribe exactly what was said, and then mark up the transcript with disfluency codes. This method allows more detailed and accurate counts to be obtained. The long term objective of this project is to build a computer tool that will assist clinicians in performing disfluency counts, both real-time and transcript based. The tool will allow both richer use of these counts, and, in the case of transcript-based counts, much less effort to create the counts. In fact, the amount of time should be reduced enough to enable transcript-based counts to be used in clinical practice.
    The goal of this Phase I-STTR is to demonstrate the feasibility of a computer tool to assist users in performing both real-time and transcript- based disfluency counts. For real-time counts, we will show that the tool achieves the same reliability and user acceptance as the pencil-and-paper method. We will also investigate whether the real-time counts can be re- examined and corrected (unlike the pencil-and-paper counts). For the transcript-based counts, we will show that the tool, for at least read-speech samples, allows the counts to be done substantially faster and with better reliability than current approaches. This will be achieved by incorporating an Automatic Speech Recognizer (ASR) that will use the story text to assist in transcribing the speech; and by incorporating a powerful user interface that allows the clinician to easily review and correct the automatic transcription. In Phase II, we will demonstrate the increased utility of disfluency counts due to them being stored in a computer file and time-aligned to the audio signal. We will extend the tool so that it can compare disfluency counts across multiple audio files. This will help clinicians better see the impact of their treatment over a period of time on the client's disfluency patterns. We will also demonstrate that the tool can assist clinicians with transcript- based counts of spontaneous speech. Again, we will incorporate an ASR to assist in the transcription process, and we will show that the tool allows transcript-based counts to be performed in substantially less time than current approaches. Furthermore, we will have a several clinicians use the tool over a period of several months with clients. This will demonstrate both the usefulness and practicality of the overall tool, and allow us to determine how to improve and augment it to best suit clinical needs.

  • Novel Computerized Behavioral Assessment Methods for Attention Deficit Hyperactivity Disorder.
This internally funded exploratory project, conducted by Lois Black, Holly Jimison, Leeza Maron, Misha Pavel, and Jan van Santen (PI), focuses on building a computerized assessment system that has these features.
  1. A clear understanding of which neuropsychological functions are measured.
  2. Interactivity (the computer adapts its behavior instantly to the subjects' responses, thereby being able to operate at a level of optimal sensitivity).
  3. Instantaneous and timed measurement of a range of behavioral responses including the force dynamics of button pushing and eye movements.
  4. Mathematical modeling of the underlying cognitive processes in order to derive "purer" measures of the neuropsychological functions.
  5. A more motivating and shorter assessment process.

  • Pilot Study for Word Recognition of Children with Speech Delay
John-Paul Hosom , PI, Medical Research Foundation of Oregon.  Children with speech delay of unknown origin (hereafter referred to as "speech delay") are characterized by a number of language problems, including reduced vocabulary size, atypical grammar, and highly unintelligible speech. The long-term objective of the proposed research is to enable children with speech delay to communicate more effectively. This proposal presents only the first step in realizing this long-term objective. In this first step, speech data from a limited number of children with speech delay will be analyzed to evaluate the feasibility of automatically identifying acoustic features in the speech signal that may be used to identify intended phonemes. The hypothesis of the proposed research is that there are correlations between intended phonemes and certain acoustic features of children with speech delay, when the intended phoneme is not the same as the phoneme actually spoken. Such correlations could then be used to assist in the automatic word recognition of an intended utterance.


NEURODEGENERATIVE DISORDERS



  • Quantitative Modeling of Segmental Timing in Dysarthria
[Jan van Santen and Kris Tjaden [University at Buffalo], PI's; NIH]. Quantitative, acoustic models of segmental timing in spoken English, such as have been developed for text-to-speech synthesis (TTS), acknowledge that segment durations in connected speech reflect the combined influence of systematic factors as well as nonsystematic or random factors.  Systematic Variability in segment durations reflects factors such as context, stress, speaking style or register, and cognitive load.  Segment durations also reflect within-speaker variability termed "Random Variability“ that cannot be attributed to any of these systematic factors.  An individual talker's speech duration patterns therefore can be mathematically characterized in terms of the magnitude of the effects of each systematic factor (e.g., amount of lengthening associated with word stress), as well as in terms of the relative and absolute amounts of systematic and random variability.  Importantly, this powerful modeling framework can be applied to meaningful sentence productions, and is capable of isolating the effects of individual systematic factors without requiring the use of artificial speech materials.  This approach to quantitatively modeling segmental timing in TTS has further proven crucial for successfully synthesizing intelligible, natural-sounding speech. 
    Given the importance of this modeling framework for generating high quality speech synthesis, it is surprising that similar modeling efforts have not been applied to dysarthria as a means of understanding the source of reduced intelligibility and naturalness in this speech disorder.  Aberrancies in the temporal patterning of speech are ubiquitous in most persons with dysarthria, and the contribution of speech duration variables to intelligibility and naturalness is suggested in a variety of studies.  The approach used in many existing studies is to document whether speech durations in dysarthria are - on average - atypically short, long or variable as compared to normal speech.  The TTS modeling framework described above, however, goes beyond this type of simple description to identify the relative contribution of specific systematic factors influencing segment durations for an individual speaker as well as the combined relative and absolute contributions of systematic and random factors to segmental timing for that individual.  The TTS modeling framework further allows model parameters for an individual speaker to be manipulated via speech synthesis to determine the impact on intelligibility and naturalness.  The proposed exploratory project seeks to apply such a quantitative modeling framework to segment durations in sentences produced by speakers with a variety of neurological diagnoses and dysarthrias.  The perceptual relevance of model parameters will be further studied via speech resynthesis to determine their impact on judgments of intelligibility and naturalness.


  • Measuring Spoken Language Variability in Elderly Individuals
[Brian Roark, PI; John-Paul Hosom, Susan Kemper [University of Kansas], Diane Howieson, co-PI's; NSF]. The focus of this project is to develop techniques to objectively (automatically) measure spoken language variability and change in aging. Many of the most effective methods for cognitive assessment are mediated by observed behavior, particularly spoken language production. These include clinical instruments, e.g., the Mini Mental Status Examination (MMSE), but also less formal assessments involving interviews or dialogs with physicians or even friends and family. Behavioral changes noted through these spoken language interactions could indicate pathological changes associated with a disorder; or the changes may be transient, due to missing medication or depression at the time of assessment. Alternatively, the observed behavior may be simply due to normal change in spoken language due to aging, or even within the range of natural behavioral variation. Understanding normal versus pathological language change with age requires the collection and annotation of repeated samples from both healthy and impaired individuals. This project has three specific aims: 1) to collect and transcribe longitudinal spoken language sample data elicited in multiple ways from diverse elderly adults; 2) to develop algorithms for automatically extracting features from these spoken language samples; and 3) to characterize the variability of feature values across samples of the same individual; and the utility of feature values and even feature variances for discriminating between subject groups. A particular challenge being addressed by this research is to achieve high-quality, efficient automatic annotation of discourse structure for the spoken language samples. The resulting methods are expected to directly contribute to important behavioral assessment applications.


  • New Methods to assess social, cognitive, and physical function in older persons
[Thomas Glass, PI [Johns Hopkins University]; Zak Shafran, Investigator; NIH]. The aim of this 4-year project is to develop and test a new personal monitoring device to measure physical, social and cognitive functioning in continuous time and in real life settings. The proposed device, called the LIFEmeter, combines four technologies: accelerometry (motion sensors), digital audio recording (for capturing speech), automatic speech recognition (for fast efficient analysis of speech), and location identification (to explore environmental influences on function). This light-weight, compact, and wearable device will be tested and validated in three phases of data collection. We will also construct the first automatic speech recognition (ASR) system designed to transcribe and analyze the natural speech of older adults. New metrics and methods will be developed to analyze complex time-embedded data on functioning. The proposed system will overcome biases and limitations found in current self-report techniques. Our team combines expertise from the MIT Media Lab (sensors and wearable computing), the JHU Center for Language and Speech Processing (ASR) and the Bloomberg School of Public Health, which is uniquely able to deliver an innovative approach to measuring complex function with potentially broad applicability.
    The proposed research builds on a previous 5-year cohort study (The Baltimore Memory Study, AG19604) of community-dwelling older adults. Existing data from this study allow us to compare and validate measures of function obtained from our new device against a range of self-reported and clinically-measured outcomes. We will also validate our instrument against the most widely used accelerometer (called ActiGraph). Data gathered using our new device will allow us to study of the impact of the built and social environment on functioning with improved precision. A key goal will also be to create and disseminate tools that allow other investigators to adopt, refine and test this new approach.


  • Spoken Language Markers for Social Engagement
[Zak Shafran, PI; Roybal Pilot Grant]. Health, quality of life and treatment outcomes in older adults have been shown to be influenced by their level of social engagement in personal relationships and activities — both positive or negative – with family members, peers, community members, local institutions, and, at the broadest level, society. Because of the heavy reliance on the cognitive and memory function of the subject, current measures of social engagement, which are based on self-report, suffer from inaccuracies. Further, they do not provide fine-level information necessary to design intervention and treatments. While advances in sensor technology is being exploited to augment self-report based assessment of physical abilities of older adults, such advances have not been realized in the assessment of social engagement due to inherent difficulties in characterizing an individual’s network of social support. This network is multi-faceted and is mediated through several different types of communications, including emails, financial transactions, and conversations with a wide range of persons, including family members, friends, medical personnel, and business associates. Of these types of communication, adult humans rely on conversations for most social interactions. Using conversations as source of data reflecting social engagement, advanced speech and language technology now gives us the capability to characterize these interactions.
    Our long-term goal is to design a computational framework to measure social engagement that accounts for variations in size, type and nature of an individual’s social network using conversations as our data source.
    Our research objective in this proposal is to create algorithms that detect spoken language markers in an older adult’s everyday conversations that are indicative of an individual’s social engagement. The three specific aims of this proposal are:
1. to determine the feasibility and acceptability of collecting conversational speech to assess social engagement of older adults,
2 . to design algorithms to detect spoken-language markers of social engagement from conversations, and
3. to identify the spoken-language markers of social engagement.

  • Making Dysarthric Speech Intelligible
[Jan van Santen, PI; John-Paul Hosom, Melanie Fried-Oken, co-PI]. This NSF-funded project [joint with at the Child Development and Rehabilitation Center at the Oregon Health & Science University]  will develop new algorithms that will enable dysarthric individuals to be more easily understood. Currently available devices are essentially spectral filters and amplifiers that enhance certain parts of the spectrum. While these can help certain types of dysarthria, many dysarthric persons suffer from speech problems that require forms of speech modification that are much more profound and complex such as: irregular sub-glottal pressure, resulting in loudness bursts that can be difficult to adjust to; absence, or poor control, of voicing; systematic mispronunciation of certain phoneme groups, resulting in certain sounds becoming indistinguishable or unrecognizable; variable mispronunciation; and poor prosody (pitch control, timing, and loudness). For these difficult problems, new approaches are needed that do not merely filter the speech signal but analyze it at acoustic, articulatory, phonetic, and linguistic levels.

  • Automatic spoken language analysis for detecting cognitive impairment
[Brian Roark, PI]. Clinical research into Alzheimer's disease (AD) and the mild cognitive impairment (MCI) that precedes its full onset, is increasingly focused on early diagnosis and treatment that can delay or even prevent full onset of AD. Effective diagnosis requires differentiating between changes in cognitive and linguistic abilities that occur during normal aging and those that are due to impairment. Both manual linguistic analyses of spoken language samples and orally administered clinical exams are effective but costly methods for discriminating between healthy and MCI subjects. For widespread testing of the growing elderly population for markers of MCI, automation of testing procedures will be required.     The objective of the NIH-Roybal-funded project will be to develop statistical speech and language analysis techniques to automatically extract features from spoken language samples recorded during clinical examinations. Healthy and MCI elderly subjects of on-going studies at the Layton Center of OHSU take full neuropsychological examinations annually for life. We will request their permission to record and analyze these sessions, which include several tests of particular interest, including a delayed story recall test and a picture description task. We will transcribe the words and annotate syntactic structure for selected tests, and develop algorithms for automatically deriving features from the spoken language samples. These automatically-derived speech- and language-based features will then be used to build classifiers for discriminating between healthy and MCI subjects. In addition to test automation, the statistical speech and language processing techniques will provide two benefits of primary importance: inclusion of approximations to previously researched manually-derived features; and the use of unexplored features derived from statistical characteristics of the samples, such as a number of entropy-based features.



  • Voice Transformation for Dysarthria - Phase I
[Jan van Santen, PI; Alexander Kain, co-PI; NIH]. A large percentage of the more than 2.5 million adult Americans with significant disability due to chronic neurological impairment present with dysarthria or speech impairment as one of their disabling conditions. There are no cures for speech impairments. Dysarthric individuals report losses to employment, educational opportunities, social integration, and quality of life. Individuals are taught strategies that compensate for their impairments, but the isolation caused by communication impairment is pervasive. The project goal is to develop a system that uses a wearable computer to transform speech compromised by dysarthria into easier-to-understand and more natural-sounding speech, and will thereby enable dysarthric individuals to communicate more effectively by telephone or in face-to-face contexts. 
    Software will be developed in a collaborative project with BioSpeech Inc., supported by the NIH,  that transforms speech compromised by dysarthria into easier-to-understand and more natural- sounding speech. The software will reside on laptop computers, with microphone input and amplified speaker or line output. Such software and hardware solutions will assist individuals with dysarthria to better communicate by voice, whether face-to-face or by telephone; it will also help these individuals when interacting with voice controlled services and devices, which are increasingly more popular. The system operates in "Interpreter Mode", meaning that output will take place after a brief processing delay once the speaker has completed an utterance. The software is based on a multi-step formant re-synthesis process: (i) Robust extraction of formant, energy, spectral balance, and pitch trajectories from input speech; (ii) Modification of extracted trajectories by imposition of smoothness and shape based constraints, and by bringing these trajectories in closer proximity to trajectories of normal speech; (iii) Conversion of the trajectories into a speech signal by formant synthesis. Results obtained with a prototype, personal computer based system show that this process is robust, enhances intelligibility, and completely eliminates "vocal fry", i.e., distortions caused by irregularities in the temporal pattern of the vocal folds.
    In Phase I, the core algorithms performing these steps will be improved and extended, and the software will be ported to a pocketable computer; the resulting system will evaluated on multiple speakers and listeners; and feedback will be obtained from potential users and their partners about desired features, usability, and functionality.  In Phase II, acceptable processing delays will be achieved using known methods for optimizing memory and processing speed; further enhancement capabilities will be added, and the system will be evaluated. The currently targeted product will be the first in a family of speech enhancement products with continually expanding functionality, by capitalizing on ongoing algorithmic and hardware improvements. Usage of standard hardware and software platforms, that in turn are compatible with a wide range of headsets and wearable amplified speakers or telephones, puts this software in a strong competitive position.

  • User Adaptation of AAC Device Voices - Phase I
[Jan van Santen, PI; Esther Klabbers, co-PI, NIH; joint with BioSpeech Inc.]. A wide range of individuals cannot communicate by voice. Voice enabled Augmentative and Alternative Communication (AAC) devices, also known as Speecg Generating Devices (SGD's) are often the only channel available by which these individuals can communicate. While many voice enabled AAC devices are currently available, they lack the important ability to generate customized speech that mimics aspects of the user's past or intermittently available speech. Modern "concatenative" speech synthesis technology can mimic a given speaker's voice, by excising speech fragments from a recorded speech data base ("acoustic inventory") and recombining these into output speech using sophisticated algorithms. It requires, however, a large amount of recordings and a high degree of consistency of pronunciation of the speaker. Many AAC users cannot meet these requirements because they already have lost the capability to speak or they cannot speak with adequate consistency of pronunciation. A new type of technology, voice transformation (VT) technology, is available that can transform speech spoken by a "source" speaker into speech that is perceived as spoken by a specific "target" speaker. To tune the transformation system, parallel "training recordings" of the same text are needed from the source and target speakers. The amount of training recordings is far less than what is needed for a high-quality acoustic inventory.
    In this joint project with BioSpeech Inc., supported by the NIH, we propose to use VT in combination with speech synthesis to convert the synthesis system's acoustic inventory into an acoustic inventory that mimics the target speaker's voice. The training recordings can consist of old home videos, or fragmented recordings produced during periods of intact speech, provided that they contain at least one sample of each phoneme. In Phase I, we will develop and evaluate a VT based synthesis system. The project will use high- quality and home-video quality recordings from male and female adults and children to create limited acoustic inventories (adequate to generate a specific set of test sentences) and VT training recordings. Perceptual experiments will be conducted to evaluate voice quality and perceived speaker identity. Phase II will focus on developing complete acoustic inventories for several canonical speakers that will be selected to cover a range of speaker characteristics, and on producing portable, user-friendly software.
    
  • Automated Test of Word Recognition - Phase II
[Robert Margolis, University of Minnesota, PI; John-Paul Hosom, Investigator]. Over 5 million word recognition tests are administered annually by audiologists in the United States with an associated cost of more than $100 million. These tests are currently performed manually by highly trained audiologists. This NIH-funded project describes the Phase II development of automated clinical speech recognition tests using clinical test recordings and an automated speech recognition system to score the subjects' responses. A method for automatically interpreting the test scores will also be evaluated. The objectives are to increase the accuracy and efficiency of these clinical tests, substantially reduce the cost, and provide an objective, automatic, evidence-based method for interpreting the results. The automated speech recognition test in combination with the automated pure tone audiogram (currently an STTR Phase II project) will perform diagnostic testing of a majority of audiology patients, freeing the audiologists' time for activities that require their training and skill. Contemporary changes in training and reimbursement patterns create a high demand for automated clinical procedures. The automated procedures are implemented on existing commercial audiometers with a personal computer that controls the audiometer delivery and routing of stimuli. Phase I results were obtained with automatic speech recognizers that were trained on a limited number of subjects (n=9). Estimates of the agreement between human and machine scoring ranged from 82-93%. Additional refinements with benefits that are predictable from prior experience will increase recognizer performance to a level that equals or exceeds human-human agreement and provide the basis for efficient and accurate clinical tests. In Phase II, an automatic speech recognition threshold test will be compared to the manual method used in routine clinical practice. Two different recognizer scoring strategies will be developed, one that requires more test time but is independent of individual speaker differences and is easily adaptable to other languages, and one that requires less time but may not be applicable to all patients. A pilot study will test the method on a Spanish-language speech-recognition test.

  • Speech Supplemented Word Prediction Program - Phase II
[Thomas Jakobs, InvoTek, PI; John-Paul Hosom, Investigator].  Commercial speech recognition software offers many people with physical limitations an important computer access method. While this access method is reasonably reliable for people with typical speech, people with motor speech disorders (dysarthria) are presently not able to use this technology reliably. The purpose of this NIH-funded research is to provide these people with a unique assistive-device access method that utilizes their speech. We will accomplish this by continuing to develop a Speech Supplemented Word Prediction Program (SSWPP) that enables people with dysarthria to use their speech capabilities to interact with personal computers, with an emphasis on assisted writing. The central element of the SSWPP is custom speech-recognition software used in conjunction with word prediction. The feasibility results for the SSWPP developed during Phase 1 are exciting. The average keystroke savings achieved by people with dysarthria on typical sentences was 68%. Commercially available word prediction programs achieved no better than 47% keystroke savings on the same text. Phase 2 design activities include improving the speech recognition engine, developing an optimized microphone interface, integrating the SSWPP into Microsoft Word, and developing a speech-to-text display for use in face-to-face communication. People with disability will evaluate the new SSWPP. The Speech Supplemented Word Prediction Program is a tool for people with disability, who also have difficult to understand speech. This tool enables these people to use their speech to reduce the amount of work required to enter text into a computer and to communicate verbally more effectively.

  • Automated voice-based cognitive assessment and spoken language-based markers for neurodegenerative diseases and Alzheimer's Disease Cooperative Study: Home-Based Assessments
This project (Tamara Hayes, PI; John-Paul Hosom, Investigator),  funded under a new program of Intel's Digital Health Group called the Behavioral Assessment and Intervention Commons, is aimed at initiating and accelerating research into behavioral markers of disease, such as changes in walking, speech and performance on computer games, that eventually translate into health-related products and services. CSLU is developing voice enabled automated assessment "kiosk" based versions of standard neurocognitive tasks (e.g., digit span) and speech and language based markers for neurodegenerative diseases.  Kiosk development is also supported by the Alzheimer's Disease Cooperative Study (ADCS; NIH) program (Jeff Kaye, OHSU PI; John-Paul Hosom, Investigator). The software architecture is designed and implemented by Senior Programmer Jacques de Villiers.