|
Current CSLU Research
Projects
CSLU conducts a wide range of research projects, including
projects focused on core speech processing and natural language
processing algorithms (technology
research projects) and projects focused on biomedical applications (biomedical research
projects), specifically on creation of diagnostic, remedial, and
assistive methods for neurodevelopmental and neurodegenerative
disorders and diseases.
I. Technology Research Projects
- Discriminative Syntactic Language Modeling:
Automatic Feature Selection and Efficient Annotation
[Brian Roark] The focus of this
NSF_funded project is on the effective use of parser-derived
and tagger-derived features within discriminative approaches to
language modeling for automatic speech recognition. Discriminative
language modeling approaches provide a tremendous amount of flexibility
in defining features, but the size of the potential parser-derived
feature space requires efficient feature annotation and selection
algorithms. The project has four specific aims. The first aim is to
develop a set of efficient, general, and scalable syntactic feature
selection algorithms for use with various kinds of annotation and
several parameter estimation techniques. The second aim is to develop
general tree and grammar transformation algorithms designed to preserve
selected feature annotations yet lead to faster parsing or even tagging
approximations to parsing. The third aim is to evaluate a broad range
of feature selection and grammar transformation approaches on a large
vocabulary continuous speech recognition (LVCSR) task, namely
Switchboard. The final aim is to design and package the algorithms to
straightforwardly support future research into other applications, such
as machine translation (MT); and into other languages, such as Chinese
and Arabic. The algorithms developed as a part of this project are
expected to contribute to improvements in LVCSR accuracy and
applications that rely upon this technology. The algorithms are being
packaged into a publicly available software library, enabling
researchers working in many application areas -- including LVCSR and MT
-- and various languages to investigate best practices in syntactic
language modeling for their specific task, without having to
hand-select and evaluate feature sets.
- Multi-Threaded
Dialogues For Real-Time Applications
[ Peter Heeman]. The goal
of this NSF-funded project is to create a speech interface that
supports
a user in interacting with multiple real-time devices at the same time,
where the interaction with each device is a separate dialogue thread.
The first aim is to show, using a human-computer study, that the simple
way to implement a speech interface for managing multiple threads is
not effective. The second aim is to run a human-human study to show
that people can inherently manage multiple dialogue threads, and to
determine what conventions they use. The third aim is to build a speech
interface that implements the conventions that were found.
The main impact of this work is the development of a
model that
accounts for how people deal with multi-threaded dialogues. This model
will be demonstrated in an implemented speech interface. This work will
create a technology that will be useful in interacting with the
pervasive electronic devices that we can expect to see in the future.
- Small Footprint Speech Synthesis
This NSF
Small Business Technology Transfer Phase I project is led by Alexander Kain at Biospeech Inc., a CSLU startup,
and Jan van Santen
The project aims to develop
and implement a new algorithm in the area of text-to-speech synthesis
(TTS) that will lead to (i) dramatic decreases in disk and memory
requirements at a given speech quality level and (ii) minimization of
the amount of voice recordings needed to create a new synthetic voice.
Most current TTS systems operate by concatenating segments of recorded
speech ([acoustic] units). A challenge for TTS is coarticulation: The
dependency of the acoustic manifestations of a phoneme on its
neighbors. Current TTS systems use multi-phone acoustic units such as
diphones, which preserve coarticulatory patterns naturally present in
speech. However, this approach requires a large amount of recordings
and generates systems with large footprints. Biospeech proposes a
uniphone approach that addresses coarticulation processes with an
explicit model. The method uses complex spectral vectors (basis
vectors) representing brief segments of speech inside single phonemes,
and decomposes these into two components: A formant vector and a
spectral balance vector. To generate speech, the formant and spectral
balance vectors derived from the basis vectors corresponding to
successive phonemes are subjected to separate--and hence generally
asynchronous--interpolation operations using time varying weights; the
formant and spectral balance vector trajectories thus created are
re-combined to create a trajectory in complex spectral space; finally,
this trajectory is converted into output speech with the inverse
Fourier transform. Asynchronicity is necessitated by the
quasi-independence of articulators underlying different spectral
features (e.g., frication, formant frequencies).
The proposed work has implications for other speech technologies,
including Automatic Speech Recognition (ASR). Current ASR technologies
address coarticulation by using multi-phone units, typical triphones.
The number of triphones in English is over 70,000, and thus requires a
large amount of training recordings. The proposed model could
dramatically impact on the amount of recordings required for system
training. Second, TTS has generally recognized societal benefits for
universal access, education, and information access by voice. For
example, TTS-based augmentative devices are available for individuals
who have lost their voice; and reading machines for the blind have been
available for several decades. Third, the approach will make
higher-quality TTS more available for smaller devices. For example,
voice based caller ID on low-end mobile telephones is currently not
possible due to memory limitations. Fourth, it enables voice adaptation
with a minimum of recordings. This will enable building personalized
TTS systems for individuals with speech disorders who can only
intermittently produce normal speech sounds or for individuals who are
about to undergo surgery that will irreversibly alter their speech. The
method proffered by Biospeech only requires recordings of valid samples
of each of (less than 50) phonemes instead of each of (2000 or more)
diphones.
- Objective
Methods for Predicting and Optimizing Synthetic Speech Quality
This NSF-funded project focuses on how
humans perceive acoustic discontinuities in speech. Current
text-to-speech
synthesis ("TTS") technology operates by retrieving intervals of stored
digitized
speech("units") from a database and splicing ("concatenating")
them
to form the output utterance. Unavoidably, there are acoustic
discontinuities at the time points where the successive speech
intervals meet. An unsolved
problem is how to predict from the quantitative, acoustic
properties
of two to-be-concatenated units whether humans will hear a
discontinuity.
This is of immediate relevance for TTS systems that select units at run
time
from a large speech corpus. During selection, the systems search
through
the space of all possible sequences of units that can be used for the
utterance
and selects the sequence that has the lowest overall objective cost
measure,
such as the Euclidean distance between the final frame and initial
frame
of two units. However, research has already shown that this method and
related
methods do not predict well whether humans will hear a discontinuity.
The
current research, by being explicitly focused on perceptually optimized
objective
cost measures, will directly contribute to the perceptual accuracy of
cost
measures and hence to synthesis quality.
- Prosody
Generation for Child Oriented Speech Synthesis
This NSF-funded project [joint with Alan
Black at Carnegie Mellon University and Richard
Sproat at the University of Illinois at Urbana-Champaign] focuses
on innovative algorithms for
generating
highly expressive synthetic speech. Generating expressive speech
involves
three hard research problems. (i) Computation of abstract tags
that
specify, e.g., which words need emphasis, and phrasing (e.g., where to
pause). (ii) Based on these tags, the system has to compute a
fundamental
frequency contour. (iii) Severe modification of the stored speech
fragments ("acoustic units") to obtain these contours. The central goal
of the project is to address these research problems, and create a TTS
system that will make the next generation of TTS based language
remediation
systems viable.
- Creating
the Next Generation of Intelligent Animated Conversational Agents
The goal of this NSF-funded project [joint with Ron
Cole at the University of Colorado and Javier
Movellan at the University of California at San Diego] is to
improve reading achievement of children with reading problems by
designing
computer-based interactive reading tutors that incorporate new speech
and
language technologies. The reading tutors will help English- and
Spanish-speaking
children learn to read by providing classroom teachers and reading
specialists
with tools to instruct and exercise the set of auditory, visual and
linguistic
skills needed to read, speech discrimination, speech production,
phonological
awareness, sound-to-letter mappings, vocabulary, fluency and
comprehension.
The tutors will be designed, tested and refined in collaboration with
reading
specialists and instructional designers, and tested with children in
special
education programs in elementary schools in Boulder Colorado.
II. Biomedical Research
Projects
- Expressive and Receptive Prosody in Autism
This NIH-supported
project, led by Jan
van Santen and Lois
Black, and in collaboration with Rhea Paul
and Fred
Volkmar at Yale's Child Study Center
and Larry
Shriberg at the University of Wisconsin's Waisman Center,
focuses on automated technologies for assessment of prosodic ability in
autism. Autistic Spectrum Disorders (ASD) form a group of
neuropsychiatric
conditions whose core behavioral features include impairments in
reciprocal social interaction, in communication, and repetitive,
stereotyped, or restricted interests and behaviors. The importance of
prosodic deficits in the adaptive communicative competence of speakers
with ASD, as well as for a fuller understanding of the social
disabilities central to these disorders is generally recognized; yet
current studies are few in number and have significant methodological
limitations. The objective of the proposed project is to detail
prosodic deficits in young speakers with ASD through a series of
experiments that address these disabilities and related areas of
function. Key features of the project include: 1) the application of
innovative technology. The study will apply computer-based speech and
language technologies for quantifying expressive prosody, for computing
dialogue structure, and for generating acoustically controlled speech
stimuli for measuring receptive prosody; moreover, all experiments will
be delivered via computer to insure consistency of stimuli and accuracy
of recording responses; 2) broad coverage of the dimensions of prosody.
All three functions of prosody, grammatical, pragmatic, and affective,
will be addressed; expressive and receptive tasks are included; and
both contextualized tasks (dialogue, story comprehension and memory)
and decontextualized tasks (e.g., vocal affect recognition) will be
used; 3) inclusion of neuropsychological assessment and classification
methodologies to address within-group heterogeneity and obtain a
detailed characterization of the groups; 4) inclusion of two comparison
groups: children with typical development and those with Developmental
Language Disorder; 5) inclusion of an experimental treatment program to
enhance the prosodic abilities of speakers with ASD. A student
fellowship for this project is supported by Autism
Speaks.
- In
Your Own Voice: Personal Augmentative and Alternative Communication
Voices for Minimally Verbal Children with Autism Spectrum Disorders
[ Jan van Santen, Lois Black , Nancy Lurie Marks Family
Foundation].
Many children with autism who have limited verbal abilities use
Augmentative and Alternative Communication (AAC) devices to help them
communicate with others. Often, these devices produce speech output.
Necessarily, the voice of such a system does not resemble in any way
the voice of the child who uses the system. This project is for
children who have at least some speech capability, such as saying a few
isolated words. The investigators will develop technology that performs
a voice transplant of the child's natural voice onto the AAC device, so
that the device's voice will sound like the child. The investigators
hypothesize that an AAC device with a personalized voice that mimics
the child's voice will psychologically reinforce powerful motivational
factors and a sense of owness for communication so that the frequency
and richness of AAC use, and its acceptance by family members and
friends, will be enhanced. In addition, as a tool for improving a
child's speech capabilities, a system that speaks with a voice similar
to the child's own voice is likely to be more effective than a system
that speaks with a default synthetic voice because the computer
provides a model that is closer to the child's speech and hence is
easier to emulate by the child. To create the system, the investigators
will build on the most recent voice transformation, speech synthesis,
and other speech technologies that have been developed in his lab.
- Automated Measurement of Dialogue Structure in Autism
[ Brian Roark, Lois Black, Jan van Santen, AutismSpeaks].
This project seeks to bring the power of machine-based sensing and
computation to improve the study of speech patterns in individuals with
autism. By combining technologies stemming from natural language
processing methods and prosodic analysis methods, they expect to find
aspects of speech that could be used as clinical markers. Current
manual methods for measuring narrative coherence are not only difficult
to obtain and extremely time consuming but it is unclear whether the
human coder can even detect the statistical degree of semantic
similarity as the machine can. This research will analyze recordings
being collected from two narrative recall tests that have the potential
to uncover a wider range of speech differences between ASD and others.
The hope is that this will clinically define children with ASD relative
to typically developing children and differentiate ASD from other
groups who also have communication impairments, i.e., children with
developmental language delay (DLD), as well as differentiate speech
characteristics or markers that might better discriminate subtypes
within the ASD umbrella (e.g., HFA vs. Asperger's). We expect that
speech and language technologies will not only make critical diagnostic
speech features easier to document but also may actually uncover
distinguishing speech features in autism and autistic subtypes that
have previously gone undetected.
- ERP Based Communication Device for Nonverbal Children
on the Autism Spectrum
[ Deniz
Erdogmus, Lois Black,
Nancy Lurie Marks
Family Foundation].
Children with Autism Spectrum Disorders (ASD) exhibit varying levels of
communication abilities. In this project, the investigators will
address
the communication needs of the subset that: 1) lack expressive speech
and language; 2) lack ability to operate a keyboard, pointing device,
or other typical assistive interface; and 3) are assumed to have
adequate cognition, literacy, and receptive language understanding.
This research aims to develop a communication system for such children.
Resulting technology could also benefit other children and adults with
adequate cognition but limited communication options. The investigators
will develop an assistive communication facilitation device referred to
as the RSVP Keyboard. It unites three technologies: 1) Rapid serial
visual presentation (RSVP, with individually adjustable presentation
rates) of letters/words/phrases; 2) a yes/no intent detection mechanism
based on detecting evoked-response potentials (ERP) in the brain to
determine which target letter or letters the child wants to convey; 3)
a statistical language model based dynamic sequencing optimization
procedure that computes which letter needs to be presented next to take
advantage of regularities in language. The system will operate by
showing the sequence of candidate letters on the screen as well as
previously typed text, such that words and phrases are formed naturally
by adding selected letters. The first goal is to test the viability of
the basic concept of facilitated communication through the RSVP
Keyboard System. Upon demonstration of feasibility through neuroimaging
and statistical analysis of brain responses to RSVP stimuli sequences,
the investigators will evaluate performances of typically developing
children and nonverbal children with ASD in three interactive cognitive
tasks.
- Comparing Standardized and Spontaneous Measures of
Language
[ Amy Costanza-Smith (Child
Development and Rehabilitation Center, OHSU), Lois Black, Jan van Santen,
Medical Research Foundation of Oregon]. This research focuses on
the
markers of childhood language disorders, and on the possibility of
automated scoring of those markers. The manner in which testing for
language disorders typically takes place - standardized assessments
administered by a clinician - often bears little resemblance to
real-world communication. Instead, it is proposed to use children's
real-life utterances to develop new markers - e.g. vocabulary, grammar,
number of errors - that will improve the accuracy of diagnosis. The
ultimate goal is to use recent advances in speech technology to
automate the processing of these utterances, currently performed
through transcription and manual analysis.
- Diagnostic Markers for Childhood Apraxia Speech
This NIH-supported
project, led by John-Paul
Hosom (PI) and in collaboration with Larry Shriberg
at the University of Wisconsin's Waisman Center,
focuses on automated methods for assessment of Childhood Apraxia of
Speech. This disorder is a highly controversial disorder due to a
lack of consensus on the features that define it and the etiologic
conditions that explain its origin. The term Suspected Apraxia of
Speech (sAOS) has been proposed as an interim term for this putative
clinical entity. The point prevalence of sAOS in young children has
been estimated at approximately 0.1%. The long-term objective of this
proposal is to develop a valid, reliable, and efficient means to
classify children as positive for sAOS. In addition to the
contributions to theoretical explication of AOS, the software-based
diagnostic tools resulting from this work will allow any certified
speech-language pathologist to determine if a child's speech includes
prosodic features that fall within a 95% confidence interval supporting
the diagnosis of sAOS. The aim for this first period of planned
programmatic research is to develop automated diagnostic markers for
sAOS with clinically adequate sensitivity and specificity (> 90%
positive and negative likelihood ratios). The four specific aims are:
(a) to automate and improve the sensitivity and specificity of two
existing (manually derived) prosodic markers, (b) to develop four
additional automatic, prosody-based diagnostic markers, (c) to derive a
single diagnostic index based on a statistical derivative from the six
individual markers, and (d) to validate the composite diagnostic marker
using classification data obtained from expert clinical researchers.
Procedures are divided into four phases. In Year 1, automated versions
of existing markers will be developed that determine speech-event
locations using automatic speech recognition (ASR). Based on two pilot
studies, this technique is expected to yield results equivalent to
published data. The sensitivity of the markers will be improved by
methods including normalizing by speaking rate and vowel identity. In
Year 2, new automated markers will be created based on ASR and
speech-signal processing techniques. These markers will measure
variation in interstress timing, linguistic rhythm, speaking rate, and
glottal-source characteristics. In the first part of Year 3, results
from all six markers will be combined into a single diagnostic index
using multi-layer perceptrons. In the latter part of Year 3, per-child
errors will be evaluated to determine relationships between specific
prosodic factors and the diagnosis of sAOS, providing insight into the
features and definition of sAOS.
- Voice Transformation for
Dysarthria - Phase I
[ Jan van Santen,
PI; Alexander Kain,
co-PI; NIH]. Software will be developed
in a collaborative project with BioSpeech Inc., supported by the NIH,
that transforms speech compromised by
dysarthria into easier-to-understand and more natural- sounding speech.
The software will reside on laptop computers, with microphone input and
amplified speaker or line output. Such software and hardware solutions
will assist individuals with dysarthria to better communicate by voice,
whether face-to-face or by telephone; it will also help these
individuals when interacting with voice controlled services and
devices, which are increasingly more popular. The system operates in
"Interpreter Mode", meaning that output will take place after a brief
processing delay once the speaker has completed an utterance. The
software is based on a multi-step formant re-synthesis process: (i)
Robust extraction of formant, energy, spectral balance, and pitch
trajectories from input speech; (ii) Modification of extracted
trajectories by imposition of smoothness and shape based constraints,
and by bringing these trajectories in closer proximity to trajectories
of normal speech; (iii) Conversion of the trajectories into a speech
signal by formant synthesis. Results obtained with a prototype,
personal computer based system show that this process is robust,
enhances intelligibility, and completely eliminates "vocal fry", i.e.,
distortions caused by irregularities in the temporal pattern of the
vocal folds. In Phase I, the core algorithms performing these steps
will be improved and extended, and the software will be ported to a
pocketable computer; the resulting system will evaluated on multiple
speakers and listeners; and feedback will be obtained from potential
users and their partners about desired features, usability, and
functionality. In Phase II, acceptable processing delays will be
achieved using known methods for optimizing memory and processing
speed; further enhancement capabilities will be added, and the system
will be evaluated. The currently targeted product will be the first in
a family of speech enhancement products with continually expanding
functionality, by capitalizing on ongoing algorithmic and hardware
improvements. Usage of standard hardware and software platforms, that
in turn are compatible with a wide range of headsets and wearable
amplified speakers or telephones, puts this software in a strong
competitive position. A large percentage of the more than 2.5 million
adult Americans with significant disability due to chronic neurological
impairment in the United States present with dysarthria or speech
impairment as one of their disabling conditions. There are no cures for
speech impairments. Dysarthric individuals report losses to employment,
educational opportunities, social integration, and quality of life.
Individuals are taught strategies that compensate for their
impairments, but the isolation caused by communication impairment is
pervasive. The project goal is to develop a system that uses a wearable
computer to transform speech compromised by dysarthria into
easier-to-understand and more natural-sounding speech, and will thereby
enable dysarthric individuals to communicate more effectively by
telephone or in face-to-face contexts.
- User Adaptation of AAC Device Voices - Phase I
[ Jan van Santen,
PI; Esther Klabbers,
co-PI, NIH]. A wide range of individuals
cannot communicate by voice. Voice enabled
Augmentative and Alternative Communication (AAC) devices are often the
only channel available by which these individuals can communicate.
While many voice enabled AAC devices are currently available, they lack
the important ability to generate customized speech that mimics aspects
of the user's past or intermittently available speech. Modern
"concatenative" speech synthesis technology can mimic a given speaker's
voice, by excising speech fragments from a recorded speech data base
("acoustic inventory") and recombining these into output speech using
sophisticated algorithms. It requires, however, a large amount of
recordings and a high degree of consistency of pronunciation of the
speaker. Many AAC users cannot meet these requirements because they
already have lost the capability to speak or they cannot speak with
adequate consistency of pronunciation. A new type of technology, voice
transformation (VT) technology, is available that can transform speech
spoken by a "source" speaker into speech that is perceived as spoken by
a specific "target" speaker. To tune the transformation system,
parallel "training recordings" of the same text are needed from the
source and target speakers. The amount of training recordings is far
less than what is needed for a high-quality acoustic inventory. In this
joint project with BioSpech Inc., supported by the NIH,
we
propose to use VT in combination with speech synthesis to convert the
synthesis system's acoustic inventory into an acoustic inventory that
mimics the target speaker's voice. The training recordings can consist
of old home videos, or fragmented recordings produced during periods of
intact speech, provided that they contain at least one sample of each
phoneme. In Phase I, we will develop and evaluate a VT based synthesis
system. The project will use high- quality and home-video quality
recordings from male and female adults and children to create limited
acoustic inventories (adequate to generate a specific set of test
sentences) and VT training recordings. Perceptual experiments will be
conducted to evaluate voice quality and perceived speaker identity.
Phase II will focus on developing complete acoustic inventories for
several canonical speakers that will be selected to cover a range of
speaker characteristics, and on producing portable, user-friendly
software. The anticipated commercial offering consists of (i) software
components to be licensed to AAC vendors and (ii) a service consisting
of collection and processing of recordings and creation of personalized
acoustic inventories. Speech communication ability is impaired or
absent in millions of Americans due to neurological disorders and
diseases and to trauma, including autism, Parkinson's disease, and
stroke. Augmentative and Alternative Communication (AAC) devices that
are operated via switches, keyboards, and a broad range of other input
devices, and that have synthetic speech as output, are often the only
manner in which these individuals can communicate. Without AAC devices,
these individuals may suffer from severe social and psychological
isolation, and may be unable to lead productive lives. A
psychologically important feature that no currently available systems
have is the ability to speak with the user's voice, i.e., the ability
to produce speech that mimics the individual's pre-morbid speech or
speech that the individual may be able to intermittently produce. The
proposed project will use voice transformation (VT) technology to
accomplish this goal. VT technology requires recordings of the user to
be available, but there is substantial flexibility as to the nature and
quantity of these recordings; they may consist of home videos or of
fragmentary speech, provided that at least some samples are available
of each speech sound in the language. The goal of the application is to
develop a synthetic voice for an AAC system that sounds like the
individual using the system (before they lost the ability to speak),
without requiring very much recorded data on the part of the original
talker. The system works by first creating a synthetic "base" voice (or
set of base voices) using professional actors who must provide a fairly
large inventory of speech data. Using the base voice and a small sample
from the target talker (i.e., containing at least one instance of each
phoneme), a new synthetic voice is created by essentially modulating
parameters in the base voice so that it takes on characteristics of the
target talker. The ability to create a voice that sounds like the
original talker without much data from the original talker would be a
significant advantage.
- Novel Computerized Behavioral Assessment
Methods for Attention Deficit Hyperactivity Disorder.
This internally funded exploratory project, conducted by Lois Black, Holly
Jimison (Biomedical Engineering
Department and Department of
Medical Informatics and Clinical Epidemiology), Leeza
Maron (Psychiatry), Misha
Pavel (Biomedical Engineering
Department), and Jan van Santen
(PI), focuses on building a computerized assessment system
that has these features.
- A clear understanding of which neuropsychological
functions are
measured.
- Interactivity
(the computer adapts its behavior instantly to the subjects’ responses,
thereby
being able to operate at a level of optimal sensitivity).
- Instantaneous and timed measurement of a range of
behavioral responses
including the force
dynamics of button pushing and eye movements.
- Mathematical
modeling of the underlying cognitive processes in order to derive
“purer”
measures of the neuropsychological functions.
- A more motivating and shorter assessment process.
- Pilot Study for Word Recognition of Children with
Speech Delay
John-Paul
Hosom , PI, Medical Research Foundation of Oregon. Children
with speech delay of unknown origin (hereafter referred to as
“speech
delay”) are characterized by a number of language problems, including
reduced vocabulary size, atypical grammar, and highly unintelligible
speech. The long-term objective of the proposed research is to enable
children with speech delay to communicate more effectively. This
proposal presents only the first step in realizing this long-term
objective. In this first step, speech data from a limited number of
children with speech delay will be analyzed to evaluate the feasibility
of automatically identifying acoustic features in the speech
signal that may be used to identify intended phonemes. The hypothesis
of the
proposed research is that there are correlations between intended
phonemes and certain acoustic features of children with speech delay,
when the intended phoneme is not the same as the phoneme actually
spoken. Such correlations could then be used to assist in the automatic
word recognition of an intended utterance.
- Making
Dysarthric Speech Intelligible
[Jan van Santen, PI]. This NSF-funded project [joint with Melanie
Fried-Oken at the Child Development and Rehabilitation Center
at the Oregon Health & Science University] will develop new
algorithms
that will enable dysarthric individuals to be more easily understood.
Currently
available devices are essentially spectral filters and amplifiers that
enhance certain parts of the spectrum. While these can help certain
types
of dysarthria, many dysarthric persons suffer from speech problems that
require forms of speech modification that are much more profound and
complex
such as: irregular sub-glottal pressure, resulting in loudness bursts
that
can be difficult to adjust to; absence, or poor control, of voicing;
systematic
mispronunciation of certain phoneme groups, resulting in certain sounds
becoming indistinguishable or unrecognizable; variable
mispronunciation;
and poor prosody (pitch control, timing, and loudness). For these
difficult
problems, new approaches are needed that do not merely filter the
speech
signal but analyze it at acoustic, articulatory, phonetic, and
linguistic
levels.
- Differentiating between
Autism Spectrum Disorder and Developmental
Language Disorders via Story Recall Analysis
Brian Roark, PI,
Medical Research Foundation of Oregon. The analysis of elicited spoken
language samples plays a key role in the diagnosis of a wide
range of linguistic and cognitive impairments, from developmental
impairments, such as Developmental Language Disorders (DLD) or Autism
Spectrum Disorder (ASD), to degenerative cognitive impairments, such as
dementia. Perhaps the most popular means of eliciting such
a sample is through a narrative recall task, where the subject is told
a story of sufficient length to preclude verbatim recall, and then
asked, either immediately or after some delay, to retell the story they
have been told. Most clinical uses of such tests involve a very
simple scoring mechanism, in which the recall of specific items in the
story is noted by the administering clinician (as the story is being
re-told), and summary scores are calculated based on the number of
these recalled items. The resulting summary score fails to
capture much of the potentially relevant information available in the
spoken language sample, e.g., grammatical complexity, pause frequency,
or the ordering of recalled items. The long-term objective of the
proposed work is to identify multiple complex markers, derived
from open and cued responses to narrative recall tasks, for
differentiating between: (1) children broadly diagnosed with ASD; (2)
children broadly diagnosed with DLD; and (3) normally developing
children. In the proposed study, narrative retellings produced by
a relatively limited number of children will be analyzed for the
feasibility of automatically extracting markers from the spoken
language samples to effectively discriminate between the three groups.
- Automatic spoken language analysis for detecting
cognitive impairment
Brian Roark, PI]. linical
research into Alzheimer's disease (AD) and the mild cognitive
impairment (MCI) that precedes its full onset, is increasingly focused
on early diagnosis and treatment that can delay or even prevent full
onset of AD. Effective diagnosis requires differentiating between
changes in cognitive and linguistic abilities that occur during normal
aging and those that are due to impairment. Both manual linguistic
analyses of spoken language samples and orally administered clinical
exams are effective but costly methods for discriminating between
healthy and MCI subjects. For widespread testing of the growing elderly
population for markers of MCI, automation of testing procedures will be
required.
The objective of the NIH-Roybal-funded project will
be to develop statistical
speech and language analysis techniques to automatically extract
features from spoken language samples recorded during clinical
examinations. Healthy and MCI elderly subjects of on-going studies at
the Layton Center of OHSU take full neuropsychological examinations
annually for life. We will request their permission to record and
analyze these sessions, which include several tests of particular
interest, including a delayed story recall test and a picture
description task. We will transcribe the words and annotate syntactic
structure for selected tests, and develop algorithms for automatically
deriving features from the spoken language samples. These
automatically-derived speech- and language-based features will then be
used to build classifiers for discriminating between healthy and MCI
subjects. In addition to test automation, the statistical speech and
language processing techniques will provide two benefits of primary
importance: inclusion of approximations to previously researched
manually-derived features; and the use of unexplored features derived
from statistical characteristics of the samples, such as a number of
entropy-based features.
- Automated Test of Word
Recognition - Phase II
[ Robert
Margolis, University of Minnesota, PI]. Over 5 million word
recognition tests are administered annually by
audiologists in the United States with an associated cost of more than
$100 million. These tests are currently performed manually by highly
trained audiologists. This NIH-funded project describes the Phase II
development of automated clinical speech recognition tests using
clinical test recordings and an automated speech recognition system to
score the subjects' responses. A method for automatically interpreting
the test scores will also be evaluated. The objectives are to increase
the accuracy and efficiency of these clinical tests, substantially
reduce the cost, and provide an objective, automatic, evidence-based
method for interpreting the results. The automated speech recognition
test in combination with the automated pure tone audiogram (currently
an STTR Phase II project) will perform diagnostic testing of a majority
of audiology patients, freeing the audiologists' time for activities
that require their training and skill. Contemporary changes in training
and reimbursement patterns create a high demand for automated clinical
procedures. The automated procedures are implemented on existing
commercial audiometers with a personal computer that controls the
audiometer delivery and routing of stimuli. Phase I results were
obtained with automatic speech recognizers that were trained on a
limited number of subjects (n=9). Estimates of the agreement between
human and machine scoring ranged from 82-93%. Additional refinements
with benefits that are predictable from prior experience will increase
recognizer performance to a level that equals or exceeds human-human
agreement and provide the basis for efficient and accurate clinical
tests. In Phase II, an automatic speech recognition threshold test will
be compared to the manual method used in routine clinical practice. Two
different recognizer scoring strategies will be developed, one that
requires more test time but is independent of individual speaker
differences and is easily adaptable to other languages, and one that
requires less time but may not be applicable to all patients. A pilot
study will test the method on a Spanish-language speech-recognition
test.
- Speech Supplemented Word Prediction Program - Phase II
[ Thomas Jakobs, InvoTek,
PI]. Commercial speech recognition software offers many people
with physical
limitations an important computer access method. While this access
method is reasonably reliable for people with typical speech, people
with motor speech disorders (dysarthria) are presently not able to use
this technology reliably. The purpose of this NIH-funded research is to
provide
these people with a unique assistive-device access method that utilizes
their speech. We will accomplish this by continuing to develop a Speech
Supplemented Word Prediction Program (SSWPP) that enables people with
dysarthria to use their speech capabilities to interact with personal
computers, with an emphasis on assisted writing. The central element of
the SSWPP is custom speech-recognition software used in conjunction
with word prediction. The feasibility results for the SSWPP developed
during Phase 1 are exciting. The average keystroke savings achieved by
people with dysarthria on typical sentences was 68%. Commercially
available word prediction programs achieved no better than 47%
keystroke savings on the same text. Phase 2 design activities include
improving the speech recognition engine, developing an optimized
microphone interface, integrating the SSWPP into Microsoft Word, and
developing a speech-to-text display for use in face-to-face
communication. People with disability will evaluate the new SSWPP. The
Speech Supplemented Word Prediction Program is a tool for people with
disability, who also have difficult to understand speech. This tool
enables these people to use their speech to reduce the amount of work
required to enter text into a computer and to communicate verbally more
effectively.
- Automated voice-based cognitive assessment and spoken
language-based markers for neurodegenerative diseases
This project ( Tamara Hayes, PI),
funded under a new program of Intel's Digital Health Group called the Behavioral
Assessment and Intervention Commons, is aimed at initiating and
accelerating research into behavioral markers
of disease, such as changes in walking, speech and performance on
computer games, that eventually translate into health-related products
and services. CSLU is developing voice enabled automated assessment
"kiosk" based versions of standard neurocognitive tasks (e.g., digit
span) and speech and language based markers for neurodegenerative
diseases. The kiosk is also develope in the context of th e Alzheimer's Disease
Cooperative Study (ADCS) program.
|
|
|