Date: Tue, 12 May 1998 14:13:10 -0400
Reply-To: "Stefan H. Jonsson" <stefanj@POP.PSU.EDU>
Sender: "SPSSX(r) Discussion" <SPSSX-L@UGA.CC.UGA.EDU>
From: "Stefan H. Jonsson" <stefanj@POP.PSU.EDU>
Subject: Sequential Analysis part #2
In-Reply-To: <c=US%a=_%p=spss%l=HERMES-980506145448Z-64991@hermes.spss.com>
Content-Type: TEXT/PLAIN; charset=US-ASCII
Hi all
After a short brake I am back to business dealing with the
sequential analysis problem I introduced earlier this month.
If the term Levenshtein distance rings any bell it is the key
concept in my analysis
I need a program within SPSS or elsewehre(prefereably in SPSS),
to calculate Levenstein distance between n(n-1)/2 number of pairs
The optimal matching method
Lets assume we have 3 types of events A B and C that happen in
sequences and are measured on, let say, 5 time points 1 2 3 4 5 I
want to measure with the Optical matching technique how similar
these three sequences are to each other.
(The actual problem includes 2700 cases with 12 time points
and 4 different states in addition to missing observation on some
time points for some cases.)
A short list of sequences for demonstration purposes. 1-5 are the
time points and i.-iv. are different sequences
12345
i. AABBC
ii. AAAAC
iii. CAABB
iv. AABBC
The goal is to get similarity/dissimilarity or distance matrix
including distance for every possible pair to import to cluster
analysis software. Some sequences happens more than once while
other are unique
The distance is measured by counting number of changes to
transform one sequence to the other. There are three steps
allowed to make changes.
1) Insert an event, then every events to the right is shifted one
step right
2) Delete and event, every event to the right shifts one step
left. Insert and delete are so similar that they are
sometimes called indel
To change iv. in to iii. first we insert C in at the start
then delete C at the end.
3) Substitution
to change i. to ii. one can substitute B with A at time points
3 and 4.
This distance is one of two forms of the Levenshtein distance. The
other form is identical exept 3 is not permitted. I will allow 1,
2, and 3.
While there are many different ways to change one sequence to
another, the minimum number of changes is required, or if the
algorithm insists, a low number of changes that is most likely
the minimum number.
There are some sources of complexity that I might have to deal
with while the project develops.
The first such source is different sizes of sequences. If
one sequence is shorter than the other, it might be necessary to
standardize the distance by dividing by the length of the longer
sequence.
Substitutions are not all equal . substitution of A for B might
cost less than A for C.
Further is the problem of different costs of indels. This is
highly depended on how the sequence looks like to begin with.
This source adds so much complexity that it is frequently ignored
and all indels are considered equal and set slightly higher than
the most expensive substitute.
Differences between different costs of substitutions and indels
needs to be adjusted after every session as it is a tool to find
meaningful clusters.
Getting the distance (cost, dissimilarity) matrix is on part of
the problem. But having 2700 person in four separated datasets
all of which need to be analyzed, the matrixes are going to be
huge.
The syntax to calculate the Levenshtein distance probably need some
Heuristic methods to try to reduce the complexity of the problem.
There might be problems with the calculation time both in
the distance measure and the cluster analysis. If the Heuristic method
does not reduce teh time spent on the calculation to a reasonable amount
other methods might be needed. One idea is to set up few poles and then
compare all cases (sequences ) to those sequence poles and use that
matrix. Then, instead of having n(n1)/2 matrix we would have n*m where m
is the number of poles. I do not know if such an approach will lead
to a solution that the cluster analysis can deal with. In other
words I am not sure if it solves any problem as the clustering
method probably needs(?) n *(n-1)/2n distance matrix to place
every sequence in the d-dimensional space and come up with a
cluster solution.
Some references
Sankoff and Kruskal "time warps, string edits, and
macromolecule: The theory and practice of Sequence comparison.
Abbot and Hrycak "Measuring Resemblance in Sequence Data: An
Optimal matching Analysis of Musicians Careers" American Journal
of Sociology Volume 96 Number 1 (July 1990) 14485
Renata M .Capocellu Editor "Sequences" Springer Verlag , New
York, Berlin Heidelberg, 1990 ISBN 0387971866
http://indigo2.uia.ac.be/~peter/doctoraat/ali.html
Best regards with hope of postive replies
Stefan Jonsson