```Date: Tue, 12 May 1998 14:13:10 -0400 Reply-To: "Stefan H. Jonsson" Sender: "SPSSX(r) Discussion" From: "Stefan H. Jonsson" Subject: Sequential Analysis part #2 Comments: To: "Nichols, David" In-Reply-To: Content-Type: TEXT/PLAIN; charset=US-ASCII Hi all After a short brake I am back to business dealing with the sequential analysis problem I introduced earlier this month. If the term Levenshtein distance rings any bell it is the key concept in my analysis I need a program within SPSS or elsewehre(prefereably in SPSS), to calculate Levenstein distance between n(n-1)/2 number of pairs The optimal matching method Lets assume we have 3 types of events A B and C that happen in sequences and are measured on, let say, 5 time points 1 2 3 4 5 I want to measure with the Optical matching technique how similar these three sequences are to each other. (The actual problem includes 2700 cases with 12 time points and 4 different states in addition to missing observation on some time points for some cases.) A short list of sequences for demonstration purposes. 1-5 are the time points and i.-iv. are different sequences 12345 i. AABBC ii. AAAAC iii. CAABB iv. AABBC The goal is to get similarity/dissimilarity or distance matrix including distance for every possible pair to import to cluster analysis software. Some sequences happens more than once while other are unique The distance is measured by counting number of changes to transform one sequence to the other. There are three steps allowed to make changes. 1) Insert an event, then every events to the right is shifted one step right 2) Delete and event, every event to the right shifts one step left. Insert and delete are so similar that they are sometimes called indel To change iv. in to iii. first we insert C in at the start then delete C at the end. 3) Substitution to change i. to ii. one can substitute B with A at time points 3 and 4. This distance is one of two forms of the Levenshtein distance. The other form is identical exept 3 is not permitted. I will allow 1, 2, and 3. While there are many different ways to change one sequence to another, the minimum number of changes is required, or if the algorithm insists, a low number of changes that is most likely the minimum number. There are some sources of complexity that I might have to deal with while the project develops. The first such source is different sizes of sequences. If one sequence is shorter than the other, it might be necessary to standardize the distance by dividing by the length of the longer sequence. Substitutions are not all equal . substitution of A for B might cost less than A for C. Further is the problem of different costs of indels. This is highly depended on how the sequence looks like to begin with. This source adds so much complexity that it is frequently ignored and all indels are considered equal and set slightly higher than the most expensive substitute. Differences between different costs of substitutions and indels needs to be adjusted after every session as it is a tool to find meaningful clusters. Getting the distance (cost, dissimilarity) matrix is on part of the problem. But having 2700 person in four separated datasets all of which need to be analyzed, the matrixes are going to be huge. The syntax to calculate the Levenshtein distance probably need some Heuristic methods to try to reduce the complexity of the problem. There might be problems with the calculation time both in the distance measure and the cluster analysis. If the Heuristic method does not reduce teh time spent on the calculation to a reasonable amount other methods might be needed. One idea is to set up few poles and then compare all cases (sequences ) to those sequence poles and use that matrix. Then, instead of having n(n1)/2 matrix we would have n*m where m is the number of poles. I do not know if such an approach will lead to a solution that the cluster analysis can deal with. In other words I am not sure if it solves any problem as the clustering method probably needs(?) n *(n-1)/2n distance matrix to place every sequence in the d-dimensional space and come up with a cluster solution. Some references Sankoff and Kruskal "time warps, string edits, and macromolecule: The theory and practice of Sequence comparison. Abbot and Hrycak "Measuring Resemblance in Sequence Data: An Optimal matching Analysis of Musicians Careers" American Journal of Sociology Volume 96 Number 1 (July 1990) 14485 Renata M .Capocellu Editor "Sequences" Springer Verlag , New York, Berlin Heidelberg, 1990 ISBN 0387971866 http://indigo2.uia.ac.be/~peter/doctoraat/ali.html Best regards with hope of postive replies Stefan Jonsson ```

Back to: Top of message | Previous page | Main SPSSX-L page