[algorithm][string] Edit distance algorithm

Hi, I would like to know if there is any interest in an implementation of the edit distance algorithm (http://en.wikipedia.org/wiki/Edit_distance)? I have uploaded an implementation of the algorithm in the vault (http://minilien.com/?Wr90MGb6LB). Let me know. JD

On 5/11/07, JD <jean.daniel.michaud@gmail.com> wrote:
Hi,
I would like to know if there is any interest in an implementation of the edit distance algorithm
I'd be interested in a general pattern recognition library - something that covers different flavors of edit distance and different flavors of hidden markov models. The two are not that distant algorithmically - see (shameless self-reference): http://tinyurl.com/2qyv9n and http://tinyurl.com/2qv45o The first paper draws some common ground between dynamic programming alignment algorithms (such as edit distance) and hidden markov models, and the second develops a formal model (semantic network model) which capitalizes on the relationship. If I have time, I might spend a part of the summer on re-implementing the general SNM framework - doing it in a way that also covers basic HMM and DPA functionality might be doable. I'll take a look at your implementation. Stjepan

Stjepan Rajko wrote:
The first paper draws some common ground between dynamic programming alignment algorithms (such as edit distance) and hidden markov models, and the second develops a formal model (semantic network model) which capitalizes on the relationship.
If I have time, I might spend a part of the summer on re-implementing the general SNM framework - doing it in a way that also covers basic HMM and DPA functionality might be doable. I'll take a look at your implementation.
Wow that's just the algorithm found on wikipedia, re-written in c++ and boostified for introduction in boost/algorithm/string. Nothing fancy here, it is definitely basic dynamic programming. My question was more to know if there is potential "users" interested and if it's the appropiate place to propose it. JD

Wow that's just the algorithm found on wikipedia, re-written in c++ and boostified for introduction in boost/algorithm/string. Nothing fancy here, it is definitely basic dynamic programming. My question was more to know if there is potential "users" interested and if it's the appropiate place to propose it.
I know... I was just pointing out that by implementing edit distance, you're not far from a slew of other useful things. For example, can easily add support for a ton of related problems (maximum common subsequence distance, for example), just by allowing the user to specify how the top row and leftmost column of the dynamic programming matrix are initialized. In your current implementation, you initialize d[0][i]=d[i][0]=i: if instead you set d[0][i]=d[i][0]=0 (if I remember correctly), you get maximum common subsequence distance. Specifying the cost of insertion / deletion is another useful thing to allow (hard coded to 1 in your case right now), and providing a custom comparison cost function between characters is another (right now it is hard-coded to a cost of 1 if they are different and 0 if they are equal). The next layer of difficulty/usefulness is allowing affine insertion/deletion costs (e.g., an insertion of length N is given a cost of aN + b), but that requires two tables. And then, you're not far from a decent probabilistic pattern recognition mechanism :-) This is all especially useful if you also add traceback extraction, i.e. extracting the calculation path which yielded the optimal result. This gives you the string alignment which yields optimal edit distance, or yields the maximum common subsequence in the other initialization case. I'm not saying you should go ahead and implement all this, but there is definitelly a part of it that requires a small amount of effort but makes it a ton more useful - I'd say at least allowing the user to specify custom insertion, deletion, and comparison function costs, and extracting the optimal alignment in the end. Stjepan

OK, I was a little off about the maximum common subsequence thing - getting the length of the maximum common subsequence would require: - initialization of top row and left column of d to 0 - instead of min you use max (so instead of "costs" things become "scores") - this is another useful thing to let the user specify - "insertion" and "deletion" scores (moving horizontally or vertically through table) become 0 - comparison score function (moving diagonally through table) is 1 for equal characters and 0 for unequal I _think_ that should be right... but I'm still talking off the top of my head :-) sorry for any mistakes. S On 5/12/07, Stjepan Rajko <stipe@asu.edu> wrote:
Wow that's just the algorithm found on wikipedia, re-written in c++ and boostified for introduction in boost/algorithm/string. Nothing fancy here, it is definitely basic dynamic programming. My question was more to know if there is potential "users" interested and if it's the appropiate place to propose it.
I know... I was just pointing out that by implementing edit distance, you're not far from a slew of other useful things. For example, can easily add support for a ton of related problems (maximum common subsequence distance, for example), just by allowing the user to specify how the top row and leftmost column of the dynamic programming matrix are initialized. In your current implementation, you initialize d[0][i]=d[i][0]=i: if instead you set d[0][i]=d[i][0]=0 (if I remember correctly), you get maximum common subsequence distance. Specifying the cost of insertion / deletion is another useful thing to allow (hard coded to 1 in your case right now), and providing a custom comparison cost function between characters is another (right now it is hard-coded to a cost of 1 if they are different and 0 if they are equal).
The next layer of difficulty/usefulness is allowing affine insertion/deletion costs (e.g., an insertion of length N is given a cost of aN + b), but that requires two tables. And then, you're not far from a decent probabilistic pattern recognition mechanism :-)
This is all especially useful if you also add traceback extraction, i.e. extracting the calculation path which yielded the optimal result. This gives you the string alignment which yields optimal edit distance, or yields the maximum common subsequence in the other initialization case.
I'm not saying you should go ahead and implement all this, but there is definitelly a part of it that requires a small amount of effort but makes it a ton more useful - I'd say at least allowing the user to specify custom insertion, deletion, and comparison function costs, and extracting the optimal alignment in the end.
Stjepan

Hi, There were some edit distance implementations in sandbox, but that was long time ago (even before I started to work on the string_algo). I think, that this one might be a useful addition to string algorithms. Altough it is probably not directly related to any of algorithms presented there, I cannot imagine a better place for it in boost. I haven't look at your code, I'll try to do it tomorrow. Regards, Pavol. JD wrote:
Hi,
I would like to know if there is any interest in an implementation of the edit distance algorithm (http://en.wikipedia.org/wiki/Edit_distance)? I have uploaded an implementation of the algorithm in the vault (http://minilien.com/?Wr90MGb6LB).
Let me know.
JD
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Hi, I would like to know if a library for reflection might be of interest. A first release is available on http://sourceforge.net/projects/crd/ and it uses, as data description, a class which is very close to boost::vector, without explicitely referring to it, for the moment. Thanks. RC

On 5/11/07, remi.chateauneu@gmx.de <remi.chateauneu@gmx.de> wrote:
Hi,
I would like to know if a library for reflection might be of interest. A first release is available on http://sourceforge.net/projects/crd/ and it uses, as data description, a class which is very close to boost::vector, without explicitely referring to it, for the moment.
How is this different from something like Boost.Serialization? Do you have more readable documentation about the library? -- Dean Michael C. Berris http://cplusplus-soup.blogspot.com/ mikhailberis AT gmail DOT com +63 928 7291459

Dean Michael Berris a écrit :
On 5/11/07, remi.chateauneu@gmx.de <remi.chateauneu@gmx.de> wrote:
I would like to know if a library for reflection might be of interest. A first release is available on http://sourceforge.net/projects/crd/ and it uses, as data description, a class which is very close to boost::vector, without explicitely referring to it, for the moment.
How is this different from something like Boost.Serialization?
Thanks for your interest. This library does not specifically aim at serialization, but more generally at providing for a given class, the list of its members. Given such a list - a kind of tuple - it is possible to apply algorithms which basically iterates over the members of this class. For example: struct MyClass { type1 _Field1 ; type2 _Field2 ; }; SQL queries: "select <Field1>, <Field2> ... from table etc..." Comparisons: x._Field1 == y._Field1 && x._Field2 == y._Field2 ... Serialization: std::cout << _Field1 << _Field2 ... XDR marshalling: xdr_type1( &xdr, &_Field1 ); xdr_type2( &xdr, &_Field2 ); etc... Most of these features are implemented and running. It is designed to add more 'plugins' of this type - and of course Boost.Serialization is worth trying. A class can have as many members list as needed, and does not have to be changed because the members lists are different objects. The need came out of applications with dozens of business classes, where the database persistence routines had to be re-written all the time.
Do you have more readable documentation about the library?
Unfortunately not for the moment, although the code is short (1100 lines in four files, without test programs) and well-documented, with examples.

On 5/12/07, remi.chateauneu@gmx.de <remi.chateauneu@gmx.de> wrote:
This library does not specifically aim at serialization, but more generally at providing for a given class, the list of its members. Given such a list - a kind of tuple - it is possible to apply algorithms which basically iterates over the members of this class. For example:
Hmmm... I distinctly remember Joel de Guzman having something like this made part of Fusion -- which allows you to create aggregate types where you can iterate over the elements of the type. That required some preprocessor programming, which was called BOOST_FUSION_STRUCT (or something similar) instead of just the normal "struct". How is this different?
Most of these features are implemented and running. It is designed to add more 'plugins' of this type - and of course Boost.Serialization is worth trying. A class can have as many members list as needed, and does not have to be changed because the members lists are different objects. The need came out of applications with dozens of business classes, where the database persistence routines had to be re-written all the time.
Have you seen how Soci (http://soci.sourceforge.net/) does it for the SQL/DB stuff?
Do you have more readable documentation about the library?
Unfortunately not for the moment, although the code is short (1100 lines in four files, without test programs) and well-documented, with examples.
I personally would like to see a writeup with a clear statement of the rationale and what it's intended to do. -- Dean Michael C. Berris http://cplusplus-soup.blogspot.com/ mikhailberis AT gmail DOT com +63 928 7291459

There is currently an SoC project involving reflection, being done by Mariano Consoni. You may want to contact him. On 5/12/07, Dean Michael Berris <mikhailberis@gmail.com> wrote:
On 5/12/07, remi.chateauneu@gmx.de <remi.chateauneu@gmx.de> wrote:
This library does not specifically aim at serialization, but more generally at providing for a given class, the list of its members. Given such a list - a kind of tuple - it is possible to apply algorithms which basically iterates over the members of this class. For example:
Hmmm... I distinctly remember Joel de Guzman having something like this made part of Fusion -- which allows you to create aggregate types where you can iterate over the elements of the type. That required some preprocessor programming, which was called BOOST_FUSION_STRUCT (or something similar) instead of just the normal "struct".
How is this different?
Most of these features are implemented and running. It is designed to add more 'plugins' of this type - and of course Boost.Serialization is worth trying. A class can have as many members list as needed, and does not have to be changed because the members lists are different objects. The need came out of applications with dozens of business classes, where the database persistence routines had to be re-written all the time.
Have you seen how Soci (http://soci.sourceforge.net/) does it for the SQL/DB stuff?
Do you have more readable documentation about the library?
Unfortunately not for the moment, although the code is short (1100 lines in four files, without test programs) and well-documented, with examples.
I personally would like to see a writeup with a clear statement of the rationale and what it's intended to do.
-- Dean Michael C. Berris http://cplusplus-soup.blogspot.com/ mikhailberis AT gmail DOT com +63 928 7291459 _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 5/12/07, Jeremy Pack <rostovpack@gmail.com> wrote:
There is currently an SoC project involving reflection, being done by Mariano Consoni. You may want to contact him.
Thanks Jeremy, I'll do that. -- Dean Michael C. Berris http://cplusplus-soup.blogspot.com/ mikhailberis AT gmail DOT com +63 928 7291459

Dean Michael Berris a écrit :
On 5/12/07, remi.chateauneu@gmx.de <remi.chateauneu@gmx.de> wrote:
This library does not specifically aim at serialization, but more generally at providing for a given class, the list of its members. Given such a list - a kind of tuple - it is possible to apply algorithms which basically iterates over the members of this class. For example:
Hmmm... I distinctly remember Joel de Guzman having something like this made part of Fusion -- which allows you to create aggregate types where you can iterate over the elements of the type. That required some preprocessor programming, which was called BOOST_FUSION_STRUCT (or something similar) instead of just the normal "struct". How is this different?
I do not know enough of this great library 'Fusion' to make a serious analysis, and BOOST_FUSION_ADAPT_STRUCT seems very close to what I did, although the whole library seems more oriented towards compile-time operations ? My specific needs are: - Iterations at run-time on class members, not only at compile-time, because some operations might be difficult to code (Trivial example: Add the latest ')' in a SQL query string). This implies, for example, that manipulation on a list of class members are done with plain STL algorithms. - The possibility to add dynamic attributes for class members, indexed for example by typeid (Column names in a SQL table, or specific format for displaying a given class member), which may change from one object to another of the same class. - The possibility to consider as a plain data member, any method (getter/setter ...) of a class, even if there is no underlying member, or if they are private, and also base classes.
Most of these features are implemented and running. It is designed to add more 'plugins' of this type - and of course Boost.Serialization is worth trying. A class can have as many members list as needed, and does not have to be changed because the members lists are different objects. The need came out of applications with dozens of business classes, where the database persistence routines had to be re-written all the time.
Have you seen how Soci (http://soci.sourceforge.net/) does it for the SQL/DB stuff?
This is great too. Please let me quote http://soci.sourceforge.net/doc/exchange.html v.set("ID", p.id); v.set("FIRST_NAME", p.firstName); v.set("LAST_NAME", p.lastName); ... sql << "insert into person(id, first_name, last_name) " "values(:ID, :FIRST_NAME, :LAST_NAME)", use(p);" Yes, it would be useful to generate the function 'to' and the 'insert' query, using any all-purpose form of class members list. As an illustration, I would generate the query with a members list defined this way: struct Person { int id; std::string firstName; }; struct Person_Def { template< size_t > struct at ; static const size_t size = 2 ; }; template<> struct Person_Def::at<0> : public crd::member_t< Person, typeof( Person().id ), &Person::id > {}; template<> struct Person_Def::at<1> : public crd::member_t< Person, typeof( Person().firstName ), &Person::firstName > {}; Hope it answers your question.
Do you have more readable documentation about the library?
Unfortunately not for the moment, although the code is short (1100 lines in four files, without test programs) and well-documented, with examples.
I personally would like to see a writeup with a clear statement of the rationale and what it's intended to do.
Agreed - btw, what are the recommended tools for editing Boost documentation ? Thanks for your attention.

On 05/12/2007 04:23 PM, remi.chateauneu@gmx.de wrote: [snip]
I do not know enough of this great library 'Fusion' to make a serious analysis, and BOOST_FUSION_ADAPT_STRUCT seems very close to what I did,
FWIW, I found the following two links useful for comparison: For crd, see lines 29-42 of: http://crd.cvs.sourceforge.net/crd/crd/README?revision=1.4&view=markup For fusion, see lines 39-43 of: http://boost.cvs.sourceforge.net/boost/boost/libs/fusion/test/sequence/adapt_struct.cpp?revision=1.2&view=markup&pathrev=HEAD
although the whole library seems more oriented towards compile-time operations ?
Well, according to: the library is a "fusion" of compile time metaprogramming with runtime programming. from the page: libs/fusion/doc/html/fusion/preface.html it should do both. Of course I realize "seems more oriented towards compile-time operations" is a "judgment call". However, it seems that the "basic operations" in both crd and fusion is accessing the I-th member in a tuple (or struct) and that the specific needs you mention below can be implemented using either the fusion "basic operations" or the crd "basic operations". If that's true (i.e. the "basic operations" are the "basis of comparison") then fusion is just as runtime oriented as crd.
My specific needs are: - Iterations at run-time on class members, not only at compile-time, because some operations might be difficult to code (Trivial example: Add the latest ')' in a SQL query string). This implies, for example, that manipulation on a list of class members are done with plain STL algorithms. - The possibility to add dynamic attributes for class members, indexed for example by typeid (Column names in a SQL table, or specific format for displaying a given class member), which may change from one object to another of the same class. - The possibility to consider as a plain data member, any method (getter/setter ...) of a class, even if there is no underlying member, or if they are private, and also base classes.
[snip]

compile-time operations" is a "judgment call". However, it seems that the "basic operations" in both crd and fusion is accessing the I-th member in a tuple (or struct) and that the specific needs you mention below can be implemented using either the fusion "basic operations" or the crd "basic operations". [snip]
- The possibility to add dynamic attributes for class members, indexed for example by typeid (Column names in a SQL table, or specific format for displaying a given class member), which may change from one object to another of the same class. - The possibility to consider as a plain data member, any method (getter/setter ...) of a class, even if there is no underlying member, or if they are private, and also base classes. OOPS. I didn't read the last two about dynamic sttributes and
On 05/13/2007 08:43 AM, Larry Evans wrote: [snip] the "consider as plain data member". I don't have any idea how fusion could be adapted to handle these needs, especially since I'm unclear about what is the "consider as plain data member" need. Could you explain more or provide examples? -regards, Larry

- The possibility to consider as a plain data member, any method (getter/setter ...) of a class, even if there is no underlying member, or if they are private, and also base classes.
OOPS. I didn't read the last two about dynamic sttributes and the "consider as plain data member". I don't have any idea how fusion could be adapted to handle these needs, especially since I'm unclear about what is the "consider as plain data member" need. Could you explain more or provide examples?
Please let me give you a silly example: struct Person { std::string _name ; int _age ; bool is_retired(void) const { return _age >= 65 ; }; }; When serializing such an object, many users would like to have 'IsRetired' saved like the other members. This may sound silly, but sometimes users are always right :):):) . So, CRD allows to have in a definition list, not only class members, but also methods of a given signature.

Larry Evans a écrit :
On 05/12/2007 04:23 PM, remi.chateauneu@gmx.de wrote: [snip]
I do not know enough of this great library 'Fusion' to make a serious analysis, and BOOST_FUSION_ADAPT_STRUCT seems very close to what I did,
FWIW, I found the following two links useful for comparison:
For crd, see lines 29-42 of:
http://crd.cvs.sourceforge.net/crd/crd/README?revision=1.4&view=markup
For fusion, see lines 39-43 of:
Yes, crd's representation is heavier, although preprocessor macros could easily lighten it (but this is not the point I guess). Crd is a very modest library compared to Fusion and simply tries to focus on the needs I presented. Although it is very probably possible to do that with Fusion, crd is modelled after the need of some business applications to have an all-purpose definition of classes on which many commonly used algorithms would be 'plugged' onto. For example, this motivated the choice to use std::type_info * as unique key for all manipulation of members lists. This defines, for example, the column names of a table, with a std::map< const std::type_info *, std::string >. There are, as well, two specific features I wish to implement: - Loading classes definition from a compiled object files with debug information ( Please see this: http://www.garret.ru/~knizhnik/cppreflection/docs/reflect.html or what is returned by readelf), or from shareable libraries. This would target existing applications which cannot be modified for organisational or budget reasons. - Given a dynamic class definition (list of members), creating an object with these members, but with a blob as underlying storage. The goal is to be able to manipulate class-like objets, with the same interface as a real class. ( ... maybe std::map< const std::type_info *, boost::any > , with type_info referring a class'member ? ) I agree that the concept is not quite clear at the moment, but again this is related to communicating data with software components which cannot be re-compiled at each modification of a user's class. Thanks for your attention.
although the whole library seems more oriented towards compile-time operations ?
Well, according to:
the library is a "fusion" of compile time metaprogramming with runtime programming.
from the page:
libs/fusion/doc/html/fusion/preface.html
it should do both. Of course I realize "seems more oriented towards compile-time operations" is a "judgment call". However, it seems that the "basic operations" in both crd and fusion is accessing the I-th member in a tuple (or struct) and that the specific needs you mention below can be implemented using either the fusion "basic operations" or the crd "basic operations". If that's true (i.e. the "basic operations" are the "basis of comparison") then fusion is just as runtime oriented as crd.
My specific needs are: - Iterations at run-time on class members, not only at compile-time, because some operations might be difficult to code (Trivial example: Add the latest ')' in a SQL query string). This implies, for example, that manipulation on a list of class members are done with plain STL algorithms. - The possibility to add dynamic attributes for class members, indexed for example by typeid (Column names in a SQL table, or specific format for displaying a given class member), which may change from one object to another of the same class. - The possibility to consider as a plain data member, any method (getter/setter ...) of a class, even if there is no underlying member, or if they are private, and also base classes.
[snip]
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 05/13/2007 04:06 PM, remi.chateauneu@gmx.de wrote: [snip]
- Given a dynamic class definition (list of members), creating an object with these members, but with a blob as underlying storage.
By blob do you mean Binary Large OBject: http://en.wikipedia.org/wiki/Binary_large_object
The goal is to be able to manipulate class-like objets, with the same interface as a real class. ( ... maybe std::map< const std::type_info *, boost::any > , with type_info referring a class'member ? ) I agree that the concept is not quite clear at the moment, but again this is related to communicating data with software components which cannot be re-compiled at each modification of a user's class.
This must be different than fusion's tuple because you mention boost::any, but wait, why not just use boost::any since it can hold any time and your wouldn't need the std::map. I must be missing somehting :( -regards, Larry

Larry Evans a écrit :
By blob do you mean Binary Large OBject:
Sorry - in fact I meant any opaque data type.
to be able to manipulate class-like objets, with the same interface as a real class. ( ... maybe std::map< const std::type_info *, boost::any > , with type_info referring a class'member ? ) I agree that the concept is not quite clear at the moment, but again this is related to communicating data with software components which cannot be re-compiled at each modification of a user's class.
This must be different than fusion's tuple because you mention boost::any, but wait, why not just use boost::any since it can hold any time and your wouldn't need the std::map.
I must be missing somehting :(
I was not clear: One of the aspects of my problem is that I wish to process all the members of a class, in a homogeneous way, while keeping the type information. A way to do that would be - in a way (This code does not make sense, it is just here to explain with I try to to with CRD) : boost::tuple< Class:Member1, Class::Member2, ... > ... with an actual 'Class' object ... ... generating a container of : std::pair< const std::type_info *, boost::any > Each of the boost::any object would actually store the value of a member of the 'Class' object. But, all std::pairs have the same type which is convenient for dynamic processing.
The possibility to add dynamic attributes for class members, indexed for example by typeid (Column names in a SQL table, or specific format for displaying a given class member), which may change from one object to another of the same class.
Now that we have, for example, a std::map indexed with const std::type_info *, we can add to it extra properties, for example fprintf format directives, as a std::map< const std::type_info *, std::string >. It is even possible to stack on to of boost::any, other properties based on the template parameters: This still allows to carry type-related information, but with the same type for all members of the class. -
-regards, Larry
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 05/15/2007 04:45 AM, remi.chateauneu@gmx.de wrote: [snip]
... generating a container of : std::pair< const std::type_info *, boost::any >
But boost::any::type() returns std::type_info const&: http://www.boost.org/doc/html/boost/any.html#id717459-bb wouldn't that make the above pair contain redundant information? [snip]

Larry Evans a écrit :
... generating a container of : std::pair< const std::type_info *, boost::any >
But boost::any::type() returns std::type_info const&:
http://www.boost.org/doc/html/boost/any.html#id717459-bb
wouldn't that make the above pair contain redundant information?
In this specific case, yes (Using type_info & as a key is mandatory only when 'assembling' different properties types at run-time) . Now, imagine that you wish to apply another operation than 'any' to all members of an object, the first example to jumps to my mind is the transformation into a XML tag. You would have something like that: template< typename Struct, typename Type, typename Struct::*Member > struct ToXml { static std::string make_tag( const Struct & anObj ) { return "<" + std::string( typeid(Member) + ">" + to_string( anObj->*Member ) + "</" + std::string( typeid(Member) + ">" ; }; Now, you can put these objects in the same list because they have the same type: std::pair< const std::type_info *, &ToXml< myStruct, Type1, Struct::Field1 >::make_tag > std::pair< const std::type_info *, &ToXml< myStruct, Type2, Struct::Field2 >::make_tag > With such a list, plus an object of type Struct, you can apply 'toXml' to all members of the object. And if you need a very specific processing for such or such field, you just need to replace an element in the list. But crd creates the initial list with the compile-time definitions of members (A kind of boost::mpl::tuple ) plus the definition of 'ToXml'.

JD wrote:
I would like to know if there is any interest in an implementation of the edit distance algorithm
I have an implementation of the "diff" (i.e. edit script) algorithm here: http://svn.anyterm.org/anyterm/trunk/common/diff.hh http://svn.anyterm.org/anyterm/trunk/common/diff.cc The "edit distance" is in there somewhere. I would welcome this sort of thing in Boost. But it needs to be the best-known algorithm. IIRC, there was one improvement that I found in the literature but didn't implement because I didn't fully understand it. Please correct me if I'm wrong, but I get the impression that your code is O(NM) where N and M are the sizes of the two input strings. The algorithm that I implemented is O((N+M)D) where D is the edit distance in the worst case, and typically O(N+M+D^2). The algorithm is described in the .cc file linked above. Also, your code seems to have O(NM) space requirement; I'm not sure what the lower bound is, but I think it's probably O(N+M). Regards, Phil.

Hello everybody, I briefly read the code posted by JD and I noted that in the cycle body you just access the last two rows of the dynamic table by typing d[i-1] and d[i-2]. Thus, one should avoid to store the full table and just deal with three rows or less if one consider well how the comparisons are done. I hope that I undestood well the code....and may be I will later post a simple library which I wrote during my "Algorithm and data structure" course at University, which should implement a quite similar algorithm (even if it doesn't use the Range Iterator Concept) Best Regards, Manuel Fiorelli

Phil,
JD wrote:
I would like to know if there is any interest in an implementation of the edit distance algorithm
I have an implementation of the "diff" (i.e. edit script) algorithm here:
http://svn.anyterm.org/anyterm/trunk/common/diff.hh http://svn.anyterm.org/anyterm/trunk/common/diff.cc
The "edit distance" is in there somewhere.
I would welcome this sort of thing in Boost. But it needs to be the best-known algorithm. IIRC, there was one improvement that I found in the literature but didn't implement because I didn't fully understand it.
Please correct me if I'm wrong, but I get the impression that your code is O(NM) where N and M are the sizes of the two input strings. The algorithm that I implemented is O((N+M)D) where D is the edit distance in the worst case, and typically O(N+M+D^2). The algorithm is described in the .cc file linked above. Also, your code seems to have O(NM) space requirement; I'm not sure what the lower bound is, but I think it's probably O(N+M).
I improved the algorithm and now the space requirements falls to O(m). Computational is still the same, however. Having a very quick look at your algo (in diff.cc) I can see it's more complex in a way that it is not understandable at first read. As I can see in the Boost Guideline (http://www.boost.org/more/lib_guide.htm#Guidelines): "Aim first for clarity and correctness; optimization should be only a secondary concern in most Boost libraries." So I think unless there is an obvious way to reduce computational cost to O(m) (the same way it was obvious to reduce space usage to the same), I think we should stick to the original algorithm. Thanks for your insights and further remarks, Pavol, Did you have time to take a look? If not, I've just uploaded the new version to the Vault here (http://minilien.com/?gSIPLYyR08). Thanks for your remarks, JD

JD wrote:
I have an implementation of the "diff" (i.e. edit script) algorithm here:
http://svn.anyterm.org/anyterm/trunk/common/diff.hh http://svn.anyterm.org/anyterm/trunk/common/diff.cc
The "edit distance" is in there somewhere.
I would welcome this sort of thing in Boost. But it needs to be the best-known algorithm. IIRC, there was one improvement that I found in the literature but didn't implement because I didn't fully understand it.
Please correct me if I'm wrong, but I get the impression that your code is O(NM) where N and M are the sizes of the two input strings. The algorithm that I implemented is O((N+M)D) where D is the edit distance in the worst case, and typically O(N+M+D^2). The algorithm is described in the .cc file linked above. Also, your code seems to have O(NM) space requirement; I'm not sure what the lower bound is, but I think it's probably O(N+M).
I improved the algorithm and now the space requirements falls to O(m). Computational is still the same, however. Having a very quick look at your algo (in diff.cc) I can see it's more complex in a way that it is not understandable at first read.
Indeed it's quite complex, but in the comments I refer to a couple of papers which I suggest that you read. The operation is much easier to understand if you can visualise it, and my attempts at ASCII-art diagrams are not as good as the ones in those papers. There is probably other material in the literature; does Knuth have a chapter on this subject?
As I can see in the Boost Guideline (http://www.boost.org/more/lib_guide.htm#Guidelines):
"Aim first for clarity and correctness; optimization should be only a secondary concern in most Boost libraries."
So I think unless there is an obvious way to reduce computational cost to O(m) (the same way it was obvious to reduce space usage to the same), I think we should stick to the original algorithm.
I would disagree with that view. Let's quantify quite how the performance of the algorithms differ. My application was to compare the contents of a 25x80-character terminal window before and after the user has changed something; typically they have just typed one or two characters. So N = M = 25x80 = 2000, and D is typically 2. Your O(NM) algorithm takes 4 000 000 time, while mine takes typically O(N+M+D^2) = 4004 i.e. 1000 times faster, worst case O((N+M)D) = 8000 i.e. 500 times faster. I am not sure what the author of the guideline that you have quoted had in mind, but I would have thought that that was not thinking of large big-O complexity issues. My personal approach to optimisation has always been to make sure that the big-O complexity is right; it is then rarely necessary to consider optimising anything else. Regards, Phil.

Hi, JD wrote:
Pavol,
Did you have time to take a look? If not, I've just uploaded the new version to the Vault here (http://minilien.com/?gSIPLYyR08).
Thanks for your remarks,
So I had a look. Code seems reasonably simple. I cannot comment the algorithm itself, since I don't know the details. Just one quick question. Why do you require size specification? Range have the facilities to determine the size. Why not use it? Regards, Pavol.

Hi, when I tried to compile your example (in libs/algorithm/string/example/), I got an error regarding the unavailability of the 'min' function. I don't add your library to my source tree, instead I modified the #include in the edit_distance_example.cpp; Here is my gcc -v output Using built-in specs. Target: i386-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java- 1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=i386-redhat-linux Thread model: posix gcc version 4.1.1 20070105 (Red Hat 4.1.1-51) I am not sure that it is either an error of your library or an error in how I used it ...However, I "solved" the problem by editing your header as follows: 1) insert at beginning #include <algorithm> 2) replace every reference to min with std::min Best regards, Manuel Fiorelli

Hi,
Hi, when I tried to compile your example (in libs/algorithm/string/example/), I got an error regarding the unavailability of the 'min' function. I don't add your library to my source tree, instead I modified the #include in the edit_distance_example.cpp; Here is my gcc -v output Using built-in specs. Target: i386-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --enable-plugin --with-java-home=/usr/lib/jvm/java- 1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=i386-redhat-linux Thread model: posix gcc version 4.1.1 20070105 (Red Hat 4.1.1-51)
I am not sure that it is either an error of your library or an error in how I used it ...However, I "solved" the problem by editing your header as follows: 1) insert at beginning #include <algorithm> 2) replace every reference to min with std::min
Sorry for that, I'm using visual which defines min and max... I should have paid more attention to the guidelines : http://www.boost.org/more/lib_guide.htm#Guidelines Thanks for your help, I've uploaded a new version of the code following Pavol's and your remmarks: http://minilien.com/?BXMB4vXUdS Thanks. JD

Hi Pavol, Pavol Droba wrote:
Just one quick question. Why do you require size specification? Range have the facilities to determine the size. Why not use it?
I don't know, I though it would be bad practice, for some reason... Anyway, it's fixed now, check out the new code uploaded here : http://minilien.com/?BXMB4vXUdS Thanks you, JD

Hi, JD wrote:
Hi Pavol,
Pavol Droba wrote:
Just one quick question. Why do you require size specification? Range have the facilities to determine the size. Why not use it?
I don't know, I though it would be bad practice, for some reason... Anyway, it's fixed now, check out the new code uploaded here : http://minilien.com/?BXMB4vXUdS
Function signature looks fine now, but if you want to be range-compliant, there are few issues in your implementation. 1. instead of std::size_t Arg1Size = Arg1End - Arg1It; use boost::range_size<Arg1>::type Arg1Size = size(Arg1); The reason for this is that not all ranges have to use std::size_t and not all ranges have random access iterators, therefors the simple substraction of Arg1End - Arg1It might not work. Better way would be to use std::distance, but the best way is to use "size" since it can take advantage of internal range representation. 2. I don't like the usage of plane C-Arrays. Even if you wanna use them, you should wrap the pointers to something like boost::scoped_array. Currently your code does not provide even the basic exeption quarantie (you will unconditionaly leak memory if an exception is thrown) and that's bad for this kind of algorithm. Regards, Pavol.

Function signature looks fine now, but if you want to be range-compliant, there are few issues in your implementation.
1. instead of std::size_t Arg1Size = Arg1End - Arg1It; use boost::range_size<Arg1>::type Arg1Size = size(Arg1);
Done.
2. I don't like the usage of plane C-Arrays. Even if you wanna use them, you should wrap the pointers to something like boost::scoped_array. Currently your code does not provide even the basic exeption quarantie (you will unconditionaly leak memory if an exception is thrown) and that's bad for this kind of algorithm.
Ok, I just move to std::vector. New version uploaded: http://minilien.com/?rrDciLH4g9 Regards, JD

JD wrote:
Function signature looks fine now, but if you want to be range-compliant, there are few issues in your implementation.
1. instead of std::size_t Arg1Size = Arg1End - Arg1It; use boost::range_size<Arg1>::type Arg1Size = size(Arg1);
Done.
2. I don't like the usage of plane C-Arrays. Even if you wanna use them, you should wrap the pointers to something like boost::scoped_array. Currently your code does not provide even the basic exeption quarantie (you will unconditionaly leak memory if an exception is thrown) and that's bad for this kind of algorithm.
Ok, I just move to std::vector. New version uploaded: http://minilien.com/?rrDciLH4g9
As far as I can tell, the code seems fine. Now the question is if we need any kind of review to add this to the boost. Unfortunately, I'm not the one that can give an aswer to this. Also I would consider the suggestions that were proposed by Phil Endecott. Big-O complexity difference seems quite significant.

I read the article pointed out by Phil Endecott: An O(NP) Sequence Comparison Algorithm where P=D/2 when M=N and D is the actual edit distance, as said in the abstract: Let A and B be two sequences of length M and N respectively, where without loss of generality N > M, and let D be the length of a shortest edit script between them. A parameter related to D is the number of deletions in such a script, P = D/2 − (N −M)/2. We present an algorithm for finding a shortest edit distance of A and B whose worst case running time is O(NP) and whose expected running time is O(N + PD). The algorithm is simple and is very efficient whenever A is similar to a subsequence of B. It is nearly twice as fast as the O(ND) algorithm of Myers [9], and much more efficient when A and B differ substantially in length. the space required is O(M+N) the pseudocode is enougth simple and it's almost straightforward to translate it in C++ Algorithm Compare Begin fp[−(M+1)..(N+1)] := −1; p := −1; Repeat Begin p := p + 1; For k := −p to D−1 do fp[k] := snake( k, max (fp[k−1] + 1, fp[k+1]) ); For k := D + p downto D + 1 by −1 do fp[k] := snake( k, max (fp[k−1] + 1, fp[k+1]) ); fp[D] := snake( D, max (fp[D−1] + 1, fp[D+1]) ); End Until fp[D] = N; Write "The edit distance is: " D+2p; End Function snake( k, y: int) : int Begin x := y − k; While x < M and y < N and A[x+1] = B[y+1] do Begin x := x + 1; y := y + 1; End snake := y; End FWIW, the article provides same performance comparison against the Miller's algorithm (O(ND)), the one implemented by Phil I guess. number of edit number of comparisons execution time M N deletions distance O(NP) O(ND) O(NP) O(ND) 4000 5000 10 1020 21564 526506 0.13 2.67 4000 5000 50 1100 59520 614391 0.41 3.12 4000 5000 100 1200 121635 737748 0.83 4.02 4000 5000 200 1400 255157 1004952 1.62 5.87 4000 5000 400 1800 600216 1693377 3.68 8.94 4000 5000 600 2200 1016433 2523687 6.41 14.85 5000 5000 200 400 49202 93139 0.33 0.61 5000 5000 600 1200 398499 791815 2.34 4.65 In the attached file there is a draft but working implementation. this is a link to the article: http://www.cs.arizona.edu/~gene/PAPERS/np_diff.ps in reading it pay attention that the x-y coordinates are swapped and this is a bit annoying. Best Regards, Marco -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

on Fri May 18 2007, Marco <mrcekets-AT-gmail.com> wrote:
I read the article pointed out by Phil Endecott: An O(NP) Sequence Comparison Algorithm where P=D/2 when M=N and D is the actual edit distance, as said in the abstract:
This is cool! I suggest you create a Trac ticket so it doesn't get forgotten: http://svn.boost.org -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

On Sun, 20 May 2007 21:17:58 +0200, David Abrahams <dave@boost-consulting.com> wrote:
on Fri May 18 2007, Marco <mrcekets-AT-gmail.com> wrote:
I read the article pointed out by Phil Endecott: An O(NP) Sequence Comparison Algorithm where P=D/2 when M=N and D is the actual edit distance, as said in the abstract:
This is cool! I suggest you create a Trac ticket so it doesn't get forgotten:
I'll do it. This is the first time I create a ticket, do I need an account ? Marco -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Marco wrote:
On Sun, 20 May 2007 21:17:58 +0200, David Abrahams <dave@boost-consulting.com> wrote:
on Fri May 18 2007, Marco <mrcekets-AT-gmail.com> wrote:
I read the article pointed out by Phil Endecott: An O(NP) Sequence Comparison Algorithm where P=D/2 when M=N and D is the actual edit distance, as said in the abstract: This is cool! I suggest you create a Trac ticket so it doesn't get forgotten:
I'll do it. This is the first time I create a ticket, do I need an account ?
An account is preferred as non-anonymous reports are much better for the management of the issues since we have someone to respond to. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - grafikrobot/yahoo
participants (12)
-
David Abrahams
-
Dean Michael Berris
-
JD
-
Jeremy Pack
-
Larry Evans
-
Manuel Fiorelli
-
Marco
-
Pavol Droba
-
Phil Endecott
-
remi.chateauneu@gmx.de
-
Rene Rivera
-
Stjepan Rajko