[boost] [GSoC] Some new ideas for this year's GSoC

6 Apr 2010

      Hello, I am Stefan and this is my first year at GSoC as it's the first 
time I'm eligible to participate as a student. Since I have been using 
C++ and the Boost libraries for quite some time, I would be thrilled to 
be able to code for Boost.
I have quite a few ideas, and I was wondering which ones you consider to 
benefit the most, therefore I'm awaiting for your feedback. Also, please 
let me know if any of these ideas are already implemented in any of the 
libraries. Here we go:

* Selection algorithm (find the kth smallest element in a container, 
O(n) worst-case, very easy to implement)

* Optimize Boost String Algorithm finders to use an efficient substring 
matching algorithm as opposed to the naive algorithm. Most likely 
because of the fact that boost has implemented the finders for strings 
as generic algorithms, they haven't used efficient algorithms such as 
KMP, Boyer-Moore, Rabin-Karp etc.
  \see boost/algorithm/string/detail/finder.hpp
  The idea is to implement efficient algorithms such as the ones 
mentioned above to replace the current naive implementation. Note: it 
may be difficult to find nth matching substring position using some of 
these algorithms.
  Difficulty: easy-medium

* Implement a string algorithm for generating the suffix array. The 
suffix array, after being precomputed on a certain string, allows for 
very efficient substring searches in the future. This is useful when the 
string doesn't change too often but there are plenty of substring to 
search against.

* Yet another efficient associative data structure, skip lists?

* Space partitioning data structures? kd-trees? quadtrees? octtrees? 
collision detectors? closest neighbors?

* A parser capable of evaluating mathematical expressions which contains 
operands,operators,function calls,variable names and constant names 
(also considering priority of operators):
  1. a list of valid operators described by: a) number of operands 
(unary/binary/ternary?); b) allowed operand types; c) the precedence 
(priority of the operator)
2. a list of existent functions described by: a) function name; b) 
number of parameters; c) type of each parameter and return type; d) a 
callback (maybe just automatically deduce type based on the callback's 
type if possible)
3. a list of constants
4. a list of allowed variable names and types

Here's a pseudo-C++ code sample on how it should look like:
namespace m = boost::math_expressions_parser;
m::parser p;
p.add_operators()
  ('+', 1, ret<int32_t>(_1+_2)) // binary +, precedence 1
  ('*', 2, ret<int32_t>(_1*_2)) // binary -, precedence 2
  ('+', 3, ret<int32_t>(_1)) // unary +, precedence 3
  ('-', 3, ret<int32_t>(-_1)); // unary - (also a funny face), precedence 3
p.add_functions()
  ("sin", ret<int32_t>(sin(_1)));
p.add_constants()
  ("PI", 3);
p.add_allowed_variables()
  (allowed_var<int32_t>("x"));

m::expression E("1 + 3*sin(PI*x)", p); // evaluates mathematical 
expression E against parser p
if (!E.valid()) { /* ... */ }
else {
  std::cout << E.evaluate( ("x", 3) ); // evaluate this expression by 
letting x=3
  E.let() ("x",4);
  std::cout << E.evaluate();
}
//Maybe an optimization so only the parts that depend on the changed 
variable will be recalculated every time?. E.g.: for expression f(x)+y, 
f(x) is recomputed only when x is changed. This may cause problems for 
functions that do not always yield the same results for the same input. 
Maybe make this optimization optional, only if the user marks the 
function/operator as "volatile" or something
Difficulty: hard-very hard

* Someone suggested on this ml earlier that a crypto suite would be 
useful for boost. Maybe, but it would take a lot of time and dedication 
to implement. I believe that we should at least add some Cryptographic 
hash functions (such as md5, sha1, sha256 etc.). No one wants to install 
Crypto++ or something similar just to be able to compute the SHA1 sum of 
a file. It is a common necessity to want to compute a hash function of a 
string.
  Difficulty: very easy if you copy the implementation and just create a 
boost-ish interface

* In-place radix sort? Radix sort is a very efficient algorithm which 
performs better than std::sort (my implementation) (also asymptotically 
better) for some particular types such as: uint8_t, uint16_t, uint32_t, 
unsigned char, unsigned char[2], unsigned char[4] etc. Radix sorting 
takes linear time, but unfortunately, linear memory. It is very useful 
for sorting very large amounts of numbers tho (or genetic codes and 
maybe some other stuff).

I am able to implement some of these stuff and I'm waiting for your 
opinion in what would be most useful to be part of boost. Let me know if 
you believe that none will have any use.

Yours sincerely,
  Stefan