[boost] [GSOC2010] HEAPS & QUEUES draft proposal

5 Apr 2010

      Hello everyone,

My name is Alexandru Damian. I am a second year student at the Polytechnic
University of Bucharest, Romania. I am very interested in the Heaps & Queues
idea, and with both a good programming and mathematical background, I
believe that I can offer valuable ideas to this project.
Below is a draft proposal for this. I didn't know what you would expect from
such a proposal (talked to grafikrobot on freenode and he redirected me
here), so the structures that are to be implemented are rather vague. If the
proposal is on track, I will detail them. I have also talked so some guys on
the irc channel about a more generic approach to trees and tries. What's
your opinion on that (given that tree structures are already implemented in
/intrusive)?

I hope I'm not too annoying :).

        ***STUDENT PROPOSAL: HEAPS & QUEUES***

NOTE: My Student Proposal follows the template listed on:
https://svn.boost.org/trac/boost/wiki/SoCSubmissionTemplate

    **PERSONAL DETAILS**

NAME: Alexandru Damian
UNIVERSITY: Polytechnic University of Bucharest
MAJOR: Computer Science and Engineering
DEGREE PROGRAM: B.Sc.
EMAIL: alexdamian1989@gmail.com
AVAILABILITY: According to the official timeline, there will be 11-12 weeks
at my disposal: March 24 - August 9 (16). However, I plan to contract the
"Community Bonding Period" to 2-3 weeks, so that I may have at least one
week advance in coding. This will help me better organize my time during the
summer, when I will be unavailable for about 2 weeks (in June) because of my
exams.

    **BACKGROUND INFORMATION:**

EDUCATIONAL & WORK BACKGROUND:
    I have been preparing in the fields of Computer Science and Mathematics
since high-school when I have competed in numerous national and
international competitions, gaining important experience and always
finishing among the best. My interest in personal and professional growth
has continued at university as well, with even better results at contests
and scientific research.
    My work background and coursework have supplied me with many skills in
theoretical computer science, data structures, algorithms and programming
languages. In addition, I have broad experience in Mathematics and problem
solving techniques. In the year 2009, I have worked at different projects
for the Faculty of Automatic Control and Computers, including the research
and development of a way to store and illustrate the temperature ratio in
the University server room. The same year, I worked as a volunteer student
consultant and I offered support in Mathematics for students taking the
admission examinations.

EXPRESSION OF INTEREST:
    Boost C++ Libraries has caught my attention in 2008, when I had my first
important encounters with the Open Source Community. A library that offers
more advanced data structures and algorithms than STL is needed on a daily
basis when optimizing complex code. It is only natural for me to choose the
"Heaps & Queues" project, as I have valuable experience in the algorithms
that power these structures and their real-world applications.
    My plans for further support and development of the Boost C++ Libraries
will extend even after the GSOC deadline, with plans to implement more
algorithms supplementary to the work done during the summer and in other
areas (such as the BigInt idea, or AI algorithms: neural networks, genetic
algorithms, fuzzy systems etc.).

PROGRAMMING SKILLS:

    Strong background in theoretical computer science, algorithms, data
stuctures and optimization .
    Strong background in advanced mathematics.
    Programming Languages: C (4/5), Java (4/5), C++ (3/5), C# (3/5), Shell
scripting (2/5), ASM (2/5).
    Other programming languages and their proficency levels are illustrated
in the Resume.

    **PROJECT PROPOSAL**

ABSTRACT

    The purporse of this project is to create a library that involves heap
and queue creation and manipulation. This will be very useful in
implementing more complex algorithms or data structures that take advantage
of these structures (a basic example is Dijkstra's shortest path algorithm).
The library will include basic implementations of heap structures (i.e.
binary heap) and queues, as well as the more complex (and more relevant)
structures.

PROJECT CONTENTS

    The current version of the Libraries has little support for heaps and
queues. Therefore, the following will be implemented, with a discussion on
every structure's difficulty, speed and memory used.
    Firstly, a well-known table of complexities for the binary
min(max)-heap:
findMin(findMax) : O(1) | deleteMin(deleteMax) : O(log N) | insert : O(log
N) | merge : O(N)

* D-ARY HEAP : a more general implementation of the binary heap (2-ary
heap).
    The main advantage a d-ary heap has is that insertion will be
significantly faster, requiring a theoretical O(log_d N) operations.
However, finding an element's parent is rather difficult if d is not a power
of 2, since we can not use bitwise shifting operations. However, delete-min
is slower, because of the d-1 comparisons to find the minimum value, using a
standard algorithm (probable fix: use a min-heap).

* BINOMIAL HEAP, BINOMIAL TREES : a mergeable heap; although the complexity
for findMin(findMax) is O(log N), this heap, along with the ones below (also
mergeable), offer a better complexity for merge : O(log N). Since binomial
heaps use binomial trees, this data structure will also be implemented.

* FIBONACCI HEAP : another mergeable heap, with better amortized time
complexity than the binomial heap

* RELAXED HEAP, RELAXED FIBONACCI HEAP (an idea at:
http://www.google.ro/url?sa=t&source=web&ct=res&cd=1&ved=0CAYQFjAA&url=http%3A%2F%2Fwww.pmg.lcs.mit.edu%2F~chandra%2Fpublications%2Fheap.pdf&ei=YOa5S5rEFIylOK3CtKEL&usg=AFQjCNFVgc1o8BRokpgu5iU0f1gXhkQDAw&sig2=qzEJWD56cflm1CnRf3esbA)

* SKEW HEAP : a structure that is easier to merge than a binary heap, but is
unbalanced. This could be considered optional, as it is a median option
between d-ary heaps and mergeable heaps (binomial, fibonacci).

* TREAP : although treaps are already implemented, they should be adapted to
use the new heap library.

* QUEUE : a basic implementation of the queue structure (possibly use the
STL container)

* DEQUEUE : although an implementation already exists /fusion/container,
that has to be documented. A new implementation using the generic queue
structure mentioned above.

* PRIORITY QUEUE : implemented with the use of a heap.

* BUFFER (CIRCULAR QUEUE) : fixed size, circular queue (if queue is full,
the oldest data is overwritten). Such a queue could be used to implement a
structure similar to RRDs.

* (BONUS) SKIPLISTS : randomized search structure.

* (BONUS) B-TREES : extended binary trees with excellent complexity when
comparisons are not the most time-consuming operations.

* (BONUS) INTERVAL TREES : extended complete binary trees supporting fast
segment intersection queries.

    Some of these structures can use a fixed size array (use /boost/array).
However, it would be good to eliminate the constraints of fixed size arrays.
This will require using STL's <vector> or, should it be decided not to
include a binary, a local implementation of it, i.e. ARRAY LIST.

TIMELINE

    As stated above, I would like to start with an advance of one week (May
17).

WEEKS -2, -1, 0    (April 26 - May 16) :    Community Bonding Period
    These three weeks will offer me the possibility to accomodate with the
code, the coding conventions and the community. During these 3 weeks, I will
also discuss with the organization whether binary libraries should be
included or proprietary structures should be implemented (variable sized
arrays, basic queue, basic priority queue (heap)). Modifications in the
timeline (reordering, additions, deletions) can also occur. Finally, I will
decide if I clean up the implementations from /pending (i.e. queue,
fibonacci_heap, relaxed_heap, etc.) or create new structures, based on the
newly implemented structures.
    Implement some simple structures so as to better get used to the code
(probably ARRAY LIST or BINARY HEAP).

WEEK 1         (May 17 - May 23) :     Implement D-ARY HEAP, BINARY HEAP
(2-ARY HEAP).
    Decide whether to first implement the d-ary heap and then the binary
heap as a particular case, or implement the binary heap first, and after
that the d-ary heap, associating a binary min(max)-heap to each array of
children; solve the excess memory.

WEEKS 2, 3     (May 24 - June 6) :     Implement BINOMIAL TREES, BINOMIAL
HEAPS, FIBONACCI HEAPS (don't adapt, usage for relaxed fibonacci heaps),
RELAXED HEAPS and RELAXED FIBONACCI HEAPS.
    Official start of the coding phase.
    All of these heaps follow the same basic ideas, therefore progress
should be relatively easier after the binomial heaps are implemented.
Initial algorithms will probably be those specified (and variations) in
Cormen or Knuth, as well as the idea suggested in the above mentioned link.

WEEKS 4, 5    (June 7 - June 20) :     University Exams

WEEKS 6, 7     (June 21 - July 4) :     Fix any problems with previous
heaps. Implement SKEW HEAPS. Adapt TREAPS.

WEEKS 8, 9    (July 5 - July 18/16) : Add mutable behaviour to the heaps.
Apply optimizations. Create some automated tests.
    Week 8 will mark the mid-term evaluation deadline. The objective is to
complete HEAP structures. The timeline offers roughly 2 weeks for bug
fixing, so that should be enough.

WEEK 10        (July 19 - July 25) :     Implement QUEUE, DEQUEUE.

WEEK 11        (July 26 - August 1) :    Implement CIRCULAR QUEUE, PRIORITY
QUEUE.

WEEK 12     (August 2 - August 8) :    Fix bugs related to the QUEUE
structures. Add mutable behaviour to the queues. Apply optimizations and
custom algorithms for fixed-length data types. Create automated tests.

WEEK 13        (August 9 - August 16): Finalize code and documentation.

NOTES:
    1. The Exams period is, actually, 4 weeks (June 7 - July 5), comprising
5 exams. However, I should be able to compress the total time to only 2
weeks by moving some of my studying in May.
    2. Weeks 6 and 8 are reserved for fixing heap bugs. Should there be less
problems than expected, I can gain one bonus week.
    3. If we choose to implement the structures using STL binaries, I could
gain one more bonus week from weeks 1, 10 and 11.
    4. With (2) and (3) in mind, I could implement any of the bonuses
written above (at the end, if everything else is complete).
        a) SKIPLISTS : a randomized search structure with very good
practical complexity.
        b) INTERVAL TREES : given the already implemented Interval structure
(/numeric/interval), it would be nice to have such a structure, for easy
geometric intersection queries.
        c) B-TREES : adaptation of the Balanced Binary Trees, typically used
in situations where the comparison operation is not the most time consuming
one (disk seeking).

WORK FLOW
    I plan to organize my time and coding in such a manner that I am able to
post changes on a forum / blog every 4 days (except the period with exams).
That means roughly 2 times per week, as I will have no problem working in
weekends.

TODO

Cheers,

-- 
Alexandru Damian

Polytechnic University of Bucharest
  Faculty of Automatic Control and Computers