Monday, February 2, 2015

The poor man's garbage collector

While c++ is not my everyday thing, every time I take on it I miss some high level features like a garbage collector. Indeed it is a very polarizing topic for c++.
Smart pointers do come to give some help but they annoy me as they've their own rules to follow and they re not that simple or straightforward to use as a full-fledge gc

This weekend I had a bit of inspiration on it and come up with something to try, but be warned: I am still validating the concept, looking at different angles and frankly not entirely sure it works ... yet :)

So let's start with the basic: reference counters and the cycle problem, I wont go into details, there's plenty of reading around about how the reference counting work and the cycle problem. In my experiment I take them as the starting point, let's call my smart pointer GcPtr (which is as you guessed a c++ template, very much likely the existing ones in the c++ libraries).

It works as you may expect, so in order to break cycles I introduce the concept of 'Context', what is it? well just the function body where you are, in code a function looks like:

void do_something_interesting()
{
    Context context;
    GcPtr<Node> node(new Node());

        //a reference to itself is the most basic case of a cycle
    node->set_parent(node);
}



Well, that's not enough for showing much, but we see that starting a function with a context object allocated in the stack gives me the possibility to execute something in the destructor of the context, just when the function is going to go out of scope.
The idea here is that the GcPtr gets attached to the active context, hence if the context is destroyed everything attached to it becomes unreachable, if it wasn't claimed due to its reference count getting to 0 then it can still be reclaimed because it goes out of the context where it was allocated.
Yes, that is part of the story but not all, because a pointer can be 'passed down' to another function / level / context, let's see a more complex example:

void create_child(GcPtr<Node>& parent)
    Context context;
    GcPtr<Node> child(new Node());

    child->set_parent(parent);
}


void do_something_interesting()
{
    Context context;
    GcPtr<Node> node(new Node());

        create_child(node);
}

 
Now we got a parent-child circular reference, child was created in context 2, while node was created in context 1.
What happens internally is that inside the node class we have:

void Node::set_parent(GcPtr<Node>& parent)
{
    m_parent = parent
    parent->set_child(GcPtr<Node>(this))
    ...
 

void Node::set_child(GcPtr<Node>& child)
{
    m_child = child;
    ...

What happens here is subtle, m_child was attached to context 1 (it was created in do_something_interesting) so when it is assigned something attached to context 2 it gets promoted to context 1.
Again, child was created in context 2, but at the point we assign something that is reachable from context 1, it gets promoted, meaning context 2 does not know of it anymore and is added to context 1
Because of that, after returning from create_child, the child would still be alive (there is one reference in the node instance that references it)
Finally, after context 1 is destroyed, even as the reference count is 1 for node and 1 for child, it is destroyed as it leaves the context is attached too.

What would be the rules for using this smart pointer? pretty simple:
  • keep references inside of GcPtr
  • If you're creating objects in a function / method, start the function with a context creation as above
  • dont touch GcPtr's in your destructors (as a rule of thumbs, I need to think it more deeply)

Neat? maybe, I did a proof of concept and works like a charm, on the positive side it is very deterministic, it does not rely on background threads neither stop-the-world syndromes as some complex GC do, there's nothing complicated like generations and so, the extra overhead is quite small.
It is still possible if desired to offload the deallocation to a background thread but that is in itself a tradeoff that will benefit some cases while punish others so it is not a win-win thing.
Another beauty is that you can go selectively using it, it is not an all-or-nothing approach.

Yet, as I mentioned before, I still have to look at it from different edges and angles, I'm not personally convinced it will handle all the cases, what if for example in the above example childs was a vector? how would the smart pointer realize the right context? not sure yet, and an elegant solution needs more poking around...

I'll publish the example code I created later on, it is just quite messy at the moment and needs heavy massaging but so far is about 50 lines of code or so

If you happen to have something to point out please do! I'm sure I'm missing stuff out.

-Mat


UPDATE:

Heh, too good to be true, not long after I wrote I realized that a class that holds a GcPtr would have to have the lowest context of any of it's reachable fields, so it still needs to climb up and traverse a chain of references, while it does not invalidate the whole concept it does put more complexity, it starts to look like a class will need a reserved field _base_context or so.
Well then, as I was suspected it was too simple to be fully working, surely I'm still missing something else.