Plugging leaks in Python

Python applications do leak memory. Not due to Python itself, but due to application bugs. Though recent versions of Python have a true garbage collector that breaks cyclical references, you may still leak a lot of memory by keeping object references in forgotten corners of your code.

Another common reason of memory leak is the presence of __del__ method in a class, which prevents the garbage collector to break cycles with those classes. And then, the uncollected object keeps references to others, which keep references to others, and suddendly 90% of your object pool cannot go away.

Unfortunately my application was leaking so much memory this way, that it was getting sluggish to use in half an hour. So I had to hunt which objects were not being freed, and why. I managed to improve the situation a lot by breaking references "manually" (setting all references to other classes to None when the class had an explicit unload method), until I found the real culprit: three classes that had __del__ methods.

The technique I put together (with the help of a lot of Googling) was to explore some features of garbage collector (gc).

import gc
gc.collect()
objects = gc.get_objects()
objects_id = {}
for o in objects:
    objects_id[id(o)] = True
# gc.garbage

In this code, I force a garbage collection, so I won't see collectable cyclic references; and then I get the complete pool of active objects. There will be several thousands of them at minimum, since everything is there: functions, methods, modules, instances, variables, etc.

The gc.garbage list contains a list of objects that gc could not garbage-collect because it didn't know how to brake the cycle of references; and it tipically happens when one class has a __del__ method, which means that developer should clean the reference by himself, but didn't. It is a very good place to begin to search for leaks.

But my application was also keeping objects alive by true references (not cyclical references), and I needed to find who was referring those leaked objects. In order to do that, I did the following code:

for o in l:
    print o
    if verbose >= 2:
        if o in gc.garbage:
            print o
            print "    In gc.garbage (possible cause: " \
		  "presence of __del__ method)"
        else:
            cold_trail, lines = show_referrers(o, [id(o)], 1)
            for line in lines:
                print line
 
def show_referrers(initial_object, backrefs, level):
   cold_trail = True
   lines = []

   for o in gc.get_referrers(initial_object):
       bump = 0
       if (id(o) in backrefs):
           # cyclical reference to an object of the trail
           continue
        elif (id(o) not in objects_id):
           # object created within this very routine
           continue

       if isinstance(o, (type, ModuleType, FunctionType)):
           # dead end, but at least we are 100% sure 
	   # this trail does not lead to a cycle
	   #
           # lines.append("  "*(level+1) + str(type(o)) + \
	   # " " + str(o)[0:80])
           cold_trail = False
           continue

       if isinstance(o, (BufferType)):
           # uninteresting to print, but must be followed
           pass
       else:
           lines.append("  "*(level+1) + str(type(o)) + " " + \
			str(o)[0:80])
           bump = 1

       if len(backrefs) < 8:
           backrefs_new = backrefs[:]
           backrefs_new.append(id(o))
           referrers_are_cold_trails, referrers_lines = \
                show_referrers(o, backrefs_new, level+bump)
           lines.extend(referrers_lines)
           cold_trail = cold_trail and referrers_are_cold_trails

   if cold_trail:
       # our introspection was worthless because
       # only lead to cyclical refs
       lines = []

   return (cold_trail, lines)

It is centered around the gc.get_referrers() function which returns who is keeping references to a given object. Since the primary reference is most likely being kept by a list or a dictionary, we need then to find who refers to that list or dict, and so on.

Of course one object may be referred by many others, and some references end up being a cycle. Those cycles can be ignored because if they were the only problem, GC would have done away with the object (except by the __del__ cases). What keeps the object alive is a non-cyclic reference. So my code tries to detect and ignore referral paths that lead to a cycle, calling it a "cold trail".

When the ultimate referrer to an object is a module or a function, it may or may not be helpful to print it. In my case, it was not, so I commented out the code that annotates such objects. If no referrer to the object is listed, try then to enable this annotation too,

I used object IDs in object_id and backref since "object in list" may fail if some involved class implements a custom __eq__. And, due to the low-level nature of those operations, I felt more comfortable using IDs, as if it were C++ pointers.

blog comments powered by Disqus