Back in the day, we used to have to keep track of all the dynamically allocated objects and arrays we created and make sure they eventually get deallocated or we’d have memory leaks all over the place. But so-called “managed” programming frameworks like .NET did away with that. Now all we have to do is forget about the object and the garbage collector will magically detect it is no longer being used and reclaim the memory it was using—all transparently behind the scenes.
Or so the theory goes…
I actually ran into a scenario recently, though, where the garbage collector just didn’t cut it. The objects were “forgotten” (meaning no strong references to them existed anymore) yet the memory used by the application would continue to rise as new objects were created in their place, almost as if there were a memory leak. Here’s a short description of my application:
My Application
It’s actually a quasi-personal project. What happened was at work one day, for some unknown reason, our internet was going insanely slow. We could tell by our network appliance that it was because we were using all our bandwidth, but we had no way of knowing who or what was using it. There was a sense of urgency to find out, since it was hindering people’s work.
Meanwhile, one of my personal projects involved using sockets to sniff packets to spy on network activity. So I got an idea. I quickly wrote a program using my packet sniffing code to calculate how much bandwidth each IP address on the network was using. Then I hooked up an old network hub (not a switch or router, a hub, because it repeats all traffic to all ports) in series with one of the physical network cables that all the internet traffic was going through, and hooked up my work computer to that hub. With all the internet traffic being forwarded to my machine, I used my software to determine which computer was using the bandwidth.
After those events, my boss expressed a desire for a long-term solution that could be used if that scenario ever happened again, although he never officially assigned me to work on it, and since it is based on code from a personal project, that is why I call it quasi-personal. In any case, since then we added a second network adapter to our development server and configured the network switch to “mirror” the port going to the network appliance allowing us to do packet analysis on a long-term basis.
However, my software was in great need of improvement. Up until that point, it was just a class library that I was invoking using PowerShell with no user interface of its own. My vision for it was simple: Something like Task Manager’s processes tab where there’s a list of IP addresses and the bandwidth usage of each and you can sort by one of the columns and it will update the sort order periodically.
So that is my application. A simple network traffic usage monitor.
Why does it need so much memory?
This is one of those things that seems like it should be easy, but it is in fact much more involved to implement. The application needs to process every single packet traveling across the network. For each packet, it needs to store information about it for a time in a buffer and categorize it by IP address.
The problem has to do with the objects that are created when a new IP address is encountered. At first, just to get it working, I did not implement any logic to “forget” these objects once the address is no longer active on the network. I knew this would have to be implemented eventually, though. Before implementing this, after a 24 hour period, the application was using gigabytes of memory! So I implemented the logic to remove objects related to addresses that have been inactive over a configurable period of time. This improved things drastically. But I used a PowerShell script to log the CPU and memory usage of the program on a minute-to-minute basis, and I noticed a disturbing trend:
Those areas of higher CPU usage are during business hours when network activity is higher. You’ll notice during those periods the memory usage rises significantly, which is to be expected. But it does not fall completely afterward! Instead, it continues to climb higher and higher over time instead of remaining about the same. It’s over 300MB after 3 days, when it started out around 50MB.
So what’s the deal? My program was properly forgetting the objects. Why wasn’t the garbage collector working its magic? Well, the other peculiar thing about this chart, is that the CPU usage is never much below 10%. This makes sense, because it is constantly processing network packets. It never gets a break.
I wish I could remember the article I read on .NET garbage collection so I could link to it. But in any case, the reason this is significant is this: The garbage collector needs to move objects around in memory in order to reclaim memory. Thus, it would be dangerous to have any other thread in the program running while this takes place. So the garbage collector actually freezes every thread in the program (except its own, of course) while it does its collection magic. Because of this, it is designed to wait for a convenient time to run; namely, when the program is fairly idle. However, as you can see above, this never really happens with this program. As a result, the garbage collector runs much less frequently in this scenario. Those sharp decreases in memory usage in the chart are likely when the garbage collector runs, and as you can see, they are hours apart. The garbage collector also does not check every object every time it runs. It uses an algorithm based on the age of an object to determine the likelihood that it needs to be collected. The less frequently the garbage collector runs, then, the worse it is at reclaiming the majority of unreachable objects.
What’s the solution?
Well, the first thing that comes to mind is the Collect method of System.GC, which forces a collection to take place. But there is a reason the garbage collector normally does not want to run when the application is busy. It could adversely affect performance—in this case, the program’s ability to process every packet on the network. Some might be skipped because the thread was frozen when the event took place.
Now I know many of you will cringe when you read this, but I implemented recyclable managed objects. Here’s the code:
public interface IRecyclable
{
bool Recycle();
void Reclaim();
}
// NOTE: This is simplified code. See below for an improved version.
public class RecycleBin<T>
where T : class, IRecyclable
{
private Queue<WeakReference> _Queue;
public RecycleBin()
{
_Queue = new Queue<WeakReference>();
}
public void Recycle(T @object)
{
if (@object.Recycle())
{
_Queue.Enqueue(new WeakReference(@object));
}
}
public T Reclaim()
{
T @object = null;
while (_Queue.Count > 0)
{
WeakReference reference = _Queue.Dequeue();
if (reference.IsAlive)
{
@object = (T)(reference.Target);
break;
}
}
if (@object != null)
{
@object.Reclaim();
}
return @object;
}
}
First is a simple interface called IRecyclable that defines two methods: Recycle, which notifies the object that it is being recycled allowing it to release any resources that are not reusable and returns a boolean indicating whether or not it should be recycled in its current state, and Reclaim, which indicates that the object is being reclaimed and should be put into a usable state.
The RecycleBin class then implements the recycling logic. You’ll notice that it keeps weak references to recycled objects. This way, if the garbage collector does run, it will collect the recycled objects. The RecycleBin class, then, will only reclaim recycled objects if they haven’t been collected. Thus, it works in harmony with the garbage collector.
The idea is that the recycle bin is checked for a reusable object before a new one is created. Therefore, between garbage collections, objects are reused if possible instead of creating new ones.
Now before you bash me too much for going against the core principles of managed code, check out this chart:
Although the scale could be reduced significantly to offer more detail, I kept it the same as the above chart to illustrate the improvement. Where the former code reached close to the top of the chart at over 300MB, this code peaks at about 30MB, and it does not continue to rise over time. Note that no other code was changed between the gathering of these two data sets than the use of the above recycling code to reuse the objects that handle a given IP address. Interestingly, the CPU usage is also significantly lower, especially during the periods of low network usage. This is presumably because there is less allocation of new objects taking place, and the garbage collector has less objects to analyze, less memory to reclaim, less objects to move around, etc.
Improvements
I since made a few improvements to the above code, although they were not running when the above data was gathered. First, I replaced WeakReference with GCHandle. WeakReference uses GCHandle internally to do its thing, but WeakReference is a reference type, which allocates memory and requires garbage collection to a small extent defeating the purpose of recycling, whereas GCHandle is a value type, so its data would be stored within the queue object’s array and wouldn’t require reclamation by the garbage collector as long as the queue is in use. However, this also means each handle has to be manually freed when it is no longer needed. Also, I added to the IRecyclable interface allowing objects to determine whether they are referring to the same incarnation of the object.
public interface IRecyclable
{
int RecycleGeneration { get; }
bool IsRecycled { get; }
bool Recycle();
void Reclaim();
}
public sealed class RecycleBin<T> : IDisposable
where T : class, IRecyclable
{
private Queue<GCHandle> _Queue;
private bool _IsDisposed;
public RecycleBin()
{
_Queue = new Queue<GCHandle>();
}
public void Dispose()
{
if (!_IsDisposed)
{
while (_Queue.Count > 0)
{
GCHandle handle = _Queue.Dequeue();
if (handle.IsAllocated)
{
handle.Free();
}
}
_IsDisposed = true;
}
}
~RecycleBin()
{
Dispose();
}
public void Recycle(T @object)
{
CheckDisposed();
if (@object.Recycle())
{
_Queue.Enqueue(GCHandle.Alloc(@object, GCHandleType.Weak));
}
}
public T Reclaim()
{
CheckDisposed();
T @object = null;
while (_Queue.Count > 0)
{
GCHandle handle = _Queue.Dequeue();
if (handle.IsAllocated)
{
@object = (T)(handle.Target);
handle.Free();
if (@object != null)
{
break;
}
}
}
if (@object != null)
{
@object.Reclaim();
}
return @object;
}
private void CheckDisposed()
{
if (_IsDisposed) throw new InvalidOperationException("Object is disposed.");
}
}
Problems
I am not advocating this solution, however. I know it is wrought with problems. First of all, there is no way to know whether all references to the recyclable object have been released before the object is recycled. Therefore, a reference could be kept somewhere which would then refer to an object whose normally immutable properties have changed. This could wreak havoc on a lot of common design patterns. If only .NET gave us more flexibility when it comes to memory management instead of implementing it for us in a completely fixed and unconfigurable way. In any case, though, this solution did manage to work wonders in my application…