The importance of profiling

I am just working on a project where I have to parse XML files that are about 50 (+/-) megabyte big. Because I still think that PC run time is less expensive then my programing time I am writing this in python. I started coding using the xml.sax library and just filled in the gaps. But the more I added the slower the code became, in one case it took the parser about 10 min to create all the structures I need out of the XML file. After getting the functionality right with a subset of data I started profiling why it was taking sooooooooooo long. I used the excellent python package

profile.py -o statsout readRepro.py

So the first thing I found was in this method:

def characters(self, data): self.tempData +=data

This is called every time I encounter the data section of a XML element. Because it kept appending strings to stings this took ages. I replaced it with

def characters(self, data): self.tmpbuff.append(data)

and the run time went down form over 10 min to 7.2 seconds. How amazing is that. I would have never guest that would take up so much time. That is one little change. If I would have had to guess what was taking so long I would have optimized the parsing. After more work I found another method that I had just built in to do some debugging. It was basically checking I was coping with all the elements I would encounter. Useful while coding but after commenting one line I improved the performance from 55.7 to 6.6 seconds.This just proves that we as programmers have no idea where our programs spend most of their time. If you are coding and you are thinking of optimizing something, chances are you are wrong. I know a lot of people say this and a lot of people are against it, but my finding is that actually finding out where your programs spends the time is far more worth then guessing.
Lesson learned for life: "Use a tool to see where you are spending your time"
It might be in some debug method you don't really need.

No comments: