Cern Week 17

The first few days of the week were spent trying to figure out why getpwent() kept failing. I had to do quite a lot of debugging and reading the source code of Perl to find out how the wrapper worked. At the end it turned out to be something insanely easy and stupid. What getpwent does when it realizes that you are going to loop through the whole list is it gets all the user names and then queries each one of them every time you call the function. But it kept the connection open, for the specific user query. But in the config file I had specified that no connection should be open for over 30 seconds. (As getting the whole list takes:

real 0m4.536s

So now the connection was closed but the program still tried to read data from it. So basically there was a dead lock. After setting up the limit the problem disappeared. Further Marco and Me looked into using the Coda file system for our Laptops. We have now requested a server and hopefully we can start installing next week. This should be really cool as this is a networked file system that will sync when it reconnects. So you can take your laptop home work offline and when you come back to work you can keep on working on your big work pc. I further did some research into shadow-utils and userlib. Without going into to much detail userlib is really nice. I don't really understand why so many people still use shadow-utils. I am currently lobbing for userlib to become the standard at Cern. I started thinking about disaster recovery and disaster management. I wrote a script that will run on a server and query the Ldap server every 15 minutes about it's entries then it creates the /etc/passwd, /etc/groups and /etc/shadow. So in the unlikely event that Ldap goes down and Kerberos is still up. The files can just be copied to all the machines and users can still use them.

I started to have a look at the quattor sendmail component that automatically configures the sendmail program. The syntax is really horrible of the sendmail config file. But more to come about this. While writing this I am waiting for my sendmail patches to be commit to the test cluster. Through some minor changes I reduced the run time from about 1 1/2 minutes (real 1m12.017s ) to half a second. (real 0m0.875s).

Further I attended quite a few meetings. And a talk about the new castor scheduler.

I was quite happy to hear that the average uptime is 99.73 % for the machines my department maneges.

No comments: