So this is what you get for my efforts – a discription of something it took me more than a couple of hours to fix and that, in my opinion, were not that easy to find a solution to.
I’ve been installing a new mail server, the usual suspects involved, postfix + cyrus. In the past they have always played well together, and they continue to do so today. Both, I think, are excellent choices when going for mail processing and end user delivery.
However, in this installation instance, which, as we hit the brass tax, did not work quite right.
The installation was done on a normal intel server running ubuntu 6.06.1 LTS release, with nothing special installed – pretty much vanilla. One this that was being done out of the ordinary was using an LDAP backend for the whole thing to authenticate off of. However, that didnt, as things end up, causing any issues.
After everything was installed and tested by the truly faithful who volunteered to move their mail spools over to the new server; it did have new features, was a more powerful server, have updated code, so it was an upgrade from the aging existing server. The problems started showing up after the server was put under some load. Postfix started reporting odd bounced messages, with a most peculiar error.
Jul 18 14:52:48 HOST postfix/lmtp: 9E7451008B: to=<...>, orig_to=<...>,
relay=/var/run/cyrus/socket/lmtp[/var/run/cyrus/socket/lmtp], delay=1, status=bounced
(host /var/run/cyrus/socket/lmtp[/var/run/cyrus/socket/lmtp] said: 250 2.1.5 ok (in reply to DATA command))
Hey postfix, cyrus said OK and you bounced the email. Are you guys listening to each other?
So I went to google and searched the error and came up with two interesting theories, by other admins who were also having issues with the same thing. I must admit that I would be defensive too, but Wietse and Viktor got a little crabby. Anyway, one says this is the fault of postfix pipelining and the other says this is the fault of cyrus duplicate suppression. One of them was partially right in my opinion.
So rather than disable pipelining, which would affect performance for almost everything dealing with postfix (although I think I understand now that you can just limit it for one particular place withing postfix and let other clients use it) I went and messed with duplicate suppression.
imapd.conf I set, per the docs,
duplicatesuppression: 0. Restarted cyrus, postfix for good measure and yay it seemed to be fixed. However while under load, the problem came back. Boo. I went ahead and reenabled duplicate suppression in cyrus, it was something that I wanted to keep anyway, however I knew that it had something to do with it so I kept digging in that general vicinity.
Turns out that Jon was right (in the above neohapsis post) and was duplicate suppression causing the problem, but not as he thought. The error that cyrus reported back in debug was the key – cyrus kept reporting how many lockers there were on the duplicate db, and that was causing postfix to abort. Lockers it turns out, related more to the db than to cyrus duplicate delivery.
In this case, the default setting, cyrus was using berkeley db as the backend for the duplicate database. I switched this setting (
duplicate_db: skiplist) over to skiplist in
imapd.conf and putting the server under load while watching the logs: the errors did not reappear.
Doing some more research on this, I found out that BDB 4.2 has a bug in it (which dapper is running and which cyrus is compiled with). Recompiling cyrus with BDB 4.3 or switching the duplicate suppression to use something that does not involve BDB 4.2 fixed the issue, even under load. Yay!
My pain, your gain.