Yahoo Groups archive

Milter-greylist

Index last updated: 2026-04-28 23:32 UTC

Thread

Limiting resident memory usage

Limiting resident memory usage

2006-11-02 by Jonathan Perkin

Hi,

I'm trialling milter-greylist on the BBC mail infrastructure, which
receives around 1 million emails per day.  Recently I added

  acl greylist domain /[0-9][0-9]*\-[0-9][0-9]*\-[0-9][0-9]*/
  acl greylist domain /[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*/
  acl greylist domain /[0-9]{12}/

to the config to greylist anything which looks like a dynamic address,
and since making that change my monitoring has shown milter-greylist
to fail an awful lot more.

The milter-greylist processes are sitting at around 600M resident
memory, and are causing the system to swap.

  1. Can I limit the amount of memory milter-greylist will use to
     cache lookups?  Obviously with a large number of connections this
     is going to grow, but I cannot add more memory to the MX easily.

  2. Why is the increased load causing more failures?  I test the
     filter with something similar to

       acl greylist from /greylist-test.*@host/

     and generate a random string after "greylist-test" for MAIL FROM
     so that it won't get cached.  Today the number of failured for
     this test has been extremely high (previously I saw a number of
     cases where it wasn't being greylisted, but it appears to get
     worse with load).

This is sendmail 8.13.7 with security fixes, milter-greylist 2.0.2,
Solaris 9 and everything compiled with Sun Studio 11.

Thanks,

-- 
Jonathan Perkin                             Unix Systems Administrator
Formerly BBC Technology                  http://www.siemens.co.uk/sbs/
Siemens Business Services Ltd,  Maiden House, Vanwall Road, Maidenhead
                                 -=-
This email (and any attachments) contains confidential information and
is for the exclusive use of  the addressee(s).  Any views contained in
this e-mail are not the views of Siemens Business Services, ORS unless
specifically stated.  If you are not the addressee then any distribut-
ion, copying or use of this email is prohibited.  If received in error
please advise the sender and delete / destroy it immediately.  We acc-
ept no liability  for any loss  or damage suffered by any person aris-
ing from  use of this e-mail / fax.  Please note that Siemens Business
Services ORS monitors e-mails sent or received.  Further communication
will signify your consent to this.
                                 -=-
Siemens Business Services Ltd          Registered No: 04128934 England
Registered Office: Siemens House, Oldbury, Bracknell, Berks.  RG12 8FZ

Re: [milter-greylist] Limiting resident memory usage

2006-11-02 by eclark

No offense, but that is an insane rule. You might want to try either rewriting 
your rule to be more reasonable, or use one of the varied rbl servers which 
specifically handle dynamic ips. This is definitely not the right way to go. 
Even better, just greylist _everything_, and set exclusions as appropriate. 
The way you are doing this is the complete opposite of how you should be, in 
my opinion.
Show quoted textHide quoted text
On Thursday 02 November 2006 10:59 am, Jonathan Perkin wrote:
> Hi,
>
> I'm trialling milter-greylist on the BBC mail infrastructure, which
> receives around 1 million emails per day.  Recently I added
>
>   acl greylist domain /[0-9][0-9]*\-[0-9][0-9]*\-[0-9][0-9]*/
>   acl greylist domain /[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*/
>   acl greylist domain /[0-9]{12}/
>
> to the config to greylist anything which looks like a dynamic address,
> and since making that change my monitoring has shown milter-greylist
> to fail an awful lot more.
>
> The milter-greylist processes are sitting at around 600M resident
> memory, and are causing the system to swap.
>
>   1. Can I limit the amount of memory milter-greylist will use to
>      cache lookups?  Obviously with a large number of connections this
>      is going to grow, but I cannot add more memory to the MX easily.
>
>   2. Why is the increased load causing more failures?  I test the
>      filter with something similar to
>
>        acl greylist from /greylist-test.*@host/
>
>      and generate a random string after "greylist-test" for MAIL FROM
>      so that it won't get cached.  Today the number of failured for
>      this test has been extremely high (previously I saw a number of
>      cases where it wasn't being greylisted, but it appears to get
>      worse with load).
>
> This is sendmail 8.13.7 with security fixes, milter-greylist 2.0.2,
> Solaris 9 and everything compiled with Sun Studio 11.
>
> Thanks,

Re: Limiting resident memory usage

2006-11-02 by Jonathan Perkin

* On 2006-11-02 at 16:45 GMT, eclark wrote:

> You might want to try either rewriting your rule to be more
> reasonable

..which would look like..?

> or use one of the varied rbl servers which specifically handle
> dynamic ips.

Possibly, but won't that end up doing the same thing?  I don't want to
block dynamic IPs, just greylist them.

> Even better, just greylist _everything_, and set exclusions as
> appropriate. 

That is an inappropriate policy for the BBC, I would end up
whitelisting millions of domains.

-- 
Jonathan Perkin                             Unix Systems Administrator
Formerly BBC Technology                  http://www.siemens.co.uk/sbs/
Siemens Business Services Ltd,  Maiden House, Vanwall Road, Maidenhead
                                 -=-
This email (and any attachments) contains confidential information and
is for the exclusive use of  the addressee(s).  Any views contained in
this e-mail are not the views of Siemens Business Services, ORS unless
specifically stated.  If you are not the addressee then any distribut-
ion, copying or use of this email is prohibited.  If received in error
please advise the sender and delete / destroy it immediately.  We acc-
ept no liability  for any loss  or damage suffered by any person aris-
ing from  use of this e-mail / fax.  Please note that Siemens Business
Services ORS monitors e-mails sent or received.  Further communication
will signify your consent to this.
                                 -=-
Siemens Business Services Ltd          Registered No: 04128934 England
Registered Office: Siemens House, Oldbury, Bracknell, Berks.  RG12 8FZ

Re: [milter-greylist] Limiting resident memory usage

2006-11-02 by Matthias Scheler

On Thu, Nov 02, 2006 at 03:59:36PM +0000, Jonathan Perkin wrote:
> I'm trialling milter-greylist on the BBC mail infrastructure, which
> receives around 1 million emails per day.  Recently I added
> 
>   acl greylist domain /[0-9][0-9]*\-[0-9][0-9]*\-[0-9][0-9]*/
>   acl greylist domain /[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*/
>   acl greylist domain /[0-9]{12}/
> 
> to the config to greylist anything which looks like a dynamic address,
> and since making that change my monitoring has shown milter-greylist
> to fail an awful lot more.

I used a rule like that in "milter-regex" for a while and it generated
a lot of false hits for hosts like "87-237-56-54.northerncolo.co.uk"
or "static-64-201-182-187.ptr.terago.ca"

A better way to do this is:

dnsrbl "SORBS DUN" dnsbl.sorbs.net 127.0.0.10
acl greylist dnsrbl "SORBS DUN" delay 12h

That will will require milter-greylist 3.0RC6.

	Kind regards

-- 
Matthias Scheler                                  http://zhadum.org.uk/

Re: [milter-greylist] Limiting resident memory usage

2006-11-02 by manu@netbsd.org

eclark <eclark@...> wrote:

> No offense, but that is an insane rule. You might want to try either rewriting
> your rule to be more reasonable, or use one of the varied rbl servers which
> specifically handle dynamic ips. This is definitely not the right way to go.
> Even better, just greylist _everything_, and set exclusions as appropriate.
> The way you are doing this is the complete opposite of how you should be, in
> my opinion.

What eclark said. And domain clauses in ACL might help here.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

Re: [milter-greylist] Limiting resident memory usage

2006-11-02 by Matt Kettler

eclark wrote:
> No offense, but that is an insane rule. You might want to try either rewriting 
> your rule to be more reasonable, or use one of the varied rbl servers which 
> specifically handle dynamic ips. This is definitely not the right way to go. 
> Even better, just greylist _everything_, and set exclusions as appropriate. 
> The way you are doing this is the complete opposite of how you should be, in 
> my opinion.


Personally, I think this is much more sane than greylisting everything.

And you can still create exclusions as appropriate. I do.

So where's the "insanity" of this when compared to acl greylist default?

As long as you don't consider this to be your "first" rule, and are willing to
add appropriate exclusions you should be no worse off than greylisting by
default. Actually, you should be better off, at least in terms of the number of
exclusions you need to add.

That said, you might be better off with the RBLs, but not all of us can make the
RBL enabled builds work right now. It blows up on my system, for example.


> On Thursday 02 November 2006 10:59 am, Jonathan Perkin wrote:
>> Hi,
>>
>> I'm trialling milter-greylist on the BBC mail infrastructure, which
>> receives around 1 million emails per day.  Recently I added
>>
>>   acl greylist domain /[0-9][0-9]*\-[0-9][0-9]*\-[0-9][0-9]*/
>>   acl greylist domain /[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*/
>>   acl greylist domain /[0-9]{12}/

This could be made a bit more efficiently. *'s can be expensive.

[0-9][0-9]* could be replaced by [0-9]+ or [0-9]{1,} with 100% equivalent behavior.

Might I suggest these two rules to replace the 3 above?

acl greylist domain /[0-9]{1,3}[-.][0-9]{1,3}[-.][0-9]{1,3}[-.]/
acl greylist domain /[0-9]{12}/

It won't likely help your memory problems very much, but it is more efficient.


>>
>> to the config to greylist anything which looks like a dynamic address,
>> and since making that change my monitoring has shown milter-greylist
>> to fail an awful lot more.
>>
>> The milter-greylist processes are sitting at around 600M resident
>> memory, and are causing the system to swap.
>>
>>   1. Can I limit the amount of memory milter-greylist will use to
>>      cache lookups?  Obviously with a large number of connections this
>>      is going to grow, but I cannot add more memory to the MX easily.

I'm not sure if it will help, but 2.1.1 added bucketed in-memory databases.
2.1.4 made some fixes to that, and some improvements to the ACL code.

Re: [milter-greylist] Limiting resident memory usage

2006-11-02 by AIDA Shinra

At Thu, 2 Nov 2006 15:59:36 +0000,
Jonathan Perkin wrote:
> 
> Hi,
> 
> I'm trialling milter-greylist on the BBC mail infrastructure, which
> receives around 1 million emails per day.  Recently I added
> 
>   acl greylist domain /[0-9][0-9]*\-[0-9][0-9]*\-[0-9][0-9]*/
>   acl greylist domain /[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*/
>   acl greylist domain /[0-9]{12}/
> 
> to the config to greylist anything which looks like a dynamic address,
> and since making that change my monitoring has shown milter-greylist
> to fail an awful lot more.
> 
> The milter-greylist processes are sitting at around 600M resident
> memory, and are causing the system to swap.
> 
>   1. Can I limit the amount of memory milter-greylist will use to
>      cache lookups?  Obviously with a large number of connections this
>      is going to grow, but I cannot add more memory to the MX easily.
> 
>   2. Why is the increased load causing more failures?  I test the
>      filter with something similar to
> 
>        acl greylist from /greylist-test.*@host/
> 
>      and generate a random string after "greylist-test" for MAIL FROM
>      so that it won't get cached.  Today the number of failured for
>      this test has been extremely high (previously I saw a number of
>      cases where it wasn't being greylisted, but it appears to get
>      worse with load).
> 
> This is sendmail 8.13.7 with security fixes, milter-greylist 2.0.2,
> Solaris 9 and everything compiled with Sun Studio 11.

Frankly speaking, don't do that just for now because:

* The milter-greylist has not been designed for such highly loaded
servers. For example, it holds everything in core. Scalability and
performance improvements are important TODOs.

* There is a known bug in libmilter which leads information loss in
greylist.db when stopping or restarting the milter-greylist. I hope it
is fixed in sendmail 8.13.9.

* There is also a known bug in all versions of milter-greylist when
handling mail addresses such as <foo@[ip.add.re.ss]>. It will be
partially fixed in the next release but the full fix will be available
in 3.1.x.

* There are known race conditions in all versions of milter-greylist.
Nobody has reported problems due to these bugs but I can't tell what
happens on heavily loaded multiple CPU servers. We already have a new
threading implementation, which will be available in 3.1.x.

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-02 by eclark

Jon, please refer to Matthias' previous email, regarding use of rbls to do 
greylisting, not blacklisting. Specifically these bits:

dnsrbl "SORBS DUN" dnsbl.sorbs.net 127.0.0.10
acl greylist dnsrbl "SORBS DUN" delay 12h

We get about 110 million emails a day inbound to our network at peak, with 
averages around 90 both out and in. We greylist everything and are pretty 
much not having any issues with our million+ mail addresses. The occasional 
domain may need to be whitelisted. While I am sure the BBC is a vast 
organization with many millions of messages being handled daily, the issue at 
hand is more one of how you decide to persue a resolution. I see that you 
handle about 1 million emails a day inbound. To be completely and totally 
frank with you, this is peanuts. In my humble opinion, you really should be 
running a much more draconian policy that utilizes third party databases of 
dynamic ips and overall broad greylisting, and only except that policy for 
particularly vociferous clients. Its far easier to greylist everything under 
the sun with varied durations and whitelist one problem user, than white list 
everyone and try to force a handful regular expressions to compensate for 
your overly lenient policy. In short, I would again suggest the following:

greylist everything for some minute time period (1-5 minutesish)
use dnsbl.sorbs to extend greylist durations as appropriate (2-8hours)
whitelist per complainy client

While Matt's regex hack he supplied does indeed cut memory costs, it is still 
by far the absolute most ineffecient way of doing things, simply because it 
will have to come before your default whitelist, causing everything thats not 
on a shortened greylist to be matched against it. The greylist all + 1min 
delay on non-dynamic ips (we greylist those in sorbs for 6 hours) has slashed 
our network bandwidth costs by 30%, and has knocked out a clean 78-85% of our 
inbound mail traffic. The principle behind the extremely short duration 
greylist is to obliterate botnets. Our broke mta whitelist numbers a meager 
84 entries; the combination of these settings allow us to handle the mail as 
previously mentioned with very little effect on performance (actually, we 
have seen a performance _gain_, due to the effect of not having to deal with 
mail delivery overhead), and run additional mail filtering in the way of 
spamassassin. 

Again, these are all just suggestions, but I very strongly feel your overly 
broad and resource intensive regex approach to the issue will ultimately bite 
you in the ass.
Show quoted textHide quoted text
On Thursday 02 November 2006 11:48 am, Jonathan Perkin wrote:
> * On 2006-11-02 at 16:45 GMT, eclark wrote:
> > You might want to try either rewriting your rule to be more
> > reasonable
>
> ..which would look like..?
>
> > or use one of the varied rbl servers which specifically handle
> > dynamic ips.
>
> Possibly, but won't that end up doing the same thing?  I don't want to
> block dynamic IPs, just greylist them.
>
> > Even better, just greylist _everything_, and set exclusions as
> > appropriate.
>
> That is an inappropriate policy for the BBC, I would end up
> whitelisting millions of domains.

Re: Limiting resident memory usage

2006-11-02 by Jonathan Perkin

* On 2006-11-02 at 17:53 GMT, eclark wrote:

> Jon, please refer to Matthias' previous email, regarding use of rbls
> to do greylisting, not blacklisting. Specifically these bits:
>
> dnsrbl "SORBS DUN" dnsbl.sorbs.net 127.0.0.10 acl greylist dnsrbl
> "SORBS DUN" delay 12h

They are being greylisted.  Or am I missing the point?  I'm hoping
milter-greylist doesn't blacklist _anything_ else I'm going to have
serious issues.

I'll definitely take a look at this, but it'll have to wait until a
version which supports it is released as stable, as I don't have time
to track beta releases at the moment.

> Its far easier to greylist everything under the sun with varied
> durations and whitelist one problem user, than white list everyone
> and try to force a handful regular expressions to compensate for
> your overly lenient policy.

Unfortunately this simply isn't possible in an organisation like the
BBC.  You say one problem user; the reality is that this is likely to
be a thousand, all with broadcast critical (literally) issues.  We
have such a diverse set of requirements and policies that it is very
tricky to balance the spam issue.

> The greylist all + 1min delay on non-dynamic ips (we greylist those
> in sorbs for 6 hours) has slashed our network bandwidth costs by
> 30%, and has knocked out a clean 78-85% of our inbound mail traffic.
> The principle behind the extremely short duration greylist is to
> obliterate botnets.

Indeed, I'd love to be able to implement rulesets like this, but in a
broadcast organisation you simply cannot afford delays on legitimate
email.  Even with a 1 minute greylist on *, you are going to hit
issues with clients which retry after 15 minutes, 30 minutes, or
longer.  If you're broadcasting Chris Moyles who wants people to email
in about a particular topic, but they don't get emails through until
after the show has finished, they're not going to be happy.  Same goes
for breaking news stories for News24, etc.  You get the idea...

This was only a test, and it didn't work very well.  I'll re-evaluate
the situation in light of other emails on this thread and try again
another time.

Thanks,

-- 
Jonathan Perkin                             Unix Systems Administrator
Formerly BBC Technology                  http://www.siemens.co.uk/sbs/
Siemens Business Services Ltd,  Maiden House, Vanwall Road, Maidenhead
                                 -=-
This email (and any attachments) contains confidential information and
is for the exclusive use of  the addressee(s).  Any views contained in
this e-mail are not the views of Siemens Business Services, ORS unless
specifically stated.  If you are not the addressee then any distribut-
ion, copying or use of this email is prohibited.  If received in error
please advise the sender and delete / destroy it immediately.  We acc-
ept no liability  for any loss  or damage suffered by any person aris-
ing from  use of this e-mail / fax.  Please note that Siemens Business
Services ORS monitors e-mails sent or received.  Further communication
will signify your consent to this.
                                 -=-
Siemens Business Services Ltd          Registered No: 04128934 England
Registered Office: Siemens House, Oldbury, Bracknell, Berks.  RG12 8FZ

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-02 by AIDA Shinra

At Thu, 2 Nov 2006 12:45:11 -0500,
eclark wrote:
> 
> Jon, please refer to Matthias' previous email, regarding use of rbls to do 
> greylisting, not blacklisting. Specifically these bits:
> 
> dnsrbl "SORBS DUN" dnsbl.sorbs.net 127.0.0.10
> acl greylist dnsrbl "SORBS DUN" delay 12h
> 
> We get about 110 million emails a day inbound to our network at peak, with 
> averages around 90 both out and in. We greylist everything and are pretty 
> much not having any issues with our million+ mail addresses. The occasional 
> domain may need to be whitelisted. While I am sure the BBC is a vast 
> organization with many millions of messages being handled daily, the issue at 
> hand is more one of how you decide to persue a resolution. I see that you 
> handle about 1 million emails a day inbound. To be completely and totally 
> frank with you, this is peanuts. In my humble opinion, you really should be 
> running a much more draconian policy that utilizes third party databases of 
> dynamic ips and overall broad greylisting, and only except that policy for 
> particularly vociferous clients. Its far easier to greylist everything under 
> the sun with varied durations and whitelist one problem user, than white list 
> everyone and try to force a handful regular expressions to compensate for 
> your overly lenient policy. In short, I would again suggest the following:
> 
> greylist everything for some minute time period (1-5 minutesish)
> use dnsbl.sorbs to extend greylist durations as appropriate (2-8hours)
> whitelist per complainy client
> 
> While Matt's regex hack he supplied does indeed cut memory costs, it is still 
> by far the absolute most ineffecient way of doing things, simply because it 
> will have to come before your default whitelist, causing everything thats not 
> on a shortened greylist to be matched against it. The greylist all + 1min 
> delay on non-dynamic ips (we greylist those in sorbs for 6 hours) has slashed 
> our network bandwidth costs by 30%, and has knocked out a clean 78-85% of our 
> inbound mail traffic. The principle behind the extremely short duration 
> greylist is to obliterate botnets. Our broke mta whitelist numbers a meager 
> 84 entries; the combination of these settings allow us to handle the mail as 
> previously mentioned with very little effect on performance (actually, we 
> have seen a performance _gain_, due to the effect of not having to deal with 
> mail delivery overhead), and run additional mail filtering in the way of 
> spamassassin. 
> 
> Again, these are all just suggestions, but I very strongly feel your overly 
> broad and resource intensive regex approach to the issue will ultimately bite 
> you in the ass.

I strongly object you. Your "default to greylist" approach makes too
many false positives. I'm sure BBC do not want even one false
positive. It is important that false positives are most likely at big
ISP's outgoing mxes. In contrast, small hosts, which often lack rDNS
or have numeric rDNS, are tend to run well-known MTAs in simple
configurations. They are less likely to suffer false positives.
Perkin's "greylist end-user-looking hosts only" approach is quite
reasonable.

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-02 by eclark

Aida, why even bother with greylisting at all then? Every big name, reliable 
mail filtering appliance on the market uses it in some fashion or another. If 
you look at something like Ironport, or Mailhurdle, these all use sender 
verification or greylisting to handle spam issues. If you can not have even 
one false positive, you absolutely can not filter at all. 

When the two choices are a complete failure of the system, or a minute failure 
of the system that can be addressed with acl statements, I feel that it is 
obtuse not to greylist by default. In this case, he is seeing a complete 
failure of the system, so your objection is null and void anyway. It would be 
different if it was actually functioning as expected. 
Show quoted textHide quoted text
>
> I strongly object you. Your "default to greylist" approach makes too
> many false positives. I'm sure BBC do not want even one false
> positive. It is important that false positives are most likely at big
> ISP's outgoing mxes. In contrast, small hosts, which often lack rDNS
> or have numeric rDNS, are tend to run well-known MTAs in simple
> configurations. They are less likely to suffer false positives.
> Perkin's "greylist end-user-looking hosts only" approach is quite
> reasonable.
>
>
>
>
> Yahoo! Groups Links
>
>
>

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-02 by Matt Kettler

eclark wrote:
> Jon, please refer to Matthias' previous email, regarding use of rbls to do 
> greylisting, not blacklisting. Specifically these bits:
> 
> dnsrbl "SORBS DUN" dnsbl.sorbs.net 127.0.0.10
> acl greylist dnsrbl "SORBS DUN" delay 12h
> 

<snip>

> 
> Again, these are all just suggestions, but I very strongly feel your overly 
> broad and resource intensive regex approach to the issue will ultimately bite 
> you in the ass.
> 

Wait.. you think the *regex* is too resource intensive, but advocate using RBLs
instead?

Are you completely out of your MIND???!!!


An RBL is a NETWORK TEST. You have to create a UDP socket, send a request, wait
for a reply, parse the reply..

That's by FAR more resource intensive than the regex is. Probably by a factor of
at least 10, and more along the lines of several thousand times more expensive.

I'll admitt that all of the nubmers below are educated guesses on my part.
However, they are likely to be fairly close to real. Certainly much closer than
the viewpoint that RBLs are cheaper than regexes.


Time:	Regex - about 0.1 microseconds
	RBL - tens of milliseconds ( > 10000 microseconds)
	RBL is: about one hundred thousand times slower

Memory: Regex - about 100 bytes, including annotations
	RBL - With the socket structures, buffers to store packets in, etc, probably
about 2000 bytes.
	RBL uses: approximately 20 times the RAM

CPU:	Regex - a few hundred clock cycles
	RBL - about ten thousand clock cycles. (remember, you have to format the query,
and parse the response here. PLUS you have the overhead of creating a udp
socket, IP stack processing, Network interface driver, etc.)
	RBL uses: approximately 50 times the CPU

IO:	Regex - RAM only
	RBL - RAM +  bus access to the NIC registers + busmastering to ram by the NIC.
	RBL uses: at least 10 times the IO bus time. Bus accesses to NIC registers are
considerably slower than cpu-to-ram accesses, and are not cacheable.


If you think you're saving resources using RBLs, do yourself a favor and
re-think that viewpoint.

Perhaps you mistakenly got this viewpoint from the bigevil.cf or
sa-blacklist-uri.cf vs surbl.org DNS issues with SpamAssassin.

However, in that case, bigevil contains HUNDREDS of VERY complicated regexes,
plus SpamAssassin adds lots of overhead beyond just the regex itself.

sa-blacklist-uri contains 540+ regexes like this one:

m/\b0(?:204-qazwsxma\.biz|2319\.com|241\.com|242\.com|243\.com|25ma\.com|284\.com|287\.com|28jsh\.co
m|2bikes\.com|2cruises\.com|2energydeals\.com|2host\.com|2optrix\.com|2owsk\.info|2refi\.net|2techno\.com|3-shopper-value\.c
om|32439\.com|345fjh\.com|35171246\.net|3l\.net|3newsletter-server1\.com|449\.com|491\.com|4aol\.com|4lyrics\.com|4newslette
r-server1\.com|4olympics\.com|4rivival\.com|5100\.com|512ly\.com|534star\.com|571che\.com|58\.cn|5988\.com\.cn|5cars\.com|5m
0rt\.com|5m0rt\.net|5mort\.com|5startlogic\.com|6100\.com|657\.com|683\.com|684\.com|685\.com|693\.com|695\.com|697\.com|6ch
ip\.com|6q\.com)\b/i


Yes.. 500 regexes that are over 600 bytes each in text form is REALLY slow and
REALLY CPU intensive..

So yes, there is a point where it's good to replace regexes with RBLs if you can
replace a LOT of regexes with one RBL. The cost of an RBL query is fixed no
matter how many entries exist in it. Regexes on the other hand start adding up
the more of them you have.

However, using a RBL to replace 2 short lightweight regexes is NOT a performance
gain. It's a massive performance loss.

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-02 by manu@netbsd.org

Matt Kettler <mkettler@...> wrote:

> Wait.. you think the *regex* is too resource intensive, but advocate using
> RBLs instead?
> 
> Are you completely out of your MIND???!!!
> 
> 
> An RBL is a NETWORK TEST. You have to create a UDP socket, send a request,
> wait for a reply, parse the reply..

That's it: you wait. That means the thread is sleeping and the CPU works
somewhere else. Perhaps in another milter-greylist thread, perhaps in
another process. 

Indeed a DNS lookup increases lattency, but it does not load the CPU as
regexp computation does.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-02 by manu@netbsd.org

Jonathan Perkin <jon.perkin@...> wrote:

> They are being greylisted.  Or am I missing the point?  I'm hoping
> milter-greylist doesn't blacklist _anything_ else I'm going to have
> serious issues.

Yes, acl greylist means the matching messages are greylisted.
 
> I'll definitely take a look at this, but it'll have to wait until a
> version which supports it is released as stable, as I don't have time
> to track beta releases at the moment.

I skiped the beta stage due to the lack of feedbacks on the alpha
snapshots... 

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-02 by Matt Kettler

manu@... wrote:
> Matt Kettler <mkettler@...> wrote:
> 
>> Wait.. you think the *regex* is too resource intensive, but advocate using
>> RBLs instead?
>>
>> Are you completely out of your MIND???!!!
>>
>>
>> An RBL is a NETWORK TEST. You have to create a UDP socket, send a request,
>> wait for a reply, parse the reply..
> 
> That's it: you wait. That means the thread is sleeping and the CPU works
> somewhere else. Perhaps in another milter-greylist thread, perhaps in
> another process. 

Yes, I know that no CPU is being used during this time. I addressed both time
AND real CPU clock cycles.

> 
> Indeed a DNS lookup increases lattency, but it does not load the CPU as
> regexp computation does.
> 

I'd argue it does load the CPU more than a regex, although to a lesser degree
than it increases latency. (ie: yes, it's obvious there's a massive increase in
latency, but I say there's also a smaller increase in CPU load).

Yes, you do wait while waiting for the response to come back, and while doing
that you are consuming no CPU cycles.

However, the number of actual CPU cycles burned building the query, sending it,
receiving the reply and parsing that is by far higher than the regex is.

Remember, you don't just count the code that milter-greylist is running.
Consider all the code in the resolver library, OS IP stack, and NIC driver. That
all has to run too. And that all takes CPU time as well.

Sure, regexes are expensive compared to a binary compare, but they're not *THAT*
expensive.

Parsers are expensive too, and that's exactly what the resolver is going to have
to do with the DNS reply it gets.

Think about it. Picture in your head all the things your computer does to create
a DNS querry in a UDP packet, send it, receive a response, and process the response.

No, really. think about all of it.

Sending the query:
buffer allocation, DNS query formatting, context switch to kernel, IP/UDP header
addition, ethernet header addition, NIC programming.

<sleep for free>

Receiving the response:

interrupt handler, kernel thread wake, (possible memcpy depending on NIC and
kernel behaviors), ethernet header parsing, IP header parsing, UDP header
parsing, match against existing socket handles, wake user thread , context
switch to user space, buffer allocation, context switch to kernel, memcpy data
to user app buffer, context switch to user space, DNS response format parsing,
buffer deallocation.

There's a LOT of work going on under the covers here. That's not cheaper than a
pair of short regexes. If you think it is, you're ignoring a large number of
these steps which are all wrapped up in a library for you.

All the context switches alone are likely on-par in clock-cycles burned with the
regex evaluation. Those are not at all inexpensive because the entire CPU state
has to be saved off into a task descriptor. There's at least 3 context switches
involved here, and that's before you actually do any real work.

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-02 by Matt Kettler

Matt Kettler wrote:
> manu@... wrote:
>> Matt Kettler <mkettler@...> wrote:
>>
>>> Wait.. you think the *regex* is too resource intensive, but advocate using
>>> RBLs instead?
>>>
>>> Are you completely out of your MIND???!!!
>>>
>>>
>>> An RBL is a NETWORK TEST. You have to create a UDP socket, send a request,
>>> wait for a reply, parse the reply..
>> That's it: you wait. That means the thread is sleeping and the CPU works
>> somewhere else. Perhaps in another milter-greylist thread, perhaps in
>> another process. 
> 
> Yes, I know that no CPU is being used during this time. I addressed both time
> AND real CPU clock cycles.
> 
>> Indeed a DNS lookup increases lattency, but it does not load the CPU as
>> regexp computation does.
>>
> 
> I'd argue it does load the CPU more than a regex, although to a lesser degree
> than it increases latency. (ie: yes, it's obvious there's a massive increase in
> latency, but I say there's also a smaller increase in CPU load).

As a side note, I made a quick test with queryperf of 100 PTR queries to
sbl-xbl.spamhaus.org and ran it through time. The records were just incrementing
 querries typical of an IP.. ie:
	1.1.1.1.sbl-xbl.spamhaus.org PTR
	1.1.1.2.sbl-xbl.spamhaus.org PTR

This would cause each lookup to be unique, although the basic NS records for the
sbl-xbl.spamhaus.org would be cached. This was directed at a resolving ( Yes,
resolving, ie: no forwarders and a root.hint zone) bind nameserver running on
localhost.

cat querytest.sbl-xbl | time queryperf

1.48user 0.85system 0:05.11elapsed 45%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (191major+26minor)pagefaults 0swaps

2.33 seconds of real CPU (1.48 user + 0.85 system =  was used to complete 100
those queries. 5.11 seconds of wall time elapsed

So a DNS query, with recursion, burns about 2ms of actual CPU. Slightly less due
to queryperf having to parse text input and generate text output.

Even with the whole batch cached in the local resolver, re-running it results in
0.05s = 5ms of real CPU time being used and 0.35s clock-time passing, so even a
cached query is 0.05ms = 50us.


Of course, this is all dependent on CPU, but I ran this on a single CPU 2ghz
intel box. Not the fastest but not too shabby.

The numbers are also quite crude, and probably have significant errors in them.

That said, do you really think your regex execution is THAT slow that it will
take longer than 50us of real CPU time to evaluate one on a 2ghz processor?

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-02 by eclark

Matt, that depends on entirely how greedy the regex in question is, and how efficient one stack is vrs another. I would bet money that the regex evaluation in the milter is vastly insuperior to the dns resolution stack that the milter is compiled against. Moreover, network latency is a nonissue when you locally mirror RBLs inside your own network, or populate local sendmail databases on a regular basis with content pulled in an automatic fashion. The bigger point is, the broader the stroke you can cut, the more you can knock out at once, which ultimately means less resource consumption, even if you are using remote network connectivity to do some of your work (as remote connectivity is limited to what got past the initial greylist in the first place). I can speak firsthand about the ailments that are caused by rbls; we have seen mailservers brought to their knees running sendmail RBLs as if it were nothing, with no additional filtering at all. Drop nameservice to a mailserver
 , and your RBLs will wait for an extended period of time until process death, and back up the runqueue in no time. No, RBLs are not an end-all solution, and were not suggested as one. They were pointed out as a replacement for an unknown number of expensive regexs. You are making the fatal error of assuming the only expression being evaluated were the two pasted; I seriously doubt this is the case, and figure there are probably many more in his conf as well. At what point would you conceed that use of alternate, broader methods of checking would be superior to a list of expressions? 10 greedy regexs? 15? 5? If replacing an entire greylisting mechanism made of 15 or more greedy expressions with one locally based hostname lookup in a mirrored in your immediate network is considered less effective, then you truely have me stumped as to what might be considered more efficient.
Show quoted textHide quoted text
On Thu, 02 Nov 2006 16:14:21 -0500, Matt Kettler <mkettler@...> wrote:
> manu@... wrote:
>> Matt Kettler <mkettler@...> wrote:
>>
>>> Wait.. you think the *regex* is too resource intensive, but advocate
> using
>>> RBLs instead?
>>>
>>> Are you completely out of your MIND???!!!
>>>
>>>
>>> An RBL is a NETWORK TEST. You have to create a UDP socket, send a
> request,
>>> wait for a reply, parse the reply..
>>
>> That's it: you wait. That means the thread is sleeping and the CPU works
>> somewhere else. Perhaps in another milter-greylist thread, perhaps in
>> another process.
> 
> Yes, I know that no CPU is being used during this time. I addressed both
> time
> AND real CPU clock cycles.
> 
>>
>> Indeed a DNS lookup increases lattency, but it does not load the CPU as
>> regexp computation does.
>>
> 
> I'd argue it does load the CPU more than a regex, although to a lesser
> degree
> than it increases latency. (ie: yes, it's obvious there's a massive
> increase in
> latency, but I say there's also a smaller increase in CPU load).
> 
> Yes, you do wait while waiting for the response to come back, and while
> doing
> that you are consuming no CPU cycles.
> 
> However, the number of actual CPU cycles burned building the query,
> sending it,
> receiving the reply and parsing that is by far higher than the regex is.
> 
> Remember, you don't just count the code that milter-greylist is running.
> Consider all the code in the resolver library, OS IP stack, and NIC
> driver. That
> all has to run too. And that all takes CPU time as well.
> 
> Sure, regexes are expensive compared to a binary compare, but they're not
> *THAT*
> expensive.
> 
> Parsers are expensive too, and that's exactly what the resolver is going
> to have
> to do with the DNS reply it gets.
> 
> Think about it. Picture in your head all the things your computer does to
> create
> a DNS querry in a UDP packet, send it, receive a response, and process the
> response.
> 
> No, really. think about all of it.
> 
> Sending the query:
> buffer allocation, DNS query formatting, context switch to kernel, IP/UDP
> header
> addition, ethernet header addition, NIC programming.
> 
> <sleep for free>
> 
> Receiving the response:
> 
> interrupt handler, kernel thread wake, (possible memcpy depending on NIC
> and
> kernel behaviors), ethernet header parsing, IP header parsing, UDP header
> parsing, match against existing socket handles, wake user thread , context
> switch to user space, buffer allocation, context switch to kernel, memcpy
> data
> to user app buffer, context switch to user space, DNS response format
> parsing,
> buffer deallocation.
> 
> There's a LOT of work going on under the covers here. That's not cheaper
> than a
> pair of short regexes. If you think it is, you're ignoring a large number
> of
> these steps which are all wrapped up in a library for you.
> 
> All the context switches alone are likely on-par in clock-cycles burned
> with the
> regex evaluation. Those are not at all inexpensive because the entire CPU
> state
> has to be saved off into a task descriptor. There's at least 3 context
> switches
> involved here, and that's before you actually do any real work.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Yahoo! Groups Links
> 
> 
>

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-03 by Matt Kettler

eclark wrote:
> Matt, that depends on entirely how greedy the regex in question is, and how efficient one stack is vrs another.

Agreed.

 I would bet money that the regex evaluation in the milter is vastly insuperior
to the dns resolution stack that the milter is compiled against.

Agreed.

Fortunately, I'm talking about 2 orders of magnitude CPU usage difference. Even
if the regex eval in the milter is a factor of 2 slower than it should be, it
would still end up with 50 times less CPU usage than the RBL query.


I don't care how fast your DNS resolver is, or how local it is, it's still going
to involve context switches. Context switches HURT. BADLY. Talk to a someone who
writes schedulers sometime. Even a good scheduler is bound by how long it takes
the processor to save and load context.

If context switches were fast, the HZ (rate of timer interrupts and basis for
when the scheduler runs) in a standard linux kernel would be about 1 million not
100 or 1000.

A regex involves no context switches and no I/O, it's all memory operations in
your process space.

A RBL query involves at least 3 context switches and may involve network I/O. On
a damn good OS and CPU, a context switch is about 1 us by itself. That's the
kind of context-switch overhead folks BRAG about.

In my testing even locally cached queries over lo take about 50us a pop.


Moreover, network latency is a nonissue when you locally mirror RBLs inside your
own network, or populate local sendmail databases on a regular basis with
content pulled in an automatic fashion.

True. I'm speaking mostly to the CPU usage side, not the latency. The latency
numbers are so hugely different it's not even worth discussing.

From here forward, unless otherwise specified, every reference to SPEED or FAST
refers to EXECUTION TIME on the processor, not clock time on the wall.

Also, not everyone mirrors the RBLs in question inside their own networks, so
there you're making a broad, possibly false, assumption.

But check my other post. Even locally cached DNS queries are NOT nearly as fast
as you seem to think they are.

Sure they're "fast" compared to running out over the wire. But they're not fast
compared to a regex.

> The bigger point is, the broader the stroke you can cut, the more you can knock out at once, which ultimately means less resource consumption, even if you are using remote network connectivity to do some of your work (as remote connectivity is limited to what got past the initial greylist in the first place).

That statement is patently false. It would only be true if the two tests making
the "cut" are of equal complexity. In that case, yes the broader brush works better.

In this case the broader brush is only about twice as effective, and consumes
about 100 times the CPU as a small handful of regexes.


 I can speak firsthand about the ailments that are caused by rbls; we have seen
mailservers brought to their knees running sendmail RBLs as if it were nothing,
with no additional filtering at all. Drop nameservice to a mailserver
>  , and your RBLs will wait for an extended period of time until process death, and back up the runqueue in no time. No, RBLs are not an end-all solution, and were not suggested as one.

I agree.

 They were pointed out as a replacement for an unknown number of expensive regexs.

And I contest the claim those regexes are "expensive" when compared to the RBL.

Most folks who talk about how "expensive" regexes are are talking in comparison
to memcmp() operations. memcmp() is essentially 1 clock cycle per 4 bytes, plus
memory I/O overhead. It's hard to be faster than that.

Folks who talk about how "expensive" regexes are are NOT talking in comparison
to disk or network i/o, or parsing through a handful of recursive queries in a
DNS resolver.

You are making the fatal error of assuming the only expression being evaluated
were the two pasted; I seriously doubt this is the case, and figure there are
probably many more in his conf as well.

Ok, fair enough. I have a lot more too. I'd still venture to guess it would take
100 regexes of similar complexity to hit the "break even" point with the RBL in
terms of CPU usage.

That said, I'd love to ditch some of my regexes for a RBL querry.. but not
because it's faster. I'd do it because it's more accurate, and I'm willing to
increase my CPU usage, and substantially increase my latency, in order to do it.

 At what point would you conceed that use of alternate, broader methods of
checking would be superior to a list of expressions? 10 greedy regexs? 15? 5?

Depends on how greedy you're talking. About 100 of my revised versions of those
regexes. About 50 of the original ones. About 10 that have .* in them without
much good fixed-match text at the start (ie: /a.*b/).

And of course, 1 really badly written one could do it. Anyone can write a regex
that is more-or-less a DOS attack by itself. Use a ton of back references, make
it about 5 miles long, lots of .*'s.. yeah, it gets ugly.

If replacing an entire greylisting mechanism made of 15 or more greedy
expressions with one locally based hostname lookup in a mirrored in your
immediate network is considered less effective, then you truely have me stumped
as to what might be considered more efficient.

Well given that it took me 2 weeks of work to even make a RBL-enabled build of
milter-greylist that doesn't instantly segfault.

And given that milter-greylist's usage of RBLs is new, and probably pretty
suboptimally written.

And given that I'm pretty reasonable at regex tuning. I don't use * or + and I
rarely use . in my regexes for milter-greylist.

Given all that, Yes. I do consider the regexes more efficient, as long as you
keep their numbers down to a sane count, and keep them reasonably written.


RBL queries are REALLY SLOW by comparison to short, reasonably written regexes.
Period.

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-03 by Oliver Fromme

Hi,

I'm very sorry, this will become a somewhat lengthy mail.
I hope your spam filter doesn't drop it.  ;-)
But the issue at hand is _not_ trivial and cannot be
discussed in a single paragraph.

Matt Kettler wrote:
 > That said, do you really think your regex execution is THAT slow
 > that it will take longer than 50us of real CPU time to evaluate
 > one on a 2ghz processor?

Do you know how regular expressions work?  The expression
itself (i.e. the string from the configuration file) is
not stored in memory, neither is it used directly for
matching.

Instead, during parsing, the regex is "compiled" into a
finite state automaton.  That automaton is basically a
data structure consisting of several tables (arrays),
typically several hundred bytes in size, sometimes even
thousands.  The size depends on structure of the regexp
and the actual implementation of the regular expression
library, and how it trades off speed and size.  I assume
that milter-greylist uses the system library (regex(3)),
which can behave quite differently on different platforms,
depending on whether the authors turned their attention
to speed or to memory consumption.

When the automaton is applied ("executed") to a string,
the speed also depends very much on the structure of the
regular expression.  If the automaton only contains simple
states (e.g. fixed characters and unlimited repetitions),
it is almost as fast as a plain string comparison, i.e.
negligible.  On the other hand, complicated states such
as repetition ranges or back-references require recursion
and will need significantly more CPU time.  If you even
nest them in multiple levels, CPU consumption will sky-
rocket exponentially.

Bottom line:  When using regular expressions, it's worth
to understand how they work, and then craft them in a way
that is most efficient.  Perform benchmarks if necessary
(using the same regex library, of course).  It can make a
huge difference.

UDP is a very efficient protocol, especially when run
over the loopback interface (as is the case when you have
a local caching nameserver, which you definitely should
have when running an MTA).  Of course it depends on the
implementation of the IP stack in the kernel.  UDP is
connection-less, doesn't have to care about re-ordering,
retransmits etc., and the loopback interface doesn't have
to produce an ethernet frame and doesn't have to perform
fragmentation and re-assembly.  So it's several orders of
magnitude less complicated and more efficient than a TCP
connection to a remote host.

Basically, the resolver library generates a request packet
(typically between 60 and 80 bytes).  That packet is copied
around once or twice in the kernel, maybe not even once if
the kernel is well optimized for that case.  It doesn't
matter much anyway for 80 bytes.  For DNS black lists, the
reply is usually not much larger ("real" DNS replies are
noticeably larger, typically between 100 and 400 bytes,
depending on how much RRs and glue records the server
associates with the request, but that's still not much).
The processing on the server side depends on lot on the
efficiency of the implementation of the caching nameserver
(as far as I know, BIND is quite good in that regard).

If the answer to the query is not already cached, then it
has to be fetched from a parent (forwarder), or from an
authoritative nameserver directly.  Of course, this will
lead to a much higher latency, but during that time the
local CPU is free to do other things, as Emmanuel already
pointed out, so it doesn't matter at all.  On my FreeBSD 4
machine the overhead is near zero and requires quite
sophisticated benchmarks to even be able to measure it.

In fact, you can probably run a caching nameserver on a
different machine (or load-balance on several machines) in
the same LAN (gigabit) instead of the local machine, and
won't notice a difference.

Some final notes:  First, when using a packet filter (IPFW
or PF on BSD systems, IPF on Solaris, IP tables/chains on
Linux or whatever), make sure that loopback and/or UDP/53
traffic is short-cut right at the beginning of your rules,
or even completely exempt from filtering if possible.  If
your DNS traffic has to pass through hundreds of rules,
it can add noticeable latency _and_ CPU processing.  The
rule of thumb from regular expressions also applies to
packet filters:  Craft the rules very carefully, make
short-cuts for the majority of them whereever possible.
If possible, use a different machine as a filtering bridge
or similar, and don't use packet filtering at all on the
MTA machines.

Second:  Note the fact that sendmail performs a DNS lookup
on every incoming connection anyway.  The result from that
lookup is passed to milter-greylist for matching with the
"acl domain" feature.  So even when using regexps on domain
names and no BLs at all, you cannot avoid DNS lookups
completely.

A note on memory efficiency:  For each regular expression,
the finite state automaton has to be stored in memory, but
no additional data has to be store for every connection
from a remote host (except for the data that is present
anyway, no matter if regexps are used or not).  However,
for DNS BL lookups, the result is always cached per remote
hosts (by your name server, not by milter-greylist).  So
using regular expressions scales better in that regard.
However, if the size of your named process isn't critical,
then it doesn't matter.  The size of the milter-greylist
process is dominated by the size of the database, but not
by the number of DNS lookups or regular expressions.

So, the bottom line is:  Whether regular expressions or
DNS BLs are more efficient for someone depend on a whole
lot of things.  You cannot generically say that one is
always more efficient than the other.  I think on really
large servers it's worth investigating to find the best
mix of the two.  For example, in a particular case it
might be best to first filter a bunch of domains out with
a broad set of _simple_ regular expressions, then perform
DNS BL lookups on the rest, and apply further refined
regexps (possibly less efficient ones) depending on the
outcome of the lookup.  (I'm not sure if the current
version of milter-greylist supports such constructs; I
haven't given it a try yet because of personal resource
constraints on my side.)

Best regards
   Oliver

-- 
Oliver Fromme,  secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

Perl is worse than Python because people wanted it worse.
        -- Larry Wall

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-03 by Matt Kettler

Oliver Fromme wrote:
> Hi,

Hello.

> 
> I'm very sorry, this will become a somewhat lengthy mail.
> I hope your spam filter doesn't drop it.  ;-)
> But the issue at hand is _not_ trivial and cannot be
> discussed in a single paragraph.

Agreed, many of my posts here have been very long.

> 
> Matt Kettler wrote:
>  > That said, do you really think your regex execution is THAT slow
>  > that it will take longer than 50us of real CPU time to evaluate
>  > one on a 2ghz processor?
> 
> Do you know how regular expressions work? 

Yes, I'm *extensively* familiar with how regexes work.


> The expression
> itself (i.e. the string from the configuration file) is
> not stored in memory, neither is it used directly for
> matching.

I know.

> 
> Instead, during parsing, the regex is "compiled" into a
> finite state automaton.  

Yes, true. This should happen when the config file is parsed, not on a
per-connection basis.

> That automaton is basically a
> data structure consisting of several tables (arrays),
> typically several hundred bytes in size, sometimes even
> thousands.

True. The regexes in question are 791 bytes total.

  The size depends on structure of the regexp
> and the actual implementation of the regular expression
> library, and how it trades off speed and size.  I assume
> that milter-greylist uses the system library (regex(3)),
> which can behave quite differently on different platforms,
> depending on whether the authors turned their attention
> to speed or to memory consumption.

Agreed that regexes CAN be slow. However, this whole thread is about two
specific regexes.

> 
> When the automaton is applied ("executed") to a string,
> the speed also depends very much on the structure of the
> regular expression.  If the automaton only contains simple
> states (e.g. fixed characters and unlimited repetitions),
> it is almost as fast as a plain string comparison, i.e.
> negligible.  On the other hand, complicated states such
> as repetition ranges or back-references require recursion
> and will need significantly more CPU time.  If you even
> nest them in multiple levels, CPU consumption will sky-
> rocket exponentially.
> 
> Bottom line:  When using regular expressions, it's worth
> to understand how they work, and then craft them in a way
> that is most efficient.

Yes, exactly. Read my posts. All of them point out you need to write your
regexes WELL.

In fact, this thread started with me offering performance tuning suggestions to
someones regexes.

All I'm arguing against is the concept that DNS is faster than a small number of
well written regexes. That's utter nonsense, but it's the view that "eclark" has
been espousing.

Let me re-quote myself from an earlier post:

------------------
However, using a RBL to replace 2 short lightweight regexes is NOT a performance
gain. It's a massive performance loss.
------------------

That's my point. Not that regexes are always faster. But that modest numbers of
them are faster.  Let's not take this further than it belongs.

Really.. eclark blasted me for suggesting that someone use the following two
regexes:

acl greylist domain /[0-9]{1,3}[-.][0-9]{1,3}[-.][0-9]{1,3}[-.]/
acl greylist domain /[0-9]{12}/

And suggested they'd be better off using a DNS query. Even if they have 10 more
regexes just like those, they won't be better off with the DNS query, except
perhaps a slight reduction in memory usage.

Those two are NOT slower than a DNS lookup. Period. Even if the DNS is locally
cached. They're structured to prevent back-tracking, a common "killer" in regexes.

Let's look at them in perl, which is going to do a memory-large compilation
trying to tune for speed:

$perl -mre=debug /[0-9]{1,3}[-.][0-9]{1,3}[-.][0-9]{1,3}[-.]/
size 73 Got 588 bytes for offset annotations.

$perl -mre=debug -e '/[0-9]{12}/'
size 14 Got 116 bytes for offset annotations.

Do you really think you can complete a DNS query with 791 bytes of data storage?
Including recursion and parsing? AND use fewer CPU clock cycles?

I don't think you believe that, and nor do I. But apparently some folks do.

> So, the bottom line is:  Whether regular expressions or
> DNS BLs are more efficient for someone depend on a whole
> lot of things.  

Agreed.. Which is why my posts have been very careful to constrain the situation
to comparing locally cached DNS against a small number of well written regexes.


> You cannot generically say that one is
> always more efficient than the other. 

Agreed 100%. I've never made that claim.

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-03 by eclark

Matt, bit incorrect here. My point had nothing to do with this:


> acl greylist domain /[0-9]{1,3}[-.][0-9]{1,3}[-.][0-9]{1,3}[-.]/
> acl greylist domain /[0-9]{12}/

But this:

  acl greylist domain /[0-9][0-9]*\-[0-9][0-9]*\-[0-9][0-9]*/
  acl greylist domain /[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*/

which I still firmly believe are terrible for performance. And more 
specifically, the point was about using kuldges to greylist purported dynamic 
ips over a maintained list of them. How is it a kludge? There are definitely 
dnamic ip providers out there who do not use 1-2-3-4.provider.com or similiar 
to denote addresses in their space. Many do yes, but not all. RBLs were 
suggested over what was/is potentially a wide variety of regexs similiar to 
the ones originally posted, as the original poster pointedly stated that 
greylisting by default and using domain based acls was totally unacceptable. 

However, the overall thread did illuminate some useful information regardless; 
the general opinion here is that 100 regexs are almost definitely worse off 
than a single RBL call, as few as two or three expressions can totally nuke 
your box if they are poorly written, and that the best option will vary 
totally, but likely to contain some mix of rbl or sendmail db references tied 
with expression based acls as neccesary, and to very carefully build 
expressions to prevent excessive backtracking, the agreed bane of this 
discussion.

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-03 by eclark

And.... I like chocolate miiiiiilk.
Show quoted textHide quoted text
On Friday 03 November 2006 12:37 pm, eclark wrote:
> Matt, bit incorrect here. My point had nothing to do with this:
> > acl greylist domain /[0-9]{1,3}[-.][0-9]{1,3}[-.][0-9]{1,3}[-.]/
> > acl greylist domain /[0-9]{12}/
>
> But this:
>
>   acl greylist domain /[0-9][0-9]*\-[0-9][0-9]*\-[0-9][0-9]*/
>   acl greylist domain /[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*/
>
> which I still firmly believe are terrible for performance. And more
> specifically, the point was about using kuldges to greylist purported
> dynamic ips over a maintained list of them. How is it a kludge? There are
> definitely dnamic ip providers out there who do not use
> 1-2-3-4.provider.com or similiar to denote addresses in their space. Many
> do yes, but not all. RBLs were suggested over what was/is potentially a
> wide variety of regexs similiar to the ones originally posted, as the
> original poster pointedly stated that greylisting by default and using
> domain based acls was totally unacceptable.
>
> However, the overall thread did illuminate some useful information
> regardless; the general opinion here is that 100 regexs are almost
> definitely worse off than a single RBL call, as few as two or three
> expressions can totally nuke your box if they are poorly written, and that
> the best option will vary totally, but likely to contain some mix of rbl or
> sendmail db references tied with expression based acls as neccesary, and to
> very carefully build expressions to prevent excessive backtracking, the
> agreed bane of this discussion.
>
>
>
> Yahoo! Groups Links
>
>
>

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-03 by Matt Kettler

eclark wrote:
> Matt, bit incorrect here. My point had nothing to do with this:
> 
> 
>> acl greylist domain /[0-9]{1,3}[-.][0-9]{1,3}[-.][0-9]{1,3}[-.]/
>> acl greylist domain /[0-9]{12}/

I disagree, but I'll accept the change of point.

> 
> But this:
> 
>   acl greylist domain /[0-9][0-9]*\-[0-9][0-9]*\-[0-9][0-9]*/
>   acl greylist domain /[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*/
> 
> which I still firmly believe are terrible for performance.

Agreed. Those could be terrible, at least for some inputs.

Also, since theres two of them, instead of one (the first one in my quote
replaces those two), you're at least doing twice as much work, and possibly much
more due to the use of *.

More reason why it's important to think very hard about your regexes, and ask
others for tuning advice.

Of course, there's much worse things to do in a regex, particularly if your
parser is dumb:

http://regexadvice.com/blogs/dneimke/archive/2004/07/28/239.aspx

Perhaps it might be worth adding PCRE support to milter-greylist. At least then
those who link against it would know they were using a reasonably fast regex
library. Compared with the posix library which might not be so well tuned.

Some basic tips which we all should try to use when writing regexes:
- Avoid * and + if you can. Use {x,y} instead whenever possible.
- Try to combine regexes when you can, provided it's not massively increasing
complexity. ie: what I did above by using [-.] to combine the two rules.
- Use [] instead of (|) whenever possible. ie: [ab] instead of (a|b).
- when using (|) try to move as much common text out of the () as possible.
ie: instead of (saturday|sunday) do s(atur|un)day.



 And more
> specifically, the point was about using kuldges to greylist purported dynamic 
> ips over a maintained list of them. How is it a kludge?

Agreed, it is a bit of a kludge. But it's useful for folks who want to greylist
some small chunk of their traffic in a lightweight manner. Just don't go
overboard on the regexes.


> 
> However, the overall thread did illuminate some useful information regardless; 
> the general opinion here is that 100 regexs are almost definitely worse off 
> than a single RBL call, as few as two or three expressions can totally nuke 
> your box if they are poorly written, and that the best option will vary 
> totally, but likely to contain some mix of rbl or sendmail db references tied 
> with expression based acls as neccesary, and to very carefully build 
> expressions to prevent excessive backtracking, the agreed bane of this 
> discussion.

Agreed.

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-03 by eclark

> Perhaps it might be worth adding PCRE support to milter-greylist. At least
> then those who link against it would know they were using a reasonably fast
> regex library. 

Whats everyones idea on this? It seems like it would be a really good idea to 
me at the very least.

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-03 by Emmanuel Dreyfus

On Fri, Nov 03, 2006 at 02:13:48PM -0500, eclark wrote:
> > Perhaps it might be worth adding PCRE support to milter-greylist. At least
> > then those who link against it would know they were using a reasonably fast
> > regex library. 
> Whats everyones idea on this? It seems like it would be a really good idea to 
> me at the very least.

I don't care as I don't have performances issues :-)
Such a feature can get in if someone contribute it, but wait for 3.0 to 
be released first.

-- 
Emmanuel Dreyfus
manu@...

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-03 by Oliver Fromme

Emmanuel Dreyfus wrote:
 > eclark wrote:
 > > > Perhaps it might be worth adding PCRE support to milter-greylist. At least
 > > > then those who link against it would know they were using a reasonably fast
 > > > regex library. 
 > > Whats everyones idea on this? It seems like it would be a really good idea to 
 > > me at the very least.
 > 
 > I don't care as I don't have performances issues :-)
 > Such a feature can get in if someone contribute it, but wait for 3.0 to 
 > be released first.

A quick test on FreeBSD 6 reveals that pcre is about
2.3 times slower than the native regex(3) library.
Maybe that's because of the many bloated features of
perl REs that are not present in POSIX REs.  Therefore
I would recommend against using pcre, at least not
without carefully benchmarking it on different plat-
forms.

Oh by the way, I wasn't able to measure any difference
between "*" and "{,}".  They perform exactly at the
same speed, at least with regex(3) on FreeBSD 6.

(If anything, I would have expected "{,}" to be slower
because it requires either more states in the DFA/NFA,
or additional bookkeeping within the state, compared
to "*".)

Best regards
   Oliver

-- 
Oliver Fromme,  secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

"C is quirky, flawed, and an enormous success."
        -- Dennis M. Ritchie.

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-03 by Matt Kettler

Oliver Fromme wrote:
>
> 
> Oh by the way, I wasn't able to measure any difference
> between "*" and "{,}".  They perform exactly at the
> same speed, at least with regex(3) on FreeBSD 6.

You shouldn't notice any difference unless the input is patterned it the right
way to cause a lot of backtracking. In that case {n,m} will limit your backtrack
to at most m characters. * is only limited by the amount of matching input.


> 
> (If anything, I would have expected "{,}" to be slower
> because it requires either more states in the DFA/NFA,
> or additional bookkeeping within the state, compared
> to "*".)

In the simple case, yes * can be faster. In the worst case, it can be much
slower. Thousands of times slower.

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-04 by AIDA Shinra

At Fri, 03 Nov 2006 14:04:14 -0500,
Matt Kettler wrote:
> > 
> > But this:
> > 
> >   acl greylist domain /[0-9][0-9]*\-[0-9][0-9]*\-[0-9][0-9]*/
> >   acl greylist domain /[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*/
> > 
> > which I still firmly believe are terrible for performance.
> 
> Agreed. Those could be terrible, at least for some inputs.

A little offtopic: they are not too bad. No backtracking for any
input. * don't do harm if * part and its following part are mutually
exclusive. For example:
\[.*\] causes backtracking
\[[^]]*\] doesn't cause any backtracking

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-07 by Oliver Fromme

AIDA Shinra wrote:
 > Matt Kettler wrote:
 > > > 
 > > > But this:
 > > > 
 > > >   acl greylist domain /[0-9][0-9]*\-[0-9][0-9]*\-[0-9][0-9]*/
 > > >   acl greylist domain /[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*/
 > > > 
 > > > which I still firmly believe are terrible for performance.
 > > 
 > > Agreed. Those could be terrible, at least for some inputs.

No, they're perfectly OK.

 > A little offtopic: they are not too bad. No backtracking for any
 > input. * don't do harm if * part and its following part are mutually
 > exclusive. For example:
 > \[.*\] causes backtracking
 > \[[^]]*\] doesn't cause any backtracking

It depends entirely on the implementation of the regex
library.  You can implement them in a way that an NFA
is built that _never_ requires any backtracking for "*"
(including both cases mentioned above).

See any good book on formal languages and automatons.

Best regards
   Oliver

-- 
Oliver Fromme,  secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

"C++ is over-complicated nonsense. And Bjorn Shoestrap's book
a danger to public health. I tried reading it once, I was in
recovery for months."
        -- Cliff Sarginson

Move to quarantaine

This moves the raw source file on disk only. The archive index is not changed automatically, so you still need to run a manual refresh afterward.