Just a further comment on this. The images are coming through with multiple background and text colors with matching grey levels and lint speckling over them. This is designed to make OCR more difficult. I don't know how successful the OCR has been, but I can tell by looking at the images that the spammers are trying to circumvent that tool. --------------- Chris Hoogendyk - O__ ---- Systems Administrator c/ /'_ --- Biology & Geology Departments (*) \(*) -- 140 Morrill Science Center ~~~~~~~~~~ - University of Massachusetts, Amherst <hoogendyk@...> --------------- Erd\ufffds 4 Oliver Fromme wrote: > Emmanuel Dreyfus wrote: > > I'd like to fight better the image spams. The only solution I have heard > > about is the OCR plug-in for spamassassin. I don't want to run spamassassin > > because it's too heavyweight. I'm looking for a way to convert the image > > into text and to run a regex filter on it. > > > > Is there something lightweight and reliable that does that? If not I'm > > going to develop it. > > > > My idea would be to create a new milter that would perform OCR on images > > contained in the message and would attach the obtained text at the end > > of the message so that others tools (milter-regex for instance) could work > > on it. > > > > Any opinion on that approach? > > The problem is that OCR itself is heavyweight. I've worked > with quite a few OCR systems in the past 15 years, and all > of them require a serious amount of processing performance. > And even then you have a certain error rate. The good > systems (those that have a low error rate) even require a > "training" process in advance for recognizing the fonts > being used for the documents being scanned. > > Therefore think that running OCR on embedded images on a > mail server isn't an option, I'm afraid. Unless you're > prepared that processing of every email requires a long > time, and that there are errors in the recognized text > (so that your regex will have trouble matching). It will > even make it easier to run a denial-of-service attack > against your mail server, by simply sending many emails > that contain multiple large images with random pixel > patterns. > > I think a better approach (i.e. much faster and more > reliable) would be to create a public database of such > images. Spammers aren't generating new images for every > single mail, so that should be feasible. > > First, when someone identifies a mail as spam, a hash of > the image (e.g. an MD5 checksum) is submitted to the > database. This could be pretty much automated by a script > or little tool, so a single hotkey from within your MUA > will do it. > > Second, if somewhere else email is received with one or > more images attached, the MD5 checksum will be calculated > and verified with the database. If it had been reported > before, it is tagged as spam. > > In fact, the database could be implemented as a DNS black > list, so milter-greylist would already support it. ;-) > The DNS query would ask for the MD5 checksum (as a string > of 32 hex digits). Maybe the reply could also contain some > indication of how many people reported this image as spam > already, so a higher number would indicate a better > reliability that it is indeed spam (i.e. higher "score"). > > Just an idea. Anyway, OCR is not a solution, I think. > > Best regards > Oliver
Message
Re: [milter-greylist] [off-topic] OCR milter?
2006-11-02 by Chris Hoogendyk
Attachments
- No local attachments were found for this message.