stripping logos from scanned PDF files

musicprog · Post by **musicprog** » Sat Jan 03, 2009 3:41 pm

I have created a small tool which can detect STAMP in an a large image. This tool is a modification of OpenCV computer vision sample find_obj program. This tool detects the co-ordinates of the STAMP on the large image. Using Imagemagic convert tool the STAMP can be removed from the large image based on the co orginates. This tools works on PNG images only. But the coordinates can be used for a TIFF image cleaning.

Usage:

find_obj.exe STAMP.png ScannedImage.png

--------------------------------------------------------------------
A batch file can be used to remove STAMP from an image
--------------------------------------------------------------------
Remove.bat STAMP.png ScannedImage.tiff
--------------------------------------------------------------------
Remove.bat source:

convert %2 -monochrome temp.png

find_obj.exe %1 temp.png > a.txt
set /p COORD=<a.txt
del a.txt
del temp.png

convert %2 -fill white -draw "rectangle %COORD%" -monochrome -compress group4 %2

convert %2 -negate -monochrome -compress group4 %2

---------------------------------------------------------------------
The accuracy is very good, but its bit slow.

URL of Binary and source

http://www.4shared.com/dir/11558491/16f ... aring.html

horndude77 · Post by **horndude77** » Sun Jan 11, 2009 11:29 pm

Interesting. Do you know how it is searching for the 'stamp'? How long does it take?

Lately I've also been messing with the leptonica library to do some other miscellaneous cleaning functions (http://www.leptonica.com). It is more document oriented compared with opencv (I believe google uses it on google reader). It also includes a sample code file for a similar function: http://github.com/horndude77/leptonica/ ... pattern1.c. It does this with the hit-and-miss transform. In any case it's fast and much better than my previous approach. I want to figure out how to make it work well enough for mass logo-removal soon. (I do have it working well for general page clean up already: deskew, grayscale->bilevel, centering, noise removal)

I've been working one some rudimentary ruby bindings to make the library slightly easier to use and quicker to try stuff out (http://github.com/horndude77/leptonica- ... ree/master). Also I'm hoping as a side benefit it will be relatively easy to try out by packaging it as a gem (I'm not quite there yet).

musicprog · Post by **musicprog** » Mon Jan 12, 2009 2:11 pm

This program uses SURF(Speed-Up Robust Features) algorithm for searching the STAMP. Normally it takes 30-45 second to find the co-ordinates of stamp in a 300dpi A4 page. You can try this program as the accuracy level is really good. STAMP image can contain any number of connected components. The features extracted are also size/orientation invariant.

This program also supports non-compressed tiff images(gray scale).

Post by **Leonard Vertighel** » Mon Jan 12, 2009 3:14 pm

I've been looking at some of the OM scores. Here are some of the issues I found:

* Some logos are not rectangular, and they may be very close to other elements of the page. The algorithm should therefore be able to match and erase non-rectangular shapes.
* Some logos are cut off at the top and/or the sides. Partial matching would be necessary to deal with those cases.
* Some color logos are dithered, and the dithering pattern may change from file to file. There also seems to be some JPEG noise at least in some of the files. In order to deal with those logos, the algorithm would need to perform some kind of fuzzy matching.

I'm not sure how well the proposed algorithms can handle these cases. Musicprog's approach for example seems to erase only rectangles, so it may not (yet) be general enough to handle all of the OM logos.

musicprog · Post by **musicprog** » Tue Jan 13, 2009 12:37 am

The original find_obj program of OpenCV is actually capable of detecting all the four edges of a STAMP. In that case any non rectangular STAMP can be detected. This program is simplified to provide the diagonally opposite edges. Need to check with actual samples to find the usability of SURF.

Post by **Carolus** » Fri Jan 16, 2009 7:49 pm

First of all, let me say how very impressive it is to see the level of thought and problem-solving being applied to this issue (the automated removal of logos). Here's another challenge for the collective wisdom here:

As many of you know, Google has scanned a fair number of music scores from the Harvard Library, University of Michigan and other places. In addition to populating some of these files with their logo (which is really very simple to remove with a Acrobat plug-in called PitStop), metatags, etc., Google has done a very strange thing whereby the majority of the scan is 600 dpi monochrome, but isolated systems (not even a complete page) appear in 150 dpi grayscale. When it comes to printing these files, this bizarro-world feature slows things down tremendously, sometimes even causing a printer crash.

Is there any way to process the Google scans in a way so that they would all be a uniform 600 dpi (or even 300 dpi) monochrome? As I mentioned before, there is a nice collection of scores many of which would be excellent additions to the archive here. An automated method of processing the Google scores would certainly be a help along with one for stripping logos. Microsoft scans, which aren't nearly as numerous, are more like CDSM in that the embedded logo is not easily removed.

Post by **Leonard Vertighel** » Fri Jan 16, 2009 9:20 pm

One thing is clear: detail which is lost due to too low resolution cannot be restored, period. It may be possible to adopt some more sophisticated algorithm to get smoother edges than one would get with simple thresholding, but I'm not sure if there is anything which a) can be applied automatically without manual tweaking, and b) which actually improves legibility.

Have you tried what results can be obtained with thresholding? You may have to play with the threshold level to find a good value which doesn't destroy more details. The quality of the result can probably best be judged by actually printing it, as the impression on the screen may be misleading.

(Apparently the software Google is using is not suited for music scores, My guess would be that it tries to distinguish between text (600dpi monochrome) and images (150dpi greyscale), but it gets confused by musical notation.)

horndude77 · Post by **horndude77** » Fri Jan 16, 2009 9:50 pm

For 150dpi grayscale the best thing you can do it upsample (with bicubic interpolation if possible, but I think linear interpolation would work ok) to 600dpi and then convert the image to monochrome. It won't look as good as if it were scanned at the higher resolution like leonard said, but it might be acceptable. This morning I was comparing a scan I did at 300dpi grayscale then upsampled to 600dpi and converted to monochrome with a scan of the same page at 600dpi grayscale converted to monochrome. It wasn't terribly easy to tell the difference. My wife thought that the 600dpi source image did look slightly better, but I couldn't tell the difference.

Do you have an example from google that you want fixed? I don't think google's logos should be that difficult to remove.

Also I'd bet that Leonard is right that google it trying to automatically differentiate between images and text, but music confuses it.

Post by **Carolus** » Fri Jan 16, 2009 11:05 pm

OK, take a look at p.16 (page no.11) of the file at the following link:
http://books.google.com/books?id=8f4QAA ... t=ALLTYPES

Notice the oddity on the second system down. Removing the logo itself is easy, the wierdness of the grayscale on that particular page makes an older HP Laserjet choke. I am also curious as to what possible benefit could there be in randomly having a segment appear in grayscale while most the scan is a straightforward 600 dpi monochrome.

horndude77 · Post by **horndude77** » Sat Jan 17, 2009 2:47 am

When I view the pdf that particular system is just missing. (I've been wondering why full pages were missing in this one.) My linux pdf viewers choke on these google pdfs, but my wife's mac seems to handle them ok.

One nice thing about these files is that the google logo is actually a separate image on top of the original image!

In any case I can't do much with it. I'm might try to file a bug report some linux pdf viewer developers (If I can figure out where the problem is. Any Ideas? okular? evince? xpdf? jbig-dec?). Or perhaps with google for mis-classifying music.

horndude77 · Post by **horndude77** » Sat Jan 17, 2009 4:14 am

For those interested it seems this bug has been fixed, but ubuntu doesn't include the fix yet: https://bugs.freedesktop.org/show_bug.cgi?id=15629.

Post by **Leonard Vertighel** » Sat Jan 17, 2009 8:06 am

Pdfimages 3.00 can't handle it: The grayscale parts are returned as blank images. Even if they were returned correctly, pdfimages would lose the positioning information. At the moment, I wouldn't even know where to start with those files...

Carolus wrote:I am also curious as to what possible benefit could there be in randomly having a segment appear in grayscale while most the scan is a straightforward 600 dpi monochrome.

As I was saying above (and horndude agreed), this is presumably an error of the scanning software. If the segment actually was an image (as presumably the software mistakenly assumed), then it would actually make sense to have it in grayscale rather than monochrome.

Post by **Leonard Vertighel** » Sat Jan 17, 2009 9:17 am

OK, here's a "proof of concept":
First I converted the page to a single 600dpi pixel image with GraphicsMagick:

Code: Select all

gm convert -density 600x600 "Petite_suite.pdf[15]" test.png

Then I thresholded it in GIMP (could have done it in GraphicsMagick, but I wanted the visual feedback for testing) at 178/255 (which also "magically" erases the Google logo), converted to 1-bit indexed and saved as TIFF with G4 compression. Finally I made a PDF from it:

Code: Select all

tiff2pdf -o test.pdf test.tiff

The second system is clearly more jagged than the rest, but printed on a 600dpi laser printer I find it's still fairly legible.

Have a look:
http://www.ducklairtower.de/temp/test.pdf
What do you think?

[Edit] Here's how to generate the TIFF in one step:

Code: Select all

gm convert -density 600x600 -threshold 69% -compress Group4 "Petite_suite.pdf[15]" test.tiff

It needs quite some resources however: on my notebook (1GHz) it took around 15 seconds for a single page. Seems to require lots of memory, too.

Post by **Carolus** » Fri Jan 23, 2009 5:15 am

That's pretty impressive, Leonard. It certainly prints out much faster on my equipment. I've been uploading quite a few Google scores in the past few days, so anyone who wishes to work on them should feel free to do so. They have already been stripped of logos, etc.

There's a nice Belaieff full score for Scriabin's Symphony No.3, for example. I suppose the ultimate thing for dealing with scanned music would be to find a way to convert the images to a vector-based format, which would allow for scaling and generally enhance the existing image - filling in gaps in staff-lines, etc. Of course, that's probably as far off as really functional music OCR.

naja · Post by **naja** » Mon Jan 26, 2009 9:06 pm

Hi,

as we all know, scanning books and processing crap pdf files is a lot of work. I figured we should combine our knowledge and share it with new users. To this purpose im working on a wiki site to combine all this kind of knowledge. Please have a look at it. This is a guide on scanning that i have written:

http://filesharing.wikidot.com/guide:sharing-documents

If you want to write similar guides, or add a section about removing logo's, or about doing it with open source software, etc,

feel welcome to join the site and/or contribute.

greets,
naja

IMSLP Forums

stripping logos from scanned PDF files

STAMP remover based on OpenCV find_obj.exe