stripping logos from scanned PDF files

Advice and Help

Moderator: kcleung

horndude77
active poster
Posts: 293
Joined: Sun Apr 23, 2006 5:08 am
notabot: YES
notabot2: Bot
Location: Phoenix, AZ

Post by horndude77 »

With Gimp, I usually use the rectangle tool (press 'r'), select the region, then press Ctrl-'.' to blank the rectangle. It's faster than the eraser.

One tool that would be useful to make this a snap would be to give a sample of what to erase. Then it would automatically look for things that look like the sample picture and erase them. I don't think we need rotation or scaling so that should make it fairly simple. It seems very doable.
kcleung
Copyright Reviewer
Posts: 127
Joined: Fri Sep 12, 2008 9:38 pm

Post by kcleung »

First of all, thanks for your hint on gimp.
horndude77 wrote: One tool that would be useful to make this a snap would be to give a sample of what to erase. Then it would automatically look for things that look like the sample picture and erase them. I don't think we need rotation or scaling so that should make it fairly simple. It seems very doable.
That tool would be great! However this would require some sort of computer vision algorithm (at least shape / object recognition), which would in turn either require huge computation resources or result in slow recognition....... whereas if we do it manually, we can process a image within 5 seconds. Although I may be wrong....

Anyway have you seen such tool?
horndude77
active poster
Posts: 293
Joined: Sun Apr 23, 2006 5:08 am
notabot: YES
notabot2: Bot
Location: Phoenix, AZ

Post by horndude77 »

No, I haven't seen such a tool. Yes it would be processor intensive, but if it worked I wouldn't mind letting it work over a weekend to get a full cd done while I'm out at the park.

In any case I started a stupid script to work on this problem which just blanks rectangles. It only works with one of the files I have. I don't think it could be made more general, but perhaps it could be a good starting point.

Code: Select all

#!/bin/sh
RIGHT_COORDINATES="1890,240 2400,370"
LEFT_COORDINATES="310,270 815,400"
CENTER_COORDINATES="1115,140 1725,370"

PDF=$1
FILENAME=testing
PREFIX=prefix

pdfimages $PDF $PREFIX

LAST=right
for i in ${PREFIX}*
do
#PPM use the center. After that alternate between left and right starting with left.
    EXT=`echo $i | sed -e 's_^[^.]*__'`
    if [ $EXT = '.ppm' ]
    then
        COORD=$CENTER_COORDINATES
        LAST=right
    elif [ $LAST = 'right' ]
    then
        COORD=$LEFT_COORDINATES
        LAST=left
    else
        COORD=$RIGHT_COORDINATES
        LAST=right
    fi
    OUT_FILE=`echo $i | sed -e 's_\.[^.]*_.tiff_'`
    convert $i -fill white -draw "rectangle $COORD" -monochrome -compress group4 $OUT_FILE
done

tiffcp ${PREFIX}*.tiff out.tiff
tiff2pdf out.tiff -t "$TITLE" -z -o $FILENAME.pdf

rm ${PREFIX}*
(Looking at it now I noticed that I forgot to set the dpi in the output images.)
kcleung
Copyright Reviewer
Posts: 127
Joined: Fri Sep 12, 2008 9:38 pm

Post by kcleung »

Thanks a lot! I will look into your script too. Although will your program also accidentally remove retangles which meant to be there? (For example, instrument name may be enclosed by a rectangle)
horndude77
active poster
Posts: 293
Joined: Sun Apr 23, 2006 5:08 am
notabot: YES
notabot2: Bot
Location: Phoenix, AZ

Post by horndude77 »

http://github.com/horndude77/image-scripts/tree/master

I spent a half hour putting together a simple program which searches a bi-level image for a smaller image and removes that section of the target image (i.e. logo removal). Yes, it's slow, but on the images I've tried it only takes a minute or two.

Shortcomings:
- It only works on PBMs right now.
- Search space is hard coded.
- Search is not directed (just try everything).
- There is no concept of 'good enough' in the search. For example, if only 30 pixels differ in a section then it is probably the desired section (which contains thousands of pixels).

I'll hopefully clean it up a bit more soon, but an approach like this is much better than removing them by hand.
tilmaen
forum adept
Posts: 85
Joined: Fri Nov 21, 2008 2:10 pm

Post by tilmaen »

@ horndude:
awesome! the program is written in java - but is there any way of getting a precompiled version of that program? would it be possible to port this either to gimp or make it a pdfsam plugin?

i also just realized that it might be possible to remove the logos using virtualdub, after making all the images into a single movie. for virtualdub there are scripts like logoaway (http://www.voidon.republika.pl/virtualdub/) that could be utilized.
another addition: i found these plugins to be more promising:
http://www.compression.ru/video/image_r ... ex_en.html and
http://www.compression.ru/video/logo_re ... ex_en.html
for the latter one would have to separate the even and uneven page number in order to make the logo appear in the same area.

greetings
tilmaen
horndude77
active poster
Posts: 293
Joined: Sun Apr 23, 2006 5:08 am
notabot: YES
notabot2: Bot
Location: Phoenix, AZ

Post by horndude77 »

Removing a logo from a bi-level static image is very different from removing a logo from video. Also I believe the programs you linked to rely on the logo being in the same place from one frame to the next. This isn't always the case for images.

(I wonder how a program like this would work with watermark removal? Often these tv logos look similar to watermarks to me.)

As for making a plugin... yes it is possible, but my goal was a command-line solution and few dependencies. The less buttons I have to press the better.

For now I don't have a precompiled version. Are you on windows, linux or osx? I could put together some build instructions though it will amount to something like: install jdk, install ant, type 'ant'.
tilmaen
forum adept
Posts: 85
Joined: Fri Nov 21, 2008 2:10 pm

Post by tilmaen »

hi!

no hurry! i don't have my hands on the CD'S (yet) so i couldn't work with it.
i will get the cds some time next year i guess. i'm running windows, but i also have a kubuntu linux installed.
the second link (image restoration) might work - i don't know for sure but their aproach with using masks could actually work for a logo.
alternatively maybe one could run an ocr software over the pdfs and detect "CD-Rom Library" and delete that.
thanks for the program though - once i get started with converting the cds that will totally increase orchestra part availibility.

greetings
tilmaen
forum adept
Posts: 85
Joined: Fri Nov 21, 2008 2:10 pm

Post by tilmaen »

some poeple uploaded logo infested files to this protected imslp server - i would "donate" some computing power to get the files done and uploaded. if you don't mind i'd love to use your program to do so.
no hurry, but it'd be great to get some assistance as soon as you have the time.
I'm running a kubuntu 8.10 linux.

greetings
tilmaen
aldona
active poster
Posts: 385
Joined: Mon Apr 16, 2007 11:09 pm
notabot: 42
notabot2: Human
Location: Melbourne, Australia

Post by aldona »

I guess this is sort of on a similar subject...removing logos/ watermarks from files....

There is a wealth of material from the Mendelssohn Complete Works appearing on the website of the Münchener Digitalisierungszentrum.

http://www.muenchener-digitalisierungszentrum.de/

They also have a whole mountain of interesting scores already. I don't have the tools or experience to try to strip the logos/ watermark but this could be an assignment for someone who is interested.

aldona
“all great composers wrote music that could be described as ‘heavenly’; but others have to take you there. In Schubert’s music you hear the very first notes, and you know that you’re there already.” - Steven Isserlis
Leonard Vertighel
Groundskeeper
Posts: 553
Joined: Fri Feb 16, 2007 8:55 am

Post by Leonard Vertighel »

Hi aldona, I had only a quick look at the site. Are the scans only available as separate images, or did I miss something? Is the image you get by clicking "150%" the best available quality? And is there anything else besides the small "BSB" symbol in the top left corner that needs to be removed?
aldona
active poster
Posts: 385
Joined: Mon Apr 16, 2007 11:09 pm
notabot: 42
notabot2: Human
Location: Melbourne, Australia

Post by aldona »

As far as I can see, they are only available as separate images. If you click on "Miniaturansicht", you can get 5 images to a screen view, which you can then right-click and do other things to.

As for your other questions, my educated guess would be that yes, the 150% image looks like the best available quality, and no, I can't see anything apart from the "BSB" symbol that needed removing.

I tried to save all of the individual images for one piece (T. Boehm, Rondo a la mazurka for flute & piano, Op.36), then combine them and convert to black-&-white PDF, but the quality of the finished images was very poor (almost to the point of being unreadable). I'm sure there are tools and techniques available to get around this, but I don't have them (or the skills).

Good luck!

Aldona
“all great composers wrote music that could be described as ‘heavenly’; but others have to take you there. In Schubert’s music you hear the very first notes, and you know that you’re there already.” - Steven Isserlis
kalliwoda
active poster
Posts: 504
Joined: Fri Dec 19, 2008 8:36 pm
notabot: YES
notabot2: Bot
Location: Berlin, Germany

Bavarian State Library (BSB)

Post by kalliwoda »

Hallo Aldona,

I had a look at this collection several weeks ago, too bad they don't go by the good example of the Danish National Library. Saving a large work as single jpeg pages is a pain in the ***
(And there are only two titles featuring the oboe (Onslow), both already available at IMSLP in better resolution...)

Following comments: A lot of works offer also a 200% view option, and the resulting jpeg has twice the size (2000 x 2500 pixels) of the resolution of the 100% view (1000 x 1250 pixels); In miniature view your downloads are also 1000 x 1250 pixels.
The larger size works out to about 250 dpi on an A4 page. Apparently their master scans (not available on the web) are only at 300dpi.

With converting to greyscale and adjusting contrast in photoshop (try contrast 60, brightness 20) I got quite decent prints on my laserprinter, but converting to B/W with a threshold of 50% usually removes too much from the image.
You can convert greyscale to pdf...

But maybe the best solution would be to request a different setup for the printed music (after all the interests of a performing musician are quite different from someone interested in a rare book with over 1000 pages.
I try my luck suggesting some changes...

Kalliwoda
Lyle Neff
active poster
Posts: 702
Joined: Wed Mar 14, 2007 3:21 pm
notabot: 42
notabot2: Human
Location: Delaware, USA
Contact:

Re: Bavarian State Library (BSB)

Post by Lyle Neff »

kalliwoda wrote:[...] Saving a large work as single jpeg pages is a pain in the *** [...]
:lol: :lol: :lol: :lol: :lol: :lol: :lol: :lol:

("***" was César Cui's pseudonym in the Russian press for many years at the beginning of his side-career as a music critic.)

:lol: :lol: :lol: :lol:
"A libretto, a libretto, my kingdom for a libretto!" -- Cesar Cui (letter to Stasov, Feb. 20, 1877)
Leonard Vertighel
Groundskeeper
Posts: 553
Joined: Fri Feb 16, 2007 8:55 am

Re: Bavarian State Library (BSB)

Post by Leonard Vertighel »

kalliwoda wrote:With converting to greyscale and adjusting contrast in photoshop (try contrast 60, brightness 20) I got quite decent prints on my laserprinter, but converting to B/W with a threshold of 50% usually removes too much from the image.
You can convert greyscale to pdf...
If a threshold of 50% removes too much, you should try moving the threshold as close to white as possible, without adding too much noise to the background. That may well be a value around 80-90% (assuming that 100% corresponds to pure white).

I'm not saying that this will necessarily work for all files; but if it does, B/W has the advantage of smaller PDFs (with group4 or better compression), meaning less storage space, faster downloads, and potentially faster printing.
Post Reply