FTP Server

Messages from and Discussions about IMSLP

Moderator: kcleung

Leonard Vertighel
Groundskeeper
Posts: 553
Joined: Fri Feb 16, 2007 8:55 am

Post by Leonard Vertighel »

Only logos were removed (I'm sure horndude will confirm this). All page headers (every single page has a header, not only the title pages) remained unchanged in the process. All of these headers were evidently digitally inserted into the image files, i.e. they are not part of the original scan.

Horndude is basically using a "search and replace" algorithm, which searches each page for a copy of the logo image and replaces it with white pixels (his software repository is linked in one of his earlier posts). Clearly you can't do this for the headers, as they are different for each file.
ras1
active poster
Posts: 164
Joined: Thu Jul 26, 2007 8:28 pm

Post by ras1 »

Here's an idea:

All titles added by the company are in the same font (and presumably at the same size). Would it be plausible to have an algorithm search for every letter of that font (lowercase and uppercase) separately and replace them with white pixels, which would ultimately delete the whole title? It might be too much work, but a lot of the programming would probably be copy/paste-able.
tilmaen
forum adept
Posts: 85
Joined: Fri Nov 21, 2008 2:10 pm

Post by tilmaen »

why worry about it? i don't think they copyrighted the font. and they problably can't get a trademark/copyright on the name of the piece.
everything but the logos should be considered trivial changes/edits, don't you think?

greetings
Leonard Vertighel
Groundskeeper
Posts: 553
Joined: Fri Feb 16, 2007 8:55 am

Post by Leonard Vertighel »

I don't think that ras1's proposal is realistically feasible. Horndude mentioned a processing time of 1-2 minutes per search. Thus, searching for all letters of the alphabet, both upper and lower case (52 separate searches, one after the other), would take roughly 1-2 hours per page, corresponding to a throughput of maybe 10-20 pages per day.

And this is without taking into account that in fact the font size isn't always the same (not sure about the font family), and that there could possibly be diacritical marks and letters from non-English alphabets, etc. And there might well be other problems, like slightly different pixel renderings of the same letter depending on its alignment with respect to the pixel grid, which would call for some kind of fuzzy matching algorithm...

In short, I don't believe that horndude's program is adaptable for this kind of task. Plus, lastly, even if it was, it would leave us with files with no headings at all, and I'm not sure if we want that.

I tend to agree with tilmaen...
horndude77
active poster
Posts: 293
Joined: Sun Apr 23, 2006 5:08 am
notabot: YES
notabot2: Bot
Location: Phoenix, AZ

Post by horndude77 »

Removing the top matter automatically isn't unfeasable, just annoying. You'd have to do the same thing for the logos for every separate pdf. At this point you may as well edit it by hand (almost). I don't think removing by font would be easy because it's essentially an OCR problem. It might be possible to pick out similar components that are near the top of each page and remove them. This however would require multiple pages in the pdf and would possibly remove actual content.

Also I sped up the search somewhat by skipping initial blank lines (duh!). It takes 10-15 seconds per page I'd guess.
Carolus
Site Admin
Posts: 2249
Joined: Sun Dec 10, 2006 11:18 pm
notabot: 42
notabot2: Human
Contact:

Post by Carolus »

It's not what I'd term a high priority, aesthetically desirable as it may be. That Edgar score I did was cleaned-up manually, except for the new page numbers.
Leonard Vertighel
Groundskeeper
Posts: 553
Joined: Fri Feb 16, 2007 8:55 am

Post by Leonard Vertighel »

Carolus wrote:It's not what I'd term a high priority, aesthetically desirable as it may be.
Well, I agree that in other collections I have seen some pretty ugly additions with utterly unsuited fonts, but as far as the OM scores are concerned, I thought they were rather decent (though certainly not perfect). What exactly is it that you dislike about the OM scores?
Carolus
Site Admin
Posts: 2249
Joined: Sun Dec 10, 2006 11:18 pm
notabot: 42
notabot2: Human
Contact:

Post by Carolus »

Actually, I was thinking of other collections rather than the OM scores, which I am not as familiar with.
Leonard Vertighel
Groundskeeper
Posts: 553
Joined: Fri Feb 16, 2007 8:55 am

Post by Leonard Vertighel »

Carolus wrote:Actually, I was thinking of other collections rather than the OM scores, which I am not as familiar with.
Sorry for the misunderstanding then - from the context I had assumed this was about the OM scores.

It seems that some of the other collections will be very challenging, even if we settle for trademark removal only. For example, there are currently some Beethoven scores on the server which have a logo scaled to different sizes and even different aspect ratios - not sure if we can ever automatically clean them up. It would certainly require more sophisticated tools than what we have available right now...
Generoso
active poster
Posts: 266
Joined: Mon Mar 12, 2007 1:49 pm

Post by Generoso »

I have uploaded to the some more files that need the removal of some stamps. The folder name is "CelloCD Works".

Thanks to horndude77 for his magic in removal of these unwanted artifacts in the previous 550 or so files.

More to come after these
horndude77
active poster
Posts: 293
Joined: Sun Apr 23, 2006 5:08 am
notabot: YES
notabot2: Bot
Location: Phoenix, AZ

Post by horndude77 »

More removed logo files uploaded to ftp server. I didn't get any major errors when running the script so I assume that all have been removed. The PDFs probably need to be proof-read to make sure nothing slipped through.
kcleung
Copyright Reviewer
Posts: 127
Joined: Fri Sep 12, 2008 9:38 pm

Re: FTP Server

Post by kcleung »

Thanks Horndude77!

I also didn't see anything slipped through. You are a real hero in this project.

I have slighted amended your script to allow automated batch process of OM files volume by volume.

ftp://imslp.org/OP Project/batch_script/

I've included the pre-compiled liblept.so.1.60 for both x86 and amd64 for Ubuntu 8.10 onwards and updated documentation (README) explaining how to set up and run the script to batch-process OM stuff.

Please have a good look at my README and tell me what you think. The script is designed to process each full volume in one go. As explained in README, I tried to standardize instrument names and avoid spaces and non-ascii character in names to make different filesystems (and my script) happy.

On a C2Q computer, it should take around 1.5 hours for this script to process a volume without user intervention
horndude77
active poster
Posts: 293
Joined: Sun Apr 23, 2006 5:08 am
notabot: YES
notabot2: Bot
Location: Phoenix, AZ

Re: FTP Server

Post by horndude77 »

Great! I'm glad you built leptonica to make it easier for others to use. The README looks good. I'm sorry I don't have much time to review in depth right now.

A couple other observations from working with the script:
- Look at line 63 of clean_pdf.rb. I added a pause so that the files could be checked/edited before the images are compiled into a pdf. I found it useful on scores where removal didn't work well. (musicprog has been asking me to take a look for a while now. It's slower from what I understand, but I'm wondering if it will fix these problem files. I still need to do this.)
- I found the 'harp and others' volumes are somewhat problematic with this approach. Manual work is required in renaming files.
- The utf-8 characters don't seem to work well in the pdf titles. I haven't investigated this much.

Good luck!
kcleung
Copyright Reviewer
Posts: 127
Joined: Fri Sep 12, 2008 9:38 pm

Re: FTP Server

Post by kcleung »

horndude77 wrote:Great! I'm glad you built leptonica to make it easier for others to use. The README looks good. I'm sorry I don't have much time to review in depth right now.

A couple other observations from working with the script:
- Look at line 63 of clean_pdf.rb. I added a pause so that the files could be checked/edited before the images are compiled into a pdf. I found it useful on scores where removal didn't work well. (musicprog has been asking me to take a look for a while now. It's slower from what I understand, but I'm wondering if it will fix these problem files. I still need to do this.)
- I found the 'harp and others' volumes are somewhat problematic with this approach. Manual work is required in renaming files.
- The utf-8 characters don't seem to work well in the pdf titles. I haven't investigated this much.

Good luck!
- The ut8-8 problem causes me to ditch the file_mapper.rb and just use the original name of the file (minus .pdf) as the name of the work. (KISS principle). In the batch script, I re-group the music by work, each pdf file has its instrument name appended to the original stem. Yagan said that the copyright reviewers should be clever enough to work out the identities of the scores.

- Yes, there are some re-naming of the instrument folder names to keep the filesystem and the bash scripts happy (remove spaces and non-ASCII characters), but they only need to be renamed once for each volume, so even this is manual renaming, there is little work on the user's part.

- I noticed at later stages, you added a couple of extra patterns, they are really useful and so far I didn't see any wrong files so far, but if the users are paranoid, they can uncomment line 63 of clean_pdf.rb to allow manual checking.
Post Reply