Automatic manga translation/scanlation

13 years ago
Posts: 173
Spurred by the recent closing of yet another scanlation group and the worry some future of manga scanlation I've tried to look into this subject.
For the past couple of hours, I've tried to find some kind of working solution to screen grab japanese characters from an image, unsuccessfully. The idea came from eroge games where an automated translation method already exist, by extracting text from a running game (with Agth) and paste it into a translator (Atlas).
The horrible mess resulting from that automated translation is often more than adequate to grasp plot-lines and dialogue, though subtleties are often lost. Still, the fact that you're not 100% reliant on translators to enjoy the games makes up for it hugely.
If it wasn't clear already, let me make it simple. You need two parts for this to work:
- A way to extract the raw japanese characters from the medium.
- A way to translate the extracted text.
The more automated step 1 to 2 is with the least input, the better.
As already mentioned, there's multiple ways to translate the text. Even google translate is sufficient. The real problem is finding a way to extract the japanese characters from the image, and this is where I'm stumped.
So far, the way I've tried to make this work is using an application JOCR (ocr = optical character recognition). Using this, you can then capture any image and the program will try to recognize any characters or texts from it. To make this work, JOCR needs Microsoft Office Document Imaging (MODI). MODI is included in MS Office 2003 and 2007, but not in 2010. Also, if you use 2007, it's not automatically installed, you need to manually go into the Control Panel, Add/Remove programs, select Office, then change, find MODI under Tools and then install for all computers.
But it's not enough! If you haven't already, you also need to install Microsoft Office Multi-Language Pack japanese version. Only after selecting japanese in MODI should JOCR be able to work.
Needless to say, I still can't make it work even after all that and I don't know why, JOCR refuses to recognize the characters, which is why I'm stuck. It can very well be me that failed somewhere along the way, so it may work for you.
One possible reason could be that the raw I used for testing writes the characters vertically, but even that would be fine since I could just arrange it back manually after, as long as the actual extraction works.
I've written this, partly as a reference for other people interested in this, and also as a way to encourage other people to try to find a solution and post it here. I will of course update my post with any significant advancements or working methods.
Just imagine: you have a raw that you desperately want to read, but it may never be scanlated and you don't speak an ounce of japanese. Wouldn't it be wonderful to just translate it yourself, slowly and with a couple of programs, but still? It could be the future where less and less scanlators are active.
13 years ago
Posts: 56
Quote from RilleL
Even google translate is sufficient.
No, its not and VN machine translations that are considered unacceptable by any serious project translation group.
A manga is a visual medium, if I look at the pretty pictures I can get a idea of what is going on, a terrible bad Google Machine Translation would confuse me more that not knowing what the moon runes mean.
What part of "Please do not put in huge images!" did you fail to understand?
13 years ago
Posts: 55
Quote from Drakron
Quote from RilleL
Even google translate is sufficient.
No, its not and VN machine translations that are considered unacceptable by any serious project translation group.
A manga is a visual medium, if I look at the pretty pictures I can get a idea of what is going on, a terrible bad Google Machine Translation would confuse me more that not knowing what the moon runes mean.
Indeed, but it's a good point to start with. When you got three big steps to take (character extraction, translation, character insertion), better take them one at a time. Automatic translation is already a big subject that is researched on, so one should probably first focus on something else.
Who knows, maybe google translate will one day become so good that it produces non-glibberish text which actually makes sense. ^^"
13 years ago
Posts: 56
Quote from JustPassingBy
Indeed, but it's a good point to start with.
The problem is that it just makes editing much longer because the gibberish pretty much forces the editors having to translate the original text to make sense out of it, the only potential time saving would be typesetting.
Automatic translators like Star Trek Universal Translator are still in the realm of science fiction and creating a program that translates scanned pages but still spews up gibberish is still not the way to go.since it would only be a translation on the broadest meaning of the word.
What part of "Please do not put in huge images!" did you fail to understand?
13 years ago
Posts: 257
Sounds like about as much effort as just looking up the kanji would be. Which is really the only difficult part of reading raws when you're not fluent.
Learning kana can take anywhere from a day to a few months, depending on how diligent you are. If you can read kana (you could even just use a chart, but that would be reaaally slow going) you can read any manga with furigana by the kanji, as long as you have a dictionary and maybe a grammar site open in another tab.
The problem is that it just makes editing much longer because the gibberish pretty much forces the editors having to translate the original text to make sense out of it, the only potential time saving would be typesetting.
This is mostly the reason why it would be such a pain. Whatever the case, with our current technology you'll only get sensible material from an actual translation. But it's really not that hard - even when I had only just learned kana, I was able to read through raws of Yotsuba!. It was a struggle, sure, but I managed it.
However, the idea of a character extractor would still be very useful for this more traditional method. Like I said, the biggest road-block is kanji, especially if you're reading a manga without furigana. Kanji can be confusing to look up, since most dictionaries use radicals or number of strokes and other things that would generally require to you actually have some knowledge about kanji.
If you could just extract the kanji then copy/paste it into a dictionary, I'm sure many non-Japanese readers or Japanese-learners would have an easier time translating. So I defintely think it's an interesting idea to develop!

13 years ago
Posts: 838
i'd be kind of easy.. if all mangas had the same Font and the scan had a quite similar resolution... But having diferent fonts could mess any program made to understand the chars and making it able to read any Font... its a nice dream.

13 years ago
Posts: 173
It seems people have misunderstood me. I did not propose that automated scanlation replace traditional scanlation. The quality is not even close. However, the possibility to read any raw yourself and not being reliant on translator is a huge freedom. It's a method for yourself, not to mass produce scanlated manga. Goggle translate or Atlas is sufficient in that it's possible to grasp what the text is largely about, which I just wrote in the previous section, not sufficient as in an acceptable translation from a scanlators point of view.
Also, the fact that it's suggested to just learn the language is a bit laughable to me. If it was so easy and everyone could do it why scanlate at all.. -.-
While it initially seems troublesome to setup a working method, everyone knows that the beginning is the hardest. Do you remember your first time registering and learning IRC? How easy is it to leech now? Or setting up Agth and Atlas? The eventual benefits are well worth the effort.
13 years ago
Posts: 27
There's only one thing that poses a problem with this, and that is, as mentioned, autoTLs. I've seen Google TL do its job well, the only problem that makes manga incompatible with autoTLs here is that mangaka use almost a whole nother form of Japanese altogether. The formality used in manga is so far off the beaten path that it almost never translates correctly, at all. That's why people say to learn it, because currently, there's nothing that can correctly do it aside from the human mind.
13 years ago
Posts: 257
Quote from RilleL
It seems people have misunderstood me. I did not propose that automated scanlation replace traditional scanlation. The quality is not even close. However, the possibility to read any raw yourself and not being reliant on translator is a huge freedom. It's a method for yourself, not to mass produce scanlated manga. Goggle translate or Atlas is sufficient in that it's possible to grasp what the text is largely about, which I just wrote in the previous section, not sufficient as in an acceptable translation from a scanlators point of view.
Also, the fact that it's suggested to just learn the language is a bit laughable to me. If it was so easy and everyone could do it why scanlate at all.. -.-
Well, I don't think I misunderstood you. I've tried the method of using Google (I've never used Atlas, maybe it's better?) to read things, and it's really such a pain. I ended up having to look up each word by themselves and then looking up grammar rules so I could piece them together, because Google just mangles everything up. Even "grasp what the text is largely about" is a stretch, most of time, unless every sentence is something like "Hello!" "What?" or "Baka!" And yes, I tried this for personal enjoyment and not mass distribution.
I'm not saying everyone should learn the whole language, just that doing exactly what I said above (looking up words and grammar rules) was easier for me than relying entirely on Google. Learning the language implies memorizing it and being able to translate it in your head. They're different things.

13 years ago
Posts: 161
http://www.youtube.com/watch?v=ae01yz5z99E
Now... wait a few decades for them to finish up on Japanese. 😛
I understand what you are saying:
Scanlations > autoTL's > RAWS.
With the first option becoming less and less available, I believe that the future of technology will allow autoTL's to improve significantly. Already some programs are becoming smarter by detecting grammatical mistakes in addition to misspellings. Even if someone learns kana, the most difficult part to master in any language is the huge vocabulary they hold, and simply not everyone is committed to doing that.
As some of the remaining scanlators ranging from bad to dictators (e.g., 50+ posts required per chapter for a one-day-long link), this fallback option seems like a good solution that needs a little more work to be in common usage. d('-')b

13 years ago
Posts: 173
So I realise I'm reviving a dinosaur, but I figured I should update this topic.
I recently found out about a JOCR method that actually work. I tested it myself, cropping a sentence from a raw, uploaded and it worked.
From then on you could copy that sentence and try to use google translate or something, but the important part is that for the first time I tried an OCR successfully which makes me really enthusiastic for the future.
The OCR in question is:
[url]http://maggie.ocrgrid.org/nhocr/[/url]
Again, if you have a better method please post it, like an OCR able to read up to down text.

13 years ago
Posts: 402
I can only be amazed at the dedication of someone able to read any significant amount of computer translated text (especially Japanese!). Surely even porn isn't that interesting. 🙂
12 years ago
Posts: 6
ye sit is if you want to no what the caracter says

12 years ago
Posts: 761
In my experience, Japanese text translated by google translate usually turns out to be totally gibberish and it's very hard or impossible, to even get its general meaning.