General Category > General Discussion

DjVu Files

(1/2) > >>

John C:
Before I start, I want to point out that I'm not agitating for change.  I just noticed something neat and figured some of the rest of you might want to poke around and experiment, too.  That said...

Has anybody out there (I guess scaners, primarily) done any work with the DjVu format?  It was shopped around by AT&T when PDF was looking terrible (err...more terrible), but never really went anywhere except the Internet Archive.  See more at http://djvu.org, particularly the page of resources.

It looks to me like there aren't particularly good tools--what I found for free was DjVu Solo, which is discontinued, clunky, and very slow--but the format shows some promise with respect to comics.

First, there's the size.  For experimental purposes, I took a few recent downloads, unzipped the archive, then stuffed the images into a DjVu file.  I was actually kind of shocked at the results.

Triumph Adventure Comics #1, as a RAR archive, clocks in at 12.2MB.  Telling DjVu Solo that the images were photographs (which I assume is the best quality) at 300dpi, it dropped to 8.3MB.  Willing to lose a bit of fidelity?  Going to a black-and-white image gives us lots of jaggies, but the file is less than a megabyte and still mostly readable!

Phantom Lady #17 went from 24.9MB to 10.3MB, again, as a "photograph," which is a more dramatic reduction of nearly 60%.  And I can't tell the difference between the compressed pages and the originals, except that they render progressively.  That is, a blocky image shows up immediately, which is then refined over time, like in old web browsers.

Likewise, Wham Comics #2 went from 49.1MB down to 20.2MB.  Sacrificing a fair amount of quality (some regions are weirdly blurry, whereas some lines are absurdly sharp--I guess that's the Wavelet compression), but again, still nicely readable, the pages can be "scans" instead of "photographs," and the resulting file drops to...3.5MB.  Feel free to do a double-take, because that's about 7% of the original size!

And different pages can actually be compressed differently, then merged, so the tradeoffs could be made fairly intelligently, I think.

(There's also a "clean" mode for saving files, but my poor little machine runs out of memory whenever I try it, so I don't know how useful it might be.  Or it could be that it chokes on calling a JPEG image "clean," for all I know.  It sounds promising for scans of line art, though.)

The other feature impressing me, at least conceptually, is that it's apparently possible to associate the images of text on the page with actual text for screen readers, translations, or copy-and-paste operations.  I wasn't able to figure out HOW, mind you, with the tools at hand, but there's an article on making it happen with handwritten pages and the Internet Archive files are all copy-able, so it's presumably workable.

I might continue messing around, so if anybody has advice on software, processes, and so forth, or warnings about why nobody should ever be using such a thing, I'd like to hear it.  Oh, and if anybody wants copies of those files for comparison purposes, I'll post them up somewhere.

GeneYas:

--- Quote from: John C on June 13, 2010, 01:43:37 PM ---Before I start, I want to point out that I'm not agitating for change.  I just noticed something neat and figured some of the rest of you might want to poke around and experiment, too.  That said...

Has anybody out there (I guess scaners, primarily) done any work with the DjVu format?  It was shopped around by AT&T when PDF was looking terrible (err...more terrible), but never really went anywhere except the Internet Archive.  See more at http://djvu.org, particularly the page of resources.

It looks to me like there aren't particularly good tools--what I found for free was DjVu Solo, which is discontinued, clunky, and very slow--but the format shows some promise with respect to comics.

First, there's the size.  For experimental purposes, I took a few recent downloads, unzipped the archive, then stuffed the images into a DjVu file.  I was actually kind of shocked at the results.

Triumph Adventure Comics #1, as a RAR archive, clocks in at 12.2MB.  Telling DjVu Solo that the images were photographs (which I assume is the best quality) at 300dpi, it dropped to 8.3MB.  Willing to lose a bit of fidelity?  Going to a black-and-white image gives us lots of jaggies, but the file is less than a megabyte and still mostly readable!

Phantom Lady #17 went from 24.9MB to 10.3MB, again, as a "photograph," which is a more dramatic reduction of nearly 60%.  And I can't tell the difference between the compressed pages and the originals, except that they render progressively.  That is, a blocky image shows up immediately, which is then refined over time, like in old web browsers.

Likewise, Wham Comics #2 went from 49.1MB down to 20.2MB.  Sacrificing a fair amount of quality (some regions are weirdly blurry, whereas some lines are absurdly sharp--I guess that's the Wavelet compression), but again, still nicely readable, the pages can be "scans" instead of "photographs," and the resulting file drops to...3.5MB.  Feel free to do a double-take, because that's about 7% of the original size!

And different pages can actually be compressed differently, then merged, so the tradeoffs could be made fairly intelligently, I think.

(There's also a "clean" mode for saving files, but my poor little machine runs out of memory whenever I try it, so I don't know how useful it might be.  Or it could be that it chokes on calling a JPEG image "clean," for all I know.  It sounds promising for scans of line art, though.)

The other feature impressing me, at least conceptually, is that it's apparently possible to associate the images of text on the page with actual text for screen readers, translations, or copy-and-paste operations.  I wasn't able to figure out HOW, mind you, with the tools at hand, but there's an article on making it happen with handwritten pages and the Internet Archive files are all copy-able, so it's presumably workable.

I might continue messing around, so if anybody has advice on software, processes, and so forth, or warnings about why nobody should ever be using such a thing, I'd like to hear it.  Oh, and if anybody wants copies of those files for comparison purposes, I'll post them up somewhere.

--- End quote ---

The problem with odd formats like that is they never get fully supported going forward and you risk not being able to open them in the future when companies create new software. If you really start to investigate graphic formats, JPG & GIF are extremely outdated. JP2 (JPEG2000) allows far better compression while still retaining image quality.  I'd be willing to bet that Lura Document Files (LDF) are more advanced that DJVU, but the disadvantage there is that the format is proprietary.

Gene

John C:
I agree with that (and I've been there, with critical design documents on disks that nobody had touched in ten years in software I had never heard of and was no longer available), but DjVu does have the weight of AT&T and the Internet Archive behind it, with Wikipedia internally debating whether to use it across the board (which I stumbled on, and which pointed me to the Solo software), so it's not like it's the pet of some kid in a basement to distribute his band's zine, right?

The tools (that I found, at least, which is why I asked the question) are awful, but that's the sort of thing that changes with adoption.  There was a time when Flash looked like a really stupid idea to most people, whereas today it's a thriving platform that's supported pretty much everywhere and very few users complain other than fogeys like myself.  I can even vaguely remember statements that JPEG would never take off, so don't bother installing the software!

And of course, I realize that a big part of the choice is (probably) the size of the audience.  Any computer produced after, say, 1994 is going to be able to unzip files and view JPEGs inside.  The same can't be said of other formats with decent compression.  Heck, I'm surprised RAR caught on.

Again, I mostly ask for my own purposes and curiousity, rather than agitation.  I don't expect anybody to suggest that we dig in and convert any comics to another format or even suggest scanners supply new books any other way.

Simply:  Are there better tools (or better ways to use the tools) than what I found, for further experiments?  Are there serious technical flaws with the format that I haven't noticed?  If the Internet Archive and Wikipedia are giving it the thumbs up, does that mean it has advantages beyond what I've found?

JonTheScanner:
I can do the same within JPG format.  If you take one of our large files and resave in JPG with more compression the results *can* be indistinguishable on screen. I suspect if you printed them not so.  And I'm not sure saving as "photo" produces the least compression.  Often B&W line art has a default with less compression in order to avoid the jaggies you mentioned.

GeneYas:
DJVU has been around a long time and I suspect it isn't going to catch on now if it hasn't already. It would be nice if the developers of the newer compression formats would donate their algorithm to the web community. They'd rather it rot in oblivion and try to squeeze every penny out of people. I really expect SVG to take off, but it'd be more for the web & 3D images. George Lucas has a new 3D image format that may take off because of it's use in Hollywood. I may be wrong, but I thought DJVU was developed for maps. Perhaps it was developed for documents. I know Irfanview will view DJVU files and I use it already. I save all my master scans as PNG and convert to other formats as necessary. PNG is like an advanced version of GIF using the full color palette like JPG. It provides good results if you convert to either. File sizes tend to be larger, but it does have compression and I don't really get into how small you can make the file. Compression degrades image quality and I always want the best quality regardless.

Gene

Navigation

[0] Message Index

[#] Next page

Go to full version