By Ctein
It's spring, and this man's fancy turns to…mortality.
My very first column for The Online Photographer was about the death of Tee Corinne. Tee's estate commissioned me to organize her computer files. My task was to eliminate the chaos and duplication and make it accessible to future scholars at the University of Oregon. I had Tee's external hard drive, floppies, Zip disks, CDs, and DVDs. All told, there were about 70 GB of data in about 75,000 files: Tee's writings, correspondence and photographs.
Tee had used a variety of word processing programs, some quite ancient. Early stuff was named in the traditional "8.3" DOS fashion we had to work with in those dim, dark days. For example, a file might be named "INVISIBL.LNS," shorthand for " Invisible Lines." As you might expect, modern operating systems get flummoxed by this.
I reconfigured all the names and appended the file creation dates. MacLink Plus converted them en-masse from their original formats into DOC files, a format so widely used I can be sure programs capable of reading it will be available into the indefinite future. INVISIBL.LNS became INVISIBL_LNS_05161994.DOC. Done!
The photos took 95% of the work. First I had to get all the files into my computer. CD and DVD reading and writing isn't foolproof. A single read of Tee's original disks didn't copy all the photos successfully. Adobe Bridge usually signaled when a photo was broken by either refusing to display a thumbnail or displaying a low-quality thumbnail with a black border. Sadly, it wasn't foolproof.
To get 100% reliable results, I pointed Photoshop's Contact Sheet II function at the hard drive and told it, "Make me contact sheets of everything!" Every time it encountered a file that wasn't 100% kosher, it halted and let me know. Then I'd go back to Tee's original disks. In all but two cases (out of over 15,000 unique photographs), I could come up with a good version of the file. It took a long time for the system to crunch that many photos, but it didn't require constant attention.
What does it take to sum up an artist's life? More than one screen
shot, for sure, but this is a record of the last half-decade of Tee's
life. Each barely-visible dot, 144 to a sheet, is a unique photograph.
The contents of Tee's disks and hard drive overlapped but weren't identical. In addition, there were numerous duplicates where Tee copied a photograph to a particular project folder to work on. Finished work was sometimes duplicated in several places for different projects. Sometimes the file names were kept the same, sometimes not. In at least two cases, there were files that had identical names and identical file lengths in different folders that were entirely different photographs.
My only option was to look each and every photo...several times. Multiply that by around 25,000.
One single contact sheet from the 130 pictured above.
The first step I took was to identify the unique contents of the hard drive and the disks and merge them into new folder. I opened up two full-screen light table windows in Adobe Bridge, showing corresponding folders from the hard drive and the disks, and started comparing the thumbnails. Boy, I have never been so happy to have dual monitors with lotsa pixels!
It went faster than you might think. The contents of the two data sets were similar, and I could scan a whole screen quite quickly. Anything that was a dupe (same look, same name, same file size) got deleted from the disk folder. Once I'd finished purging the folder I copied the remaining disk files over to the corresponding folder for the hard drive. I preserved the unique disk contents as well, so future scholars could see which photos Tee had offloaded from her hard drive. Who knows; it might be of interest?
Now I was down to a mere 45 GB of data! It was time to weed out the superfluous, as I'll talk about next time.
____________________
Ctein
DOC files are often incompatible between different versions of MS Word, let alone reliably convertible to other word processors. They can also carry MS Word viruses and tend to be bloated due to a feature of Word that I recommend turning off--"Fast Saves."
I usually tell my students to convert their DOC files to RTF format, if they want to send me files. That keeps them smaller for e-mail, safe from viruses, and more readable by more word processors, so I don't have to worry about compatibility issues between their version of Word and mine.
I keep a copy of Word for dealing with files that people send me and for preparing manuscripts for publication, but I really write in Nota Bene, which is still the best software for academic writing around (www.notabene.com for more info on the current version).
Posted by: David A. Goldfarb | Saturday, 22 March 2008 at 08:16 PM
Sorry David, but I have to disagree. ZThe .doc format is pretty muc ubiquitous. If your version od a .doc format is unrecognizable in one version of Word, then that means the .doc format is more than three versions older than the one you are currently using as Microsoft has legacy support back three generations. (So, 2008 will recognize XP versions of Office.) There are also converter packs you can get for earlier versions.
I have worked in several universities and only once have I come across a professor using Nota Bene. The entire Helpdesk staff chortled at his request for support. Nota Bene is NOT a mainstream program, nor in my experience is it very common even in higher education. The schools that I have had some exposure with include Alfred University, SUNY Cortland (and by extension all SUNY and CUNY schools), Skidmore College, Gettysburg College, Suffolk Count COmunity College, University of Missouri, Washington University, University of Colorado, Denver University, St. Clair County COmmunity College, University of Toledo, College of Charleston, Charleston University, Vanderbilt, Appalachian State, UNC CHapel Hill, Clemson, USC, and probably a few more. The one instance of Nota Bene was at CofC and the state of SC is woefully ranked either 49th or 50th in education depending on which poll you reference.
The .doc format is probably the closest to a ubiquitous formatted editor there is given the saturation of the Microsoft Office applications through the PC marketplace. If you wanted to pick a format that is entirely ubiquitous, then go with the txt format.
Posted by: Jason | Saturday, 22 March 2008 at 10:43 PM
For anybody tackling a project like this again, check out this free program, 'Double Killer'. It searches for and displays duplicate files based on name, size, date, modified date and a checksum it generates (or any combination of the above).
It then displays a list of the doubles it finds so you can confirm its findings before deleting the doubles.
It won't do ALL of the work for you, but might take care of a large chunk of the more obvious ones. I'm bad with keeping things organized, so find that running this once in a while keeps things a bit cleaner.
http://www.bigbangenterprises.de/en/doublekiller/
Posted by: Curt Bousquet | Saturday, 22 March 2008 at 10:49 PM
You should look into a program called Dup Detector. It can find duplicate images based on the actual content of the image. After building a catalog, it will present pairs of images to you that have x percent correspondence (with x being a value you can set). It's really great for detecting identical and similar images on your computer.
Posted by: Zaan | Saturday, 22 March 2008 at 11:17 PM
I didn't claim that the NotaBene was widely used or mainstream, just that it is the only serious academic word processor out there, even if most of my colleagues don't realize it. Word is widely used, but is more oriented toward business applications. Try to call Microsoft to help you sort out your footnote formatting problem or making your document conform to an academic style manual, and they'll generally have no idea what you're talking about or why it's important. I've never run into Bill Gates at the MLA Conference, but I have spoken with Steve Seibert, the President of NotaBene there, and the company's help desk has always been helpful when I've called.
If the DOC format is only supported for three generations, well, then it's a moving target, and is in no way a long term archival solution, nor can it be called a single file format. Which DOC format is ubiquitous? There are several of them. Usually the problem is that one version of Word will open a DOC file from another version of Word, but there will be lots of garbage appended to the file, and the formatting won't be as the author intended. This never happens with RTF. The RTF format is much more stable, and it's easy to convert to it in Word.
Posted by: David A. Goldfarb | Saturday, 22 March 2008 at 11:17 PM
I've long relied on d'peg to remove duplicates.
It has color content and pattern matching to find duplicates of the same image, despite rotation, cropping, color correction, borders, text added, you name it.
I seem to be happier with my older version than the latest, but in case anyone is interested:
http://www.gotdupes.com/
Posted by: Stephen S. | Saturday, 22 March 2008 at 11:38 PM
Having said the above, I forgot completely the main point of the article which was to save things in a logical format and try to use descriptive file names and current software. You not only will be doing yourselves a favor in file management, but eventually your heirs as well. My heart goes out to Ctein for the extensive work he had to put in on this project. Great idea also to use Bridge to view a thumbnail of everything - had never thought to use it for these purposes, but it makes perfect sense. Now, excuse me while I go organize my own hard drive(s)! :)
Posted by: Jason | Saturday, 22 March 2008 at 11:50 PM
Guys, enough with the software debate. This isn't a software forum.
Thanks--
Mike J.
Posted by: Mike Johnston | Sunday, 23 March 2008 at 01:19 AM
Dear David,
In your zeal to damn Microsoft, you've missed the point, and you're wrong in detail.
The DOCs are compatible with every version of Word as far back as 6. So many documents exist in this one format that readers *will* be available into the indefinite future even if Microsoft disappeared tomorrow. That's all that's important.
The virus argument is ludicrous. Maclink Pro doesn't attach viruses to files. The Fast Save argument is equally silly, for much the same reason.
I don't use Word either; I hate it. I happen to much prefer Nisus. So what? This isn't about me. It isn't even about you.
Please save the rants for appropriate venues.
Hopefully this will get to be the last word on this unfortunate diversion.
pax / Ctein
Posted by: Ctein | Sunday, 23 March 2008 at 05:20 AM
Dear Folks,
Double Killer, Dupe Detector and d'peg all sound seriously worthwhile. I will investigate them. Thanks so much for the suggestions! It's great when I can write a column like this and learn so much.
pax / Ctein
Posted by: Ctein | Sunday, 23 March 2008 at 05:25 AM
Ms. Corinne's fans are very fortunate to have someone of Ctein's calibre to save her work from being lost in obscurity.
The vast bulk of this century's digital image archives on personal computers will likely be lost in oblivion because there won't be anyone with enough skill or caring among the departed's friends and associates who could make the time and effort to wade through the digital swamp.
I make a print of whatever I think is a 'good' photo in my own archives, so that subsequent to my death someone can find them easily. I wouldn't wish it on anyone to sit at my workstation and trawl through the tens of thousands of images (and perhaps hundreds of thousands if I'm fortunate to live that long).
Posted by: Craig Norris | Sunday, 23 March 2008 at 06:50 AM
I think the core lesson here is about organizing your own data. Not for your executor - most of us will not merit that - but because I do not think many folks are really able to make good use of their digital collections. Sloppy data habits that were limited by how much film you could shoot become nightmarish with 8 gig memory cards. Fred Picker advised throwing away marginal negatives once a year so that you would not waste time with them, rather than shooting and moving forward. While I do not do that, the wisdom in it is more acute for the digital world. I do ruthlessly prune my digital files, then use a filing program to catagorize them and include context info.
Posted by: Ed Richards | Sunday, 23 March 2008 at 09:43 AM
We have an interest around here in things that are archive-quality (or "archival", an adjective I'm not fond of).
For physical objects, we have some idea what makes things age-resistant. Pigments instead of dyes; acid-free paper; proper storage.
For information, we are still groping our way around. It's as much a problem of library science as of computer science ... but for librarians and computer scientists both, it's uncharted territory to some degree.
However, there are some aspects of data conservation that are well understood due to their similarities to other interoperability problems. A good file format for conservation is described by an *open specification*. This specification has *no intellectual property encumbrances* and exists in *multiple independent implementations*.
RTF fits the bill. Word does not. This is not about Microsoft-bashing; if we were talking about AppleWorks or WordPerfect files, I'd say the same thing.
I think people were unfairly grumpy in David's direction. The concern he brought up was relevant, and the solution he proposed was easy, cheap, and, I believe, correct.
Posted by: Ben Rosengart | Sunday, 23 March 2008 at 12:48 PM
Sorry for the off-topic comments about word processing software. I think you're assimilating my skepticism about the archival suitability of DOC files to the general anti-Microsoft sentiment out there or the PC/Mac wars. I've done all of my writing on a PC since around 1985 after a year or so of using a Xerox 3030.
My point is that RTF (Rich Text Format), which is a standard developed by Microsoft, is much more stable than DOC, which changes with every version of Word. RTF files are stored in ASCII format and are readable in any text editor, and are easily convertible to XML-based formats, which are the real library standard for digital archives of formatted text documents.
Posted by: David A. Goldfarb | Sunday, 23 March 2008 at 01:16 PM
Only one nit to pick, really. Why on earth didn't you use ISO dates?
http://en.wikipedia.org/wiki/Iso_date
Also, and more a personal preference, I'd put them at the start of the filename, eg.
19700101_unixepoch.txt
20080324_wordsucks.doc
20080324_nittygritty.doc
and so on. This means that things will appear neatly in chronological order without any special effort.
The date format you have selected shows significant lack of foresight.
Posted by: john slee | Sunday, 23 March 2008 at 07:51 PM
Dear folks,
Re: file formats, we're just going to have to agree to disagree. There's nothing wrong with RTF but I do not accept that DOC is any less useful. What will determine the readability of formats far into the future is not whether or not they conform to an open standard but how much information of import is stored in them. There are a number of open-standard data formats out there which have turned out to be so little-used that I am certain that most computers will be unable to read them in 30 years. While open code allows for the writing of translators at any time, the practical reality is that those activities consume time and money and those are always in short supply. Regardless of the future of Microsoft, the existence of a gazillion important documents in DOC form ensures widespread readability for a very long time.
The same can be said of RTF. But this does not make the use of DOC inappropriate. It is merely that there is more than one way to address this problem satisfactorily. I could actually trot out some arguments about why DOC is better for my purposes, but frankly I think they boil down to a kind of nit-picking that I find both boring and useless. So I won't. Because both formats work. Pick the one that you think will work better for you. You can't guess wrong in this case.
Re: date codes, I did put some substantial thought into the question of whether the date should be at the beginning or end of the file name. I decided end of file name was better because it was not as important to be able to sort the documents into chronological order easily as it was to be able to easily collect documents that belonged to the same project. For example a book manuscript of Tee's spanned a considerable number of years in various versions, addenda and subsections. They tended to fall nicely together if organized by original filename. Sorted by date, though, they were dispersed among the myriad other documents that had been created over the same period. Since scholars would also have access to Tee's original data and could look at the chronology of her file and folder structures there, I decided it was not important to emphasize this in my translations, rather it was more important to make it easy for scholars to find all the documents that might belong in a particular group of interest.
But one could just as easily argue it the other way under slightly different circumstances. I'm only saying that this was not a thoughtless decision.
~ pax \ Ctein
[ please excuse any word salad. MacSpeech in training! ]
======================================
-- Ctein's online Gallery http://ctein.com
-- Digital restorations http://photorepair.com
======================================
Posted by: Ctein | Monday, 24 March 2008 at 04:16 AM