Tuesday, October 11, 2005

More on Word Metadata

This past summer, I gave a couple of presentations, one that covered file metadata. I got to thinking...I've parsed Event Log files and Registry files in binary format...why not do the same with Word documents and see what else is in there besides what the MS API is telling me? After all, a particular value that references "hidden" data may be set to 0 (or 0x0000), but the actual data itself may still be there.

Remember the issue with Blair's gov't? When I found this, I tried using the MS API (via OLE) to retrieve the metadata concerning the last 10 authors of the file, and I simply could not get it to work. However, Richard Smith had no trouble doing so.

I started looking around and found the MS Word 97 binary file format (BFF) (here in HTML). I haven't had any trouble parsing the file information block, but what I am having a bit of trouble doing is locating the beginning of the table stream. Many of the values I'm interested in are listed as "offset in the table stream", indicating (to me, anyway) that the offset is from the beginning of the table stream.

If anyone has any information on this, I'd greatly appreciate some help with this.

Also, for testing/verification purposes, I was wondering if anyone out there with a Mac would do me a favor and create a couple of simple Word documents on that platform, zip them up, and send them to me. Some of the metadata within the Word document tells you whether the file was created or revised on a Mac. When you send the files, if you could specify the platform and version numbers (of the os and the application), I'd appreciate it. Thanks!

3 comments:

Anonymous said...

A great resource would be the OpenOffice source code. They've come a long way to being almost completely compatible with Word documents.

Ryan Sommers

Anonymous said...

There are many references for extracting data from Office/OLE documents. I recommend you check out the OLE libraries on CPAN (for ideas), the LAOLA project (for Perl code), and as mentioned, the OpenOffice project for documentation. While the code from the latter may not be directly applicable for you, I'm sure their comments will be quite illuminating.

H. Carvey said...

I recommend you check out the OLE libraries on CPAN...

Been doing that...OLE::Storage and OLE::PropertySet are old, and very, very poorly documented. In fact, it's pretty clear from looking at the POD for OLE::PropertySet that the author reused the POD from OLE::Storage, without really attempting to clean it up. Module dependencies are well documented, either.


...LAOLA project (for Perl code)...

Yeah, I've gotta go back and try and figure that one out.

...the OpenOffice source code...

Ugh! "Check the source code...". ;-) Oh, well, maybe I'll have to...

Thanks, and thanks to those who've sent some docs in. I have enough Word docs from Macs...thanks.