Sunday, October 16, 2005

Yet, even more on Word metadata

While awaiting information on the binary format of shortcut (LNK) files, I decided to try to learn more about structured storage and metadata in Word documents. The best example I've seen of that describes some of the metadata in Word documents is available at the Computerbytesman site, and addresses an issue that Tony Blair's government had a while back. While I was researching my book, Richard Smith was kind enough to share his code for retrieving the last 10 author's for within the Word document with me. Since that time, I've thought about taking another look at the sort of metadata that one can retrieve from within a Word document.

I included a Perl script for retrieving Word metadata with my book. The code is on the CD that accompanies the book, in the code directory for chapter 3. The script is called "meta.pl" and uses the Win32::OLE module to create an instance of Word, and use the API to retrieve metadata. Well, as I've seen with the work that I did on reading Event Log files, the API doesn't always get everything. Also, I've been looking for something a little more platform-independant.

Thanks to Richard Smith, I dug into the OLE::Storage module a bit, and found exactly what I was looking for. First, a quick caveat...the POD for this module, as well as some of the supporting modules, is a bit out of date. However, by using some of the accompanying examples (such as ldat, written by Martin Schwartz, copyright '96-'97) and simply trying some things out, I was able to figure things out. So the script uses that module, and a couple of others...but only after it opens the file in binary mode to retrieve other information from the file.

Okay, on to the output. I started with the Blair document from the Computerbytesman site, and got the same information (I didn't include the VBA Macro information, though). I downloaded a couple of arbitrary Word documents from the Web, via Google, and found some interesting info:

--------------------
Statistics
--------------------
File = d:\cd\wd\04_007.doc
Size = 322560 bytes
Magic = 0xa5ec (Word 8.0)
Version = 193
LangID = English (US)

Document has picture(s).

Document was created on Windows.

Magic Created : MS Word 97
Magic Revised : MS Word 97

--------------------
Last Author(s) Info
--------------------
1 : Susan and Shawn Sutherland :
2 : Susan and Shawn Sutherland :
3 : Susan and Shawn Sutherland :
4 : Susan and Shawn Sutherland :
5 : picketb :
6 : padilld :
7 : ONR :
8 : John T. McCain :
9 : horvats :
10 : arbaizd :

--------------------
Summary Information
--------------------
Title : I
Subject :
Authress : PICKETB
LastAuth : arbaizd
RevNum : 2
AppName : Microsoft Word 10.0
Created : 08.12.2003, 16:11:00
Last Saved : 08.12.2003, 16:11:00
Last Printed : 08.12.2003, 16:11:00

--------------------
Document Summary Information
--------------------
Organization : Office of Naval Research

Pretty cool, eh? Again, I found this document on the web. From my previous post, I asked some folks to send me documents written on the Mac platform, and I received a couple. Here's what the output looks like:

--------------------
Statistics
--------------------
File = d:\cd\wd\ex1.doc
Size = 21504 bytes
Magic = 0xa5ec (Word 8.0)
Version = 193
LangID = English (US)

Document was created on a Mac.
File was last saved on a Mac.

Magic Created : Word 98 Mac
Magic Revised : Word 98 Mac

--------------------
Last Author(s) Info
--------------------
1 : : Macintosh HD:Users:name:Desktop:Ex1.doc

--------------------
Summary Information
--------------------
Title : The quick brown fox jumps over the lazy dog
Subject :
Authress : name
LastAuth : name
RevNum : 1
AppName : Microsoft Word 10.1
Created : 12.10.2005, 02:51:00
Last Saved : 12.10.2005, 02:58:00
Last Printed :

--------------------
Document Summary Information
--------------------
Organization :

Okay, I made a couple of obvious changes, but the point is that there is information within the binary contents of the file information block (FIB) that tells you the platform that a document was created on...for example, if it was created on a Mac, or on a Windows platform. Pretty cool, eh?

So...what do you think? I'll be posting the script soon, along with a couple of other scripts...for example, I'm going to include one that I used for troubleshooting, which simply writes all of the structured storage streams to files on the system. After all, MS describes structured storage as "a file system within a file", so wouldn't you like to see the contents of each of those files? I'm not entirely sure of the usefulness of this with regards to forensic analysis, but someone might find it useful.

An offshoot of all this involves the MergeStreams application (here's something I found at UTulsa) that I've used in some of my presentations. This application allows you to merge an Excel spreadsheet into a Word document, resulting in a much larger, but otherwise unchanged Word doc. However, if you change the resulting file's extension to ".xls", and double-click on it, you'll see the entire, unmodified contents of the spreadsheet. This is due to the streams being merged, and handled by the appropriate application (no, this is not steganography!!). Whenever I've presented on this, I've been asked how this sort of thing can be detected, and up until now, the only solutions I've been able to come up with have include the use of 'strings' and 'find'. With this module, however, you can dump the names of the streams from an OLE document, and if you see a stream named "Workbook" inside a Word document, you can be pretty sure that you've got an embedded document. This is a more accurate method than using 'strings'.

I'll be releasing the scripts soon...there are a couple of things I need to clean up, and I'm having a small issue with the compiled EXE version of the main script (above) that I'm trying to clear up.

1 comment:

Anonymous said...

I like your blog. I also run a site about online colleges and universities. We have programs for all kinds of career paths including
forensic software