Monday, September 25, 2006

MetaData and eDiscovery

In yesterday's CyberSpeak podcast, mention was made of issues with Office document metadata and eDiscovery. Several commercially available tools were mentioned, and I wanted to mention that there are freeware tools available.

First off, let me say that the tool I'll mention is one of my own...I'll be up front about that. It's a Perl module that I posted on CPAN, and it ships with a sample script called "testwd.pl". On Windows, if you're using ActiveState's ActivePerl, installation of the module is simple. Download the archive and extract the MSWord.pm file to \perl\site\lib\File. To install the necessary modules to support this module, use the following commands:

ppm install OLE-Storage
ppm install Startup
ppm install Unicode-Map

The sample script pulls out the data in a crude format...the original script that I based this module on (wmd.pl) did a better job of extracting the information in a pretty format. As an example, I'll use the Blair document:

C:\Perl>wmd.pl d:\cd\blair.doc
--------------------
Statistics
--------------------
File = d:\cd\blair.doc
Size = 65024 bytes
Magic = 0xa5ec (Word 8.0)
Version = 193
LangID = English (US)

Document was created on Windows.

Magic Created : MS Word 97
Magic Revised : MS Word 97

--------------------
Last Author(s) Info
--------------------
1 : cic22 : C:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of Iraq - securi
ty.asd
2 : cic22 : C:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of Iraq - securi
ty.asd
3 : cic22 : C:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of Iraq - securi
ty.asd
4 : JPratt : C:\TEMP\Iraq - security.doc
5 : JPratt : A:\Iraq - security.doc
6 : ablackshaw : C:\ABlackshaw\Iraq - security.doc
7 : ablackshaw : C:\ABlackshaw\A;Iraq - security.doc
8 : ablackshaw : A:\Iraq - security.doc
9 : MKhan : C:\TEMP\Iraq - security.doc
10 : MKhan : C:\WINNT\Profiles\mkhan\Desktop\Iraq.doc

--------------------
Summary Information
--------------------
Title : Iraq- ITS INFRASTRUCTURE OF CONCEALMENT, DECEPTION AND INTIMIDATION
Subject :
Authress : default
LastAuth : MKhan
RevNum : 4
AppName : Microsoft Word 8.0
Created : 03.02.2003, 09:31:00
Last Saved : 03.02.2003, 11:18:00
Last Printed : 30.01.2003, 21:33:00

--------------------
Document Summary Information
--------------------
Organization : default

Notice the bolded line above...this is extracted from the binary data of the file.

The module extracts the information, it just needs to be prettied up a bit. Another benefit of the module is that it extracts additional information from the OLE contents of the file. First off, it extracts information about the OLE "trash bins", where useful data could be hidden:

Trash Bin Size
BigBlocks 0
SystemSpace 940
SmallBlocks 0
FileEndSpace 1450

Also, the module collects information about the OLE streams within the file:

Stream : ☺CompObj
Stream : WordDocument
Stream : ♣DocumentSummaryInformation
Stream : ObjectPool
Stream : 1Table
Stream : ♣SummaryInformation

At this point, you're probably thinking, "yeah...so?" Well, there's a freeware utility available called MergeStreams that allows you to merge an Excel spreadsheet into a Word document. The resulting file is slightly smaller than the sum of both file sizes, and the file extension is ".doc"...so if you double click the file, it will open in Word and all of the word data will be visible. However, if you change the file extension to ".xls" and double-click the file, it will open in Excel, with none of the Word data/information visible. It's still there...it's just not being parsed by Excel.

Why is this important? Well, if I wanted to smuggle information out of an organization, I might put the information in a spreadsheet for easy access and searching and then merge it into an innocuous Word document and copy it to my thumb drive (or laptop hard drive). If on the off chance anyone was to search me or my devices, they'd see the Word document. If the double-clicked it, they'd see the innocuous, boring content I'd put there...and wave me on my merry way. The same could be true for email attachments.

The example that I use that gets the LEOs sitting up in their seats is to take three illicit images and paste them into a Word document. Merge the document with an Excel spreadsheet that may be widely circulated throughtout the company...financial forecasts, etc. Only those folks who know that the images are there will know to change the file extension to ".doc" so that they can view the images.

Interesting stuff. Like I said before, if you have a situation like what was mentioned in the podcast (i.e., you have to search a lot of files for specific metadata, such as the last author, or one of the last 10 authors), then something like the Perl module provides the necessary framework; combine it with any number of ways to enumerate the files in question (read the contents of a directory, read the file list from a file, etc.), Perl's regular expressions, and you can output to any format you like (HTML, XML, spreadsheet, database, text file, etc.).

9 comments:

Anonymous said...

I posted a HOWTO for anyone who wants to run your tool on Linux and does not know how to prepare the Perl side of things. You can find it at:

http://chicago-ediscovery.com/computer-forensic-howtos/howto-extract-metadata-microsoft-word-linux.html

Andrew Hoog
Chicago Electronic Discovery
http://chicago-ediscovery.com

Md. Abdur Rahman said...

hi, its impressive. If i want to use your perl file as is and use it in my own metadata extraction tool written in PHP, will you mind? What might be the restrictions of it? Here is the usage scenario:
1. I have a user who will upload a file in .doc extension
2. I will use PHP to call your perl file to extract the metadata for me
3. I will receive it and use it for subsequent part of my system.

Can you please be generous on this issue? I want to thank you in advance.

H. Carvey said...

Please, feel free...thanks!

Sunny said...

Hi Harlan,

I cannot use wmd.pl in case of docx files. Its giving me the following error:

"C:\Perl\site\lib>wmd.pl f:\hi.docx
Use of assignment to $[ is deprecated at C:/Perl/site/lib/OLE/Storage.pm line 61
.
Use of assignment to $[ is deprecated at C:/Perl/site/lib/OLE/PropertySet.pm lin
e 409.
--------------------
Statistics
--------------------
File = f:\hi.docx
Size = 50501 bytes
Magic = 0x0 ()
Version = 0
LangID = Unknown


Document was created on Windows.

Magic Created :
Magic Revised :

Can't call method "directory" without a package or object reference at C:\Perl\s
ite\lib\wmd.pl line 59."

Please help

H. Carvey said...

Sunny,

wmd.pl isn't intended for .docx files...it was written for .doc files. The Office 2007+ file formats are different and as such require the use of different tools.

SurfKahuna said...

Hello Mr. Carvey. My computer forensics class (MSIA program, 600-level course, will remain anonymous for now) is using one of your textbooks along with some of your Perl scripts. Thank you very much for the easy-to-follow, chock-full-of-information resources! Anyways, I have a lingering question that I thought I would bring to the man himself. In fact, I am going to paste the exactly question I posted in our class forum:


Prof./Class,

Does anyone know exactly what the Trash Bin metadata is in a Word Document? I searched for about an hour on the topic just now. I found articles on Word metadata, articles on the severity of security leaks caused by such metadata, etc. I then looked up the Perl modules that were used to pull the trash information. However, the closest I came to finding what the trash bins really are is the following article: http://windowsir.blogspot.com/2006/09/metadata-and-ediscovery.html

The article is from Carvey's blog. He notes:

"The module extracts the information, it just needs to be prettied up a bit. Another benefit of the module is that it extracts additional information from the OLE contents of the file. First off, it extracts information about the OLE "trash bins", where useful data could be hidden:


Trash Bin Size
BigBlocks 0
SystemSpace 940
SmallBlocks 0
FileEndSpace 1450"

OK.... so what does that mean? What exactly are the four items noted and how does one go about trying to extract data stored in those "trash bins?" I also found http://www.cpan.org/authors/id/H/HC/HCARVEY/File-MSWord-0.1.readme, where Carvey discusses trash bins. He notes:

"%hash = $word->readTrash()

Reads the trash bins in an OLE/compound/structured storage document.
Returns a hash of hashes with the names of the trash bins as keys, and
the size and contents of the bins as subkeys."

Again, this does not really tell me what I would like to know about the trash bins. I would really like to know what those buggers are, so any feedback is appreciated. Thanks prof/all!

Can you please shed more light on the trash bins for me?

SurfKahuna said...

Hello Mr. Carvey. My computer forensics class (MSIA program, 600-level course, will remain anonymous for now) is using one of your textbooks along with some of your Perl scripts. Thank you very much for the easy-to-follow, chock-full-of-information resources! Anyways, I have a lingering question that I thought I would bring to the man himself. In fact, I am going to paste the exactly question I posted in our class forum:


Prof./Class,

Does anyone know exactly what the Trash Bin metadata is in a Word Document? I searched for about an hour on the topic just now. I found articles on Word metadata, articles on the severity of security leaks caused by such metadata, etc. I then looked up the Perl modules that were used to pull the trash information. However, the closest I came to finding what the trash bins really are is the following article: http://windowsir.blogspot.com/2006/09/metadata-and-ediscovery.html

The article is from Carvey's blog. He notes:

"The module extracts the information, it just needs to be prettied up a bit. Another benefit of the module is that it extracts additional information from the OLE contents of the file. First off, it extracts information about the OLE "trash bins", where useful data could be hidden:


Trash Bin Size
BigBlocks 0
SystemSpace 940
SmallBlocks 0
FileEndSpace 1450"

OK.... so what does that mean? What exactly are the four items noted and how does one go about trying to extract data stored in those "trash bins?" I also found http://www.cpan.org/authors/id/H/HC/HCARVEY/File-MSWord-0.1.readme, where Carvey discusses trash bins. He notes:

"%hash = $word->readTrash()

Reads the trash bins in an OLE/compound/structured storage document.
Returns a hash of hashes with the names of the trash bins as keys, and
the size and contents of the bins as subkeys."

Again, this does not really tell me what I would like to know about the trash bins. I would really like to know what those buggers are, so any feedback is appreciated. Thanks prof/all!

Can you please shed more light on the trash bins for me?

--- said...

can we modify the tool to handle pptx and xlsx files as well?

Regards

H. Carvey said...

Nope, they're completely different formats.

Try EXIFTool.