What steps are you using to perform data reduction? What are you doing to sort the wheat from the chaff, as it were?
Some of the data reduction steps I'm aware of include:
- Hash sets - look for known good or known bad files
- File signature analysis - look for files whose header information doesn't match up nicely with the file extension
- File version info - parse binary files for file version info, and flag those that don't have any
- Keyword searches - depending on the case, look for files/sectors containing certain key words
Hash sets can be used to sift through those hundreds or thousands of operating system files...the ones that we know are good, and therefore we're not interested in them. You can also use hash sets to look for known bad files, as well.
What else are folks doing?