New XML-Based Files :Implications for Forensics
For more than 20 years, programs such as Microsoft Word have stored their documents in binary file formats. That's changing as Microsoft, Sun Microsystems, and other developers migrate to new XML-based formats for document files. Document files are of critical interest to forensic practitioners because of the data they contain; they're also a rich topic for forensic research. Although most investigations concern themselves solely with a document's surface content, some examinations dive deeper, examining the metadata or deleted material that's still present in the file. Investigators can, for instance, use metadata to identify individuals potentially responsible for unauthorized file modification, establish text plagiarization, or even indicate falsification of evidence. Unfortunately, metadata can also be modified to implicate innocent people -- and the ease of modifying these new files means that it's far easier to make malicious modifications that are difficult (if not impossible) to detect. With so many aspects to consider, we present a forensic analysis of the two rival XML-based office document file formats: the Office Open XML (OOX) that Microsoft adopted for its Office software suite and the OpenDocument Format (ODF) used by Sun's OpenOffice software. We detail how forensic tools can exploit features in these file formats and show how these formats could cause problems for forensic practitioners. For additional information on the development and increased use of these two file formats, see the "Background" sidebar.
To begin our analysis, we created multiple ODF and OOX files using Microsoft Office 2007 for Windows, Microsoft Office 2008 for Macintosh, OpenOffice 2.3.1, and NeoOffice 2.2.2 (a version of OpenOffice that runs under MacOS).
Overall, we found that ODF and OOX files tend to be smaller than equivalent legacy non-XML files, almost certainly a result of ZIP compression. Although it's trivial to add to or remove parts from a ZIP archive after its creation, we found that in many cases, adding or removing parts to the archive corrupted the file so that it couldn't be processed with Microsoft Office or OpenOffice.
The ZIP structure for these files is useful when performing data recovery or file carving. (File carving is the process of recognizing files by their content, rather than file system metadata. Carving is frequently used for recovering files from devices that have hardware errors, have been formatted, or have been partially overwritten.) Because each part of the archive includes a multibyte signature and a 32-bit cyclic redundancy check (CRC32) for validation, we can recover parts of a ZIP archive even when other parts of it are damaged, missing, or otherwise corrupted. We can also use the CRC32 and relative offsets within the archive to automatically reassemble fragmented ZIP files. We can then manually process recovered parts or insert them into other OOX/ODF files to view the data. ODF and OOX both contain a ZIP directory as the last structure in the file. We can examine this directory using standard tools, such as the Unix unzip command or Sun's JAR.
ODF has a second directory that stores document parts in an XML data structure called Meta-INF/ manifest.xml. The OOX files store references to the additional document parts in the [Content_Types].xml and .rels parts, in addition to the document contents themselves.
Both file formats include a special XML file that contains the document's main flow. In ODF, the file content is called content.xml. The primary contents of an OOX word processing document created with Microsoft Office 2007 or 2008 reside in the document.xml part, although the standard allows a different name to be specified in the [Content_Types]. xml part.
Forensic tools should extract text from the content parts, but tool developers must understand that text can be present in other document parts as well. For example, Microsoft Word allows other Word documents to be embedded within a Word document using the "Insert/Object..." menu command. These documents are embedded as a named .docx file inside the ZIP archive, as Figure 1 shows. In such an instance, where files are embedded within other files, investigators should analyze files recursively using a special forensic tool.
The most straightforward way for forensic practitioners to handle these new compound document formats is to save the file and then open it with a compliant program. Although this approach works, it raises several potential problems:
The compound document might contain active content that the forensic investigator doesn't wish to execute. (Despite assurances from Microsoft and others that these file formats are safer, both ODF and OOX have provisions for storing active content3 and therefore can carry viruses.)
Links to external Web sites can reveal that someone has captured the file and is analyzing it.
If parts of the file are overwritten or missing, applications such as Word or OpenOffice might be unable to open the files.
Desktop applications can overlook or ignore critical information of interest to the forensic investigator.
To this end, we tested both Guidance's EnCase 6.11 and AccessData's Forensic ToolKit 1.8 and determined that they could display and search for text inside ODF files, OOX files, and OOX files embedded as objects inside other OOX files.
Both the compressed nature of ODF and OOX files and the multiple codings for the strings possible within XML represent a significant problem for forensic program developers. Because all the text is compressed, it's no longer possible to find it by scanning for strings within raw disk or document images. And because XML allows strings to be coded in hexadecimal or even interrupted by comment characters, any forensic tool that takes shortcuts in decoding the ZIP archive or implementing the full XML schema could return false negatives when performing searches.
Document files are fundamentally container files -- that is, single files (a consecutive stream of bytes) that contain multiple data objects. A typical Microsoft Word file might contain data streams associated with the summary info, the main text, tables, and embedded images. The file also contains numerous forms of metadata -- both for the document and for the container itself.
Sun Microsystems submitted the OpenOffice OpenDocument Format (ODF) to the Organization for the Advancement of Structured Information Standards (Oasis). The ODF was approved as an Oasis standard on 1 May 2005 and adopted as ISO 26300 the following year.
Because of the verbose nature of XML, ODF calls for the XML File to be compressed. Parsing XML can also be time-consuming, so ODF uses a single document represented by multiple XML files bundled together into a single ZIP archive. Images and other binary objects aren't coded as XML but are stored natively as binary sections in the ZIP archive.
Following the introduction of ODF, Microsoft introduced its own XML-based document file formats called WordprocessingML, SpreadsheetML, and PresentionML. Like ODF, Office Open XML (OOX) is a ZIP archive file consisting of multiple XML document elements (unless the file is encrypted, in which case it's an OLE compound file). Microsoft refers to the file as a package, with each file within the archive referred to as a part. As with ODF, structured information is first encoded into XML and compressed; embedded images are stored as binary objects within their own parts.
Because Microsoft's XML languages are defined in terms of behaviors built in to Microsoft Office, OOX files can't be readily translated into ODF or vice versa.
Microsoft's Office 2003 allowed these formats to be used as alternative document file formats; with Microsoft Office 2007, the XML-based document formats became the default file format. Native support for Office Open XML is provided today in Microsoft Office 2007 for Windows and Office 2008 for Macintosh. Additionally, several other programs have the ability to read or write Word 2007 files.
ZIP files consist of one or more file sections followed by a central directory. Each file section consists of a local file header that includes metadata such as the file's directory and filename, time stamp, compression method used, and additional information, followed by the actual file data and a data descriptor that includes a 32-bit checksum. The Central Directory Record contains the names of all the files, their offsets within the file, and their time stamps.
The new XML-based file formats have several advantages when compared with binary file formats:
Because they're compressed, files in the new format are typically smaller than files in the legacy format.
Programs that process document files need only extract the sections that they're concerned with and can ignore the rest.
Only sections that could contain computer viruses need to be scanned for computer viruses.
Even if parts of the file are corrupted, complete ZIP sections can still be recovered. This could allow embedded images or even content to be recovered under some circumstances.
Existing tools for handling ZIP files and XML documents make it easier for developers to write programs that can automatically process data stored in XML document files than to process legacy Word documents. However, because these are ZIP files of XML documents, they're far easier to modify. With off-the-shelf tools, an attacker can open one of these files and selectively add or remove information.
Both ODF and OOX are still relatively rare, but their numbers are increasing. We performed Google searches by file type in March, July, and September 2008, as well as January 2009, and saw the number of OOX files nearly triple during this study period.
"Save preview picture" on the "Advanced Options" for the "Save" dialog box isn't checked by default on Word and Excel 2007 the way it is in PowerPoint.
Embedded thumbnails can be valuable in forensic practice. If the thumbnail doesn't match the document, then someone modified the thumbnail or the document after the file's creation. If the file is no longer intact, the thumbnail might give the investigator some idea of the file's contents before the file was damaged. The thumbnail can also give a sense of what the document is about if the document file itself is corrupted and can't be completely recovered.
For completeness, we also examined the thumbnail images for metadata. The .jpg thumbnails created by Microsoft Office contained metadata for only the image size and resolution, whereas the .pdf thumbnails created by NeoOffice filled in the PDF's creator, producer, and creation date. However, these values merely indicated the program that created the thumbnail, not the user who ran the program, as Figure 2 shows.
Unique identifiers stored within documents can play an important role in many forensic investigations. Because unique identifiers remain the same even when the document is edited, we can use them to track the movement of documents through or between organizations. By correlating unique identifiers found on multiple hard drives, it's possible to find previously unknown social networks. We can use unique identifiers that survived copying and pasting to show plagiarism.
Unique identifiers can also raise privacy concerns. We found many unique identifiers stored within the ODF and OOX files. Some of them were "unique" in that they didn't occur elsewhere within a specific XML part or within the ZIP file: primarily, these were 32-bit numbers stored in hexadecimal. Others were 128-bit numbers unique for a particular generation of a particular document. We didn't find any unique identifiers that appeared to be unique for a specific machine.
For example, OOX defines a revision identifier for paragraphs (rsidP and rsidR). Microsoft Word uses these identifiers to determine the editing session in which a user added a paragraph to the main document, to aid in Word's "Compare Documents" feature. According to the specification, the rsidR values should be unique within a document: instances with the same value within a single document indicate that modifications occurred during the same editing session.
The primary value of these identifiers to forensic examiners is document tracking. Consequently, it's possible -- using these numbers -- to show that one file probably resulted from editing another file (although there is, of course, a one in four billion chance that two of these 32-bit numbers will be the same). However, the new XML-based formats also make it easier to change unique IDs, making it much easier to maliciously implicate an innocent computer user or create the appearance of a false correlation.