DOC2TXT(1)
NAME
doc2txt, xls2txt olefs, mswordstrings msexceltable − extract printable strings from Microsoft Office documents
SYNOPSIS
doc2txt [ file.doc ]
xls2txt [ file.xls ]
aux/olefs [ -m mtpt ] file.doc
aux/mswordstrings /mnt/doc/WordDocument
aux/msexceltable [ -n ] [ -t ] [ -a ] [ -ddelim ] /mnt/doc/Workbook
DESCRIPTION
Doc2txt is a shell script that uses olefs and mswordstrings to extract the printable text from the body of a Microsoft Word document. Xls2txt performs a similar function for Microsoft Excel documents.
Microsoft Office documents are stored in OLE (Object Linking and Embedding) format, which is a scaled down version of Microsoft’s FAT file system. Olefs presents the contents of an Office document as a file system on mtpt, which defaults to /mnt/doc. Mswordstrings or msexceltables may then be used to parse the files inside, extracting a text stream. Msexceltables may be given options to control the formatting of its output.
-n Disables field padding to colum width.
-t Truncate fields to the colum width.
-a Attempt conversion of non-tabular sheets in the workbook. (charts).
-d delim
Sets the interfield delimiter to the string delim, by default a single space.
SOURCE
/sys/src/cmd/aux/mswordstrings.c
/sys/src/cmd/aux/msexceltables.c
/sys/src/cmd/aux/olefs.c
/rc/bin/xls2txt
/rc/bin/doc2txt
BUGS
Msexcelstrings cannot parse files containing rich text field descriptions or Asian phonetic pronunciation hints due to a lack of ducumentation on these formats; It has only been tested on BIFF8 files generated by MS Office 97; Caveat Emptor.
SEE ALSO
strings(1)
“Microsoft Word 97 Binary File Format”, available on line at Microsoft’s developer home page.
“LAOLA Binary Structures”, http://snake.cs.tu-berlin.de:8081/~schwartz/pmh
“OpenOffice.Org’s Excel Documentation”, http://sc.openoffice.org/excelfileformat.pdf
Plan 9 — January 06, 2005