Museum

Home

Lab Overview

Retrotechnology Articles

Online Manuals

⇒ (1) — Plan9 4th Edition

Media Vault

Software Library

Restoration Projects

Artifacts Sought

Related Articles

strings(1)

DOC2TXT(1)

NAME

doc2txt, xls2txt olefs, mswordstrings msexceltable − extract printable strings from Microsoft Office documents

SYNOPSIS

­doc2txt [ ­file.doc ]
­xls2txt [ ­file.xls ]
­aux/olefs [ ­-m ­mtpt ] ­file.doc
­aux/mswordstrings ­/mnt/doc/WordDocument
­aux/msexceltable [ ­-n ] [ ­-t ] [ ­-a ] [ -ddelim ] ­/mnt/doc/Workbook

DESCRIPTION

­Doc2txt is a shell script that uses ­olefs and ­mswordstrings to extract the printable text from the body of a Microsoft Word document.  ­Xls2txt performs a similar function for Microsoft Excel documents. 

Microsoft Office documents are stored in OLE (Object Linking and Embedding) format, which is a scaled down version of Microsoft’s FAT file system.  ­Olefs presents the contents of an Office document as a file system on mtpt, which defaults to /mnt/doc.  ­Mswordstrings or ­msexceltables may then be used to parse the files inside, extracting a text stream.  ­Msexceltables may be given options to control the formatting of its output. 

-n Disables field padding to colum width. 

-t Truncate fields to the colum width. 

-a Attempt conversion of non-tabular sheets in the workbook. (charts). 

-d delim
Sets the interfield delimiter to the string delim, by default a single space. 

SOURCE

­/sys/src/cmd/aux/mswordstrings.c
­/sys/src/cmd/aux/msexceltables.c
­/sys/src/cmd/aux/olefs.c
­/rc/bin/xls2txt
­/rc/bin/doc2txt

BUGS

­Msexcelstrings cannot parse files containing rich text field descriptions or Asian phonetic pronunciation hints due to a lack of ducumentation on these formats; It has only been tested on BIFF8 files generated by MS Office 97; Caveat Emptor. 

SEE ALSO

strings(1)
“Microsoft Word 97 Binary File Format”, available on line at Microsoft’s developer home page.
“LAOLA Binary Structures”, ­http://snake.cs.tu-berlin.de:8081/~schwartz/pmh
“OpenOffice.Org’s Excel Documentation”, ­http://sc.openoffice.org/excelfileformat.pdf

Plan 9  —  January 06, 2005

Typewritten Software • bear@typewritten.org • Edmonds, WA 98026