DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP),
OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE),
ODFXML (FODP, FODS, FODT), PDF, EML and HTML documents to plain text.
Extracts metadata and annotations.
对于解析像office2007这类的文件,doctotext只是识别出来格式是OOXML类型,并没有细分是word还是execl。
namespace doctotext
{enum TableStyle { TABLE_STYLE_TABLE_LOOK, TABLE_STYLE_ONE_ROW, TABLE_STYLE_ONE_COL, };enum UrlStyle { URL_STYLE_TEXT_ONLY, URL_STYLE_EXTENDED, URL_STYLE_UNDERSCORED, };class ListStyle {};struct FormattingStyle {TableStyle table_style;UrlStyle url_style;ListStyle list_style;}; enum XmlParseMode {PARSE_XML, FIX_XML, STRIP_XML};
}
class PlainTextExtractor
{//文件类型的枚举enum ParserType{......}//实现结构体struct Implementation; //实现结构体私有变量Implementation *impl;
}
implementation中实现的函数列表
isRTF [PlainTextExtractor::Implementation]
isODFOOXML [PlainTextExtractor::Implementation]
isXLS [PlainTextExtractor::Implementation]
isDOC [PlainTextExtractor::Implementation]
isPPT [PlainTextExtractor::Implementation]
isHTML [PlainTextExtractor::Implementation]
isIWork [PlainTextExtractor::Implementation]
isXLSB [PlainTextExtractor::Implementation]
isPDF [PlainTextExtractor::Implementation]
isEML [PlainTextExtractor::Implementation]
isODFXML [PlainTextExtractor::Implementation]
parseRTF [PlainTextExtractor::Implementation]
parseODFOOXML [PlainTextExtractor::Implementation]
parseXLS [PlainTextExtractor::Implementation]
parseDOC [PlainTextExtractor::Implementation]
parsePPT [PlainTextExtractor::Implementation]
parseHTML [PlainTextExtractor::Implementation]
parseIWork [PlainTextExtractor::Implementation]
parseXLSB [PlainTextExtractor::Implementation]
parsePDF [PlainTextExtractor::Implementation]
parseTXT [PlainTextExtractor::Implementation]
parseEML [PlainTextExtractor::Implementation]
parseODFXML [PlainTextExtractor::Implementation]
parseRTFMetadata [PlainTextExtractor::Implementation]
parseODFOOXMLMetadata [PlainTextExtractor::Implementation]
parseXLSMetadata [PlainTextExtractor::Implementation]
parseDOCMetadata [PlainTextExtractor::Implementation]
parsePPTMetadata [PlainTextExtractor::Implementation]
parseHTMLMetadata [PlainTextExtractor::Implementation]
parseIWorkMetadata [PlainTextExtractor::Implementation]
parseXLSBMetadata [PlainTextExtractor::Implementation]
parsePDFMetadata [PlainTextExtractor::Implementation]
parseEMLMetadata [PlainTextExtractor::Implementation]
parseODFXMLMetadata [PlainTextExtractor::Implementation]
PlainTextExtractor [PlainTextExtractor]
~PlainTextExtractor [PlainTextExtractor]
setVerboseLogging [PlainTextExtractor]
setLogStream [PlainTextExtractor]
setFormattingStyle [PlainTextExtractor]
setXmlParseMode [PlainTextExtractor]
setManageXmlParser [PlainTextExtractor]
parserTypeByFileExtension [PlainTextExtractor]
parserTypeByFileExtension [PlainTextExtractor]
parserTypeByFileContent [PlainTextExtractor]
parserTypeByFileContent [PlainTextExtractor]
parserTypeByFileContent [PlainTextExtractor]
processFile [PlainTextExtractor]
processFile [PlainTextExtractor]
processFile [PlainTextExtractor]
processFile [PlainTextExtractor]
根据输入参数选项指定文件类型
