Are you searching for a method to extract text from DOCX or ODT files using PHP? Well in this article I will show you how to do so. This technique can be used to create a web crawler and index document files based upon their content i.e. this can be used to create a document repository. The technique here doesn't involve any third party plugins or softwares. It will work in PHP 5.2+ and the only requirement is php_zip.dll for Windowsor --enable-zip parameter for Linux. Actually the DOCX and ODT files are archive files whose extension has been changed from .zip to .docx or .odt. Hence we need a ZIP library for PHP in order to extract the data from them.
You can verify this fact yourself. Just try to open any docx or odt file with a ZIP utility. Check out the screenshot below -
Os dados de texto está na palavra / document.xml para DOCX e content.xml para o arquivo ODT. Para extrair o texto tudo o que precisamos fazer é que são o conteúdo da palavra / document.xml (para docx) ou content.xml (para arquivo odt) e em seguida, exibir o seu conteúdo depois de filtrar as tags XML presentes no mesmo.
<?php/*Name of the document file*/$document = 'attractive_prices.docx';/**Function to extract text*/function extracttext($filename) { //Check for extension $ext = end(explode('.', $filename)); //if its docx file if($ext == 'docx') $dataFile = "word/document.xml"; //else it must be odt file else $dataFile = "content.xml"; //Create a new ZIP archive object $zip = new ZipArchive; // Open the archive file if (true === $zip->open($filename)) { // If successful, search for the data file in the archive if (($index = $zip->locateName($dataFile)) !== false) { // Index found! Now read it to a string $text = $zip->getFromIndex($index); // Load XML from a string // Ignore errors and warnings $xml = DOMDocument::loadXML($text, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING); // Remove XML formatting tags and return the text return strip_tags($xml->saveXML()); } //Close the archive file $zip->close(); } // In case of failure return a message return "File not found";}echo extracttext($document);?>
Nenhum comentário :
Postar um comentário