Como extrair texto de arquivos DOCX e ODT usando PHP

Are you searching for a method to extract text from DOCX or ODT files using PHP? Well in this article I will show you how to do so. This technique can be used to create a web crawler and index document files based upon their content i.e. this can be used to create a document repository. The technique here doesn't involve any third party plugins or softwares. It will work in PHP 5.2+ and the only requirement is php_zip.dll for Windowsor --enable-zip parameter for Linux. Actually the DOCX and ODT files are archive files whose extension has been changed from .zip to .docx or .odt. Hence we need a ZIP library for PHP in order to extract the data from them.

You can verify this fact yourself. Just try to open any docx or odt file with a ZIP utility. Check out the screenshot below -

Os dados de texto está na palavra / document.xml para DOCX e content.xml para o arquivo ODT. Para extrair o texto tudo o que precisamos fazer é que são o conteúdo da palavra / document.xml (para docx) ou content.xml (para arquivo odt) e em seguida, exibir o seu conteúdo depois de filtrar as tags XML presentes no mesmo.

<?php

/*Name of the document file*/

$document = 'attractive_prices.docx';

/**Function to extract text*/

function extracttext($filename) {

    //Check for extension

    $ext = end(explode('.', $filename));

    //if its docx file

    if($ext == 'docx')

    $dataFile = "word/document.xml";

    //else it must be odt file

    else

    $dataFile = "content.xml";     

    //Create a new ZIP archive object

    $zip = new ZipArchive;

    // Open the archive file

    if (true === $zip->open($filename)) {

        // If successful, search for the data file in the archive

        if (($index = $zip->locateName($dataFile)) !== false) {

            // Index found! Now read it to a string

            $text = $zip->getFromIndex($index);

            // Load XML from a string

            // Ignore errors and warnings

            $xml = DOMDocument::loadXML($text, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);

            // Remove XML formatting tags and return the text

            return strip_tags($xml->saveXML());

}

        //Close the archive file

        $zip->close();

}

    // In case of failure return a message

    return "File not found";

}

echo extracttext($document);

?>

Fonte: BotskoolPor: Srivastava Shashwat

AzorWeb - Fabrício Azor

Páginas

quarta-feira, 15 de junho de 2011

Como extrair texto de arquivos DOCX e ODT usando PHP

Nenhum comentário :

Postar um comentário

Total de visualizações de página