README

此PHP类可以使用poppler-utils将您的PDF文件转换为HTML。

感谢

特别感谢Mochamad Gufron (mgufrone)！我基于它的包创建了一个包(https://github.com/mgufrone/pdf-to-html)。

重要提示

请参阅以下使用说明。

安装

当您在您的活动目录应用中时，您只需运行此命令即可将此包添加到您的应用

  composer require tonchik-tm/pdf-to-html:~1

或将此包添加到您的 composer.json

{
  "tonchik-tm/pdf-to-html":"~1"
}

要求

1. 安装Poppler-Utils

Debian/Ubuntu

sudo apt-get install poppler-utils

Mac OS X

brew install poppler

Windows

对于需要在Windows上使用此包的人来说，有一个方法。首先在此处下载Windows的poppler-utils http://blog.alivate.com.au/poppler-windows/。并下载最新版本的二进制文件。

下载后，提取它。

2. 我们需要知道工具的位置

Debian/Ubuntu

$ whereis pdftohtml
pdftohtml: /usr/bin/pdftohtml

$ whereis pdfinfo
pdfinfo: /usr/bin/pdfinfo

Mac OS X

$ which pdfinfo
/usr/local/bin/pdfinfo

$ which pdftohtml
/usr/local/bin/pdfinfo

Windows

进入提取的目录。将有一个名为 bin 的目录。我们需要这个。

3. PHP配置，启用shell访问

使用方法

示例

<?php
// if you are using composer, just use this
include 'vendor/autoload.php';

// initiate
$pdf = new \TonchikTm\PdfToHtml\Pdf('test.pdf', [
    'pdftohtml_path' => '/usr/bin/pdftohtml',
    'pdfinfo_path' => '/usr/bin/pdfinfo'
]);

// example for windows
// $pdf = new \TonchikTm\PdfToHtml\Pdf('test.pdf', [
//     'pdftohtml_path' => '/path/to/poppler/bin/pdftohtml.exe',
//     'pdfinfo_path' => '/path/to/poppler/bin/pdfinfo.exe'
// ]);

// get pdf info
$pdfInfo = $pdf->getInfo();

// get count pages
$countPages = $pdf->countPages();

// get content from one page
$contentFirstPage = $pdf->getHtml()->getPage(1);

// get content from all pages and loop for they
foreach ($pdf->getHtml()->getAllPages() as $page) {
    echo $page . '<br/>';
}

完整的设置列表

<?php

$full_settings = [
    'pdftohtml_path' => '/usr/bin/pdftohtml', // path to pdftohtml
    'pdfinfo_path' => '/usr/bin/pdfinfo', // path to pdfinfo

    'generate' => [ // settings for generating html
        'singlePage' => false, // we want separate pages
        'imageJpeg' => false, // we want png image
        'ignoreImages' => false, // we need images
        'zoom' => 1.5, // scale pdf
        'noFrames' => false, // we want separate pages
    ],

    'clearAfter' => true, // auto clear output dir (if removeOutputDir==false then output dir will remain)
    'removeOutputDir' => true, // remove output dir
    'outputDir' => '/tmp/'.uniqid(), // output dir

    'html' => [ // settings for processing html
        'inlineCss' => true, // replaces css classes to inline css rules
        'inlineImages' => true, // looks for images in html and replaces the src attribute to base64 hash
        'onlyContent' => true, // takes from html body content only
    ]
]

反馈 & 贡献

向我发送一个改进或任何有缺陷的问题。我乐于帮助并解决他人的问题。谢谢 👍

tonchik-tm / pdf-to-html

维护者

详细信息