README

这是一个辅助组件，用于简化自定义网站爬虫的创建。它实现了遍历列表页面和下载项目的一些基本逻辑。为了使用此组件，您必须首先实现自己的 ItemListDownloader（通过扩展 AbstractItemListDownloader）和 ItemDownloader（通过扩展 AbstractItemDownloader 或 AbstractJsonItemDownloader），以适应您的特定网站。

安装

此包需要 php >= 7.4。要安装组件，请使用 composer

composer require unique/scraper

用法

为了使用此组件，您必须首先为您的特定网站实现自己的 ItemListDownloader 和 ItemDownloader。
由于大多数爬虫使用（至少是我的使用）包括遍历列表并从中抓取项目。
也许有一天，随着需求的出现，我会扩展它，但现在爬虫使用的是相同的方法。

假设我们有一个广告网站，该网站有一个广告列表。列表分为若干页，每页有20个广告。我们需要抓取所有广告。

我们首先创建一个类，该类将代表我们抓取的广告。它必须实现 SiteItemInterface。

    class SiteItem implements \unique\scraper\interfaces\SiteItemInterface {
        
        protected $id;
        protected $url;
        protected $title;
        
        // @todo: implement setter and getters for $id, $url, $title
    }

然后我们实现 ItemListDownloader

    class ItemListDownloader extends \unique\scraper\AbstractItemListDownloader {
        
        protected function getNumberOfItemsInPage( \Symfony\Component\DomCrawler\Crawler $doc ): ?int {

            // Or we could implement some logic of checking the website for the actual number.
            return 20;
        }

        protected function hasNextPage( \Symfony\Component\DomCrawler\Crawler $doc, int $current_page_num ): bool {

            // We could implement some logic of checking the page's paginator,
            // or we can just return true and let the scraper go through all of the listing
            // pages until it finds one, that has no items in it. It will then stop automatically.
            
            return true;
        }

        function getListUrl( ?int $page_num ): string {

            return 'https://some.website.here/?page_num=' . $page_num;
        }

        function getTotalItems( \Symfony\Component\DomCrawler\Crawler $doc ): ?int {

            // If possible, we could find the total number of items (that's in all of the listing pages)
            return null;
        }

        function getItems( \Symfony\Component\DomCrawler\Crawler $doc ): iterable {

            // We define a selector, where each item will be a unique ad.
            // The scraper will iterate these items and get all of them.
            // It doesn't need to be <a> tag, you define your own logic of how to get
            // to the actual item page.
            
            return $doc->filter( 'a.ad-item' );
        }

        function getItemUrl( \DOMElement $item ): ?string {

            // Here, $item is the item from the getItems() method,
            // we analyze it and return the url for scraping the item itself.
            return $item->getAttribute( 'href' );
        }

        function getItemId( string $url, \DOMElement $item ): string {

            // We return some string by which we can uniquely identify the ad.
            // This can later be used to skip the ads, that we already have in DB, for example.
            return $item->getAttribute( 'data-id' );
        }

        function getItemDownloader( string $url, string $id ): ?AbstractItemDownloader {

            return new ItemDownloader( 'https://some.website.here/' . $url, $id, $this, new SiteItem() );
        }
    }

然后我们创建广告本身的下载器

    class ItemDownloader extends \unique\scraper\AbstractItemDownloader {
        
        protected function assignItemData( \Symfony\Component\DomCrawler\Crawler $doc ) {

            // We set all the attributes we need for our custom SiteItem object,
            // which can be accessed by the $this->item attribute.
            $this->item->setTitle( $doc->filter( 'h1' )->text() );
        }
    }

或者，如果您通过 json 获取广告数据，可以扩展 AbstractJsonItemDownloader。

    class ItemDownloader extends \unique\scraper\AbstractJsonItemDownloader {

        protected function assignItemData( array $json ) {

            // We set all the attributes we need for our custom SiteItem object,
            // which can be accessed by the $this->item attribute.
            $this->item->setTitle( $json['title'] );
        }
    }

这样就完成了抓取。现在剩下的只是创建一个示例命令脚本，
以启动抓取。

    class ScraperController implements \unique\scraper\interfaces\ConsoleInterface {
        
        // @todo implement stdOut() and stdErr() methods for logging.
        
        public function actionRun() {
            
            $transport = new GuzzleHttp\Client();
            $log_container = new LogContainerConsole( $this );
            $downloader = new ItemListDownloader( SiteItem::class, $transport, $log_container );

            $downloader->on( \unique\scraper\AbstractItemListDownloader::EVENT_ON_ITEM_END, function ( \unique\scraper\events\ItemEndEvent $event ) {
                
                if ( $event->site_item ) {

                    $event->site_item->save();
                }
            } );

            $downloader->scrape();
        }
    }

您可以使用可选的 LogContainerConsole 将日志记录到控制台，使用两个方法：stdOut() 和 stdErr()，您需要自己实现。

文档

事件

您可以通过使用 on( string $event_name, callable $handler ) 方法订阅由 AbstractItemListDownloader 触发的各种事件。每个处理器将接收一个 EventObject，它取决于事件类型

`on_list_begin`

事件对象将是 ListBeginEvent。这是一个“可中断”事件（继续阅读以了解更多信息）。方法

getPageNum(): int 返回页码。

`on_list_end`

事件对象将是 ListEndEvent。方法

getItemCount(): ItemCount 返回有关页码、大小和项目总数的详细信息。
willContinue(): bool 如果爬虫将继续到下一页，则返回 true。

`on_item_begin`

事件对象将是 ItemBeginEvent。这是一个“可中断”事件（继续阅读以了解更多信息）。方法

getId(): string 返回项目的 id。
getUrl(): string 返回项目的 url。
getDomElement(): \DOMElement 返回相应的 \DOMElement。

`on_item_end`

事件对象将是 ItemEndEvent。方法

getItemCount(): ItemCount 返回有关页码、大小和项目总数的详细信息。
getState(): int 是 AbstractItemListDownloader::STATE_* 中找到的状态常量之一。
getSiteItem(): ?SiteItemInterface 如果没有发现错误，提供抓取项目的数据。
getDomElement(): \DOMElement 返回相应的 \DOMElement。

`on_item_missing_url`

事件对象将是 ItemMissingUrlEvent。方法

getUrl(): ?string 返回项目的 url。
setUrl( ?string $url ) 允许处理器设置新的 url。
getDomElement(): \DOMElement 返回相应的 \DOMElement。

`on_break_list`

事件对象将是 BreakListEvent。方法

getCausingEvent(): ?EventObjectInterface 返回指示中断列表抓取的事件对象。

可中断事件

这些是实现了 BreakableEventInterface 的事件，可以指示爬虫中止处理项目
或者终止整个列表的抓取。在PHP中，这些是 while 循环中的 continue 和 break。
因此，可中断的事件对象实现了以下方法

shouldSkip(): bool - 如果列表项应该被跳过，则返回 true。
shouldBreak(): bool - 如果应该终止列表的抓取，则返回 true。
continue() - 指示抓取器继续处理当前项。
skip() - 指示抓取器跳过当前项，但继续处理列表。
break() - 指示抓取器终止列表并停止抓取。

unique / scraper

维护者

详细信息