ssnepenthe/recipe-scraper

一个食谱抓取库。

0.4.4 2020-05-21 04:13 UTC

This package is auto-updated.

Last update: 2024-09-06 03:51:44 UTC


README

一个从网络上流行的网站抓取食谱的库,使抓取变得简单。

目前对网站的支持仍然有限。完整列表可在 SITE-SUPPORT.md 中找到。

要求

Composer, PHP 7.0 或更高版本。

安装

composer require ssnepenthe/recipe-scraper

使用

抓取实例在 Symfony DomCrawler 实例上工作。这些实例可以按您的选择创建,但最简单的方法是使用 BrowserKit 实现,如 Goutte

$client = new Goutte\Client;
$crawler = $client->request('GET', 'http://allrecipes.com/recipe/139917/joses-shrimp-ceviche/');

如果您只需要从单个网站抓取食谱,可以使用 src/Scrapers 中的相应类

$scraper = new RecipeScraper\Scrapers\AllRecipesCom;

如果您想能够从所有支持的网站抓取食谱,请使用 Factory 类创建 DelegatingScraper

$scraper = RecipeScraper\Factory::make();

使用 ->supports() 方法检查抓取器是否支持给定的爬虫

$scraper->supports($crawler); // true

最后,通过传递爬虫到 ->scrape() 方法来抓取食谱

$recipe = $scraper->scrape($crawler);

以下属性保证会在 $recipe 数组上设置

$recipe['author'] // string|null
$recipe['categories'] // string[]|null
$recipe['cookingMethod'] // string|null
$recipe['cookTime'] // string|null
$recipe['cuisines'] // string[]|null
$recipe['description'] // string|null
$recipe['image'] // string|null
$recipe['ingredients'] // string[]|null
$recipe['instructions'] // string[]|null
$recipe['name'] // string|null
$recipe['notes'] // string[]|null
$recipe['prepTime'] // string|null
$recipe['publisher'] // string|null
$recipe['totalTime'] // string|null
$recipe['url'] // string|null
$recipe['yield'] // string|null

如果对不支持爬虫实例调用 ->scrape() 方法,则 $recipe 中的所有值都将为 null。

总共

$scraper = RecipeScraper\Factory::make();
$client = new Goutte\Client;
$url = 'http://allrecipes.com/recipe/139917/joses-shrimp-ceviche/';
$crawler = $client->request('GET', $url);

if ($scraper->supports($crawler)) {
    var_dump($scraper->scrape($crawler));
} else {
    var_dump("{$url} not currently supported!");
}

输出

array(15) {
    'author' => string(9) "carrielee"
    'categories' => array(2) {
        [0] => string(21) "Appetizers and Snacks"
        [1] => string(5) "Spicy"
    }
    'cookingMethod' => NULL
    'cookTime' => string(5) "PT10M"
    'cuisines' => NULL
    'description' => string(336) ""I've looked all over the net and haven't found a shrimp ceviche quite like this one! My friends absolutely love it and beg me for the recipe! You can always double it for larger parties--it goes FAST! Serve as a dip with tortilla chips or as a topping on a tostada spread with mayo. The fearless palate might like this with hot sauce.""
    'image' => string(66) "https://images.media-allrecipes.com/userphotos/560x315/1364063.jpg"
    'ingredients' => array(9) {
        [0] => string(41) "1 pound peeled and deveined medium shrimp"
        [1] => string(22) "1 cup fresh lime juice"
        [2] => string(23) "10 plum tomatoes, diced"
        [3] => string(27) "1 large yellow onion, diced"
        [4] => string(49) "1 jalapeno pepper, seeded and minced, or to taste"
        [5] => string(28) "2 avocados, diced (optional)"
        [6] => string(31) "2 ribs celery, diced (optional)"
        [7] => string(31) "chopped fresh cilantro to taste"
        [8] => string(24) "salt and pepper to taste"
    }
    'instructions' => array(2) {
        [0] => string(294) "Place shrimp in a glass bowl and cover with lime juice to marinate (or 'cook') for about 10 minutes, or until they turn pink and opaque. Meanwhile, place the plum tomatoes, onion and jalapeno (and avocados and celery, if using) in a large, non-reactive (stainless steel, glass or plastic) bowl."
        [1] => string(200) "Remove shrimp from lime juice, reserving juice. Dice shrimp and add to the bowl of vegetables. Pour in the remaining lime juice marinade. Add cilantro and salt and pepper to taste. Toss gently to mix."
    }
    'name' => string(21) "Jose's Shrimp Ceviche"
    'notes' => NULL
    'prepTime' => string(5) "PT45M"
    'publisher' => NULL
    'totalTime' => string(5) "PT55M"
    'url' => string(62) "https://www.allrecipes.com/recipe/139917/joses-shrimp-ceviche/"
    'yield' => string(2) "20"
}

限制

抓取似乎是(遗憾的是)从该抓取器支持的网站提取结构化食谱的最好选项。

只需记住,目标网站上的任何模板更新都有很高的可能性会破坏此抓取器。

如果您注意到对特定网站的支撑已损坏,请提交问题或拉取请求。

出于同样的原因,跟踪此存储库的 master 分支可能更合适,尽管我将在任何特定网站更新后尝试标记新版本。