inetprocess/neuralyzer

此包已被弃用且不再维护。作者建议使用 edyan/neuralyzer 包。

数据匿名化库和CLI工具

v4.1 2021-05-13 18:16 UTC

README

Scrutinizer Code Quality Code Coverage Build Status Build Status

edyan/neuralyzer

摘要

该项目是一个库和命令行工具,通过更新数据或生成虚假数据(更新与插入)来匿名化数据库。它使用 Faker 根据配置文件中定义的规则生成数据。

因为它可以逐行或使用批量机制,所以你可以加载包含数千万条虚假记录的表。

它使用 Doctrine DBAL 来抽象与数据库的交互。它应该能够与任何数据库类型一起工作。目前它已与 MySQL、PostgreSQL 和 SQLServer 完成测试。

Neuralyzer 有一个选项在启动匿名化之前通过注入一个带有 WHERE 条件的 DELETE FROM 来清理表格(请参阅配置参数 deletedelete_where)。

Neuralyzer 以前有一个清理表格的选项,但现在由预操作和后操作管理。

entities:
    books:
        cols:
            title: { method: sentence, params: [8], unique: true }
        action: update
        pre_actions:
            - db.query("DELETE FROM books")
post_actions:
    - db.query("DELETE FROM books WHERE title LIKE '%war%'")

作为库安装

composer require edyan/neuralyzer

作为可执行文件安装

您甚至可以直接下载可执行文件(以 v3.1 为例)

$ wget https://github.com/edyan/neuralyzer/raw/v4.0/neuralyzer.phar
$ sudo mv neuralyzer.phar /usr/local/bin/neuralyzer
$ sudo chmod +x /usr/local/bin/neuralyzer
$ neuralyzer

用法

使用该工具的最简单方法是先从命令行工具开始。在克隆项目并运行 composer install 之后,尝试

$ bin/neuralyzer

自动生成配置

Neuralyzer 能够读取数据库并为您生成配置。命令 config:generate 接受以下选项

Options:
    -D, --driver=DRIVER              Driver (check Doctrine documentation to have the list) [default: "pdo_mysql"]
    -H, --host=HOST                  Host [default: "127.0.0.1"]
    -d, --db=DB                      Database Name
    -u, --user=USER                  User Name [default: "www-data"]
    -p, --password=PASSWORD          Password (or it'll be prompted)
    -f, --file=FILE                  File [default: "neuralyzer.yml"]
        --protect                    Protect IDs and other fields
        --ignore-table=IGNORE-TABLE  Table to ignore. Can be repeated (multiple values allowed)
        --ignore-field=IGNORE-FIELD  Field to ignore. Regexp in the form "table.field". Can be repeated (multiple values allowed)

示例

bin/neuralyzer config:generate --db test_db -u root -p root --ignore-table config --ignore-field ".*\.id.*"

这会产生一个看起来像这样的文件

entities:
    authors:
        cols:
            first_name: { method: firstName, unique: false }
            last_name: { method: lastName, unique: false }
        action: update # Will update existing data, "insert" would create new data
        pre_actions: {  }
        post_actions: {  }

    books:
        cols:
            name: { method: sentence, params: [8] }
            date_modified: { method: date, params: ['Y-m-d H:i:s', now] }
        action: update
        pre_actions: {  }
        post_actions: {  }

guesser: Edyan\Neuralyzer\Guesser
guesser_version: '3.0'
language: en_US

您必须修改该文件以更改其配置。例如,如果您需要在匿名化时删除数据并更改语言(有关可用语言的说明,请参阅 Faker 的文档),请执行以下操作

# be careful that some languages have only a few methods.
# Example : https://github.com/FakerPHP/Faker/tree/v1.14.1/src/Faker/Provider/fr_FR
language: fr_FR

信息:您还可以在独立模式下使用删除,而不进行任何匿名化。这将删除 books 中的所有内容

entities:
    authors:
        cols:
            first_name: { method: firstName, unique: false }
            last_name: { method: lastName, unique: false }
        action: update
    books:
        pre_actions:
            - db.query("DELETE FROM books")

如果您想删除所有内容然后插入 1000 本新书

guesser_version: '3.0'
entities:
    authors:
        cols:
            first_name: { method: firstName, unique: false }
            last_name: { method: lastName, unique: false }
        action: update
    books:
        cols:
            name: { method: sentence, params: [8] }
        action: insert
        pre_actions:
            - db.query("DELETE FROM books")
        limit: 1000

运行匿名化器

要运行匿名化器,命令简单为 "run",并期望

Options:
    -D, --driver=DRIVER      Driver (check Doctrine documentation to have the list) [default: "pdo_mysql"]
    -H, --host=HOST          Host [default: "127.0.0.1"]
    -d, --db=DB              Database Name
    -u, --user=USER          User Name [default: "www-data"]
    -p, --password=PASSWORD  Password (or prompted)
    -c, --config=CONFIG      Configuration File [default: "neuralyzer.yml"]
    -t, --table=TABLE        Do a single table
        --pretend            Don't run the queries
    -s, --sql                Display the SQL

    -m, --mode=MODE          Set the mode : batch or queries [default: "batch"]

示例

bin/neuralyzer run --db test_db -u root -p root

这会产生这种类型的输出

Anonymizing authors
 2/2 [============================] 100%

Queries:
UPDATE authors SET first_name = 'Don', last_name = 'Wisoky' WHERE id = '1'
UPDATE authors SET first_name = 'Sasha', last_name = 'Denesik' WHERE id = '2'

....

警告:在大型表中,使用 --sql 将产生巨大的输出。仅用于调试目的。

该库旨在与任何工具(如 CLI 工具)集成。它包含以下内容:

  • 配置读取器和配置写入器
  • 猜测器
  • 数据库匿名化工具

猜测器

猜测器是配置生成器的核心组件。它根据字段名称或字段类型猜测应用哪种 faker 方法。

由于需要将其注入到写入器中,因此可以非常容易地进行扩展。

配置写入器

写入器有助于生成包含所有数据库表和字段的 YAML 文件。基本用法如下:

<?php

require_once 'vendor/autoload.php';

// Create a container
$container = Edyan\Neuralyzer\ContainerFactory::createContainer();
// Configure DB Utils, required
$dbUtils = $container->get('Edyan\Neuralyzer\Utils\DBUtils');
// See Doctrine DBAL configuration :
// https://www.doctrine-project.org/projects/doctrine-dbal/en/2.7/reference/configuration.html
$dbUtils->configure([
    'driver' => 'pdo_mysql',
    'host' => '127.0.0.1',
    'dbname' => 'test_db',
    'user' => 'root',
    'password' => 'root',
]);

$writer = new \Edyan\Neuralyzer\Configuration\Writer;
$data = $writer->generateConfFromDB($dbUtils, new \Edyan\Neuralyzer\Guesser);
$writer->save($data, 'neuralyzer.yml');

如果需要,您可以使用正则表达式保护某些列或表。

<?php
// ...
$writer = new \Edyan\Neuralyzer\Configuration\Writer;
$writer->protectCols(true); // will protect primary keys
// define cols to protect (must be prefixed with the table name)
$writer->setProtectedCols([
    '.*\.id',
    '.*\..*_id',
    '.*\.date_modified',
    '.*\.date_entered',
    '.*\.date_created',
    '.*\.deleted',
]);
// Define tables to ignore, also with regexp
$writer->setIgnoredTables([
    'acl_.*',
    'config',
    'email_cache',
]);
// Write the configuration
$data = $writer->generateConfFromDB($dbUtils, new \Edyan\Neuralyzer\Guesser);
$writer->save($data, 'neuralyzer.yml');

配置读取器

配置读取器与写入器正好相反。其主要工作是验证 YAML 文件的配置是否正确,然后提供访问其参数的方法。例如:

<?php
require_once 'vendor/autoload.php';

// will throw an exception if it's not valid
$reader = new Edyan\Neuralyzer\Configuration\Reader('neuralyzer.yml');
$tables = $reader->getEntities();

数据库匿名化工具

目前唯一可用的匿名化工具是数据库匿名化工具。它期望一个 PDO 对象和一个配置读取器对象。

<?php

require_once 'vendor/autoload.php';

// Create a container
$container = Edyan\Neuralyzer\ContainerFactory::createContainer();
$expression = $container->get('Edyan\Neuralyzer\Utils\Expression');
// Configure DB Utils, required
$dbUtils = $container->get('Edyan\Neuralyzer\Utils\DBUtils');
// See Doctrine DBAL configuration :
// https://www.doctrine-project.org/projects/doctrine-dbal/en/2.7/reference/configuration.html
$dbUtils->configure([
    'driver' => 'pdo_mysql',
    'host' => '127.0.0.1',
    'dbname' => 'test_db',
    'user' => 'root',
    'password' => 'root',
]);

$db = new \Edyan\Neuralyzer\Anonymizer\DB($expression, $dbUtils);
$db->setConfiguration(
    new \Edyan\Neuralyzer\Configuration\Reader('neuralyzer.yml')
);

初始化后,匿名化表的以下方法是:

<?php
public function processEntity(string $entity, callable $callback = null): array;

参数

  • Entity:例如表名(必需)
  • Callback(可调用/可选)例如使用进度条等

可以通过调用以下方法设置一些选项:

<?php
// Limit of fake generated records for updates and creates.
// Default : 0 = everything to update / nothing to insert
public function setLimit(int $limit);
// Don't do anything, default true
public function setPretend(bool $pretend);
// Return or not a result, default false
public function setReturnRes(bool $returnRes);

完整示例

<?php

require_once 'vendor/autoload.php';

// Create a container
$container = Edyan\Neuralyzer\ContainerFactory::createContainer();
$expression = $container->get('Edyan\Neuralyzer\Utils\Expression');
// Configure DB Utils, required
$dbUtils = $container->get('Edyan\Neuralyzer\Utils\DBUtils');
// See Doctrine DBAL configuration :
// https://www.doctrine-project.org/projects/doctrine-dbal/en/2.7/reference/configuration.html
$dbUtils->configure([
    'driver' => 'pdo_mysql',
    'host' => 'mysql',
    'dbname' => 'test_db',
    'user' => 'root',
    'password' => 'root',
]);

$reader = new \Edyan\Neuralyzer\Configuration\Reader('neuralyzer.yml');

$db = new \Edyan\Neuralyzer\Anonymizer\DB($expression, $dbUtils);
$db->setConfiguration($reader);
$db->setPretend(false);
// Get tables
$tables = $reader->getEntities();
foreach ($tables as $table) {
    $total = $dbUtils->countResults($table);

    if ($total === 0) {
        fwrite(STDOUT, "$table is empty" . PHP_EOL);
        continue;
    }
    fwrite(STDOUT, "$table anonymized" . PHP_EOL);

    $db->processEntity($table);
}

预操作和后操作

您可以为 pre_actionspost_actions 设置一个数组,这些操作将在 neuralyzer 开始匿名化实体之前和之后执行。

这些操作实际上是 symfony 表达式(请参阅 Symfony 表达式语言),它们依赖于 服务。这些服务从 Service/ 目录加载。

目前只有一个服务:Database,它包含一个可用的方法 query,可按以下方式使用:db.query("DELETE FROM table")

配置参考

bin/neuralyzer config:example 提供了一个默认配置,其中所有参数都有解释

config:

    # Set the guesser class
    guesser:              Edyan\Neuralyzer\Guesser

    # Set the version of the guesser the conf has been written with
    guesser_version:      '3.0'

    # Faker's language, make sure all your methods have a translation
    language:             en_US

    # List all entities, theirs cols and actions
    entities:             # Required, Example: people

        # Prototype
        -

            # Either "update" or "insert" data
            action:               update

            # Should we delete data with what is defined in "delete_where" ?
            delete:               ~ # Deprecated (delete and delete_where have been deprecated. Use now pre and post_actions)

            # Condition applied in a WHERE if delete is set to "true"
            delete_where:         ~ # Deprecated (delete and delete_where have been deprecated. Use now pre and post_actions), Example: '1 = 1'
            cols:

                # Examples:
                first_name:
                    method:              firstName
                last_name:
                    method:              lastName

                # Prototype
                -

                    # Faker method to use, see doc : https://fakerphp.github.io/
                    method:               ~ # Required

                    # Set this option to true to generate unique values for that field (see faker->unique() generator)
                    unique:               false

                    # Faker's parameters, see Faker's doc
                    params:               []

            # Limit the number of written records (update or insert). 100 by default for insert
            limit:                0

            # The list of expressions language actions to executed before neuralyzing. Be careful that "pretend" has no effect here.
            pre_actions:          []

            # The list of expressions language actions to executed after neuralyzing. Be careful that "pretend" has no effect here.
            post_actions:         []

自定义应用程序逻辑

当使用自定义 doctrine 类型时,doctrine 将生成一个错误,表明该类型未知。这可以通过提供一个引导文件来注册自定义 doctrine 类型来解决。

bootstrap.php

<?php

require_once '../vendor/autoload.php';

\Doctrine\DBAL\Types\Type::addType('custom_type', 'Namespace\Of\The\Custom\Type');

然后将引导文件提供给运行命令

bin/neuralyzer run --db test_db -u root -p root -b bootstrap.php

开发

Neuralyzer 使用 Robo 通过 Docker 运行其测试并构建其 phar。

克隆项目,运行 composer install 然后...

运行测试

  • 如果因为数据库尚未就绪而有大量错误,请更改 --wait 选项。
  • 更改 --php 选项为 7.27.4
  • 如果想要禁用 PHPUnit 代码覆盖率,请设置 --no-coverage

与 MySQL 一起使用

$ vendor/bin/robo test --php 7.2 --wait 10 --db mysql --db-version 5
$ vendor/bin/robo test --php 7.3 --wait 10 --db mysql --db-version 8
$ vendor/bin/robo test --php 7.4 --wait 10 --db mysql --db-version 8
$ vendor/bin/robo test --php 8.0 --wait 10 --db mysql --db-version 8

支持 PostgreSQL 9、10 和 11(12 也支持)

$ vendor/bin/robo test --php 7.2 --wait 10 --db pgsql --db-version 10
$ vendor/bin/robo test --php 7.3 --wait 10 --db pgsql --db-version 11
$ vendor/bin/robo test --php 7.4 --wait 10 --db pgsql --db-version 12
$ vendor/bin/robo test --php 8.0 --wait 10 --db pgsql --db-version 13

支持 SQL Server

警告:由于 SQL Server ... 或 Doctrine / Dbal 的奇怪行为,2 个测试 失败。PHPUnit 无法比较 2 个数据集,因为字段顺序不一致。

$ vendor/bin/robo test --php 7.2 --wait 15 --db sqlsrv
$ vendor/bin/robo test --php 7.3 --wait 15 --db sqlsrv
$ vendor/bin/robo test --php 7.4 --wait 15 --db sqlsrv
$ vendor/bin/robo test --php 8.0 --wait 15 --db sqlsrv

构建发布版本(使用 phar 和 git 标签)

$ php -d phar.readonly=0 vendor/bin/robo release

仅构建 phar

$ php -d phar.readonly=0 vendor/bin/robo phar

使用 phpinsights 提高代码质量

docker run -it --rm -v $(pwd):/app nunomaduro/phpinsights analyse --fix

更新依赖项以确保与 PHP 7.2 兼容

vendor/bin/robo composer:update