edyan/neuralyzer

数据匿名化库和CLI工具

v4.1 2021-05-13 18:16 UTC

README

Scrutinizer Code Quality Code Coverage Build Status Build Status

edyan/neuralyzer

摘要

此项目是一个库和命令行工具,通过更新数据或生成虚假数据(更新与插入)来对数据库进行匿名化。它使用Faker根据配置文件中定义的规则生成数据。

由于其可以逐行处理或使用批量机制,您可以用数亿条虚假记录加载表。

它使用Doctrine DBAL来抽象数据库交互。它应该能够与任何数据库类型一起工作。目前它已与MySQL、PostgreSQL和SQLServer进行了广泛测试。

Neuralyzer有一个选项在启动匿名化之前通过注入一个带有WHERE条件的DELETE FROM来清理表(请参阅配置参数deletedelete_where)。

Neuralyzer曾经有一个清理表的选项,但现在它由预和后操作管理

entities:
    books:
        cols:
            title: { method: sentence, params: [8], unique: true }
        action: update
        pre_actions:
            - db.query("DELETE FROM books")
post_actions:
    - db.query("DELETE FROM books WHERE title LIKE '%war%'")

作为库安装

composer require edyan/neuralyzer

作为可执行文件安装

您甚至可以直接下载可执行文件(以v3.1为例)

$ wget https://github.com/edyan/neuralyzer/raw/v4.0/neuralyzer.phar
$ sudo mv neuralyzer.phar /usr/local/bin/neuralyzer
$ sudo chmod +x /usr/local/bin/neuralyzer
$ neuralyzer

用法

使用该工具的最简单方法是先从命令行工具开始。在克隆项目并运行composer install之后,尝试以下操作:

$ bin/neuralyzer

自动生成配置

Neuralyzer能够读取数据库并为您生成配置。命令config:generate接受以下选项

Options:
    -D, --driver=DRIVER              Driver (check Doctrine documentation to have the list) [default: "pdo_mysql"]
    -H, --host=HOST                  Host [default: "127.0.0.1"]
    -d, --db=DB                      Database Name
    -u, --user=USER                  User Name [default: "www-data"]
    -p, --password=PASSWORD          Password (or it'll be prompted)
    -f, --file=FILE                  File [default: "neuralyzer.yml"]
        --protect                    Protect IDs and other fields
        --ignore-table=IGNORE-TABLE  Table to ignore. Can be repeated (multiple values allowed)
        --ignore-field=IGNORE-FIELD  Field to ignore. Regexp in the form "table.field". Can be repeated (multiple values allowed)

示例

bin/neuralyzer config:generate --db test_db -u root -p root --ignore-table config --ignore-field ".*\.id.*"

这会生成一个看起来像的文件

entities:
    authors:
        cols:
            first_name: { method: firstName, unique: false }
            last_name: { method: lastName, unique: false }
        action: update # Will update existing data, "insert" would create new data
        pre_actions: {  }
        post_actions: {  }

    books:
        cols:
            name: { method: sentence, params: [8] }
            date_modified: { method: date, params: ['Y-m-d H:i:s', now] }
        action: update
        pre_actions: {  }
        post_actions: {  }

guesser: Edyan\Neuralyzer\Guesser
guesser_version: '3.0'
language: en_US

您需要修改该文件以更改其配置。例如,如果您需要在匿名化时删除数据并更改语言(请参阅Faker的文档以获取可用的语言),则执行以下操作:

# be careful that some languages have only a few methods.
# Example : https://github.com/FakerPHP/Faker/tree/v1.14.1/src/Faker/Provider/fr_FR
language: fr_FR

INFO:您也可以在不进行任何匿名化的情况下单独使用删除。这将删除books中的所有内容

entities:
    authors:
        cols:
            first_name: { method: firstName, unique: false }
            last_name: { method: lastName, unique: false }
        action: update
    books:
        pre_actions:
            - db.query("DELETE FROM books")

如果您想删除所有内容然后插入1000本新书

guesser_version: '3.0'
entities:
    authors:
        cols:
            first_name: { method: firstName, unique: false }
            last_name: { method: lastName, unique: false }
        action: update
    books:
        cols:
            name: { method: sentence, params: [8] }
        action: insert
        pre_actions:
            - db.query("DELETE FROM books")
        limit: 1000

运行匿名化器

要运行匿名化器,命令很简单为“run”,并期望以下内容:

Options:
    -D, --driver=DRIVER      Driver (check Doctrine documentation to have the list) [default: "pdo_mysql"]
    -H, --host=HOST          Host [default: "127.0.0.1"]
    -d, --db=DB              Database Name
    -u, --user=USER          User Name [default: "www-data"]
    -p, --password=PASSWORD  Password (or prompted)
    -c, --config=CONFIG      Configuration File [default: "neuralyzer.yml"]
    -t, --table=TABLE        Do a single table
        --pretend            Don't run the queries
    -s, --sql                Display the SQL

    -m, --mode=MODE          Set the mode : batch or queries [default: "batch"]

示例

bin/neuralyzer run --db test_db -u root -p root

这将产生这种类型的输出

Anonymizing authors
 2/2 [============================] 100%

Queries:
UPDATE authors SET first_name = 'Don', last_name = 'Wisoky' WHERE id = '1'
UPDATE authors SET first_name = 'Sasha', last_name = 'Denesik' WHERE id = '2'

....

WARNING:在大型表中,--sql将产生非常大的输出。请仅用于调试目的。

此库旨在与任何工具集成,例如CLI工具。它包含

  • 配置读取器和配置写入器
  • 猜测器
  • 数据库匿名化器

猜测器

猜测器是配置生成器的核心部分。它根据字段名称或字段类型猜测要应用的Faker方法类型。

它可以非常容易地进行扩展,因为它需要注入到写入器中。

配置写入器

写入器有助于生成一个包含所有表和字段的yaml文件。基本用法可能如下所示

<?php

require_once 'vendor/autoload.php';

// Create a container
$container = Edyan\Neuralyzer\ContainerFactory::createContainer();
// Configure DB Utils, required
$dbUtils = $container->get('Edyan\Neuralyzer\Utils\DBUtils');
// See Doctrine DBAL configuration :
// https://www.doctrine-project.org/projects/doctrine-dbal/en/2.7/reference/configuration.html
$dbUtils->configure([
    'driver' => 'pdo_mysql',
    'host' => '127.0.0.1',
    'dbname' => 'test_db',
    'user' => 'root',
    'password' => 'root',
]);

$writer = new \Edyan\Neuralyzer\Configuration\Writer;
$data = $writer->generateConfFromDB($dbUtils, new \Edyan\Neuralyzer\Guesser);
$writer->save($data, 'neuralyzer.yml');

如果您需要,可以保护某些列(使用正则表达式)或表

<?php
// ...
$writer = new \Edyan\Neuralyzer\Configuration\Writer;
$writer->protectCols(true); // will protect primary keys
// define cols to protect (must be prefixed with the table name)
$writer->setProtectedCols([
    '.*\.id',
    '.*\..*_id',
    '.*\.date_modified',
    '.*\.date_entered',
    '.*\.date_created',
    '.*\.deleted',
]);
// Define tables to ignore, also with regexp
$writer->setIgnoredTables([
    'acl_.*',
    'config',
    'email_cache',
]);
// Write the configuration
$data = $writer->generateConfFromDB($dbUtils, new \Edyan\Neuralyzer\Guesser);
$writer->save($data, 'neuralyzer.yml');

配置读取器

配置读取器与写入器正好相反。其主要任务是验证yaml文件的配置是否正确,然后提供访问其参数的方法。示例

<?php
require_once 'vendor/autoload.php';

// will throw an exception if it's not valid
$reader = new Edyan\Neuralyzer\Configuration\Reader('neuralyzer.yml');
$tables = $reader->getEntities();

数据库匿名化器

目前可用的唯一匿名化器是数据库匿名化器。它期望PDO和配置读取器对象

<?php

require_once 'vendor/autoload.php';

// Create a container
$container = Edyan\Neuralyzer\ContainerFactory::createContainer();
$expression = $container->get('Edyan\Neuralyzer\Utils\Expression');
// Configure DB Utils, required
$dbUtils = $container->get('Edyan\Neuralyzer\Utils\DBUtils');
// See Doctrine DBAL configuration :
// https://www.doctrine-project.org/projects/doctrine-dbal/en/2.7/reference/configuration.html
$dbUtils->configure([
    'driver' => 'pdo_mysql',
    'host' => '127.0.0.1',
    'dbname' => 'test_db',
    'user' => 'root',
    'password' => 'root',
]);

$db = new \Edyan\Neuralyzer\Anonymizer\DB($expression, $dbUtils);
$db->setConfiguration(
    new \Edyan\Neuralyzer\Configuration\Reader('neuralyzer.yml')
);

初始化后,匿名化表的以下方法是

<?php
public function processEntity(string $entity, callable $callback = null): array;

参数

  • Entity:例如表名(必需)
  • Callback(可调用/可选)例如使用进度条

可以通过调用设置一些选项

<?php
// Limit of fake generated records for updates and creates.
// Default : 0 = everything to update / nothing to insert
public function setLimit(int $limit);
// Don't do anything, default true
public function setPretend(bool $pretend);
// Return or not a result, default false
public function setReturnRes(bool $returnRes);

完整示例

<?php

require_once 'vendor/autoload.php';

// Create a container
$container = Edyan\Neuralyzer\ContainerFactory::createContainer();
$expression = $container->get('Edyan\Neuralyzer\Utils\Expression');
// Configure DB Utils, required
$dbUtils = $container->get('Edyan\Neuralyzer\Utils\DBUtils');
// See Doctrine DBAL configuration :
// https://www.doctrine-project.org/projects/doctrine-dbal/en/2.7/reference/configuration.html
$dbUtils->configure([
    'driver' => 'pdo_mysql',
    'host' => 'mysql',
    'dbname' => 'test_db',
    'user' => 'root',
    'password' => 'root',
]);

$reader = new \Edyan\Neuralyzer\Configuration\Reader('neuralyzer.yml');

$db = new \Edyan\Neuralyzer\Anonymizer\DB($expression, $dbUtils);
$db->setConfiguration($reader);
$db->setPretend(false);
// Get tables
$tables = $reader->getEntities();
foreach ($tables as $table) {
    $total = $dbUtils->countResults($table);

    if ($total === 0) {
        fwrite(STDOUT, "$table is empty" . PHP_EOL);
        continue;
    }
    fwrite(STDOUT, "$table anonymized" . PHP_EOL);

    $db->processEntity($table);
}

前和后操作

您可以为pre_actionspost_actions设置一个数组,这些操作将在开始匿名化实体之前和之后执行。

这些操作实际上是symfony表达式(参见Symfony表达式语言),依赖于服务。这些服务从Service/目录加载。

目前只有一个服务:Database,其中包含一个可用的query方法:db.query("DELETE FROM table")

配置参考

bin/neuralyzer config:example提供了一个默认配置,其中解释了所有参数

config:

    # Set the guesser class
    guesser:              Edyan\Neuralyzer\Guesser

    # Set the version of the guesser the conf has been written with
    guesser_version:      '3.0'

    # Faker's language, make sure all your methods have a translation
    language:             en_US

    # List all entities, theirs cols and actions
    entities:             # Required, Example: people

        # Prototype
        -

            # Either "update" or "insert" data
            action:               update

            # Should we delete data with what is defined in "delete_where" ?
            delete:               ~ # Deprecated (delete and delete_where have been deprecated. Use now pre and post_actions)

            # Condition applied in a WHERE if delete is set to "true"
            delete_where:         ~ # Deprecated (delete and delete_where have been deprecated. Use now pre and post_actions), Example: '1 = 1'
            cols:

                # Examples:
                first_name:
                    method:              firstName
                last_name:
                    method:              lastName

                # Prototype
                -

                    # Faker method to use, see doc : https://fakerphp.github.io/
                    method:               ~ # Required

                    # Set this option to true to generate unique values for that field (see faker->unique() generator)
                    unique:               false

                    # Faker's parameters, see Faker's doc
                    params:               []

            # Limit the number of written records (update or insert). 100 by default for insert
            limit:                0

            # The list of expressions language actions to executed before neuralyzing. Be careful that "pretend" has no effect here.
            pre_actions:          []

            # The list of expressions language actions to executed after neuralyzing. Be careful that "pretend" has no effect here.
            post_actions:         []

自定义应用程序逻辑

当使用自定义doctrine类型时,doctrine会生成一个错误,指出该类型未知。这可以通过提供一个bootstrap文件来注册自定义doctrine类型来解决。

bootstrap.php

<?php

require_once '../vendor/autoload.php';

\Doctrine\DBAL\Types\Type::addType('custom_type', 'Namespace\Of\The\Custom\Type');

然后提供bootstrap文件给run命令

bin/neuralyzer run --db test_db -u root -p root -b bootstrap.php

开发

Neuralyzer使用Robo通过Docker运行其测试和构建其phar。

克隆项目,运行composer install然后...

运行测试

  • 如果因为数据库没有准备好而有大量错误,请更改--wait选项。
  • 更改--php选项以使用7.27.4
  • 如果您想禁用PHPUnit代码覆盖率,请设置--no-coverage

使用MySQL

$ vendor/bin/robo test --php 7.2 --wait 10 --db mysql --db-version 5
$ vendor/bin/robo test --php 7.3 --wait 10 --db mysql --db-version 8
$ vendor/bin/robo test --php 7.4 --wait 10 --db mysql --db-version 8
$ vendor/bin/robo test --php 8.0 --wait 10 --db mysql --db-version 8

使用PostgreSQL 9、10和11(12也适用)

$ vendor/bin/robo test --php 7.2 --wait 10 --db pgsql --db-version 10
$ vendor/bin/robo test --php 7.3 --wait 10 --db pgsql --db-version 11
$ vendor/bin/robo test --php 7.4 --wait 10 --db pgsql --db-version 12
$ vendor/bin/robo test --php 8.0 --wait 10 --db pgsql --db-version 13

使用SQL Server

警告:2个测试失败,因为SQL Server的奇怪行为...或Doctrine / Dbal。PHPUnit无法比较两个数据集,因为字段顺序不同。

$ vendor/bin/robo test --php 7.2 --wait 15 --db sqlsrv
$ vendor/bin/robo test --php 7.3 --wait 15 --db sqlsrv
$ vendor/bin/robo test --php 7.4 --wait 15 --db sqlsrv
$ vendor/bin/robo test --php 8.0 --wait 15 --db sqlsrv

构建一个发布版本(包括phar和git标签)

$ php -d phar.readonly=0 vendor/bin/robo release

仅构建phar

$ php -d phar.readonly=0 vendor/bin/robo phar

使用phpinsights提高代码质量

docker run -it --rm -v $(pwd):/app nunomaduro/phpinsights analyse --fix

更新依赖关系以确保它与PHP 7.2兼容

vendor/bin/robo composer:update