zeeml/dataset

适用于机器学习算法训练的多功能DataSet

dev-master 2017-08-29 14:41 UTC

This package is not auto-updated.

Last update: 2024-09-15 01:25:14 UTC


README

build

DataSet

适用于机器学习算法训练的多功能dataSet。

创建DataSet

为了创建用于Zeeml机器学习的DataSet,您需要指定一个源:要么是一个CSV文件,要么是一个数组

从CSV文件创建DataSet

$dataSet =  DataSetFactory::create('/path/to/csv', ['name', 'Gender'], ['Height]);

在标题中设置的键(CSV文件的第一行)用作DataSet的键

从数组创建DataSet

$dataSet =  DataSetFactory::create(
    [
        ['name' => 'Zac',    'gender' => 'Male',    'height' => 180],
        ['name' => 'Emily',  'gender' => 'Female',  'height' => 177],
        ['name' => 'Edward', 'gender' => 'Male',    'height' => 175],
        ['name' => 'Mark',   'gender' => 'Male',    'height' => 183],
        ['name' => 'Lesly',  'gender' => 'Female',  'height' => 170],
    ]
);

其他任何数组格式都会抛出异常

指定输入和输出

在调用其他任何方法之前必须先调用prepare方法,否则将抛出异常。

$mapper = new Mapper(['name', 'gendre'], ['height']);
$dataSet->prepare($mapper);

其中 ['name', 'gendre'] 是用作输入的索引,['height'] 是用作输出的索引。

可以从条目中选择任意数量的输入和输出

如果键不存在,将抛出异常。

操作DataSet

为了操作和更改DataSet的值(清洗、重命名等),您可以应用一个"策略"。

在创建Mapper时调用策略。每个列可以定义多个策略

$dataSet = DataSetFactory::create(
      [
          [180, 'Male'],
          [177, 'Female'],
          [170, ''],
          [183, 'Male'],
      ]
);
$mapper = new Mapper(
    [
        0 => [Policy::replaceWithAvg(), Policy::rename('height')], 
    ], 
    [
        1 => [Policy::skip()]
    ]
);
$dataSet->prepare($mapper);

###支持策略

  • Policy::skip() : 如果对应索引的值是空的(NULL、false、''),则整行将被跳过

    示例

    $data = [
        [1, 2, 3],
        [4, null, 5],
        [6, 7, null],
        [null, 8, 9],
    ];
    
    $dataSet =  DataSetFactory::create($data);
    $mapper = new Mapper([0, 1 => Policy::skip()], [2 => Policy::skip()]);
    $dataSet->prepare($mapper);
    
    will use the following Inputs/Outputs :
    
    Inputs:                                             
    [                                                 
        [1, 2],                                        
        [null, 8], //No policy applied on 0           
    ]                                                
    
    Outputs:    
    [
        [3],
        [9],
    ]   
    
  • Policy::replaceWith() : 如果对应索引的值是空的(NULL、false、''),则将其替换为给定的值

    示例

    $data = [
        [1, 2, 3],
        [4, null, 5],
        [6, 7, null],
        [null, 8, 9],
    ];
    
    $dataSet =  DataSetFactory::create($data);
    $mapper = new Mapper([0, 1 => Policy::replaceWith('Unknown')], [2 => Policy::replaceWith(-1)]);
    $dataSet->prepare($mapper);
    
    will use the following Inputs/Outputs :
    
    Inputs:                                           
    [                                                  
        [1, 2],                                          
        [4, 'Unknown'],                                  
        [6, 7],                                          
        [null, 8], //No policy applied on 0              
    ]                                                  
    
    Outputs:                                          
    [
        [3],
        [5], 
        [-1],
        [9]
    ] 
    
  • Policy::replaceWithAvg() : 将空值替换为从原始DataSet计算出的该列的平均值

    示例

    $data = [
        [1, 2, 3],
        [4, null, 5],
        [6, 7, null],
        [null, 8, 9],
    ];
    
    $dataSet =  DataSetFactory::create($data);
    $mapper = new Mapper([0 => Policy::replaceWithAvg(), 1 => Policy::skip()], [2 => Policy::replaceWithAvg()]);
    $dataSet->prepare($mapper);
    
    will use the following Inputs/Outputs :
    
    Inputs:                                                              
    [                                                                    
        [1, 2],                                                            
        [6, 7],                                                                                    
        [2.75, 8], // Avg(0) = 1 + 4 + 6 + 0 = 11 / 4 = 2.75               
    ]                                                                    
    
    Outputs:
    [
        [3],
        [-1],
        [9],
    ]   
                                                                    ]
    
  • Policy::replaceWithMostCommon() : 将空值替换为最常见值(出现次数最多的值)。如果多个值具有相同的频率,则随机选择一个。

    示例

    $data = [
        [1, 2, 3],
        [1, null, 5],
        [6, 7, null],
        [null, 8, 9],
    ];
    
    $dataSet =  DataSetFactory::create($data);
    $mapper = new Mapper([0=> Policy::replaceWithMostCommon(), 1 => Policy::skip()], [2]);
    $dataSet->prepare($mapper);
    
    will use the following Inputs/Outputs :
    
    Inputs:                                 
    [                                        
        [1, 2],                               
        [6, 7],                               
        [1, 8],                                
    ]                                      
    
    Outputs:
    [
        [3],
        [null],
        [9],
    ]
    
  • Policy::custom() : 创建自己的策略

    可调用函数仅在值为空时调用。可调用必须

    • 接受一个引用参数作为每次迭代中列的值
    • 接受一个参数作为行
    • 返回true以保留行,返回false以跳过行

    示例

    $data = [
        [180, 'Male'],
        [177, 'Female'],
        [170, ''],
        [183, 'Male'],
    ];
    
    $dataSet =  DataSetFactory::create($data);
    
    $genderCleaner = function(&$value, $line) {
        if ($line[0] > 175) {
            $value = 'Male' ;
        } else {
            $value = 'Female';
        }
        
        return true;
    }
    
    $mapper = new Mapper([0], [1 => Policy::custom($genderCleaner)]);
    $dataSet->prepare($mapper);
    
    will use the following Inputs/Outputs :
    
    Inputs:                                 
    [                                        
        [180],                               
        [177],                               
        [170],                                
        [183],                                
    ]                                      
    
    Outputs:
    [
        ['Male'],                               
        ['Female'],                               
        ['Female'],                                
        ['Male'],
    ]
    

重命名DataSet键

您可以重命名DataSet键

$data = [
    ['Zac',    'Male',    180],
    ['Emily',  'Female',  177],
    ['Edward', 'Male',    175],
    ['Mark',   'Male',    183],
    ['Lesly',  'Female',  170],
];    
    
$dataSet =  DataSetFactory::create($data);
    
$mapper = new Mapper([0, 1], [2]);
$dataSet->prepare($mapper);
    
$dataSet->rename([0 => 'Name', 1 => 'Gender', 2 => 'Height']);

and the inputs/outputs matrices used are :

Inputs :
[
    ['Name' => 'Zac',    'Gender' => 'Male'],
    ['Name' => 'Emily',  'Gender' => 'Female'],
    ['Name' => 'Edward', 'Gender' => 'Male'],
    ['Name' => 'Mark',   'Gender' => 'Male'],
    ['Name' => 'Lesly',  'Gender' => 'Female'],
]

Outputs :
[
    ['Height' => 180],
    ['Height' => 177],
    ['Height' => 175],
    ['Height' => 183],
    ['Height' => 170],
]