zeeml / dataset
适用于机器学习算法训练的多功能DataSet
dev-master
2017-08-29 14:41 UTC
Requires
- php: >=7.0.0
- league/csv: 8.2.1
Requires (Dev)
- phpunit/phpunit: 6.0.*
This package is not auto-updated.
Last update: 2024-09-15 01:25:14 UTC
README
DataSet
适用于机器学习算法训练的多功能dataSet。
创建DataSet
为了创建用于Zeeml机器学习的DataSet,您需要指定一个源:要么是一个CSV文件,要么是一个数组
从CSV文件创建DataSet
$dataSet = DataSetFactory::create('/path/to/csv', ['name', 'Gender'], ['Height]);
在标题中设置的键(CSV文件的第一行)用作DataSet的键
从数组创建DataSet
$dataSet = DataSetFactory::create(
[
['name' => 'Zac', 'gender' => 'Male', 'height' => 180],
['name' => 'Emily', 'gender' => 'Female', 'height' => 177],
['name' => 'Edward', 'gender' => 'Male', 'height' => 175],
['name' => 'Mark', 'gender' => 'Male', 'height' => 183],
['name' => 'Lesly', 'gender' => 'Female', 'height' => 170],
]
);
其他任何数组格式都会抛出异常
指定输入和输出
在调用其他任何方法之前必须先调用prepare方法,否则将抛出异常。
$mapper = new Mapper(['name', 'gendre'], ['height']);
$dataSet->prepare($mapper);
其中 ['name', 'gendre'] 是用作输入的索引,['height'] 是用作输出的索引。
可以从条目中选择任意数量的输入和输出
如果键不存在,将抛出异常。
操作DataSet
为了操作和更改DataSet的值(清洗、重命名等),您可以应用一个"策略"。
在创建Mapper时调用策略。每个列可以定义多个策略
$dataSet = DataSetFactory::create(
[
[180, 'Male'],
[177, 'Female'],
[170, ''],
[183, 'Male'],
]
);
$mapper = new Mapper(
[
0 => [Policy::replaceWithAvg(), Policy::rename('height')],
],
[
1 => [Policy::skip()]
]
);
$dataSet->prepare($mapper);
###支持策略
-
Policy::skip() : 如果对应索引的值是空的(NULL、false、''),则整行将被跳过
示例
$data = [ [1, 2, 3], [4, null, 5], [6, 7, null], [null, 8, 9], ]; $dataSet = DataSetFactory::create($data); $mapper = new Mapper([0, 1 => Policy::skip()], [2 => Policy::skip()]); $dataSet->prepare($mapper); will use the following Inputs/Outputs : Inputs: [ [1, 2], [null, 8], //No policy applied on 0 ] Outputs: [ [3], [9], ] -
Policy::replaceWith() : 如果对应索引的值是空的(NULL、false、''),则将其替换为给定的值
示例
$data = [ [1, 2, 3], [4, null, 5], [6, 7, null], [null, 8, 9], ]; $dataSet = DataSetFactory::create($data); $mapper = new Mapper([0, 1 => Policy::replaceWith('Unknown')], [2 => Policy::replaceWith(-1)]); $dataSet->prepare($mapper); will use the following Inputs/Outputs : Inputs: [ [1, 2], [4, 'Unknown'], [6, 7], [null, 8], //No policy applied on 0 ] Outputs: [ [3], [5], [-1], [9] ] -
Policy::replaceWithAvg() : 将空值替换为从原始DataSet计算出的该列的平均值
示例
$data = [ [1, 2, 3], [4, null, 5], [6, 7, null], [null, 8, 9], ]; $dataSet = DataSetFactory::create($data); $mapper = new Mapper([0 => Policy::replaceWithAvg(), 1 => Policy::skip()], [2 => Policy::replaceWithAvg()]); $dataSet->prepare($mapper); will use the following Inputs/Outputs : Inputs: [ [1, 2], [6, 7], [2.75, 8], // Avg(0) = 1 + 4 + 6 + 0 = 11 / 4 = 2.75 ] Outputs: [ [3], [-1], [9], ] ] -
Policy::replaceWithMostCommon() : 将空值替换为最常见值(出现次数最多的值)。如果多个值具有相同的频率,则随机选择一个。
示例
$data = [ [1, 2, 3], [1, null, 5], [6, 7, null], [null, 8, 9], ]; $dataSet = DataSetFactory::create($data); $mapper = new Mapper([0=> Policy::replaceWithMostCommon(), 1 => Policy::skip()], [2]); $dataSet->prepare($mapper); will use the following Inputs/Outputs : Inputs: [ [1, 2], [6, 7], [1, 8], ] Outputs: [ [3], [null], [9], ] -
Policy::custom() : 创建自己的策略
可调用函数仅在值为空时调用。可调用必须
- 接受一个引用参数作为每次迭代中列的值
- 接受一个参数作为行
- 返回true以保留行,返回false以跳过行
示例
$data = [ [180, 'Male'], [177, 'Female'], [170, ''], [183, 'Male'], ]; $dataSet = DataSetFactory::create($data); $genderCleaner = function(&$value, $line) { if ($line[0] > 175) { $value = 'Male' ; } else { $value = 'Female'; } return true; } $mapper = new Mapper([0], [1 => Policy::custom($genderCleaner)]); $dataSet->prepare($mapper); will use the following Inputs/Outputs : Inputs: [ [180], [177], [170], [183], ] Outputs: [ ['Male'], ['Female'], ['Female'], ['Male'], ]
重命名DataSet键
您可以重命名DataSet键
$data = [
['Zac', 'Male', 180],
['Emily', 'Female', 177],
['Edward', 'Male', 175],
['Mark', 'Male', 183],
['Lesly', 'Female', 170],
];
$dataSet = DataSetFactory::create($data);
$mapper = new Mapper([0, 1], [2]);
$dataSet->prepare($mapper);
$dataSet->rename([0 => 'Name', 1 => 'Gender', 2 => 'Height']);
and the inputs/outputs matrices used are :
Inputs :
[
['Name' => 'Zac', 'Gender' => 'Male'],
['Name' => 'Emily', 'Gender' => 'Female'],
['Name' => 'Edward', 'Gender' => 'Male'],
['Name' => 'Mark', 'Gender' => 'Male'],
['Name' => 'Lesly', 'Gender' => 'Female'],
]
Outputs :
[
['Height' => 180],
['Height' => 177],
['Height' => 175],
['Height' => 183],
['Height' => 170],
]