Download Radoop Documentation - RapidMiner Documentation

Transcript
Radoop Documentation, Release 2.1
--- Please do not change the first line of the script. The default script does
-- nothing, just delivers the input data set on the output port. You may
-- refer to other input data sets in a similar way to the last line.
-CREATE VIEW ##outputtable## AS
SELECT *
FROM ##inputtable1##
• auto validate (expert): Validate the script automatically using the remote Hive connection. This is required for
appropriate meta data generation during design-time.
– Default value: true
• user defined functions (expert): Add User-Defined Functions (UDFs) that can be used in the script. The temporary functions are defined by their name and the class name that implements it. Please note that the class must
exist both in Hadoop’s classpath and Hive’s classpath.
13.17.2 Pig Script
Synopsis
Runs an arbitrary Pig script.
Description
This operator is for advanced users who want to write their Pig scripts to manipulate their data directly in the process
data flow. This operator also enables Pig experts to integrate their existing Pig Latin code into a Radoop process. To
be able to do this, please note the following instructions about handling input and output data in your Pig script.
As a Pig Latin script may work on multiple inputs and produce multiple outputs, the operator may have arbitrary
number of inputs and outputs. Just connect an input example set to the free input port if you want to use it in you Pig
script. Similarly, you can connect an output port if you want to produce another output with this operator. Your Pig
script should specify the data on these output ports.
The first input data set should be referred in the Pig script using the following keywords: ##inputfile1##, ##inputstorage1##, ##inputcolumns1##. Before running the operator, Radoop will replace these keywords with the appropriate
values to produce a valid Pig script. The ##inputfile1## keyword refers to the directory that contains the data of the
first input example set. The ##inputstorage1## keyword will be replaced by the appropriate Pig storage handler class
(with their arguments like the field separator) that the software determines automatically for this input data set. The
##inputcolumns1## keyword refers to the list of column name and column type pairs of the input example table. The
conversion of RapidMiner (and Hive) column types to Pig data types is done automatically. The default Pig script of
the operator shows a simple line that loads an input example set using these keywords. The relation name here can be
any arbitrary name.
operator_input1 = LOAD ‘##inputfile1##’ USING ##inputstorage1## AS (##inputcolumns1##);
You can load all input example sets the same way, just use the next integer number in the keywords instead of 1. Only
in very rare cases should you consider changing this template for loading your input data.
You can later insert a column list of the your first input example set into the script with the keyword ##inputcolumnaliases1##. E.g. this may be used in a FOREACH expression, like in the following default script code line.
operator_output1 = FOREACH operator_input1 GENERATE ##inputcolumnaliases1##;
Otherwise, you may refer to the columns of an example set by their RapidMiner attribute name (this is true if you load
your data with the default template (##inputcolumns1##)).
13.17. Transformation - Custom Script
97