Pro Hadoop Jason Venner

[ Pobierz całość w formacie PDF ]
.txt" );inner(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,"maptest_a.txt"),�'tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,"maptest_b.txt"),�'tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat,"maptest_c.txt"))String CompositeInputFormat.compose(String op, Class inf,Path.path)This method is identical to the String variant except that Path objects instead of String objectsprovide the table paths.Building and Running a JoinThere are two critical pieces of engaging the join behavior: the input format must be set toCompositeInputFormat.class, and the key mapred.join.expr must have a value that is a validjoin specification.Optionally, the mapper, reducer, reduce count, and output key/valueclasses may be set.The mapper key class will be the key class of the leftmost data source, and the key classesof all data sources should be identical.The mapper value class will be TupleWritable for inner,outer, and user-defined join operators.For the override join operator, the mapper value classwill be the value class of the data sources.In Listing 8-7, note that the quote characters surrounding the path names are escaped.271CHAPTER 8 �%�� ADVANCED AND ALTERNATE MAPREDUCE TECHNIQUESListing 8-7.Synthetic Example of Configuring a Join Map Job/** All of the outputs are Text.*/conf.setOutputFormat(TextOutputFormat.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(Text.class);conf.setMapperClass(MyMap.class);/** setting the input format to {@link CompositeInputFormat}* is the trigger for the map-side join behavior.*/conf.setInputFormat(CompositeInputFormat.class);conf.set("mapred.join.expr","override(tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat, �'\"maptest_a.txt\"),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat, �'\"maptest_b.txt\"),tbl(org.apache.hadoop.mapred.KeyValueTextInputFormat, �'\"maptest_c.txt\"))");Synthetic Example of Configuring a Join Map Job Using the Compose Helper/** All of the outputs are Text.*/conf.setOutputFormat(TextOutputFormat.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(Text.class);conf.setMapperClass(MyMap.class);/** setting the input format to {@link CompositeInputFormat}* is the trigger for the map-side join behavior.*/conf.setInputFormat(CompositeInputFormat.class);conf.set("mapred.join.expr",CompositeInputFormat.compose("override",�'KeyValueTextInputFormat.class, "maptest_a.txt",�'"maptest_b.txt", "maptest_c.txt"));The Magic of the TupleWritable in the Mapper.map() MethodThe map method for the inner and outer join has a value class of TupleWritable, and each callto the map method presents one join result row.The TupleWritable class provides a numberof ways to understand the shape of the join result row.Listing 8-8 provides a sample mapperthat demonstrates the use of TupleWritable.size(), TupleWriter.iterator(), TupleWritable.has(), and TupleWritable.get() methods.Table 8-13 provides a description of these methods.272CHAPTER 8 �%�� ADVANCED AND ALTERNATE MAPREDUCE TECHNIQUESTable 8-13.TupleWritable Methods for Interacting with the Join Result RowMethod Argument Descriptionboolean has(int i) The ordinal number of Returns true if that dataset provides a value toa dataset.this result row.Writable get(int i) The ordinal number Returns the value object that the dataset hasof a dataset.provided to this result row.The object returnedby get will be reinitialized on the next call toget.The application will need to make a copyof the contents before calling get() again ifthe contents need to exist past the next call toget().int size() Returns the number of datasets in the join.Only the top-level datasets are counted, evenif the dataset is the result of many nested joins.This method is used to provide an index limitfor loops through the values using has and get.for( int i = 0; i 'C:\tmp\dataset'head /cygdrive/c/tmp/dataset0c065a60:0c065a6a0c065a60;0c067f4d:0c067f570c067f4d;0c06e9c7:0c06e9d10c06e9c7;0c06ef2d:0c06ef370c06ef2d;0c1e1694:0c1e169e0c1e1694;In the command shown in Listing 9-1, a dataset was prepared with converted IP addressesfrom an Apache log file.Listing 9-2 runs a streaming job to see how the records will actually besorted by the default comparator.As you can see from the Listing 9-2 output, the search spacerecords (0c065a60:0c065a6a) sort before a search request record that starts with the sameaddress (0c065a60;).Success this is the pattern we were hoping to achieve.�%Note Cygwin users are likely to always have an error message that starts with cygpath: cannotcreate short name of c:\Documents and Settings\Jason\My Documents\Hadoop Source\hadoop-0.19\logs.This error may be ignored.Listing 9-2 is structured to run from the Hadoop installationdirectory.Listing 9-2.Running a Streaming Job to Verify Comparator Orderingbin/hadoop jar contrib/streaming/hadoop-0.19-streaming.jar D �'mapred.job.tracker=local -D fs.default.name=file:/// -input 'C:\tmp\dataset' �'-output 'C:\tmp\sorted' -mapper 'C:\cygwin\bin\cat' -reducer 'C:\cygwin\bin\cat' �'-numReduceTasks 1;290CHAPTER 9 �%�� SOLVING PROBLEMS WITH HADOOPjvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=mapred.JobClient: No job jar file set.User classes may not be found.�'See JobConf(Class) or JobConf#setJar(String).mapred.FileInputFormat: Total input paths to process : 1streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-Jason/mapred/local]streaming.StreamJob: Running job: job_local_0001streaming.StreamJob: Job running in-process (local Hadoop)mapred.FileInputFormat: Total input paths to process : 1mapred.MapTask: numReduceTasks: 1mapred.MapTask: io.sort.mb = 1mapred.MapTask: data buffer = 796928/996160mapred.MapTask: record buffer = 2620/3276streaming [ Pobierz całość w formacie PDF ]