distcp

Sergey Ostapov

Used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling, recovery, and reporting. It expands a list of files/directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

More information can be found at Hadoop DistCp Guide.

The usage is as follows:

$ hadoop distcp [OPTIONS] <src> <dst>

Arguments
-append	Reuses existing data in target files and appends new data to them if possible
-async	Defines whether the `distcp` execution should be blocking
-atomic	Commits all changes or none
-bandwidth <arg>	Specifies bandwidth per map in MB, accepts bandwidth as a fraction
-blocksperchunk <arg>	If set to a positive value, files with more blocks than this value will be split into chunks of <blocksperchunk> blocks to be transferred in parallel, and reassembled on the destination. By default, <blocksperchunk> is 0 and the files will be transmitted in their entirety without splitting. This switch is only applicable when the source file system implements the `getBlockLocations` method and the target file system implements the `concat` method.
-copybuffersize <arg>	The size of the copy buffer to use (in bytes). Defaults to 8192
-delete	Deletes those files on target that are missing in source. Delete is applicable only with `update` or `overwrite` options
-diff <arg>	Uses the snapshot diff report to identify the difference between source and target
-f <arg>	Specifies a list of files to copy
-filters <arg>	The path to a file containing a list of strings for paths to be excluded from the copy
-i	Ignores failures during copy
-log <arg>	Specifies a directory on DFS where `distcp` execution logs are saved
-m <arg>	The maximum number of concurrent maps to use for copy
-numListstatusThreads <arg>	The number of threads to use for building file listing (max 40)
-overwrite	Overwrites target files unconditionally, even if they exist
-p <arg>	Preserves status (rbugpcaxt)(replication, block-size, user, group, permission, checksum-type, ACL, XATTR, timestamps). If `-p` is specified with no <arg>, then preserves replication, block size, user, group, permission, checksum type, and timestamps.
-rdiff <arg>	Use target snapshot diff reports to identify changes made on target
-skipcrccheck	Whether to skip CRC checks between source and target paths.
-strategy <arg>	The copy strategy to use. The default is dividing work based on file sizes
-tmp <arg>	Intermediate work path to be used for atomic commits
-update	Updates the target, copying only missing files and overwriting the files that are different from source
-v	Logs additional info (path, size) to the SKIP/COPY log
-xtrack <arg>	Saves information about missing source files to the specified directory

Examples:

$ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
$ hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs://nn2:8020/bar/foo

Found a mistake? Seleсt text and press Ctrl+Enter to report it