distcp

Used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling, recovery, and reporting. It expands a list of files/directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

More information can be found at Hadoop DistCp Guide.

The usage is as follows:

$ hadoop distcp [OPTIONS] <src> <dst>
Arguments

-append

Reuses existing data in target files and appends new data to them if possible

-async

Defines whether the distcp execution should be blocking

-atomic

Commits all changes or none

-bandwidth <arg>

Specifies bandwidth per map in MB, accepts bandwidth as a fraction

-blocksperchunk <arg>

If set to a positive value, files with more blocks than this value will be split into chunks of <blocksperchunk> blocks to be transferred in parallel, and reassembled on the destination. By default, <blocksperchunk> is 0 and the files will be transmitted in their entirety without splitting. This switch is only applicable when the source file system implements the getBlockLocations method and the target file system implements the concat method.

-copybuffersize <arg>

The size of the copy buffer to use (in bytes). Defaults to 8192

-delete

Deletes those files on target that are missing in source. Delete is applicable only with update or overwrite options

-diff <arg>

Uses the snapshot diff report to identify the difference between source and target

-f <arg>

Specifies a list of files to copy

-filters <arg>

The path to a file containing a list of strings for paths to be excluded from the copy

-i

Ignores failures during copy

-log <arg>

Specifies a directory on DFS where distcp execution logs are saved

-m <arg>

The maximum number of concurrent maps to use for copy

-numListstatusThreads <arg>

The number of threads to use for building file listing (max 40)

-overwrite

Overwrites target files unconditionally, even if they exist

-p <arg>

Preserves status (rbugpcaxt)(replication, block-size, user, group, permission, checksum-type, ACL, XATTR, timestamps). If -p is specified with no <arg>, then preserves replication, block size, user, group, permission, checksum type, and timestamps.

-rdiff <arg>

Use target snapshot diff reports to identify changes made on target

-skipcrccheck

Whether to skip CRC checks between source and target paths.

-strategy <arg>

The copy strategy to use. The default is dividing work based on file sizes

-tmp <arg>

Intermediate work path to be used for atomic commits

-update

Updates the target, copying only missing files and overwriting the files that are different from source

-v

Logs additional info (path, size) to the SKIP/COPY log

-xtrack <arg>

Saves information about missing source files to the specified directory

Examples:

$ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
$ hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs://nn2:8020/bar/foo
Found a mistake? Seleсt text and press Ctrl+Enter to report it