# Hadoop Developer's Guide

> Annotation: The scripts executed in this example should be run on a CentOS operating system. For other operating systems, please modify the scripts before attempting to execute them.

## 1. Create Hadoop Client Node

UHadoop provides client node and SSH two access modes, preferentially recommend client access mode, for specifics, see [Cluster Access](/docs/uhadoop/developer/access).


## 2. HDFS

HDFS is a highly fault-tolerant and high-throughput distributed file system. It is designed to be scalable and easy to use, suitable for storing massive files.

#### 2.1 Basic HDFS Operations

- Query Files
  ```
  Usage: hadoop fs [generic options] -ls [-d] [-h] [-R] [<path>]
  ```
- Upload Files
  ```
  Usage: hadoop fs [generic options] -put [-f] [-p] [-l]
  <localsrc> ... <dst>
  ```
- Download Files
  ```
  Usage: hadoop fs [generic options] -get [-p] [-ignoreCrc] [-crc]
  <src> ... <localdst>
  ```
For more details, refer to: hadoop fs -help

#### 2.2 WebHDFS

WebHDFS provides the RESTful interface for HDFS, which can be used to operate HDFS files. When using WebHDFS, the client first accesses the Namenode node to get the address of the Datanode where the file is located, and then exchanges data with the Datanode node.

###### 2.2.1 Upload File

UHadoop cluster is default configured with 2 Master nodes, only one node Namenode is in Active state at the same moment, another is in Standby state. Below uses Namenode of uhadoop-\*\*\*\*\*\*-master1 in Active as an example.

- Data Preparation

  ```
    touch uhadoop.txt
    echo "uhadoop" > uhadoop.txt
  ```

- Create File Request

  ```
    curl -i -X PUT "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=CREATE"
  ```

  > Annotation：
  > 1. Need to add the host of all nodes in the cluster to the machine executing this command
  > 2. If the prompt is Operation category READ is not supported in state standby, please replace with uhadoop-\*\*\*\*\*\*-master2 to attempt

  The above command will get the Location address, which is the Datanode address of the file

  ```
  HTTP/1.1 307 TEMPORARY_REDIRECT
  Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
  Content-Length: 0
  ```

- Upload File Using the Above Location Address

  ```
    curl -i -X PUT -T uhadoop.txt "http://uhadoop-******-core*:50075/webhdfs/v1/tmp/uhadoop.txt?op=CREATE&namenoderpcaddress=Ucluster&overwrite=false"
  ```

###### 2.2.2 Append File

- Data Preparation

  ```
    touch  append_uhadoop.txt
    echo "test_content" > append_uhadoop.txt
  ```

- Get the Address of the File to be Appended

  ```
    curl -i -X POST "http://uhadoop-hfygbg-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=APPEND"
  ```
  The execution of the above command will get the Location address, which is the Datanode address of the file
  ```
  HTTP/1.1 307 TEMPORARY_REDIRECT
  Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
  Content-Length: 0
  ```

- Append File

  ```
    curl -i -X POST -T append_uhadoop.txt "http://uhadoop-******-core*:50075/webhdfs/v1/tmp/uhadoop.txt?op=APPEND&namenoderpcaddress=Ucluster"
  ```

###### 2.2.3 Open and Read Files

```
  curl -i -L "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=OPEN"
```

###### 2.2.4 Delete Files
```
  curl -i -X DELETE "http://uhadoop-******-master1:50070/webhdfs/v1/tmp/uhadoop.txt?op=DELETE"
```

#### 2.3 HttpFS

Httpfs is an http interface for HDFS provided by cloudera, which can access HDFS for reading and writing through WebHDFS Restful API. The difference from WebHDFS is that Httpfs does not require clients to access each node of the cluster, but only needs to authorize access to a single machine that has started the Httpfs service (UHadoop defaults to start Httpfs on master1:14000). As Httpfs is a web application in the embedded tomcat, it will be somewhat constrained in performance.

###### 2.3.1 Upload File

- Data Preparation

  ```
    touch httpfs_uhadoop.txt
    echo "httpfs_uhadoop" > httpfs_uhadoop.txt
  ```

- Upload Data

  ```
    curl -i -X PUT -T httpfs_uhadoop.txt --header "Content-Type: application/octet-stream" "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=CREATE&user.name=root&data=true"
  ```

  > Annotation：
  > 1. Need to add the host of master1 in the cluster to the machine executing this command
  > 2. Need to add user.name in the url, otherwise will report "HTTP Status 401 - Authentication required" error

###### 2.3.2 Append File

- Data Preparation

  ```
    touch append_httpfs.txt
    echo "append_httpfs" > append_httpfs.txt
  ```

- Append File

  ```
    curl -i -X POST -T append_httpfs.txt --header "Content-Type: application/octet-stream" "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=APPEND&user.name=root&data=true"
  ```

###### 2.3.3 Open and Read File
  ```
    curl -i -L "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=OPEN&user.name=root"
    curl -i -X DELETE "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/uhadoop.txt?op=DELETE"
  ```

###### 2.3.4 Delete File

```
  curl -i -X DELETE "http://uhadoop-******-master1:14000/webhdfs/v1/tmp/httpfs_uhadoop.txt?op=DELETE&user.name=root"
```

#### 2.4 MapReduce Job

Taking terasort as an example, to demonstrate how to submit a MapReduce Job.

- Generate official terasort input dataset

  ```
    hadoop jar /home/hadoop/hadoop-examples.jar teragen 100 /tmp/terasort_input
  ```

- Submit Task

  ```
    hadoop jar /home/hadoop/hadoop-examples.jar  terasort /tmp/terasort_input /tmp/terasort_output
  ```

#### 2.5 HDFS Daily Operations

###### 2.5.1 Restart Service

Restart Namenode: service hadoop-hdfs-namenode restart

Restart Datanode: service hadoop-hdfs-datanode restart

Restart ResourceManager: service hadoop-yarn-resourcemanager restart

Restart NodeManager: service hadoop-yarn-nodemanager restart

Restart the entire Hadoop service: Please operate it through the cluster service management page of the console.

###### 2.5.2 Check HDFS status and node information

```
  hdfs dfsadmin -report
```

###### 2.5.3 Modify the Number of Replicas of HDFS Files

```
  hdfs dfs -setrep -R [replication-factor] [targetDir]
```

> Example: Modify the number of HDFS root directory file replicas to 2, hdfs dfs -setrep -R 2 /

###### 2.5.4 View HDFS File System Status

```
  hadoop fsck /
```

The return result is shown as follows:

```
 Total size:    455660769497 B (Total open files size: 44723814 B)
 Total dirs:    47975
 Total files:   70456
 Total symlinks:        0 (Files currently being written: 11)
 Total blocks (validated):  69916 (avg. block size 6517260 B) (Total open file blocks (not validated): 10)
 Minimally replicated blocks:   69916 (100.0 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:   87 (0.12443504 %)
 Mis-replicated blocks:     0 (0.0 %)
 Default replication factor:    3
 Average block replication: 3.0011585
 Corrupt blocks:        0
 Missing replicas:      522 (0.24815665 %)
 Number of data-nodes:      4
 Number of racks:       1
FSCK ended at Thu Nov 24 16:08:12 CST 2016 in 2044 milliseconds

The filesystem under path '/' is HEALTHY
```

The above HEALTHY indicates that the current HDFS file system is normal, without bad blocks or data loss.