Browsed by
Tag: bigdata

使用CDH示例程序进行字数统计

使用CDH示例程序进行字数统计

Wordcount程序是Hadoop上的经典“HelloWorld”程序。CDH系统自带了wordcount程序来检测部署的成功与否。

# 解压提前准备好的莎士比亚全集
[sujx@elephant ~]gzip -d shakespeare.txt.gz

# 上传至hadoop文件系统
[sujx@elephant ~] hdfs dfs -mkdir /user/sujx/input
[sujx@elephant ~]hdfs dfs -put shakespeare.txt /user/sujx/input

# 查看有哪些测试程序可用
[sujx@elephant ~] hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.    grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.       pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.   randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

# 执行mapreduce运算,output文件夹会自动建立
[sujx@elephant ~]hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /user/sujx/input/shakespeare.txt /user/sujx/output/

# 查看输出结果
[sujx@elephant ~] hdfs dfs -ls /user/sujx/output
Found 4 items
-rw-r--r--   3 sujx supergroup          0 2020-03-09 02:47 /user/sujx/output/_SUCCESS      -rw-r--r--   3 sujx supergroup     238211 2020-03-09 02:47 /user/sujx/output/part-r-00000  -rw-r--r--   3 sujx supergroup     236617 2020-03-09 02:47 /user/sujx/output/part-r-00001  -rw-r--r--   3 sujx supergroup     238668 2020-03-09 02:47 /user/sujx/output/part-r-00002 

# 查看输出内容
[sujx@elephant ~]$ hdfs dfs -tail /user/sujx/output/part-r-00000
.       3
writhled        1
writing,        4
writings.       1
writs   1
written,        3
wrong   112
wrong'd-        1
wrong-should    1
wrong.  39
wrong:  1
wronged 11
wronged.        3
wronger,        1
wronger;        1
wrongfully?     1
wrongs  40
wrongs, 9
wrongs; 9
wrote?  1
wrought,        4
…………
HDFS初步使用

HDFS初步使用

HDFS(Hadoop Distributed File System)是可扩展、容错、高性能的分布式文件系统,异步复制,一次写入多次读取,主要负责存储。其概念和内容可以参考[1]。这里就做一个简单的实验来看一下其文件管理的功能。更多的Hadoop命令可以参考[2]

用户建立

在实验环境中,不建议使用root账号直接登录运行,所以建立一个普通账号。

# Elephant主机执行
# 安装ansible
[root@elephant ~]# yum install -y ansible
# 在/etc/ansible/hosts中新增所有主机名
# 建立ansible文件
[root@elephant ~]# mkdir playbook
[root@elephant ~]# vim ./playbook/useradd.yaml
---
- hosts: hadoop
  remote_user: root
  vars_prompt:
    - name: user_name
      prompt: Enter Username
      private: no
    - name: user_passwd
      prompt: Enter Password
      encrypt: "sha512_crypt"
      confirm: yes
  tasks:
    - name: create user
      user:
        name: "{{user_name}}"
        password: "{{user_passwd}}"

# 执行
[root@elephant ~]# ansible-playbook ./playbook/useradd.yaml
Enter Username: sujx
Enter Password:
confirm Enter Password:

PLAY [hadoop] *****************************************************************************

HDFS文件使用

我们先将准备的481M文件(access.log)上传至用户家目录,看看这个文件将在hdfs文件系统中如何存储。

[root@elephant ~]# su hdfs
[hdfs@elephant root]hadoop fs -mkdir /user/sujx
[hdfs@elephant root] hadoop fs -chown sujx /user/sujx
[hdfs@elephant root]su sujx -
[hdfs@elephant root] hdfs dfs -put access_log weblog

然后,我们通过hdfs的web控制台就可以看到文件的存储情况,可见文件以3副本的形式,按照每个128M大小的存储块分割存储在namenode之上。当前情况是分成了4块。
show
block

我们可以从lion主机中看到存储的数据块。

[root@lion ~]# tree /dfs
/dfs
`-- dn
    |-- current
    |   |-- BP-752680285-192.168.174.131-1582986010714
    |   |   |-- current
    |   |   |   |-- VERSION
    |   |   |   |-- dfsUsed
    |   |   |   |-- finalized
    |   |   |   |   `-- subdir0
    |   |   |   |       |-- subdir0
    |   |   |   |       |-- subdir1
    |   |   |   |       |-- subdir2
    |   |   |   |       |-- subdir3
    |   |   |   |       |   |-- blk_1073742734
    |   |   |   |       |   |-- blk_1073742734_1910.meta
    |   |   |   |       `-- subdir4
    |   |   |   `-- rbw
    |   |   |-- scanner.cursor
    |   |   `-- tmp
    |   `-- VERSION
    `-- in_use.lock