当前位置：首页 > 编程日记 > 正文

hadoop 2 java hdfs_Hadoop2.6.0学习笔记（二）HDFS访问

编程日记 2025-01-19 17:00:00

鲁春利的工作笔记，谁说程序员不能有文艺范？

通过hadoop shell与java api访问hdfs

工作笔记之Hadoop2.6集群搭建已经将集群环境搭建好了，下面来进行一些HDFS的操作

1、HDFS的shell访问

HDFS设计主要用来对海量数据进行处理，即HDFS上存储大量文件。HDFS将这些文件进行分割后存储在不同的DataNode上。HDFS提供了一个shell接口，屏蔽了block存储的内部细节，所有的Hadoop操作均由bin/hadoop脚本引发。

不指定任何参数的hadoop命令将打印所有命令的描述，与hdfs文件相关的操作为hadoop fs(hadoop脚本其他的命令此处不涉及)。[hadoop@nnode ~]$ hadoop fs

Usage: hadoop fs [generic options]

[-appendToFile ... ]

[-cat [-ignoreCrc] ...]

[-checksum ...]

[-chgrp [-R] GROUP PATH...]

[-chmod [-R] PATH...]

[-chown [-R] [OWNER][:[GROUP]] PATH...]

[-copyFromLocal [-f] [-p] [-l] ... ]

[-copyToLocal [-p] [-ignoreCrc] [-crc] ... ]

[-count [-q] [-h] ...]

[-cp [-f] [-p | -p[topax]] ... ]

[-createSnapshot []]

[-deleteSnapshot ]

[-df [-h] [ ...]]

[-du [-s] [-h] ...]

[-expunge]

[-get [-p] [-ignoreCrc] [-crc] ... ]

[-getfacl [-R] ]

[-getfattr [-R] {-n name | -d} [-e en] ]

[-getmerge [-nl] ]

[-help [cmd ...]]

[-ls [-d] [-h] [-R] [ ...]]

[-mkdir [-p] ...]

[-moveFromLocal ... ]

[-moveToLocal ]

[-mv ... ]

[-put [-f] [-p] [-l] ... ]

[-renameSnapshot ]

[-rm [-f] [-r|-R] [-skipTrash] ...]

[-rmdir [--ignore-fail-on-non-empty]

[-setfacl [-R] [{-b|-k} {-m|-x } ]|[--set ]]

[-setfattr {-n name [-v value] | -x name} ]

[-setrep [-R] [-w] ...]

[-stat [format] ...]

[-tail [-f] ]

[-test -[defsz] ]

[-text [-ignoreCrc] ...]

[-touchz ...]

[-usage [cmd ...]]

Generic options supported are

-conf specify an application configuration file

-D use value for given property

-fs specify a namenode

-jt specify a ResourceManager

-files specify comma separated files to be copied to the map reduce cluster

-libjars specify comma separated jar files to include in the classpath.

-archives specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is

bin/hadoop command [genericOptions] [commandOptions]

hadoop2.6版本中提示hadoop fs为“Deprecated, use hdfs dfs instead.”(2.6之前的版本未接触过，这里就没有深究从哪一个版本开始的，但是hadoop fs仍然可以使用)。[hadoop@nnode ~]$ hdfs dfs

Usage: hadoop fs [generic options]

[-appendToFile ... ]

[-cat [-ignoreCrc] ...]

[-checksum ...]

[-chgrp [-R] GROUP PATH...]

[-chmod [-R] PATH...]

[-chown [-R] [OWNER][:[GROUP]] PATH...]

[-copyFromLocal [-f] [-p] [-l] ... ]

[-copyToLocal [-p] [-ignoreCrc] [-crc] ... ]

[-count [-q] [-h] ...]

[-cp [-f] [-p | -p[topax]] ... ]

[-createSnapshot []]

[-deleteSnapshot ]

[-df [-h] [ ...]]

[-du [-s] [-h] ...]

[-expunge]

[-get [-p] [-ignoreCrc] [-crc] ... ]

[-getfacl [-R] ]

[-getfattr [-R] {-n name | -d} [-e en] ]

[-getmerge [-nl] ]

[-help [cmd ...]]

[-ls [-d] [-h] [-R] [ ...]]

[-mkdir [-p] ...]

[-moveFromLocal ... ]

[-moveToLocal ]

[-mv ... ]

[-put [-f] [-p] [-l] ... ]

[-renameSnapshot ]

[-rm [-f] [-r|-R] [-skipTrash] ...]

[-rmdir [--ignore-fail-on-non-empty]

[-setfacl [-R] [{-b|-k} {-m|-x } ]|[--set ]]

[-setfattr {-n name [-v value] | -x name} ]

[-setrep [-R] [-w] ...]

[-stat [format] ...]

[-tail [-f] ]

[-test -[defsz] ]

[-text [-ignoreCrc] ...]

[-touchz ...]

[-usage [cmd ...]]

Generic options supported are

-conf specify an application configuration file

-D use value for given property

-fs specify a namenode

-jt specify a ResourceManager

-files specify comma separated files to be copied to the map reduce cluster

-libjars specify comma separated jar files to include in the classpath.

-archives specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is

bin/hadoop command [genericOptions] [commandOptions]

如：[hadoop@nnode ~]$ hdfs dfs -ls -R /user/hadoop

-rw-r--r-- 2 hadoop hadoop 2297 2015-06-29 14:44 /user/hadoop/20130913152700.txt.gz

-rw-r--r-- 2 hadoop hadoop 211 2015-06-29 14:45 /user/hadoop/20130913160307.txt.gz

-rw-r--r-- 2 hadoop hadoop 93046447 2015-07-18 18:01 /user/hadoop/apache-hive-1.2.0-bin.tar.gz

-rw-r--r-- 2 hadoop hadoop 4139112 2015-06-28 22:54 /user/hadoop/httpInterceptor_192.168.1.101_1_20130913160307.txt

-rw-r--r-- 2 hadoop hadoop 240 2015-05-30 20:54 /user/hadoop/lucl.gz

-rw-r--r-- 2 hadoop hadoop 63 2015-05-27 23:55 /user/hadoop/lucl.txt

-rw-r--r-- 2 hadoop hadoop 9994248 2015-06-29 14:12 /user/hadoop/scalog.txt

-rw-r--r-- 2 hadoop hadoop 2664495 2015-06-28 20:54 /user/hadoop/scalog.txt.gz

-rw-r--r-- 3 hadoop hadoop 28026803 2015-06-24 21:16 /user/hadoop/test.txt.gz

-rw-r--r-- 2 hadoop hadoop 28551 2015-05-27 23:54 /user/hadoop/zookeeper.out

[hadoop@nnode ~]$

# 这里的点为当前目录，我是通过hadoop用户操作的因此类似于/user/hadoop

# hdfs默认具有/user/{hadoop-user},但是在/下也可以自己通过mkdir命令来创建自己的目录

[hadoop@nnode ~]$ hdfs dfs -ls -R .

-rw-r--r-- 2 hadoop hadoop 2297 2015-06-29 14:44 20130913152700.txt.gz

-rw-r--r-- 2 hadoop hadoop 211 2015-06-29 14:45 20130913160307.txt.gz

-rw-r--r-- 2 hadoop hadoop 93046447 2015-07-18 18:01 apache-hive-1.2.0-bin.tar.gz

-rw-r--r-- 2 hadoop hadoop 4139112 2015-06-28 22:54 httpInterceptor_192.168.1.101_1_20130913160307.txt

-rw-r--r-- 2 hadoop hadoop 240 2015-05-30 20:54 lucl.gz

-rw-r--r-- 2 hadoop hadoop 63 2015-05-27 23:55 lucl.txt

-rw-r--r-- 2 hadoop hadoop 9994248 2015-06-29 14:12 scalog.txt

-rw-r--r-- 2 hadoop hadoop 2664495 2015-06-28 20:54 scalog.txt.gz

-rw-r--r-- 3 hadoop hadoop 28026803 2015-06-24 21:16 test.txt.gz

-rw-r--r-- 2 hadoop hadoop 28551 2015-05-27 23:54 zookeeper.out

[hadoop@nnode ~]$

如果不清楚hdfs命令的详细操作，可以查看帮助信息：[hadoop@nnode ~]$ hdfs dfs -help ls

-ls [-d] [-h] [-R] [ ...] :

List the contents that match the specified file pattern. If path is not

specified, the contents of /user/ will be listed. Directory entries are of the form:

permissions - userId groupId sizeOfDirectory(in bytes)

modificationDate(yyyy-MM-dd HH:mm) directoryName

and file entries are of the form:

permissions numberOfReplicas userId groupId sizeOfFile(in bytes)

modificationDate(yyyy-MM-dd HH:mm) fileName

-d Directories are listed as plain files.

-h Formats the sizes of files in a human-readable fashion rather than a number of bytes.

-R Recursively list the contents of directories.

[hadoop@nnode ~]$

2、HDFS的Java API访问

Hadoop中通过DataNode节点存储数据，而NameNode节点则记录数据的存储位置。Hadoop中各部分的通信基于RPC来实现，NameNode也是hadoop中RPC的server端(dfs.namenode.rpc-address说明了rpc端的主机名和端口号)，而Hadoop提供的FileSystem类为hadoop中RPC Client的抽象实现。

a.) 通过java.util.URL来读取hdfs的数据

为了让java程序能够识别Hadoop的hdfs URL需要通过URL的setURLStreamHandlerFactory(...);

每个Java虚拟机只能调用依次这个方法，因此通常在静态方法中调用。package com.invic.hdfs;

import java.io.IOException;

import java.io.InputStream;

import java.io.OutputStream;

import java.net.URL;

import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;

import org.apache.hadoop.io.IOUtils;

/**

* @author lucl

* @ 通过java api来访问hdfs上特定的数据

public class MyHdfsOfJavaApi {

static {

/**

* 为了让java程序能够识别hadoop的hdfs url需要配置额外的URLStreamHandlerFactory

* 如下方法java虚拟机只能调用一次，若原有的其他程序已经声明过该factory，则我的java程序将无法从hadoop中读取数据

URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());

}

public static void main(String[] args) throws IOException {

String path = "hdfs://nnode:8020/user/hadoop/lucl.txt";

InputStream in = new URL(path).openStream();

OutputStream ou = System.out;

int buffer = 4096;

boolean close = false;

IOUtils.copyBytes(in, ou, buffer, close);

IOUtils.closeStream(in);

}

b.) 通过Hadoop的FileSystem来访问HDFS

Hadoop有一个抽象的文件系统概念，HDFS只是其中的一个实现。java抽象类org.apache.hadoop.fs.FileSystem定义了Hadoop中的一个文件系统接口。java.lang.Object

org.apache.hadoop.conf.Configured

org.apache.hadoop.fs.FileSystem

|--org.apache.hadoop.fs.FilterFileSystem

|----org.apache.hadoop.fs.ChecksumFileSystem

|----org.apache.hadoop.fs.LocalFileSystem

|--org.apache.hadoop.fs.ftp.FTPFileSystem

|--org.apache.hadoop.fs.s3native.NativeS3FileSystem

|--org.apache.hadoop.fs.RawLocalFileSystem

|--org.apache.hadoop.fs.viewfs.ViewFileSystempackage com.invic.hdfs;

import java.io.IOException;

import java.io.OutputStream;

import java.net.URI;

import java.util.Scanner;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FSDataInputStream;

import org.apache.hadoop.fs.FSDataOutputStream;

import org.apache.hadoop.fs.FileStatus;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.FileUtil;

import org.apache.hadoop.fs.LocatedFileStatus;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.fs.PathFilter;

import org.apache.hadoop.fs.RemoteIterator;

import org.apache.hadoop.io.IOUtils;

import org.apache.hadoop.util.Progressable;

/**

* @author lucl

* @ 通过FileSystem API来实现

* FileSystem get(Configuration) 通过设置配置文件core-site.xml读取类路径来实现，默认本地文件系统

* FileSystem get(URI, Configuration) 通过URI来设定要使用的文件系统

* FileSystem get(URI, Configuration, user) 作为给定用户来访问文件系统，对安全来说至关重要

public class MyHdfsOfFS {

private static String HOST = "hdfs://nnode";

private static String PORT = "8020";

private static String NAMENODE = HOST + ":" + PORT;

public static void main(String[] args) throws IOException {

Configuration conf = new Configuration();

String path = NAMENODE + "/user/";

/**

* 由于这里设计的为hadoop的user目录，默认会查询hdfs的用户家目录下的文件

String user = "hadoop";

FileSystem fs = null;

try {

fs = FileSystem.get(URI.create(path), conf, user);

} catch (InterruptedException e) {

e.printStackTrace();

}

if (null == fs) {

return;

}

/**

* 递归创建目录

boolean mkdirs = fs.mkdirs(new Path("invic/test/mvtech"));

if (mkdirs) {

System.out.println("Dir ‘invic/test/mvtech’ create success.");

}

/**

* 判断目录是否存在

boolean exists = fs.exists(new Path("invic/test/mvtech"));

if (exists) {

System.out.println("Dir ‘invic/test/mvtech’ exists.");

}

/**

* FSDataInputStream支持随意位置访问

* 这里的lucl.txt默认查找路径为/user/Administrator/lucl.txt

因为我是windows的eclipse

* 如果我上面的get方法最后指定了user

则查询的路径为/user/get方法指定的user/lucl.txt

FSDataInputStream in = fs.open(new Path("lucl.txt"));

OutputStream os = System.out;

int buffSize = 4098;

boolean close = false;

IOUtils.copyBytes(in, os, buffSize, close);

System.out.println("\r\n跳到文件开始重新读取文件。。。。。。");

in.seek(0);

IOUtils.copyBytes(in, os, buffSize, close);

IOUtils.closeStream(in);

/**

* 创建文件

FSDataOutputStream create = fs.create(new Path("sample.txt"));

create.write("This is my first sample file.".getBytes());

create.flush();

create.close();

/**

* 文件拷贝

fs.copyFromLocalFile(new Path("F:\\Mvtech\\ftpfile\\cg-10086.com.csv"),

new Path("cg-10086.com.csv"));

/**

* 文件追加

FSDataOutputStream append = fs.append(new Path("sample.txt"));

append.writeChars("\r\n");

append.writeChars("New day, new World.");

append.writeChars("\r\n");

IOUtils.closeStream(append);

/**

* progress的使用

FSDataOutputStream progress = fs.create(new Path("progress.txt"),

new Progressable() {

@Override

public void progress() {

System.out.println("write is in progress......");

}

});

// 接收键盘输入到hdfs上

Scanner sc = new Scanner(System.in);

System.out.print("Please type your enter : ");

String name = sc.nextLine();

while (!"quit".equals(name)) {

if (null == name || "".equals(name.trim())) {

continue;

}

progress.writeChars(name);

System.out.print("Please type your enter : ");

name = sc.nextLine();

}

/**

* 递归列出文件

RemoteIterator it = fs.listFiles(new Path(path), true);

while (it.hasNext()) {

LocatedFileStatus loc = it.next();

System.out.println(loc.getPath().getName() + "|" + loc.getLen() + "|"

+ loc.getOwner());

}

/**

* 文件或目录元数据：文件长度、块大小、复本、修改时间、所有者及权限信息

FileStatus status = fs.getFileStatus(new Path("lucl.txt"));

System.out.println(status.getPath().getName() + "|" +

status.getPath().getParent().getName() + "|" + status.getBlockSize() + "|"

+ status.getReplication() + "|" + status.getOwner());

/**

* 列出目录中文件listStatus，若参数为文件则以数组方式返回长度为1的FileStatus对象

fs.listStatus(new Path(path));

fs.listStatus(new Path(path), new PathFilter() {

@Override

public boolean accept(Path tmpPath) {

String tmpName = tmpPath.getName();

if (tmpName.endsWith(".txt")) {

return true;

}

return false;

}

});

// 可以传入一组路径，会最终累计合并成一个数组返回

// fs.listStatus(Path [] files);

FileStatus [] mergeStatus = fs.listStatus(new Path[]{new Path("lucl.txt"),

new Path("progress.txt"), new Path("sample.txt")});

Path [] listPaths = FileUtil.stat2Paths(mergeStatus);

for (Path p : listPaths) {

System.out.println(p);

}

/**

* 文件模式匹配

FileStatus [] patternStatus = fs.globStatus(new Path("*.txt"));

for (FileStatus stat : patternStatus) {

System.out.println(stat.getPath());

}

/**

* 删除数据

boolean recursive = true;

fs.delete(new Path("demo.txt"), recursive);

fs.close();

}

c.) 访问HDFS集群package com.invic.hdfs;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.LocatedFileStatus;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.fs.RemoteIterator;

import org.apache.log4j.Logger;

/**

* @author lucl

* @ 通过访问hadoop集群来访问hdfs

public class MyClusterHdfs {

public static void main(String[] args) throws IOException {

System.setProperty("hadoop.home.dir", "E:\\Hadoop\\hadoop-2.6.0\\hadoop-2.6.0\\");

Logger logger = Logger.getLogger(MyClusterHdfs.class);

Configuration conf = new Configuration();

conf.set("fs.defaultFS", "hdfs://cluster");

conf.set("dfs.nameservices", "cluster");

conf.set("dfs.ha.namenodes.cluster", "nn1,nn2");

conf.set("dfs.namenode.rpc-address.cluster.nn1", "nnode:8020");

conf.set("dfs.namenode.rpc-address.cluster.nn2", "dnode1:8020");

conf.set("dfs.client.failover.proxy.provider.cluster",

"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");

FileSystem fs = FileSystem.get(conf);

RemoteIterator it = fs.listFiles(new Path("/"), true);

while (it.hasNext()) {

LocatedFileStatus loc = it.next();

logger.info(loc.getPath().getName() + "|" + loc.getLen() + loc.getOwner());

}

/*for (int i = 0; i

String str = "the sequence is " + i;

logger.info(str);

}*/

try {

Thread.sleep(10);

} catch (InterruptedException e) {

e.printStackTrace();

}

System.exit(0);

}

说明：System.setProperty("hadoop.home.dir", "E:\\Hadoop\\hadoop-2.6.0\\hadoop-2.6.0\\");

# 在main方法的第一行配置hadoop的home路径，否则在Windows下可能报错如下：

15/07/19 22:05:54 DEBUG util.Shell: Failed to detect a valid hadoop home directory

java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.

at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:302)

at org.apache.hadoop.util.Shell.(Shell.java:327)

at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:438)

at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:484)

at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:170)

at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:153)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)

at com.invic.mapreduce.wordcount.WordCounterTool.main(WordCounterTool.java:29)

15/07/19 22:05:54 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:355)

at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:370)

at org.apache.hadoop.util.Shell.(Shell.java:363)

at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:438)

at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:484)

at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:170)

at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:153)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)

at com.invic.mapreduce.wordcount.WordCounterTool.main(WordCounterTool.java:29)

https://www.dkcj.cn/info/36998.html

hadoop 2 java hdfs_Hadoop2.6.0学习笔记（二）HDFS访问

相关文章：

知乎如何洞察你的真实喜好？首页信息流技术揭秘

[Web开发] 微软的RSS协议扩展 - FeedSync 介绍 (4)

weblogic 修改控制台密码

WPF框架的内存泄漏BUG

java map深拷贝_java 实现Map的深复制

出门问问工程副总裁黄美玉入选IEEE Fellow，曾担任微软Cortana首席NLP科学家

Windows2003服务器不支持FLV视频的解决方法

mpi并行 java_【并行计算】用MPI进行分布式内存编程（一）

JQuery——选择器分类

3月6日工作日志-88250

专注文本处理，达观数据完成B轮融资，累计融资超2亿元

Asp.Net Core写个共享磁盘文件Web查看器

ImageNet时代将终结？何恺明新作：Rethinking ImageNet Pre-training

java 序列化缓存_java_缓冲流、转换流、序列化流

QQ2007去广告教程（本地vip）

java instanceof 报错_java instanceof方法

grep的常用命令语法

千呼万唤始出来！OpenCV 4.0正式发布！

ORA-01031: insufficient privileges的解决方法

java 线程通讯_java多线程（五）线程通讯

合并排序（C语言实现)

工程实践也能拿KDD最佳论文？解读Embeddings at Airbnb

计算点、线、面等元素之间的交点、交线、封闭区域面积和闭合集(续1)

android 抓取native层奔溃

渗透各行各业，这家RPA外企宣布全面进军中国市场

java gettext_JAVA中getText()怎么从一个JTextArea中读出内容？

想在SqlDbHelper.cs类中加的垃圾方法

java全站_javaWeb_全站编码

在Linux系统中修改目录的权限如何恢复

.Net Framework 3.5 结构图