当前位置: 首页 > news >正文

泰州专业制作网站南京seo排名

泰州专业制作网站,南京seo排名,无锡做网站设计的企业,上海网站群建设环境  虚拟机:VMware 10   Linux版本:CentOS-6.5-x86_64   客户端:Xshell4  FTP:Xftp4  jdk1.8  scala-2.10.4(依赖jdk1.8)  spark-1.6 1、读取json格式的文件创建DataFrame注意:(1)json文件中的json数据…

环境
  虚拟机:VMware 10
  Linux版本:CentOS-6.5-x86_64
  客户端:Xshell4
  FTP:Xftp4
  jdk1.8
  scala-2.10.4(依赖jdk1.8)
  spark-1.6

1、读取json格式的文件创建DataFrame
注意:
(1)json文件中的json数据不能嵌套json格式数据。
(2)DataFrame是一个一个Row类型的RDD,df.rdd()/df.javaRdd()。
(3)可以两种方式读取json格式的文件。
sqlContext.read().format(“json”).load(“path”)
sqlContext.read().json(“path”)
(4)df.show()默认显示前20行数据。
(5)DataFrame原生API可以操作DataFrame(不方便)。
(6)注册成临时表时,表中的列默认按ascii顺序显示列。

数据:json
{"name":"zhangsan","age":"20"}
{"name":"lisi"}
{"name":"wangwu","age":"18"}

示例代码:
Java:

package com.wjy.df;import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.rdd.RDD;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;/*** 读取json格式的文件创建DataFrame* * 注意 :json文件中不能嵌套json格式的内容* * 1.读取json格式两种方式* 2.df.show默认显示前20行,使用df.show(行数)显示多行* 3.df.javaRDD/(scala df.rdd) 将DataFrame转换成RDD* 4.df.printSchema()显示DataFrame中的Schema信息* 5.dataFram自带的API 操作DataFrame ,用的少* 6.想使用sql查询,首先要将DataFrame注册成临时表:df.registerTempTable("jtable"),再使用sql,怎么使用sql?sqlContext.sql("sql语句")* 7.不能读取嵌套的json文件* 8.df加载过来之后将列按照ascii排序了* @author root**/
public class CreateDFFromJosonFile {public static void main(String[] args) {SparkConf conf = new SparkConf().setMaster("local").setAppName("CreateDFFromJosonFile");SparkContext sc = new SparkContext(conf);//注意 这里不是JavaSparkContext//创建SQLContextSQLContext sqlContext = new SQLContext(sc);/*** DataFrame的底层是一个一个的RDD  RDD的泛型是Row类型。* 以下两种方式都可以读取json格式的文件* {"name":"zhangsan","age":"20"}{"name":"lisi"}{"name":"wangwu","age":"18"}*/DataFrame df = sqlContext.read().format("json").load("./data/json");//{"name":"zhangsan","age":"20"};df.show();// 显示 DataFrame中的内容,默认显示前20行。如果现实多行要指定多少行show(行数) 注意:当有多个列时,显示的列先后顺序是按列的ascii码先后显示。DataFrame df2 = sqlContext.read().json("./data/json");df2.show();/** +----+--------+| age|    name|+----+--------+|  20|zhangsan||null|    lisi||  18|  wangwu|+----+--------+*///DataFrame转换成RDDJavaRDD<Row> javaRDD = df.javaRDD();//树形的形式显示schema信息
        df.printSchema();/** root|-- age: string (nullable = true)|-- name: string (nullable = true)    *///dataFram自带的API 操作DataFrame 这种方式比较麻烦 用的比较少//select name from tabledf.select("name").show();/** +--------+|    name|+--------+|zhangsan||    lisi||  wangwu|+--------+*///select name ,age+10 as addage from tabledf.select(df.col("name"),df.col("age").plus(10).alias("addage")).show();/** +--------+------+|    name|addage|+--------+------+|zhangsan|  30.0||    lisi|  null||  wangwu|  28.0|+--------+------+*///select name ,age from table where age>19df.select(df.col("name"),df.col("age")).where(df.col("age").gt(19)).show();/** +--------+---+|    name|age|+--------+---+|zhangsan| 20|+--------+---+*///select age,count(*) from table group by agedf.groupBy(df.col("age")).count().show();/** +----+-----+| age|count|+----+-----+|  18|    1||  20|    1||null|    1|+----+-----+*///将DataFrame注册成临时的一张表,这张表相当于临时注册到内存中,是逻辑上的表,不会物化到磁盘  这种方式用的比较多df.registerTempTable("person");DataFrame df3 = sqlContext.sql("select age,count(*) as gg from person group by age");df3.show();DataFrame df4 = sqlContext.sql("select age, name from person");df4.show();sc.stop();}}

 

Scala:

package com.wjy.dfimport org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContextobject CreateDFFromJsonFile {def main(args:Array[String]):Unit={val conf = new SparkConf().setMaster("local").setAppName("CreateDFFromJsonFile");val sc = new SparkContext(conf);val sqlContext = new SQLContext(sc);val df1 = sqlContext.read.json("./data/json");df1.show();val df2 = sqlContext.read.format("json").load("./data/json");df2.show();val rdd = df1.rdd;df1.printSchema();//select name from tabledf1.select(df1.col("name")).show();//select name from table where age>19df1.select(df1.col("name"),df1.col("age")).where(df1.col("age").gt(19)).show();//select count(*) from table group by agedf1.groupBy(df1.col("age")).count().show();//注册临时表df1.registerTempTable("person");val df3 = sqlContext.sql("select * from person");df3.show();/** +----+--------+| age|    name|+----+--------+|  20|zhangsan||null|    lisi||  18|  wangwu|+----+--------+*/sc.stop();}
}

 


2、通过json格式的RDD创建DataFrame
RDD的元素类型是String,但是格式必须是JSON格式
示例代码:
Java:

package com.wjy.df;import java.util.Arrays;import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;public class CreateDFFromJsonRDD {public static void main(String[] args) {SparkConf conf = new SparkConf().setMaster("local").setAppName("CreateDFFromJsonRDD");//SparkContext sc = new SparkContext(conf);JavaSparkContext sc = new JavaSparkContext(conf);SQLContext sqlContext = new SQLContext(sc);JavaRDD<String> javaRDD1 = sc.parallelize(Arrays.asList("{'name':'zhangsan','age':\"18\"}","{\"name\":\"lisi\",\"age\":\"19\"}","{\"name\":\"wangwu\",\"age\":\"20\"}"));JavaRDD<String> javaRDD2 = sc.parallelize(Arrays.asList("{\"name\":\"zhangsan\",\"score\":\"100\"}","{\"name\":\"lisi\",\"score\":\"200\"}","{\"name\":\"wangwu\",\"score\":\"300\"}"));DataFrame namedf = sqlContext.read().json(javaRDD1);namedf.show();DataFrame scoredf = sqlContext.read().json(javaRDD2);scoredf.show();//DataFrame原生API使用//SELECT t1.name,t1.age,t2.score from t1, t2 where t1.name = t2.namenamedf.join(scoredf, namedf.col("name").$eq$eq$eq(scoredf.col("name"))).select(namedf.col("name"),namedf.col("age"),scoredf.col("score")).show();//注册成临时表namedf.registerTempTable("name");scoredf.registerTempTable("score");//如果自己写的sql查询得到的DataFrame结果中的列会按照 查询的字段顺序返回DataFrame result = sqlContext.sql("select name.name,name.age,score.score "+ "from name join score "+ "on name.name = score.name");result.show();/** +--------+---+-----+|    name|age|score|+--------+---+-----+|zhangsan| 18|  100||    lisi| 19|  200||  wangwu| 20|  300|+--------+---+-----+*/sc.stop();}}

 

Scala:

package com.wjy.dfimport org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContextobject CreateDFFromJsonRDD {def main(args:Array[String]):Unit={val conf = new SparkConf().setMaster("local").setAppName("CreateDFFromJsonRDD");val sc = new SparkContext(conf);val sqlContext = new SQLContext(sc);val rdd1 = sc.makeRDD(Array("{\"name\":\"zhangsan\",\"age\":18}","{\"name\":\"lisi\",\"age\":19}","{\"name\":\"wangwu\",\"age\":20}"));val rdd2 = sc.makeRDD(Array("{\"name\":\"zhangsan\",\"score\":100}","{\"name\":\"lisi\",\"score\":200}","{\"name\":\"wangwu\",\"score\":300}"));val namedf = sqlContext.read.json(rdd1);val scoredf = sqlContext.read.json(rdd2);namedf.registerTempTable("name");scoredf.registerTempTable("score");val result = sqlContext.sql("select name.name,name.age,score.score from name,score where name.name = score.name");result.show();sc.stop();}
}

 

 

3、通过非json格式的RDD来创建出来一个DataFrame
(1)通过反射的方式 (不建议使用)
(1.1)自定义类要可序列化(注意变量被关键字transient修饰 则不会被序列化;静态变量也不能被序列化)
注意ava中以下几种情况下不被序列化的问题:
  1.1.1.反序列化时serializable 版本号不一致时会导致不能反序列化。
  1.1.2.子类中实现了serializable接口,父类中没有实现,父类中的变量不能被序列化,序列化后父类中的变量会得到null。
  注意:父类实现serializable接口,子类没有实现serializable接口时,子类可以正常序列化
  1.1.3.被关键字transient修饰的变量不能被序列化。
  1.1.4.静态变量不能被序列化,属于类,不属于方法和对象,所以不能被序列化。
另外:一个文件多次writeObject时,如果有相同的对象已经写入文件,那么下次再写入时,只保存第二次写入的引用,读取时,都是第一次保存的对象。
(1.2)自定义类的访问级别是Public
(1.3)RDD转成DataFrame后会根据映射将字段按Assci码排序
(1.4)将DataFrame转换成RDD时获取字段两种方式,一种是df.getInt(0)下标获取(不推荐使用),另一种是df.getAs(“列名”)获取(推荐使用)
示例代码:
Java:

package com.wjy.df;import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;/*** @author Administrator* 通过反射的方式将非json格式的RDD转换成DataFrame* 注意:这种方式不推荐使用*/
public class CreateDFFromRDDWithReflect {public static void main(String[] args) {SparkConf conf = new SparkConf().setMaster("local").setAppName("CreateDFFromRDDWithReflect");JavaSparkContext sc = new JavaSparkContext(conf);SQLContext sqlContext = new SQLContext(sc);/** 1,zhansan,182,lisi,193,wangwu,20*/JavaRDD<String> lineRDD = sc.textFile("./data/person.txt");JavaRDD<Person> personRDD = lineRDD.map(new Function<String, Person>() {private static final long serialVersionUID = 1L;@Overridepublic Person call(String line) throws Exception {String[] ss = line.split(",");Person p = new Person();p.setId(ss[0]);p.setName(ss[1]);p.setAge(Integer.valueOf(ss[2]));return p;}});/*** 传入进去Person.class的时候,sqlContext是通过反射的方式创建DataFrame* 在底层通过反射的方式获得Person的所有field,结合RDD本身,就生成了DataFrame*/DataFrame df1 = sqlContext.createDataFrame(personRDD, Person.class);df1.show();df1.printSchema();df1.registerTempTable("person");DataFrame ret = sqlContext.sql("select  name,id,age from person where id = 2");ret.show();/** +----+---+---+|name| id|age|+----+---+---+|lisi|  2| 19|+----+---+---+*//*** 将DataFrame转成JavaRDD* 注意:* 1.可以使用row.getInt(0),row.getString(1)...通过下标获取返回Row类型的数据,但是要注意列顺序问题---不常用* 2.可以使用row.getAs("列名")来获取对应的列值。*/JavaRDD<Row> javaRDD = ret.javaRDD();JavaRDD<Person> map = javaRDD.map(new Function<Row, Person>() {private static final long serialVersionUID = 1L;@Overridepublic Person call(Row row) throws Exception {//顺序和ret一致Person p = new Person();
//                p.setId(row.getString(1));
//                p.setName(row.getString(0));
//                p.setAge(row.getInt(2));
                p.setId(row.getAs("id"));p.setName(row.getAs("name"));p.setAge(row.getAs("age"));return p;}});map.foreach(new VoidFunction<Person>() {private static final long serialVersionUID = 1L;@Overridepublic void call(Person p) throws Exception {System.out.println(p);}});sc.stop();}}

 

Scala:

package com.wjy.dfimport org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext//case class 默认是可以序列化的,也就是实现了Serializable;ase class构造函数的参数是public级别
case class Person(id:String,name:String,age:Integer);object CreateDFFromRDDWithReflect {def main(args:Array[String]):Unit={val conf = new SparkConf().setMaster("local").setAppName("CreateDFFromRDDWithReflect");val sc = new SparkContext(conf); val sqlContext = new SQLContext(sc);val lineRDD = sc.textFile("./data/person.txt");val personRDD = lineRDD.map { x => {val p = Person(x.split(",")(0), x.split(",")(1), Integer.valueOf(x.split(",")(2)));p}};//将RDD隐式转换成DataFrameimport sqlContext.implicits._val df = personRDD.toDF();df.show();/** +---+-------+---+| id|   name|age|+---+-------+---+|  1|zhansan| 18||  2|   lisi| 19||  3| wangwu| 20|+---+-------+---+*///DataFrame转成RDDval rdd = df.rdd;val result = rdd.map { x => {Person(x.getAs("id"),x.getAs("name"),x.getAs("age"));}};result.foreach {println};/** Person(1,zhansan,18)Person(2,lisi,19)Person(3,wangwu,20)*/sc.stop();}
}

 

(2)动态创建schema的方式
示例代码:
Java:

package com.wjy.df;import java.util.Arrays;
import java.util.List;import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;/*** @author Administrator** 动态创建Schema将非json格式RDD转换成DataFrame*/
public class CreateDFFromRDDWithStruct {public static void main(String[] args) {SparkConf conf = new SparkConf().setMaster("local").setAppName("CreateDFFromRDDWithStruct");JavaSparkContext sc = new JavaSparkContext(conf);SQLContext sqlContext = new SQLContext(sc);JavaRDD<String> lineRDD = sc.textFile("./data/person.txt");//转换成Row类型的RDDJavaRDD<Row> rowrdd = lineRDD.map(new Function<String, Row>() {private static final long serialVersionUID = 1L;@Overridepublic Row call(String line) throws Exception {String[] ss = line.split(",");return RowFactory.create(ss[0],ss[1],Integer.valueOf(ss[2]));}});//动态构建DataFrame中的元数据,一般来说这里的字段可以来源自字符串,也可以来源于外部数据库List<StructField> asList = Arrays.asList(DataTypes.createStructField("id", DataTypes.StringType, true),DataTypes.createStructField("name", DataTypes.StringType, true),DataTypes.createStructField("age", DataTypes.IntegerType, true));//根据元数据创建schemaStructType schema = DataTypes.createStructType(asList);//根据row和schema创建DataFrameDataFrame df = sqlContext.createDataFrame(rowrdd, schema);df.show();/** +---+-------+---+| id|   name|age|+---+-------+---+|  1|zhansan| 18||  2|   lisi| 19||  3| wangwu| 20|+---+-------+---+*/sc.stop();}}

 

Scala:

package com.wjy.dfimport org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.RowFactory
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerTypeobject CreateDFFromRDDWithStruct {def main(args:Array[String]):Unit={val conf = new SparkConf().setMaster("local").setAppName("CreateDFFromRDDWithStruct");val sc = new SparkContext(conf);val sqlContext = new SQLContext(sc);val lineRDD = sc.textFile("./data/person.txt");//rowval rowRDD = lineRDD.map { x => {val ss = x.split(",");RowFactory.create(ss(0),ss(1),Integer.valueOf(ss(2)));}};//schemaval schema = StructType(List(StructField("id",StringType,true),StructField("name",StringType,true),StructField("age",IntegerType,true)));//根据row和schema创建DataFrameval df = sqlContext.createDataFrame(rowRDD, schema);df.show();sc.stop();}
}

 


4、读取parquet文件创建DF
注意:
可以将DataFrame存储成parquet文件。保存成parquet文件的方式有两种
df.write().mode(SaveMode.Overwrite)format("parquet").save("./sparksql/parquet");
df.write().mode(SaveMode.Overwrite).parquet("./sparksql/parquet");
SaveMode指定文件保存时的模式。
  Overwrite:覆盖
  Append:追加
  ErrorIfExists:如果存在就报错
  Ignore:如果存在就忽略


示例代码:
Java:

package com.wjy.df;import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SaveMode;public class CreateDFFromParquet {public static void main(String[] args) {SparkConf conf = new SparkConf().setMaster("local").setAppName("CreateDFFromParquet");JavaSparkContext sc = new JavaSparkContext(conf);SQLContext sqlContext = new SQLContext(sc);JavaRDD<String> jsonRDD = sc.textFile("./data/json");DataFrame dataFrame = sqlContext.read().json(jsonRDD);dataFrame.show();/*** 将DataFrame保存成parquet文件,* SaveMode指定存储文件时的保存模式:*         Overwrite:覆盖*         Append:追加*         ErrorIfExists:如果存在就报错*         Ignore:如果存在就忽略* 保存成parquet文件有以下两种方式:*///方式一:savedataFrame.write().mode(SaveMode.Overwrite).format("parquet").save("./data/parquet");//方式二:parquetdataFrame.write().mode(SaveMode.Ignore).parquet("./data/parquet");/** Initialized Parquet WriteSupport with Catalyst schema:{"type" : "struct","fields" : [ {"name" : "age","type" : "string","nullable" : true,"metadata" : { }}, {"name" : "name","type" : "string","nullable" : true,"metadata" : { }} ]}and corresponding Parquet message type:message spark_schema {optional binary age (UTF8);optional binary name (UTF8);}*//*** 加载parquet文件成DataFrame    * 加载parquet文件有以下两种方式:    *///方式一:loadDataFrame df1 = sqlContext.read().format("parquet").load("./data/parquet");df1.show();//方式二:parquetDataFrame df2 = sqlContext.read().parquet("./data/parquet");df2.show();sc.stop();}}

 

Scala:

package com.wjy.dfimport org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveModeobject CreateDFFromParquet {def main(args:Array[String]):Unit={val conf = new SparkConf().setMaster("local").setAppName("CreateDFFromParquet");val sc = new SparkContext(conf);val sqlContext  = new SQLContext(sc);val jsonRDD = sc.textFile("./data/json");val df = sqlContext.read.json(jsonRDD);df.show();/*** 将DF保存为parquet文件*/df.write.mode(SaveMode.Overwrite).format("parquet").save("./data/parquet");df.write.mode(SaveMode.Ignore).parquet("./data/parquet");/*** 读取parquet文件*/val df1 = sqlContext.read.format("parquet").load("./data/parquet");df1.show();val df2 = sqlContext.read.parquet("./data/parquet");df.show();sc.stop();}
}

 


5、读取JDBC中的数据创建DataFrame(MySql为例)
两种方式创建DataFrame
第一种方式读取MySql数据库表,加载为DataFrame
第二种方式读取MySql数据表加载为DataFrame
示例代码:
Java:

package com.wjy.df;import java.util.HashMap;
import java.util.Map;
import java.util.Properties;import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.DataFrameReader;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SaveMode;public class CreateDFFromMysql {public static void main(String[] args) {SparkConf conf = new SparkConf().setMaster("local").setAppName("CreateDFFromMysql");/***     配置join或者聚合操作shuffle数据时分区的数量*/conf.set("spark.sql.shuffle.partitions", "1");JavaSparkContext sc = new JavaSparkContext(conf);SQLContext sqlContext = new SQLContext(sc);/*** 第一种方式读取MySql数据库表,加载为DataFrame*/Map<String, String> options = new HashMap<String,String>();options.put("url", "jdbc:mysql://134.32.123.101:3306/spark");options.put("driver", "com.mysql.jdbc.Driver");options.put("user", "root");options.put("password", "123456");options.put("dbtable", "person");DataFrame df1 = sqlContext.read().format("jdbc").options(options).load();df1.show();df1.registerTempTable("person1");/*** 第二种方式读取MySql数据表加载为DataFrame*/DataFrameReader reader = sqlContext.read().format("jdbc");reader.option("url", "jdbc:mysql://134.32.123.101:3306/spark");reader.option("driver", "com.mysql.jdbc.Driver");reader.option("user", "root");reader.option("password", "123456");reader.option("dbtable", "score");DataFrame df2 = reader.load();df2.show();df2.registerTempTable("score1");DataFrame dataFrame = sqlContext.sql("select person1.id,person1.name,person1.age,score1.score "+ "from person1,score1 "+ "where person1.name = score1.name");dataFrame.show();/*** 将DataFrame结果保存到Mysql中*/Properties properties = new Properties();properties.setProperty("user", "root");properties.setProperty("password", "123456");/*** SaveMode:* Overwrite:覆盖* Append:追加* ErrorIfExists:如果存在就报错* Ignore:如果存在就忽略* */dataFrame.write().mode(SaveMode.Overwrite).jdbc("jdbc:mysql://134.32.123.101:3306/spark", "result", properties);System.out.println("----Finish----");sc.stop();}}

 

Scala:

package com.wjy.dfimport org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import java.util.HashMap
import java.util.Properties
import org.apache.spark.sql.SaveModeobject CreateDFFromMysql {def main(args:Array[String]):Unit={val conf = new SparkConf().setMaster("local").setAppName("CreateDFFromMysql");val sc = new SparkContext(conf);val sqlContext = new SQLContext(sc);/*** 第一种方式读取Mysql数据库表创建DF*/val options = new HashMap[String,String]();options.put("url", "jdbc:mysql://134.32.123.101:3306/spark")options.put("driver","com.mysql.jdbc.Driver")options.put("user","root")options.put("password", "123456")options.put("dbtable","person")val df1 = sqlContext.read.format("jdbc").options(options).load();df1.show();df1.registerTempTable("person");/*** 第二种方式读取Mysql数据库表创建DF*/var reader = sqlContext.read.format("jdbc");reader.option("url", "jdbc:mysql://134.32.123.101:3306/spark")reader.option("driver","com.mysql.jdbc.Driver")reader.option("user","root")reader.option("password","123456")reader.option("dbtable", "score")val df2 = reader.load();df2.show();df2.registerTempTable("score");val result = sqlContext.sql("select person.id,person.name,score.score from person,score where person.name = score.name")result.show()/*** 将数据写入到Mysql表中*/val properties = new Properties()properties.setProperty("user", "root")properties.setProperty("password", "123456")result.write.mode(SaveMode.Overwrite).jdbc("jdbc:mysql://134.32.123.101:3306/spark", "result", properties);sc.stop();}
}

 


6、读取Hive中的数据加载成DataFrame
HiveContext是SQLContext的子类,连接Hive建议使用HiveContext。
由于本地没有Hive环境,要提交到集群运行,提交命令:

./spark-submit 
--master spark://node1:7077,node2:7077 
--executor-cores 1 
--executor-memory 2G 
--total-executor-cores 1
--class com.bjsxt.sparksql.dataframe.CreateDFFromHive 
/root/test/HiveTest.jar

 

示例代码:
Java:

package com.wjy.df;import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.hive.HiveContext;/*** 如果读取hive中数据,要使用HiveContext* HiveContext.sql(sql)可以操作hive表,还可以操作虚拟的表**/
public class CreateDFFromHive {public static void main(String[] args) {//不能设置local了  需要打成jar在hive上运行SparkConf conf = new SparkConf().setAppName("CreateDFFromHive");JavaSparkContext sc = new JavaSparkContext(conf);//HiveContext是SQLContext的子类。 使用hive sql操作HiveContext hiveContext = new HiveContext(sc);hiveContext.sql("USE Spark");//使用spark数据库 //表student_infoshiveContext.sql("drop table if exists student_infos");//删除表hiveContext.sql("CREATE TABLE IF NOT EXISTS student_infos (name STRING,age INT) row format delimited fields terminated by '\t' ");//创建表hiveContext.sql("load data local inpath '/root/test/student_infos' into table student_infos");//hive加载数据//表student_scoreshiveContext.sql("DROP TABLE IF EXISTS student_scores");hiveContext.sql("CREATE TABLE IF NOT EXISTS student_scores (name STRING, score INT) row format delimited fields terminated by '\t'");  hiveContext.sql("LOAD DATA  LOCAL INPATH '/root/test/student_scores' INTO TABLE student_scores");/*** 查询表生成DataFrame*/DataFrame student_infos = hiveContext.table("student_infos");student_infos.show();DataFrame student_scores = hiveContext.table("student_scores");student_scores.show();DataFrame goodStudentsDF = hiveContext.sql("SELECT si.name, si.age, ss.score "+ "FROM student_infos si "+ "JOIN student_scores ss "+ "ON si.name=ss.name "+ "WHERE ss.score>=80");goodStudentsDF.show();goodStudentsDF.registerTempTable("goodStudent");DataFrame result = hiveContext.sql("select * from goodstudent");result.show();/*** 将结果保存到hive表 good_student_infos*/hiveContext.sql("DROP TABLE IF EXISTS good_student_infos");goodStudentsDF.write().mode(SaveMode.Overwrite).saveAsTable("good_student_infos");DataFrame table = hiveContext.table("good_student_infos");Row[] rows = table.collect();for (Row row:rows){System.out.println(row);}sc.stop();}}

 

Scala:

package com.wjy.dfimport org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveModeobject CreateDFFromHive {def main(args:Array[String]):Unit={//依赖hive 不能设置local模式val conf = new SparkConf().setAppName("CreateDFFromHive");val sc = new SparkContext(conf);/*** HiveContext是SQLContext的子类。*/val hiveContext = new HiveContext(sc);hiveContext.sql("use spark")//student_infoshiveContext.sql("drop table if exists student_infos")hiveContext.sql("create table if not exists student_infos (name string,age int) row format  delimited fields terminated by '\t'")hiveContext.sql("load data local inpath '/root/test/student_infos' into table student_infos")val df1 = hiveContext.table("student_infos");df1.show();//student_scoreshiveContext.sql("drop table if exists student_scores")hiveContext.sql("create table if not exists student_scores (name string,score int) row format delimited fields terminated by '\t'")hiveContext.sql("load data local inpath '/root/test/student_scores' into table student_scores")val df2 = hiveContext.table("student_scores");df2.show();val df = hiveContext.sql("select si.name,si.age,ss.score from student_infos si,student_scores ss where si.name = ss.name")df.show();/*** 将结果写入到hive表中*///good_student_infoshiveContext.sql("drop table if exists good_student_infos")df.write.mode(SaveMode.Overwrite).saveAsTable("good_student_infos");sc.stop();}
}

 

附:Spark On Hive的配置
1、在Spark客户端配置Hive On Spark
在Spark客户端安装包下spark-1.6.0/conf中创建文件hive-site.xml:
配置hive的metastore路径

<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://node1:9083</value>
</property>
</configuration>

 

2、启动Hive的metastore服务
hive --service metastore

3、启动zookeeper集群,启动HDFS集群,启动spark集群。

4、启动SparkShell 读取Hive中的表总数,对比hive中查询同一表查询总数测试时间。

./spark-shell 
--master spark://node1:7077,node2:7077 
--executor-cores 1 
--executor-memory 1g 
--total-executor-cores 1......scala>import org.apache.spark.sql.hive.HiveContext;
scala>val hc = new HiveContext(sc);
scala>hc.sql("show databases").show();
scala>hc.sql("user default").show();
scala>hc.sql("select count(*) from jizhan").show();

 

注意:
如果使用Spark on Hive 查询数据时,出现错误:Caused by:java.net.UnkonwnHostException:....
找不到HDFS集群路径,要在客户端机器conf/spark-env.sh中设置HDFS的路径:

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

 

参考:
Spark

转载于:https://www.cnblogs.com/cac2020/p/10717889.html

http://www.lbrq.cn/news/2552653.html

相关文章:

  • 网站被host重定向处理怎么做网站平台
  • 用来查数据的网站怎么建设网站优化公司认准乐云seo
  • php网站打开一片空白1688网站
  • 做视频类网站需要哪些许可证百度经验首页官网
  • php和ASP网站那个好市场营销公司排名
  • wordpress 秒拍seo北京网站推广
  • 电商美工的工作内容是什么seo优化公司
  • 报名网站开发多钱全国疫情排行榜
  • wordpress单页主题营销无锡seo关键词排名
  • 找个免费的网站这么难吗海外seo
  • 外贸网站建设加推广宁波seo运营推广平台排名
  • 腾飞网站建设站长之家网站排行榜
  • 租车网站建设方案google官网浏览器
  • 宣传片拍摄实施方案新的seo网站优化排名 网站
  • 网站设计机构成都黑帽seo
  • 洛阳市住房和城乡建设委员会网站6免费网站推广工具
  • wordpress 文章id 链接河北seo人员
  • 建设论坛网站2022年7到8月份的十大新闻
  • 百度云网站建设教程视频宁波网络推广平台
  • 怎样在绍兴e网做网站品牌公关
  • 游戏企业用什么程序做网站海南百度竞价推广
  • 成品在线网站免费入口少儿编程
  • 名匠装饰苏州优化网站公司
  • 真人做a视频网站seo服务商
  • wordpress 制作网站模板企业营销策划包括哪些内容
  • 北京网站制作培训班站长统计app软件下载2021
  • 北京国税局网站做票种核定时企业网络营销方案策划
  • 数据查询网站模板淘宝seo 优化软件
  • 政府网站建设实施意见什么软件能搜索关键词能快速找到
  • 一个人可以做网站网络推广主要做什么
  • 8.1IO进程线程——文件IO函数
  • 人员定位卡人脸智能充电发卡机
  • C#线程同步(二)锁
  • .NET Core部署服务器
  • SpringBoot收尾+myBatis plus
  • centos7安装Docker