Day11 重写

学习日期: 2.15-2.16

所学内容概述

day11任务

因为很久没有使用scala对数据处理,这两天打算用之前老师学习的数据和题目使用一下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
package subject4

import org.apache.spark.sql.SparkSession

object dataGet {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("dataGet4")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.exec.max.dynamic.partitions", "100000")
.config("hive.exec.max.dynamic.partitions.pernode", "100000")
.getOrCreate()
import spark.sql
val date = "20230215"
spark.read.format("jdbc")
.option("url","jdbc:mysql://172.16.1.77/shtd_store")
.option("driver","com.mysql.jdbc.Driver")
.option("user","root")
.option("password","root")
.option("dbtable","CUSTOMER").load().createOrReplaceTempView("customer")

mysqlGet(spark,"NATION")
mysqlGet(spark,"PART")
mysqlGet(spark,"PARTSUPP")
mysqlGet(spark,"REGION")
mysqlGet(spark,"SUPPLIER")

sql(
s"""
|insert into table ods.customer partition (etldate=$date)
|select * from customer
|""".stripMargin)

sql(
s"""
|insert into table ods.nation partition (etldate=$date)
|select * from nation
|""".stripMargin)

sql(
s"""
|insert into table ods.part partition (etldate=$date)
|select * from part
|""".stripMargin)

sql(
s"""
|insert into table ods.partsupp partition (etldate=$date)
|select * from partsupp
|""".stripMargin)



def mysqlGet(mysqlSpark:SparkSession,mysqlName:String): Unit ={
val url = "jdbc:mysql://172.16.1.77:3306/shtd_store?serverTimezone=GMT&useUnicode=true"
val driver="com.mysql.jdbc.Driver"
val user = "root"
val password = "root"
mysqlSpark.read.format("jdbc")
.option("url",url)
.option("driver",driver)
.option("user",user)
.option("password",password)
.option("dbtable",mysqlName).load().createTempView(mysqlName.toLowerCase)
}
}
}

BUG点

打完jar包用spark提交 会报错说是jdbc的问题,解决了一整天都没搞好,找不到问题的原因。

image-20230329174401734

日总结

这两天放到一天去了,因为今天一天都在修BUG,问题原本以为不大,但是我们两个提交都是同样的报错,感觉是服务器的配置,自己用之前自己提交过的代码还是失败。也没有什么头绪,尝试用jdbc启动hive也不行,问题应该就是在hive和MySQL的连接了,明天请人再来看一看,顺利清洗一次,以后就会好起来了。

Spark数据清洗BUG

学习日期: 2.17-2.20

所学内容概述

延续昨天结尾时候的BUG

BUG点

修了几天,后面出现的BUG,运行失败的截图就不放上来了,简单叙述一下,报错信息是找不到hive中的库和表名,但是我是刚创建的,我就怀疑是有两个hive,因为之前偶尔会查得到,然后问龚春强,说IDEA清理一下缓存,然后还没有用,把hive-site.xml换成原来的就好了,难道是xml文件的问题?然后我又换回去,也是好的,就琢磨不透了。

image-20230329174418261

扩展学习

无,修bug,一道算法题都没做

日总结

这几天修BUG,只能说没什么头绪,因为报错信息不明确, CSDN搜也文不对题,从刚开始连接mysql就报错,然后解决,更换了mysql-connector-java包版本,也没解决,后面在spark里面加jar包,就修好了,然后就是一直连接不上hive,突然想起来,自己之前学hadoop的时候装的就是hive3和spark3,虽然版本是对套的,但是就是查找不到,好像要更改配置,太麻烦了,自己索性就降版本了。这次把版本降低了,又出现了其他的BUG,在最后一个BUG,找不到hive表的时候,解决了一天半,最后也不算很清楚怎么解决的,现在能练题了就行,大概率是之前电脑有上一个hive的缓存文件,导致找的一直是服务器之前的hive。

4月任务1以及函数梳理

学习日期: 2.21-2.22

所学内容概述

4月四合题目任务1的完成

代码记录

数据抽取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
package gz04.subject1

import org.apache.spark.sql.SparkSession

object dataMath {
def main(args: Array[String]): Unit = {

def mysqlExtract(sparkSession: SparkSession, mysqlTableName: String): Unit = {
val url = "jdbc:mysql://172.16.1.77/shtd_store?useSSL=false&useUnicode=true&characterEncoding=utf-8"
val driver = "com.mysql.jdbc.Driver"
val user = "root"
val password = "123456Admin@123"
sparkSession.read.format("jdbc")
.option("url", url)
.option("driver", driver)
.option("user", user)
.option("password", password)
.option("dbtable", mysqlTableName).load().createTempView("mysql_" + mysqlTableName.toLowerCase)
}

val spark: SparkSession = SparkSession.builder().master("local[*]").appName("clear4")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.exec.max.dynamic.partitions", "1000000")
.config("hive.exec.max.dynamic.partitions.pernode", "1000000")
.getOrCreate()
import spark.sql

sql(
"""
|WITH t1 AS(
|SELECT
|dc.cust_key as customerkey,
|dn.nation_key as nationkey,
|dn.name as nationname,
|dr.region_key as regionkey,
|dr.name as regionname,
|date_format(order_date,'yyyy') as year,
|date_format(order_date,'MM') as month,
|fo.total_price
|FROM
|dwd.fact_orders AS fo
|LEFT JOIN dwd.dim_customer dc ON fo.cust_key = dc.cust_key
|LEFT JOIN dwd.dim_nation dn ON dc.nation_key = dn.nation_key
|LEFT JOIN dwd.dim_region dr ON dn.region_key = dr.region_key
|)
|SELECT
|nationkey,
|max(nationname),
|max(regionkey),
|max(regionname),
|sum(total_price),
|count(1) AS totalorder,
|year,
|month
|FROM
|t1
|GROUP BY
|nationkey,year,month
|""".stripMargin).createTempView("math1")

mysqlExtract(spark,"nationeverymonth")

sql(
"""
|insert overwrite table mysql_nationeverymonth
|select * from math1
|""".stripMargin)


spark.conf.set("spark.sql.crossJoin.enabled","true")
sql(
"""
|with t1 as (
|select c.cust_key as custkey,
|n.nation_key as nationkey,
|n.name as nationname,
|total_price
|from dwd.fact_orders fo
|left join dwd.dim_customer c on fo.cust_key=c.cust_key
|left join dwd.dim_nation n on c.nation_key=n.nation_key),
|t2 as (
|select custkey,max(nationkey) as nationkey, max(nationname) as nationname,
|sum(total_price) as sumprice
|from t1
|group by t1.custkey),
|t3 as (
|select nationkey,
|max(nationname) as nationname,
|round(avg(sumprice),2) as avgprice
|from t2
|group by nationkey),
|t4 as (
|select round(avg(sumprice),2) as allavgprice from t2)
|select t3.*,t4.* from t3,t4
|""".stripMargin).show()

val d: Double = sql(
"""
|select avg(allnationavgconsumption)
|from
|(
|SELECT
|sum(total_price) as allnationavgconsumption,
|count(1)
|FROM
|dwd.fact_orders fo
|group by
|fo.cust_key)
|""".stripMargin).collect()(0).getDouble(0)
print(d)
sql(
s"""
|with t1 as (
|SELECT
|dn.nation_key AS nationkey,
|max(dn.name) AS nationname,
|sum(total_price)/count(distinct(fo.cust_key)) as nationavgconsumption
|FROM
|dwd.fact_orders fo
|LEFT JOIN dwd.dim_customer dc on fo.cust_key = dc.cust_key
|LEFT JOIN dwd.dim_nation dn on dc.nation_key = dn.nation_key
|group by dn.nation_key)
|select t1.*,$d as allnationavgconsumption from t1
|""".stripMargin).show()


mysqlExtract(spark,"nationavgcmp")
sql(
"""
|insert overwrite table mysql_nationavgcmp
|select h.*,
|(case
|when h.nationavgconsumption > h.allnationavgconsumption then '高'
|when h.nationavgconsumption = h.allnationavgconsumption then '相同'
|when h.nationavgconsumption < h.allnationavgconsumption then '低' end
|)
|from hive2 h
|""".stripMargin)



sql(
"""
|with t1 as (
|select cust_key,
|cast(date_format(order_date,'yyyyMM') as int) as month,
|sum(total_price) as total_price,
|count(1) as total_order
|from dwd.fact_orders
|group by cust_key,cast(date_format(order_date,'yyyyMM') as int)
|),
|t2 as (
|select
|a.cust_key,
|concat(a.month,'_',b.month) as month,
|a.total_price + b.total_price as total_price,
|a.total_order + b.total_order as total_order
|from t1 a
|left join t1 b
| on --判断连续两个月增长
|a.cust_key=b.cust_key
|where
|a.month + 1 = b.month
|and
|a.total_price < b.total_price)
|select c.cust_key,c.name,month,total_price,total_order
|from t2
|left join dwd.dim_customer c on c.cust_key=t2.cust_key
|""".stripMargin).coalesce(50).createTempView("hiveOrders")

mysqlExtract(spark, "usercontinueorder")

sql(
"""
|insert overwrite table mysql_usercontinueorder
|select * from hiveOrders
|""".stripMargin)
}
}

数据清洗

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
package gz04.subject1

import org.apache.spark.sql.SparkSession

object dataClear {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("clear4")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.exec.max.dynamic.partitions", "1000000")
.config("hive.exec.max.dynamic.partitions.pernode", "1000000")
.getOrCreate()
import spark.sql

sql(
"""
|insert overwrite table dwd.dim_customer partition (etldate)
|select
|custkey,
|name ,
|address ,
|nationkey ,
|phone ,
|acctbal ,
|mktsegment,
|comment ,
|'user1',
|date_format(current_timestamp,'yyyy-MM-dd HH:mm:ss'),
|'user1',
|date_format(current_timestamp,'yyyy-MM-dd HH:mm:ss'),
|etldate
|from
|ods.customer
|""".stripMargin)

sql(
"""
|insert overwrite table dwd.dim_part partition (etldate)
|select
|partkey ,
|name ,
|mfgr ,
|brand ,
|type ,
|size ,
|container ,
|retailprice ,
|comment ,
|'user1',
|date_format(current_timestamp,'yyyy-MM-dd HH:mm:ss'),
|'user1',
|date_format(current_timestamp,'yyyy-MM-dd HH:mm:ss'),
|etldate
|from ods.part
|""".stripMargin)

sql(
"""
|insert overwrite table dwd.dim_nation partition (etldate)
|select
|nationkey ,
|name ,
|regionkey ,
|comment ,
|'user1',
|date_format(current_timestamp,'yyyy-MM-dd HH:mm:ss'),
|'user1',
|date_format(current_timestamp,'yyyy-MM-dd HH:mm:ss'),
|etldate
|from
|ods.nation
|""".stripMargin)

sql(
"""
|insert overwrite table dwd.dim_region partition (etldate)
|select
|regionkey ,
|name ,
|comment ,
|'user1',
|date_format(current_timestamp,'yyyy-MM-dd HH:mm:ss'),
|'user1',
|date_format(current_timestamp,'yyyy-MM-dd HH:mm:ss'),
|etldate
|from
|ods.region
|""".stripMargin)

sql(
"""
|insert overwrite table dwd.fact_orders partition (etldate)
|select
|orderkey ,
|custkey ,
|orderstatus ,
|totalprice ,
|orderdate ,
|orderpriority ,
|clerk ,
|shippriority ,
|comment ,
|'user1',
|date_format(current_timestamp,'yyyy-MM-dd HH:mm:ss'),
|'user1',
|date_format(current_timestamp,'yyyy-MM-dd HH:mm:ss'),
|etldate
|from
|ods.orders
|""".stripMargin)

sql(
"""
|insert overwrite table dwd.fact_lineitem partition (etldate)
|select
|orderkey ,
|partkey ,
|suppkey ,
|linenumber ,
|quantity ,
|extendedprice ,
|discount ,
|tax ,
|returnflag ,
|linestatus ,
|shipdate ,
|commitdate ,
|receiptdate ,
|shipinstruct ,
|shipmode ,
|comment ,
|'user1',
|date_format(current_timestamp,'yyyy-MM-dd HH:mm:ss'),
|'user1',
|date_format(current_timestamp,'yyyy-MM-dd HH:mm:ss'),
|etldate
|from
|ods.lineitem
|""".stripMargin)

print(
"""
|
|┌───┐ ┌───┬───┬───┬───┐ ┌───┬───┬───┬───┐ ┌───┬───┬───┬───┐ ┌───┬───┬───┐
|│Esc│ │ F1│ F2│ F3│ F4│ │ F5│ F6│ F7│ F8│ │ F9│F10│F11│F12│ │P/S│S L│P/B│ ┌┐ ┌┐ ┌┐
|└───┘ └───┴───┴───┴───┘ └───┴───┴───┴───┘ └───┴───┴───┴───┘ └───┴───┴───┘ └┘ └┘ └┘
|┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───────┐ ┌───┬───┬───┐ ┌───┬───┬───┬───┐
|│~ `│! 1│@ 2│# 3│$ 4│% 5│^ 6│& 7│* 8│( 9│) 0│_ -│+ =│ BacSp │ │Ins│Hom│PUp│ │N L│ / │ * │ - │
|├───┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─────┤ ├───┼───┼───┤ ├───┼───┼───┼───┤
|│ Tab │ Q │ W │ E │ R │ T │ Y │ U │ I │ O │ P │{ [│} ]│ | \ │ │Del│End│PDn│ │ 7 │ 8 │ 9 │ │
|├─────┴┬──┴┬──┴┬──┴┬──┴┬──┴┬──┴┬──┴┬──┴┬──┴┬──┴┬──┴┬──┴─────┤ └───┴───┴───┘ ├───┼───┼───┤ + │
|│ Caps │ A │ S │ D │ F │ G │ H │ J │ K │ L │: ;│" '│ Enter │ │ 4 │ 5 │ 6 │ │
|├──────┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴────────┤ ┌───┐ ├───┼───┼───┼───┤
|│ Shift │ Z │ X │ C │ V │ B │ N │ M │< ,│> .│? /│ Shift │ │ ↑ │ │ 1 │ 2 │ 3 │ │
|├─────┬──┴─┬─┴──┬┴───┴───┴───┴───┴───┴──┬┴───┼───┴┬────┬────┤ ┌───┼───┼───┐ ├───┴───┼───┤ E││
|│ Ctrl│ │Alt │ Space │ Alt│ │ │Ctrl│ │ ← │ ↓ │ → │ │ 0 │ . │←─┘│
|└─────┴────┴────┴───────────────────────┴────┴────┴────┴────┘ └───┴───┴───┘ └───────┴───┴───┘
|
|
|""".stripMargin)
}
}

指标计算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
package gz04.subject1

import org.apache.spark.sql.SparkSession

object dataMath {
def main(args: Array[String]): Unit = {

def mysqlExtract(sparkSession: SparkSession, mysqlTableName: String): Unit = {
val url = "jdbc:mysql://172.16.1.77/shtd_store?useSSL=false&useUnicode=true&characterEncoding=utf-8"
val driver = "com.mysql.jdbc.Driver"
val user = "root"
val password = "123456Admin@123"
sparkSession.read.format("jdbc")
.option("url", url)
.option("driver", driver)
.option("user", user)
.option("password", password)
.option("dbtable", mysqlTableName).load().createTempView("mysql_" + mysqlTableName.toLowerCase)
}

val spark: SparkSession = SparkSession.builder().master("local[*]").appName("clear4")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.exec.max.dynamic.partitions", "1000000")
.config("hive.exec.max.dynamic.partitions.pernode", "1000000")
.getOrCreate()
import spark.sql

sql(
"""
|WITH t1 AS(
|SELECT
|dc.cust_key as customerkey,
|dn.nation_key as nationkey,
|dn.name as nationname,
|dr.region_key as regionkey,
|dr.name as regionname,
|date_format(order_date,'yyyy') as year,
|date_format(order_date,'MM') as month,
|fo.total_price
|FROM
|dwd.fact_orders AS fo
|LEFT JOIN dwd.dim_customer dc ON fo.cust_key = dc.cust_key
|LEFT JOIN dwd.dim_nation dn ON dc.nation_key = dn.nation_key
|LEFT JOIN dwd.dim_region dr ON dn.region_key = dr.region_key
|)
|SELECT
|nationkey,
|max(nationname),
|max(regionkey),
|max(regionname),
|sum(total_price),
|count(1) AS totalorder,
|year,
|month
|FROM
|t1
|GROUP BY
|nationkey,year,month
|""".stripMargin).createTempView("math1")

mysqlExtract(spark,"nationeverymonth")

sql(
"""
|insert overwrite table mysql_nationeverymonth
|select * from math1
|""".stripMargin)


spark.conf.set("spark.sql.crossJoin.enabled","true")
sql(
"""
|with t1 as (
|select c.cust_key as custkey,
|n.nation_key as nationkey,
|n.name as nationname,
|total_price
|from dwd.fact_orders fo
|left join dwd.dim_customer c on fo.cust_key=c.cust_key
|left join dwd.dim_nation n on c.nation_key=n.nation_key),
|t2 as (
|select custkey,max(nationkey) as nationkey, max(nationname) as nationname,
|sum(total_price) as sumprice
|from t1
|group by t1.custkey),
|t3 as (
|select nationkey,
|max(nationname) as nationname,
|round(avg(sumprice),2) as avgprice
|from t2
|group by nationkey),
|t4 as (
|select round(avg(sumprice),2) as allavgprice from t2)
|select t3.*,t4.* from t3,t4
|""".stripMargin).show()

val d: Double = sql(
"""
|select avg(allnationavgconsumption)
|from
|(
|SELECT
|sum(total_price) as allnationavgconsumption,
|count(1)
|FROM
|dwd.fact_orders fo
|group by
|fo.cust_key)
|""".stripMargin).collect()(0).getDouble(0)
print(d)
sql(
s"""
|with t1 as (
|SELECT
|dn.nation_key AS nationkey,
|max(dn.name) AS nationname,
|sum(total_price)/count(distinct(fo.cust_key)) as nationavgconsumption
|FROM
|dwd.fact_orders fo
|LEFT JOIN dwd.dim_customer dc on fo.cust_key = dc.cust_key
|LEFT JOIN dwd.dim_nation dn on dc.nation_key = dn.nation_key
|group by dn.nation_key)
|select t1.*,$d as allnationavgconsumption from t1
|""".stripMargin).show()


mysqlExtract(spark,"nationavgcmp")
sql(
"""
|insert overwrite table mysql_nationavgcmp
|select h.*,
|(case
|when h.nationavgconsumption > h.allnationavgconsumption then '高'
|when h.nationavgconsumption = h.allnationavgconsumption then '相同'
|when h.nationavgconsumption < h.allnationavgconsumption then '低' end
|)
|from hive2 h
|""".stripMargin)



sql(
"""
|with t1 as (
|select cust_key,
|cast(date_format(order_date,'yyyyMM') as int) as month,
|sum(total_price) as total_price,
|count(1) as total_order
|from dwd.fact_orders
|group by cust_key,cast(date_format(order_date,'yyyyMM') as int)
|),
|t2 as (
|select
|a.cust_key,
|concat(a.month,'_',b.month) as month,
|a.total_price + b.total_price as total_price,
|a.total_order + b.total_order as total_order
|from t1 a
|left join t1 b
| on --判断连续两个月增长
|a.cust_key=b.cust_key
|where
|a.month + 1 = b.month
|and
|a.total_price < b.total_price)
|select c.cust_key,c.name,month,total_price,total_order
|from t2
|left join dwd.dim_customer c on c.cust_key=t2.cust_key
|""".stripMargin).coalesce(50).createTempView("hiveOrders")

mysqlExtract(spark, "usercontinueorder")

sql(
"""
|insert overwrite table mysql_usercontinueorder
|select * from hiveOrders
|""".stripMargin)
}
}

BUG点

模块B的第七题,查询ods.orders表的时候,出现了报错,一对发现自己的hive表分区名和任务书是不一样的,按照自己的为准了,因为去年备赛的时候就发现了这个问题,比赛的时候没这种问题。

image-20230222145255409

扩展学习

使用到的sql函数

distinct

distinct也可以用做去重,一列将重复的去重,结合count使用,可以直接将该列不重复的字段,计算出多少行

greatest

greatest(a,b,c,d) 返回参数中的最大值

coalesce

1
select coalesce(success_cnt, 1) from tableA

当success_cnt 为null值的时候,将返回1,否则将返回success_cnt的真实值。

1
select coalesce(success_cnt,period,1) from tableA 

当success_cnt不为null,那么无论period是否为null,都将返回success_cnt的真实值(因为success_cnt是第一个参数),当success_cnt为null,而period不为null的时候,返回period的真实值。只有当success_cnt和period均为null的时候,将返回1。

regexp_replace

替换字符的,三个参数

1
2
3
--使用
regexp_replace(orderdate,"-","")
将orderdate字段中的-变成空\

date_format

日期函数将日期格式化

1
2
select date_format('2019-12-12','yyyyMMdd');
--20191212 注意点是第一个参数 必须是yyyy-MM-dd HH:mm:ss格式的

from_unixtime

转换时间戳为指定格式的时间

1
2
3
select from_unixtime(cast(1567896035000 as int),'yyyy-MM-dd');
--2019-09-08
默认有毫秒

current_timestamp

1
2
3
4
5
6
--获取当前的时间
select current_timestamp;
--2022-09-07 15:58:36.102
--可以用date_format转换格式
select date_format(current_timestamp,'yyyy-MM-dd HH:mm:ss');
--2022-09-07 15:58:36

unix_timestamp

1
2
3
4
5
6
--空参 得到当前的时间戳
--如果参数date满足yyyy-MM-dd HH:mm:ss形式,则可以直接unix_timestamp(string date) 得到参数对应的时间戳
--如果参数date不满足yyyy-MM-dd HH:mm:ss形式,则我们需要指定date的形式,在进行转换
select unix_timestamp('2009-03-20') --1237507200
select unix_timestamp('2009-03-20 00:00:00', 'yyyy-MM-dd HH:mm:ss') --1237507200
select unix_timestamp('2009-03-20 00:00:01', 'yyyy-MM-dd HH:mm:ss') --1237507201

truncate table

清除表内的数据 结构保存的 自己练习最好每次都加上 防止打的jar包报错 但是之前都已经存在了 数据会重复

alter table ods.orders partition (etldate < ‘19980731’)

日总结

今天才算是正式的又做了一次国赛任务书,发现其实模块B和C倒是没什么问题,只要能理解题目,sql语句不是很难,指标计算那边太久没敲了,其实做起来要耗费很长的时间,才完成,而且出现了小问题,没有对用户去重。但是学会了新的方法,之前只要一个字段放一列的时候,需要spark开启笛卡尔积很麻烦,现在只需要用scala将字段取出成val变量就可以了。

Hive50练习题

学习日期: 2.23-2.24

所学内容概述

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
--1.查询课程编号为“01”的课程比“02”的课程成绩高的所有学生的学号(重点):
select student.*,s.s_score as score1,b.s_score as score2
from student
left join score s on student.s_id = s.s_id and s.c_id = '1'
left join score b on student.s_id = b.s_id and b.c_id = '2'
where b.s_score<s.s_score;

--2、查询"01"课程比"02"课程成绩低的学生的信息及课程分数(重点):
select student.*,c.s_score as score2,s.s_score as score1
from student
left join score s on student.s_id = s.s_id and s.c_id = '1'
left join score c on student.s_id = c.s_id and c.c_id = '2'
where c.s_score > s.s_score;

--3、查询平均成绩大于等于60分的同学的学生编号和学生姓名和平均成绩(重点):
select stu.s_id,stu.s_name,round(avg(s.s_score),1) as avgScore
from student stu
left join score s on stu.s_id = s.s_id
group by stu.s_id,stu.s_name
having avg(s.s_score) >= 60;


set hive.strict.checks.cartesian.product = true;
set hive.mapred.mode = nonstrict;

--4、查询平均成绩小于60分的同学的学生编号和学生姓名和平均成绩:
select stu.s_id,stu.s_name,round(avg(s.s_score),1) as avgScore
from student stu
left join score s on stu.s_id = s.s_id
group by stu.s_id, stu.s_name
having round(avg(s.s_score),1) < 60
union all
select stu2.s_id,stu2.s_name,0
from student stu2
where stu2.s_id not in
(select sc.s_id from score sc);


--5、查询所有同学的学生编号、学生姓名、选课总数、所有课程的总成绩:
select stu.s_id,stu.s_name,count(s.c_id),sum(s.s_score)
from student stu
left join score s on stu.s_id = s.s_id
group by stu.s_id, stu.s_name;

--6、查询"李"姓老师的数量:
select t_name,count(1)
from teacher
where teacher.t_name like '李%'
group by t_name;

--7、查询学过"张三"老师授课的同学的信息(重点):
select stu.*
from student stu
left join score s on stu.s_id = s.s_id
left join course c on s.c_id = c.c_id
left join teacher t on c.t_id = t.t_id
where t.t_name = '张三';


--8、查询没学过"张三"老师授课的同学的信息(重点):
select student.*,tmp.s_id from student
left join (select s_id from score
join course on course.c_id=score.c_id
join teacher on course.t_id=teacher.t_id and t_name='张三')tmp
on student.s_id =tmp.s_id
where tmp.s_id is null;

--9、查询学过编号为"01"并且也学过编号为"02"的课程的同学的信息(重点):
select stu.*
from student stu
join score s on stu.s_id = s.s_id and s.c_id='2'
join score c on stu.s_id = c.s_id and c.c_id='1';
--10、查询学过编号为"01"但是没有学过编号为"02"的课程的同学的信息(重点):
set hive.strict.checks.cartesian.product = true;
set hive.mapred.mode = nonstrict;
select stu.*
from student stu
join score c on stu.s_id = c.s_id and c.c_id='1'
left join score s on stu.s_id = s.s_id and s.c_id='2'
where s.s_id is null;

--11、查询没有学全所有课程的同学的信息(重点):
select student.*
from student
join (Select count(c_id) num1 from course) tmp1
left join (select s_id,count(1) num2 from score group by s_id) tmp2
on tmp1.num1 = tmp2.num2 and student.s_id=tmp2.s_id
where tmp2.s_id is null;

--12、查询至少有一门课与学号为"01"的同学所学相同的同学的信息(重点):
with t1 as (
select c_id
from
score
where s_id='1'
)
select stu.*
from student stu
join score s on stu.s_id = s.s_id
join t1 on t1.c_id = s.c_id
where stu.s_id <> '1'
group by stu.s_id,s_name,s_sex,s_birth;

--13、查询和"01"号的同学学习的课程完全相同的其他同学的信息(重点):
with t1 as (
select c_id
from
score
where s_id='1'
)
select stu.*,count(s.c_id)
from student stu
join score s on stu.s_id = s.s_id
join t1 on t1.c_id = s.c_id
where stu.s_id <> '1'
group by stu.s_id,s_name,s_sex,s_birth;

--14、查询没学过"张三"老师讲授的任一门课程的学生姓名(重点):
select stu.s_name
from
student stu
left join (select s.c_id,s_id from score s
join course c on s.c_id = c.c_id
join teacher t on c.t_id = t.t_id and t.t_name = '张三'
) tmp on tmp.s_id = stu.s_id
where tmp.s_id is null;


--15、查询两门及其以上不及格课程的同学的学号,姓名及其平均成绩(重点):
with t1 as (
select stu.s_id,stu.s_name,count(s.s_score) as countBadScore
from student stu
left join score s on stu.s_id = s.s_id and s.s_score < 60
group by stu.s_id, stu.s_name
)
select t1.s_id,t1.s_name,round(avg(s_score),1)
from t1
left join score s on s.s_id=t1.s_id
where countBadScore >= 2
group by t1.s_id,t1.s_name;

--16、检索"01"课程分数小于60,按分数降序排列的学生信息
select stu.*,tmp.scores FROM student stu
join (select sid,scores from score where cid=1 and scores<60)tmp
on tmp.sid=stu.id
order by tmp.scores;

--17、按平均成绩从高到低显示所有学生
的所有课程的成绩以及平均成绩
select a.sid,tmp1.scores as c1,tmp2.scores as c2,tmp3.scores as c3,
round(avg (a.scores),2) as avgScore
from score a
left join (select sid,scores from score s1 where cid='01')tmp1 on tmp1.sid=a.sid
left join (select sid,scores from score s2 where cid='02')tmp2 on tmp2.sid=a.sid
left join (select sid,scores from score s3 where cid='03')tmp3 on tmp3.sid=a.sid
group by a.sid,tmp1.scores,tmp2.scores,tmp3.scores
order by avgScore desc;

--18.查询各科成绩最高分、最低分和平均分:以如下形式显示:课程ID,课程name,最高分,最低分,平均分,及格率,中等率,优良率,优秀率

select course.cid,course.cname,tmp.maxScore,tmp.minScore,tmp.avgScore,tmp.passRate,tmp.moderate,tmp.goodRate,tmp.excellentRates from course
join(select
cid,
max(scores) as maxScore,
min(scores) as minScore,
round(avg(scores),2) avgScore,
round(sum(case when scores>=60 then 1 else 0 end)/count(cid),2)passRate,
round(sum(case when scores>=60 and scores<70 then 1 else 0 end)/count(cid),2) moderate,
round(sum(case when scores>=70 and scores<80 then 1 else 0 end)/count(cid),2) goodRate,
round(sum(case when scores>=80 and scores<90 then 1 else 0 end)/count(cid),2) excellentRates
from score group by cid) tmp on tmp.cid=course.cid;

--19、按各科成绩进行排序,并显示排名:
– row_number() over()分组排序功能(mysql没有该方法)
select cou.cname,stu.name,scores,rank()
over(partition by cou.cname order by scores desc) from score
left join student stu on stu.id=score.sid
left join course cou on cou.cid=score.cid;

--20、查询学生的总成绩并进行排名
select stu.name,sum(scores) sum_sco,rank()
over(order by sum(scores) desc) from score
left join student stu on stu.id=score.sid
group by stu.name;


--21、查询不同老师所教不同课程平均分从高到低显示


select teacher.tname,course.cname,round(avg(scores),2) avg_scores
from score
join course on score.cid=course.cid
join teacher on teacher.tid=course.tid
group by course.cname,teacher.tname
order by avg_scores desc;

--22、查询所有课程的成绩第2名到第3名的学生信息及该课程成绩
select course.cname,stu.*,tmp.scores,tmp.cno from
student stu join
(select *,row_number() over (partition by cid order by scores desc)cno from score )tmp
on stu.id=tmp.sid
join course on course.cid=tmp.cid
where tmp.cno between 2 and 3;

--23、统计各科成绩各分数段人数:课程编号,课程名称,[100-85],[85-70],[70-60],[0-60]及所占百分比

select score.cid,course.cname,
round(sum(case when score.scores>=85 and score.scores<=100 then 1 else 0 end)/count(score.scores),2) as 100and85,
round(sum(case when score.scores>=70 and score.scores<85 then 1 else 0 end)/count(score.scores),2) as 85and70,
round(sum(case when score.scores>=60 and score.scores<70 then 1 else 0 end)/count(score.scores),2) as 70and60,
round(sum(case when score.scores>=0 and score.scores<60 then 1 else 0 end)/count(score.scores),2) as 60and0
from score
left join course on score.cid=course.cid
group by score.cid,course.cname;


--24、查询学生平均成绩及其名次
select student.name,nvl(round(avg(scores),2),0.0),rank()
over(order by avg(scores) desc)
from score right join student on student.id=score.sid
group by sid,student.name;

--25、查询各科成绩前三名的记录

select course.cname,stu.*,tmp.scores,tmp.cno from student stu
join(select cid,sid,scores,row_number()
over(partition by cid order by scores desc) cno
from score )tmp
on stu.id=tmp.sid
join course on course.cid=tmp.cid
where cno<=3;

--26、查询每门课程被选修的学生数
select course.cname,count(sid) from score
join course on course.cid=score.cid
group by score.cid,course.cname;

--27、查询出只有两门课程的全部学生的学号和姓名
SELECT student.*,count(cid) from score
join student on student.id=score.sid
group by student.name,id,birthday,sex
having count(cid)=2;

--28、查询男生、女生人数
SELECT sex,count(sex) from student
group by sex;

--29、查询名字中含有"风"字的学生信息
select * from student
where name like("%风%");

--30、查询同名同性学生名单,并统计同名人数

select name,sex,count(name) from student
group by name,sex;

--31、查询1990年出生的学生名单
SELECT *,count(*) over() from student
where substring(birthday,1,4) = '1990'
group by student.id,name,sex,birthday;

select * from student
where birthday like("1990%");

select * from student
where year(birthday)=1990;

--32、查询每门课程的平均成绩,结果按平均成绩降序排列,平均成绩相同时,按课程编号升序排列
select cid,round(avg(scores),2) as avgs,row_number()
over(order by round(avg(scores),2) desc,cid)
from score
group by cid;

--33、查询平均成绩大于等于85的所有学生的学号、姓名和平均成绩
select id,name,tmp.cno from student
join (select sid,avg(scores) cno from score group by sid having avg(scores) >=85) tmp
on tmp.sid=student.id;

--34、查询课程名称为"数学",且分数低于60的学生姓名和分数
select stu.name,sco.scores
from student stu
join score sco
join course cor
on stu.id=sco.sid and sco.cid=cor.cid
where cor.cname='数学' and sco.scores<60;

--35、查询所有学生的课程及分数情况
SELECT id,name,tmp.Chinese,tmp.English,tmp.Math from student left join(
select sco.sid,
sum(case cor.cname when '语文' then sco.scores else 0 end) as Chinese,
sum(case cor.cname when '英语' then sco.scores else 0 end) as English,
sum(case cor.cname when '数学' then sco.scores else 0 end) as Math
from score sco
join course cor on cor.cid=sco.cid
group by sco.sid) tmp
on tmp.sid=student.id;

--36、查询任何一门课程成绩在70分以上的学生姓名、课程名称和分数
select student.name,course.cname,scores from score
join student on student.id=score.sid
join course on score.cid=course.cid
where scores>=70;


--37、查询课程不及格的学生
select student.name,course.cname,scores from score
join student on student.id=score.sid
join course on score.cid=course.cid
where scores<60;


--38、查询课程编号为01且课程成绩在80分以上的学生的学号和姓名

select student.id,student.name,scores from score
join student on student.id=score.sid
where scores>=80 and cid=1;


--39、求每门课程的学生人数
select cid,count(sid) from score
group by cid;


--40、查询选修"张三"老师所授课程的学生中,成绩最高的学生信息及其成绩

select student.*,scores ,RANK()
over(order by scores desc) from score
join student on student.id=score.sid
join course on course.cid=score.cid
join teacher on teacher.tid=course.tid
where teacher.tname = "张三";


--41、查询不同课程成绩相同的学生的学生编号、课程编号、学生成绩
select s1.* from
score s1,score s2
where s1.cid<>s2.cid and s1.scores=s2.scores
order by s1.sid;



--42、查询每门课程成绩最好的前三名
select tmp.cid,stu.*,tmp.scores,tmp.cno from
student stu join
(select cid,sid,scores,row_number() over(partition by cid order by scores desc) cno
from score) tmp
on stu.id=tmp.sid
where tmp.cno<=3;


--43、统计每门课程的学生选修人数(超过5人的课程才统计):
–要求输出课程号和选修人数,查询结果按人数降序排列,若人数相同,按课程号升序排列

select cid,count(sid) as cc
from score
group by cid
having count(sid)>=5
order by cc desc ,cid asc;

--44、检索至少选修两门课程的学生学号
select sid,count(cid) as cc
from score
group by sid
having cc >= 2;

--45、查询选修了全部课程的学生信息
select stu.* from score
join student stu on stu.id=score.sid
group by sid,stu.id,name,sex,birthday
having count(cid)=3;


--46 查询各学生的年龄(周岁):按照出生日期来算,当前月日 < 出生年月的月日则,年龄减一 ?

--sum 分组完 必须聚合 否则语法不对
select name,
sum(case month(student.birthday)<month(current_date()) when true then tmp.tage-1 else tmp.tage end) s_age
from student
join (select id,year(current_date())-year(birthday) as tage from student) tmp on tmp.id = student.id
group by student.name;


--47、查询本周过生日的学生:

select *
from student
where weekofyear(concat(year(current_date()),'-',date_format(birthday,'MM-dd')))= --把学生的年份换今年 月日不变 求第几周
weekofyear(current_date())


48、查询下周过生日的学生:
select *
from student
where weekofyear(concat(year(current_date()),'-',date_format(birthday,'MM-dd')))=
weekofyear(current_date())+1;


49、查询本月过生日的学生:
select * from student
where month(CURRENT_DATE())=month(birthday);


50、查询12月份过生日的学生:
select * from student
where month(birthday)='12';

BUG点

hive日志报错

看报错信息似乎是要设置两个参数,设置。是开启笛卡尔积的允许

1
2
set hive.strict.checks.cartesian.product = true;
set hive.mapred.mode = nonstrict;

image-20230223083040890

扩展学习

蓝桥杯备赛,时间问题只做了蓝桥杯官方发的一道真题

提交了好几次都没有全部通过,又看了眼题目,发现a,b,n的范围是10的18次以内,推测后面没通过的部分是因为超出了int的长度限制。

image-20230224135108347

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import java.util.Scanner;
// 1:无需package
// 2: 类名必须Main, 不可修改
public class Main {
public static void main(String[] args) {
Scanner scan = new Scanner(System.in);
//在此输入您的代码...
long a = scan.nextLong();
long b = scan.nextLong();
long n = scan.nextLong();
long weekNumber = a *5 + b * 2;
long week = n / weekNumber;
int day = 0;
long sum = 0;
while(sum < n - week * weekNumber){
day +=1;
if(day % 7 == 6 || day %7 == 0){
sum += b;
}else{
sum += a;
}
}
System.out.println(day+week*7);
scan.close();
}
}

日总结

今天是自我训练,自己安排了当时去年做真题之前,做了一套hive的50道题目,还是有点难度的,如果能很清楚的完成hive需要的查询条件,对于后面指标计算是没有什么问题的了。今天做了一下,有几道题目还是有点问题,出了个小BUG,也解决了,一天做了15题也还算满意,去年一天就做四五题。有些难题还是参考了别人的思路,自己还是要多加练习,清楚每个连接以后表会怎样。抽空做了道算法题,感觉很简单,真正比赛的时候是要比这些难度大很多的。

七月任务1

学习日期: 2.27

所学内容概述

抽取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
package gz07.subject1.GZ07

import org.apache.spark.sql.SparkSession

object dataChouQu {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("dataChouQu")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.exec.max.dynamic.partitions", "100000")
.config("hive.exec.max.dynamic.partitions.pernode", "100000")
.getOrCreate()
import spark.sql
val date = "20230226"
mysqlChouQu(spark,"user_info","mysql_user")
mysqlChouQu(spark,"sku_info","mysql_sku")
mysqlChouQu(spark,"base_province","mysql_province")
mysqlChouQu(spark,"base_region","mysql_region")
mysqlChouQu(spark,"order_info","mysql_order")
mysqlChouQu(spark,"order_detail","mysql_detail")
sql(
"""
|select * from user_info
|where crete_time ='2020-04-26 18:57:55' and (operate_time <= '2020-04-06 18:57:55' or operate_time = null)
|""".stripMargin)
sql(
s"""
|insert into table ods.user_info partition (etl_date='$date')
|select * from mysql_user limit 20
|""".stripMargin)

sql(
s"""
|insert into table ods.user_info partition (etl_date='$date')
|select * from mysql_user m
|where
|greatest(coalesce(unix_timestamp(m.create_time),0),coalesce(unix_timestamp(m.operate_time),0)) >=
|(select greatest(coalesce(max(unix_timestamp(create_time)),0),coalesce(max(unix_timestamp(operate_time)),0)) from ods.user_info)
|""".stripMargin)

sql(
s"""
|insert into table ods.sku_info partition (etl_date=$date)
|select * from mysql_sku m
|where
|m.create_time not in
|(select create_time from ods.sku_info)
|""".stripMargin)

sql(
s"""
|insert into table ods.base_province partition (etl_date=$date)
|select *,
|current_timestamp()
|from
|mysql_province m
|where
|m.id not in
|(select id from ods.base_province)
|""".stripMargin)
sql(
s"""
|insert into table ods.base_region partition (etl_date=$date)
|select *,
|current_timestamp()
|from
|mysql_region m
|where
|m.id not in
|(select id from ods.base_region)
|""".stripMargin)

sql(
s"""
|insert into table ods.order_info partition (etl_date=$date)
|select *
|from
|mysql_order m
|where
|greatest(coalesce(unix_timestamp(m.create_time),0),coalesce(unix_timestamp(m.operate_time),0)) >=
|(select greatest(coalesce(max(unix_timestamp(create_time)),0),coalesce(max(unix_timestamp(operate_time)),0)) from ods.order_info)
|""".stripMargin)

sql(
s"""
|insert into table ods.order_detail partition (etl_date=$date)
|select * from mysql_detail m
|where
|m.create_time not in
|(select create_time from ods.order_detail)
|""".stripMargin)
spark.stop()
def mysqlChouQu(sparkSession: SparkSession,mysqlName:String,tempName:String): Unit = {
val url = "jdbc:mysql://172.16.1.77/shtd_store"
val driver = "com.mysql.jdbc.Driver"
val user = "root"
val password = "123456Admin@123"
sparkSession.read.format("jdbc")
.option("url",url)
.option("driver",driver)
.option("user",user)
.option("password",password)
.option("dbtable",mysqlName).load().createTempView(tempName)
}
}
}

清洗

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
package gz07.subject1.GZ07

import org.apache.spark.sql.SparkSession

object dataClear {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("dataClear")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.exec.max.dynamic.partitions", "100000")
.config("hive.exec.max.dynamic.partitions.pernode", "100000")
.getOrCreate()
import spark.sql

sql(
s"""
|insert into table dwd.dim_user_info partition (etl_date)
|select
|id,login_name,nick_name,passwd,name,phone_num,
|email,head_img,user_level,birthday,gender,
|date_format(create_time,'yyyy-MM-dd HH:mm:ss'),
|coalesce(date_format(operate_time,'yyyy-MM-dd HH:mm:ss'),date_format(create_time,'yyyy-MM-dd HH:mm:ss')),
|'user1',date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss'),
|'user1',date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss'),
|etl_date
|from ods.user_info
|""".stripMargin)

sql(
"""
|insert overwrite table dwd.dim_sku_info partition (etl_date)
|select
|id,spu_id,price,sku_name,sku_desc,
|weight,tm_id,category3_id,sku_default_img,
|date_format(create_time,'yyyy-MM-dd HH:mm:ss'),
|'user1',date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss'),
|'user1',date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss'),etl_date
|from ods.sku_info
|""".stripMargin)

sql(
"""
|insert overwrite table dwd.dim_province partition (etl_date)
|select id,name,region_id,area_code,iso_code,create_time,
|'user1',date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss'),
|'user1',date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss'),etl_date
|from ods.base_province
|""".stripMargin)

sql(
"""
|insert overwrite table dwd.dim_region partition (etl_date)
|select id,region_name,create_time,
|'user1',date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss'),
|'user1',date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss'),etl_date
|from ods.base_region
|""".stripMargin)

sql(
"""
|insert overwrite table dwd.fact_order_info partition (etl_date)
|select
|id,consignee,consignee_tel,final_total_amount,order_status,
|user_id,delivery_address,order_comment,out_trade_no,trade_body,
|date_format(create_time,'yyyyMMdd'),
|coalesce(operate_time,create_time),
|expire_time,tracking_no,parent_order_id,img_url,
|province_id,benefit_reduce_amount,original_total_amount,feight_fee,
|'user1',date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss'),
|'user1',date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss'),etl_date
|from ods.order_info
|""".stripMargin)


sql(
"""
|insert overwrite table dwd.fact_order_detail partition (etl_date)
|select
|id,order_id,sku_id,sku_name,img_url,order_price,sku_num,
|date_format(create_time,'yyyyMMdd'),
|source_type,source_id,
|'user1',date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss'),
|'user1',date_format(current_timestamp(),'yyyy-MM-dd HH:mm:ss'),etl_date
|from ods.order_detail
|""".stripMargin)

spark.stop()
}
}

BUG点

暂时无BUG

日总结

今天任务是七月的任务书,还是因为hive没有数据,导致没发对答案,明天往hive中存入一些假数据。因为对七月份的表结构不是很熟悉,指标计算那部分做的很慢,模糊不清的问题。只能跟别人对对答案了,看思路是不是一一致的。

7月任务书 向hive添加假数据

学习日期: 2.28

所学内容概述

使用insert into 先从mysql加一些数据到hive,发现hive无法导出sql文件,只能导出txt或者csv,在字段列数对应的情况下可以使用load data local inpath ‘/opt/order_info.txt’(文件路径) into table ods.order_info(表名);导入,但是如果表需要分区的话就无法导入了。无法把最后一列当成分区导入,解决办法如下

建立一个临时表收取txt文件的数据,使用insert语句把临时表导入分区表,设置分区为partition(分区列),这样反而不如直接建立分区表用mysql导入方便了

DSL风格

1
2
3
4
5
6
7
8
val maxTime: Timestamp = sql("select greatest(max(create_time),max(operate_time)) from ods.user_info").first().getTimestamp(0)
mysqlExtract(spark,"user_info","mysql_info")
sql(
s"""
|select *,"20230227" as etl_date
|from mysql_info
|where operate_time > '$maxTime' or create_time > '$maxTime'
|""".stripMargin).show()

不是纯sql可以结合着做,用dsl结合sql,注意点在maxTime那边需要加’',时间规范性。

BUG点

1
2
3
4
5
6
7
8
9
10
11
sql(
"""
|select * from mysql_order
|where (create_time <= '2020-04-26 18:00:00' or create_time is null)
|and
|(operate_time <= '2020-04-26 18:00:00' or operate_time is null)
|""".stripMargin)
.withColumn("etl_date",lit("20230212"))
.write.format("parquet")
.partitionBy("etl_date")
.saveAsTable("ods.order_info")

使用dsl风格代码导入 如此导入数据表结构都有 问题在于select 的时候会报错,上面是查询报错,下面是自动建立的表结构。

多次测试后发现是字段类型,查询其他bigint和string就可以,但是一查decimal类型的就会报错。没找到解决办法,打算使用自己建表的方法来添加数据,将添加假数据的代码打成jar包,做题目前先运行该class

image-20230329174341766

扩展学习

力扣2363

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class Solution {
public List<List<Integer>> mergeSimilarItems(int[][] items1, int[][] items2) {
TreeMap<Integer, Integer> map = new TreeMap<>();
for (int[] item : items1) {
map.put(item[0], map.getOrDefault(item[0], 0) + item[1]);
}
for (int[] item : items2) {
map.put(item[0], map.getOrDefault(item[0], 0) + item[1]);
}
List<List<Integer>> ans = new ArrayList<>();
for (Map.Entry<Integer, Integer> entry : map.entrySet()) {
ans.add(new ArrayList<>(Arrays.asList(entry.getKey(), entry.getValue())));
}
return ans;
}
}

日总结

今天弥补了之前hive没有数据的漏洞,用一个scala类解决了,建表语句以及抽取mysql部分数据。全部封转到类中去了,然后做了四月的任务书1,问题不是很大,指标计算有些地方还是想的有点久了,其实清洗和抽取,一个任务书两三天练一次就行了,重点和难点还是在计算。