2019独角兽企业重金招聘Python工程师标准>>>
1. 目的
从string类型的字段中解析并汇总每种category类型的总amount
2. 素材
表名:test_table
order_no hotel_seq discount_detail
D8662EF4E 10212527 NULL
45C024849 01801434 {"detail":[{"discount_type":"返现","discount_name":"bonus_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"bonus_commission_subtract","amount":10.0,"category":"营销"},{"discount_type":"返现","discount_name":"coupon_back_rate","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"coupon_commission_subtract","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"upto_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"firstnight_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"firstnight_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"discount_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"discount_commission_subtract","amount":0.0,"category":"营销"}]}
268A58R09 01224011 {"detail":[{"discount_type":"返现","discount_name":"bonus_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"bonus_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"coupon_back_rate","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"coupon_commission_subtract","amount":0.0,"category":"定价"},{"discount_type":"直减","discount_name":"upto_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"firstnight_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"firstnight_commission_subtract","amount":0.0,"category":"营销"},{"discount_type":"返现","discount_name":"discount_back_rate","amount":0.0,"category":"营销"},{"discount_type":"直减","discount_name":"discount_commission_subtract","amount":0.0,"category":"营销"}]}
3. UDF编写
注意:hive环境中的NULL值传递给python的形态是\N
#!/usr/bin/env python
# -- coding: utf-8 --
import sys
import json
reload(sys);
sys.setdefaultencoding('utf-8');for line in sys.stdin:line_sp = line.strip().split('\t')order_no = line_sp[0]hotel_seq = line_sp[1]#判断解析字段是否为空if line_sp[2] != '\N':str_json = line_sp[2].split('[')[1].split(']')[0]str_json_sp=str_json.replace('},{','}&{').split('&')marketing = 0.0pricing = 0.0other = 0.0#累加各类别的amountfor single_json in str_json_sp:djson = json.loads(single_json)if djson['category'] == '营销':marketing = marketing+djson['amount']elif djson['category'] == '定价':pricing = pricing + djson['amount']elif djson['category'] == '其他':other = other + djson['amount']else:marketing = 0.0pricing = 0.0other = 0.0print order_no,'\t',hotel_seq,'\t',marketing,'\t',pricing,'\t',other
4. UDF使用
测试:
#可以将待处理字段复制在文本中,在shell环境中预先测试
cat test20181109a.txt|python test20181109.py
结果示例:
0B0CFED94 91219624 0.0 11.0 0.0
D83EC96F4 91253482 4.0 0.0 0.0
C11D6127E 90932407 0.0 0.0 0.0
A54AE774E 10212527 0.0 0.0 0.0
sql调用:
#调用编写的UDF(本地路径非HDFS)
add FILE /tmp/parseUDF.py#使用UDF解析
SELECTTRANSFORM(order_no, hotel_seq,discount_detail) USING 'python test20181109.py' AS (order_no, hotel_seq, marketing, pricing, other)
FROM test_table
LIMIT 10;
结果示例:
45C024849 01801434 10.0 0.0 0.0
268A8A6B9 01224011 0.0 0.0 0.0
E071C63CA 01224011 20.0 42.0 0.0
7C42DC06E 01105013 0.0 10.0 0.0
6B04415D6 00701122 0.0 0.0 0.0