当前位置: 首页 > 编程日记 > 正文

kaggle预测

两个预测kaggle比赛

一 .https://www.kaggle.com/c/web-traffic-time-series-forecasting/overview

Arthur Suilin(1st in this Competition)a year ago•Options

github:https://github.com/sjvasquez/web-traffic-forecasting

My model is basically RNN seq2seq (encoder+decoder) with some twists. Key ideas:
  • There are two main information sources for prediction. A) Year/Quarter seasonality. B) Past trend. Good model should use both sources and combine them intelligently
  • Minimal feature engineering. Deep learning model is powerful enough to discover and use features on it's own. My task is just to assist model to use incoming data in a meaningful way.

I'll describe encountered implementation problems and their solutions

1. Learning takes too much time.

RNN's are inherently sequential and hard to parallelize. Today's most efficient RNN implementation is CuDNN fused kernels, created by NVIDIA experts. Tensorflow by default uses own generic, but slow sequential RNNCell. Surprisingly, TF also has support for CuDNN kernels (hard to find in documentation and poorly described). I spent some time to figure out how to use classes in tf.contrib.cudnn_rnn module and got amazing result: ~10x decrease in computation time! I also used GRU instead of classical LSTM: it gives better results and computes ~1.5x faster. Of course, CuDNN can be used only for encoder. In decoder, each next step depends on customized processing of outputs from previous step, so decoder is Tensorflow GRUBlockCellGRUBlockCell is again slightly faster than standard GRUCell (~1.2x)

2. Long short-term memory is not so long.

The practical memory limit for LSTM-type cells is 100-200 steps. If we use longer sequence, LSTM/GRU just forgets what was at the beginning. But, to exploit yearly seasonality, we should use at least 365 steps. The conventional method to overcome this memory limit is attention. We can take encoder outputs from the distant past and feed them as inputs into current decoder step. My first very basic positional attention model: take encoder outputs from steps current_day - 365 (year seasonality) and current_day - 92 (quarter seasonality), squeeze them through FF layer (to reduce dimensionality and extract useful features), concatenate and feed into decoder. To compensate random walk noise and deviations in year and quarter lengths (leap/non-leap year, different number of days in months), I take weighted average (in proportion 0.25:0.5:0.25) of 3 decoder outputs around the chosen step. Then I realized that 0.25:0.5:0.25 is just a 1D convolution kernel of size 3, and my model can learn most effective kernel weights and attention offsets on it's own. This learnable convolutional attention significantly improved model results.

But what if we just use lagged pageviews (year or quarter lag) as additional input features? Can lagged pageviews supplement or even replace attention? Yes, they can. When I added 4 additional features (3,6,9,12 months lagged pageviews) to inputs, I got roughly the same improvement as from attention.

3. Overfitting.

I decided to limit a number of days used for training to 100..400 and use remaining days to generate different samples for training. Example: if we have 500 days of data, use 200 days window for training, 60 days for prediction, then first 240 days is a 'free space' to randomly choose a starting day for training. Each starting day will produce a different time series. 145K pages x 250 starting days = 36.25M unique timeseries, not bad! For stage 2, this number is even higher. This is an effective kind of data augmentation: models using random starting point shows very little overfitting, even without any regularization. With dropout and slight L2 regularization, overfitting is almost non existent.

4. How model can decide what to use: seasonality or past trend or both?

Autocorrelation coefficient to the rescue. It turned to be a very important input feature. If year-to-year (lag 365) autocorrelation is positive and high, model should use mostly year-to-year seasonality, if it's low or negative, model should use mostly past trend information (or quarter seasonality if it's high). RNN can't compute autocorrelation on it's own (this will require additional pass over all steps), so this is only hand-crafted input feature in my models. It's important to not include leading/ending zeros/nans into autocorrelation calculation (page either don't exists at leading zeros day either deleted at ending zeros day)

5. High variance

I used following variance reduction methods:

  1. SGD weights averaging, decay=0.99. It really don't reduced observable variance, but improved prediction quality by ~0.2 SMAPE points.
  2. Checkpoints created at each 100 training steps, prediction results of models at 10 last checkpoints were averaged.
  3. Same model was trained on 3 different random seeds, prediction results were averaged. Again, it slightly improved prediction quality.

Prediction quality (predict last 60 days) of my models on Stage 2 data was ~35.2-35.5 SMAPE if autocorrelation calculated over all available data (including prediction interval) and ~36 SMAPE if autocorrelation calculated on all data excluding prediction interval. Let's see if model will hold same quality on future data. Predictions example

Tips from the winning solutions

Congratulation to "all winners"! (including organizers) Thank you so much for creating, maintaining, competing, and sharing your solutions! Let me summarize something I learned from the top:

  • Use medians as features.

  • Use log1p to transform data, and MAE as the evaluation metric.

  • XGBoost and deep learning models such as MLP, CNN, RNN work. However, the performance hugely depends on how we create and train models.

  • For these deep learning models, skip connection works.

  • Best trick to me: clustering these time-series based on the performance of the best model. Then training different models for each cluster.

  • The period of stage 2 is easier for prediction than the period of stage 1. This affects how we will choose our best model (should it capture the weird behavior of stage 1 or not?).

  • Don't wait until last hour to submit models. For me, I overslept so I can't submit my best model =o= that model might have given me a gold (it boosts my CV to a margin of 0.5) :D


Various solutions (including 1st, 3rd, 4th,... places): https://www.kaggle.com/c/web-traffic-time-series-forecasting/discussion/39367

2nd place solution: https://www.kaggle.com/c/web-traffic-time-series-forecasting/discussion/39395

第六名:

https://github.com/sjvasquez/web-traffic-forecasting

二. Corporación Favorita Grocery Sales Forecasting


1st place solution

Topic 8 months ago in Corporación Favorita Grocery Sales Forecasting

115

Congrats to all winner teams and new grandmaster sjv. Thanks to kaggle for hosting and Favorita for sponsoring this great competition. Special thanks to @sjv, @senkin13, @tunguz, @ceshine, we build our models based on your kernels.

  • https://github.com/sjvasquez/web-traffic-forecasting/blob/master/cnn.py
  • https://www.kaggle.com/senkin13/lstm-starter/code
  • https://www.kaggle.com/tunguz/lgbm-one-step-ahead-lb-0-513
  • https://www.kaggle.com/ceshine/lgbm-starter

Like the Rossmann competiton, the private leaderboard shaked up again this time. I think luck is on our side finally.


Sample Selection

we used only 2017 data to extract features and construct samples.

train data:20170531 - 20170719 or 20170614 - 20170719, different models are trained with different data set.

validition: 20170726 - 20170810

In fact, we tried to use more data but failed. The gap between public and private leadboard is not very stable. If we train a single model for data of 16 days, the gap will be smaller(0.002-0.003).

Preprocessing

We just filled missing or negtive promotion and target values with 0.

Feature Engineering

  1. basic features
    • category features: store, item, famlily, class, cluster...
    • promotion
    • dayofweek(only for model 3)
  2. statitical features: we use some methods to stat some targets for different keys in different time windows
    • time windows
      • nearest days: [1,3,5,7,14,30,60,140]
      • equal time windows: [1] * 16, [7] * 20...
    • key:store x item, item, store x class
    • target: promotion, unit_sales, zeros
    • method
      • mean, median, max, min, std
      • days since last appearance
      • difference of mean value between adjacent time windows(only for equal time windows)
  3. useless features
    • holidays
    • other keys such as: cluster x item, store x family...

Single Model

  • model_1 : 0.506 / 0.511 , 16 lgb models trained for each day source code:  https://www.kaggle.com/shixw125/1st-place-lgb-model-public-0-506-private-0-511
  • model_2 : 0.507 / 0.513 , 16 nn models trained for each day source code: https://www.kaggle.com/shixw125/1st-place-nn-model-public-0-507-private-0-513
  • model_3 : 0.512 / 0.515,1 lgb model for 16 days with almost same features as model_1
  • model_4 : 0.517 / 0.519,1 nn model based on @sjv's code

Ensemble

Stacking doesn't work well this time, our best model is linear blend of 4 single models.

final submission = 0.42*model_1 + 0.28 * model_2 + 0.18 * model_3 + 0.12 * model_4

public = 0.504 , private = 0.509

相关文章:

PHP开发中,让var_dump调试函数输出更美观 ^_^#

前提:php必须安装Xdebug模块。 用var_dump打印输出时,输出的内容没有被格式化。如下图: 通常使用var_dump打印的内容是被格式化后输出的,如下图: 造成没有格式化输出的原因是因为php.ini设置的问题,使用php…

@Override is not allowed when implementing interface method

用idea打开项目,有下划线 解决办法: 选中出现红色下划线的项目,右键单击,选择open module settings 将language level改为8-Lambdas… 点击apply 选择projects,进行更改 点击apply 点击ok 即可

C#发现之旅第一讲 C#-XML开发

C#发现之旅第一讲 C#-XML开发 袁永福 2008-5-15 系列课程说明 为了让大家更深入的了解和使用C#,我们将开始这一系列的主题为“C#发现之旅”的技术讲座。考虑到各位大多是进行WEB数据库开发的,而所谓发现就是发现我们所不熟悉的领域&#xff…

页面的前进/后退/刷新方法

前进一页 οnclick"javascript:window.history.forward()" 后退一页 οnclick"javascript:window.history.back();" 前进/后退 n页 n为正是前进,负数是后退 onclick"javascript:window.history.go(n);" 刷新 οnclick"window.location.re…

25. javacript高级程序设计-新兴的API

1. 新兴的API requestAnimationFrame():是一个着眼于优化js动画的api,能够在动画运行期间发出信号。通过这种机制,浏览器就能够自动优化屏幕重绘操作 Page Visibility API:让开发人员知道用户什么时候正在看着页面,而什…

Git-remote Incorrect username or password ( access token )

码云上传错误 错误原因:输入git clone https://地址;回车之后弹框输入码云的用户名和密码,用户名我输入的是码云的昵称,应该输入注册时的电子邮箱地址,当我关闭命令框,重新输入输入git clone https://地址&…

jquery选择器的使用方式

1.基本选择器选择器描述返回示例代码说明1id选择器根据指定的id匹配元素单个元素$("#one").css("background","#bbffaa");找到id为one的元素,改变其background属性2class选择器根据给定的类名匹配元素集合元素$(".mini").c…

互动网计算机频道图书7日销售排行(05.20-05.26)

互动网计算机频道图书7日销售排行(05.20-05.26) 1、Hadoop权威指南(中文版) 2、人人都是产品经理 3、演讲之禅:一位技术演讲家的自白 内容简介 本书既实用又引人入胜。作为职业演讲家,作者斯科特博克顿为…

Xtreme.Toolkit.Pro编译简单教程

前面介绍了Codejock.Xtreme.Toolkit.Pro,下面介绍一下它的安装和编译。 1.先下载Codejock.Xtreme.Toolkit.Pro 2.安装:一路“下一步”,很简单 3.安装完以后会出一个新的“codejock deployment wizard”窗口,这里你可以跟据需要&am…

上传代码到码云(第一次)

下载git; 注册码云的账号; ssh创建(参考百度,较简单) 新建仓库; 在电脑上新建文件夹gitcode; 在gitcode文件夹下右键 git bash; 输入git clone https://地址;&#xff0…

《JavaScript编程实战》

《JavaScript编程实战》 基本信息 原书名:JavaScript programming: pushing the limits 作者: (美)Jon Raasch 译者: 吴海星 丛书名: 图灵程序设计丛书 出版社:人民邮电出版社 ISBN:9787115345486 上架时…

再识C中的结构体

在前面认识C中的结构体中我介绍了结构体的基础知识&#xff0c;下面通过这段代码来回顾一下&#xff1a; 1 #include<stdio.h>2 #define LEN 203 4 struct Student{ //定义结构体5 char name[LEN];6 char address[LEN];7 int age;8 };9 10 int m…

《妙解Hibernate 3.X》读书笔记一-Hibernate概述及环境搭建

很早就想开始Hibernate的系统学习&#xff0c;但是一是工作原因&#xff0c;二是苦于找不到合适的书籍。Hibernate更新较快&#xff0c;一些被称为经典的书籍&#xff0c;如<深入浅出Hibernate>、《Hibernate实战》等都过于年老&#xff0c;介绍的为Hibernate2.1&#xf…

ssm框架实现学生成绩管理系统

学习ssm框架&#xff0c;写的一个小项目&#xff0c;参考 实现的功能有&#xff1a;学生信息增删改查&#xff0c;成绩信息查询&#xff0c;修改&#xff0c;求平均值&#xff0c; 附上链接&#xff0c;欢迎下载 git clone https://gitee.com/LOL_toulan/SpringBootProject.gi…

如何查找特定目录下最大的文件及文件夹

如何查看特定目录下大小在前10位的文件 find 目录 -ls |sort -nrk7 |head 参数说明如下&#xff1a; -ls True; list current file in ls -dils format on standard output. 没加-ls之前&#xff0c;输出的只是文件名&#xff0c;类似于 /u01/app/oracle/oradata/test/…

LightOJ 1364 Expected Cards(概率+DP)

题目链接&#xff1a;http://lightoj.com/volume_showproblem.php?problem1364 题意&#xff1a;一副牌。依次在桌面上放牌。求放了四种花色的牌为C,D,H,S张时放的牌数的期望。大小王出现时必须将其指定为某种花色。指定时要使最后的期望最小。 思路&#xff1a;DP&#xff0c…

会计科目中英文对照表

现金 Cash in hand 银行存款 Cash in bank 其他货币资金-外埠存款Other monetary assets - cash in other cities 其他货币资金-银行本票 Other monetary assets - cashier‘s check 其他货币资金-银行汇票 Other monetary assets - bank draft 其他货币资金-信用卡 Other…

关于get和post两种提交方式

Get请求&#xff1a; 1.可携带的数据量小 2.只能存放字符串类型的数据&#xff0c;不能存放bean对象 3.安全性差&#xff0c;例如如果在登录上使用get请求&#xff0c;在地址栏中会显 示输入的username和password 4.客户端在接收到get请求后&#xff0c;浏览器会自动的缓存响应…

数据库原理与设计 P75作业 学号2013211466 班级0401302

习题5 2. (1)写出关系模式&#xff1a; 学生:R1 U1{学号,姓名,出生日期,系名,班号,宿舍区}; F1{学号->(姓名,出生日期,系名,班号,宿舍区)&#xff0c;班号->系名&#xff0c;系名->宿舍区}&#xff1b; 班级:R2 U2{班号,专业名,系名,人数,入校年份}; F2{班号->(专业…

Windows DDK介绍,选择和安装

windows的文档工作还是非常不错的&#xff0c;所有的信息都可以从windows DDK主页和DDK自带的帮助文档中获得&#xff0c;本文只是一个总结。 今天开始正式接触DDK&#xff0c;首先来到DDK主页&#xff0c;有如下信息有用&#xff1a; 1. 选择安装什么版本的DDK。目前DDK的推荐…

关于jsp基础知识题目(一)

1.为了标识一个HTML文件&#xff0c;应该使用标记 html 2.form表单中提交数据的目的地址的属性是 action 3.关于post&#xff1a;安全性较好&#xff0c;地址栏看不到提交的数据&#xff0c;超链接标识post提交方式&#xff0c;可以传输大量数据 4.表单的提交方式有 2 种 5…

GO环境变量设置

GOROOT就是go的安装路径在~/.bash_profile中添加下面语句: GOROOT/usr/local/go export GOROOT 当然, 要执行go命令和go工具, 就要配置go的可执行文件的路径:操作如下:在~/.bash_profile中配置如下:export $PATH:$GOROOT/bin如果是windows需要使用;符号分割两个路径, mac和类un…

Camera+销量突破200万 创始人分享成功经验

Camera 突破了200万份销量大关&#xff0c;开发商tap tap tap分享了成功经验。 创始人John Casasanta指出&#xff0c;他们花了6个多月实现销量突破100万大关&#xff0c;而接下来3个月就销售了200万份&#xff0c;照这个趋势下去&#xff0c;300万销售大关也许只需要1个半月。…

计算机组成原理习题(一)

1.计算机系统包括&#xff1a;硬件系统和软件系统 2.计算机的软件系统包括&#xff1a;系统软件和应用软件 3.冯诺依曼计算机的核心思想是&#xff1a;存储程序 4.计算机的五大功能部件&#xff1a;运算器&#xff0c;控制器&#xff0c;存储器&#xff0c;输入设备&#xf…

HTML中常见的各种位置距离以及dom中的坐标讨论

最近在学习JavaScript&#xff0c;特意买了一本犀牛角书来看看&#xff0c;尼玛一千多页&#xff0c;看的我头昏脑涨&#xff0c;翻到DOM这章节&#xff0c;突然记起平常在使用DOM时&#xff0c;碰到了好多的这个dom里面的各种宽度&#xff0c;高度&#xff0c;特意在此写一写&…

快速掌握Python的捷径-Python基础前传(1)

文&#xff1a; jacky(朱元禄) 开文序 最近看新闻&#xff0c;发现高考都考Python了&#xff0c;随着人工智能的火热&#xff0c;学数据科学的人越来越多了&#xff01;但对于数据行业本身来说&#xff0c;现象级的火热&#xff0c;这并不是什么好事。 方丈高楼平地起&#xf…

jsp实现日历

在JSP程序中使用各种脚本元素和标签实现具体的功能 <%--Created by IntelliJ IDEA.User: asusDate: 2020/2/25Time: 21:39To change this template use File | Settings | File Templates. --%> <% page contentType"text/html;charsetUTF-8" language&quo…

小胖妞洗发水广告

觉得自己的博客可能太严肃了&#xff0c;都是技术文章&#xff0c;书评&#xff0c;鸡汤呀&#xff0c;来点稍微轻松点。虽然自己都不敢直视一年前的小胖妞跳的舞蹈&#xff0c;不过现在更胖。 权且娱乐下&#xff0c;也当留个纪念~ http://v.youku.com/v_show/id_228708395.ht…

Yii2.0 RESTful API 之版本控制

Yii2.0 RESTful API 之版本控制 之前我写过两篇关于 Yii2.0 RESTful API 如何搭建&#xff0c;以及 认证 等处理&#xff0c;但是没有涉及到版本管理&#xff0c;今天就来谈谈版本管理如何实现。 索性就从头开始一步一步搭建吧&#xff0c;但是关于一些概念以及使用本篇就不一一…

ssl 和 https

SSL (Secure Socket Layer) 为Netscape所研发&#xff0c;用以保障在Internet上数据传输之安全&#xff0c;利用数据加密(Encryption)技术&#xff0c;可确保数据在网络上之传输过程中不会被截取及窃听。目前一般通用之规格为40 bit之安全标准&#xff0c;美国则已推出128 bit之…