时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战...

版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。

1 数据的预处理

  • 时间序列数据生成

      import pandas as pdimport numpy as npdate_range:可以指定开始时间与周期H:小时D:天M:月# TIMES #2016 Jul 1 7/1/2016 1/7/2016 2016-07-01 2016/07/01rng = pd.date_range('2016-07-01', periods = 10, freq = '3D')rngDatetimeIndex(['2016-07-01', '2016-07-04', '2016-07-07', '2016-07-10','2016-07-13', '2016-07-16', '2016-07-19', '2016-07-22','2016-07-25', '2016-07-28'],dtype='datetime64[ns]', freq='3D')time=pd.Series(np.random.randn(20),index=pd.date_range(dt.datetime(2016,1,1),periods=20))print(time)2016-01-01   -0.1293792016-01-02    0.1644802016-01-03   -0.6391172016-01-04   -0.4272242016-01-05    2.0551332016-01-06    1.1160752016-01-07    0.3574262016-01-08    0.2742492016-01-09    0.8344052016-01-10   -0.0054442016-01-11   -0.1344092016-01-12    0.2493182016-01-13   -0.2978422016-01-14   -0.1285142016-01-15    0.0636902016-01-16   -2.2460312016-01-17    0.3595522016-01-18    0.3830302016-01-19    0.4027172016-01-20   -0.694068Freq: D, dtype: float64
    复制代码
  • truncate过滤

      time.truncate(before='2016-1-10')2016-01-10   -0.0054442016-01-11   -0.1344092016-01-12    0.2493182016-01-13   -0.2978422016-01-14   -0.1285142016-01-15    0.0636902016-01-16   -2.2460312016-01-17    0.3595522016-01-18    0.3830302016-01-19    0.4027172016-01-20   -0.694068Freq: D, dtype: float64time.truncate(after='2016-1-10')2016-01-01   -0.1293792016-01-02    0.1644802016-01-03   -0.6391172016-01-04   -0.4272242016-01-05    2.0551332016-01-06    1.1160752016-01-07    0.3574262016-01-08    0.2742492016-01-09    0.8344052016-01-10   -0.005444Freq: D, dtype: float64print(time['2016-01-15':'2016-01-20'])2016-01-15    0.0636902016-01-16   -2.2460312016-01-17    0.3595522016-01-18    0.3830302016-01-19    0.4027172016-01-20   -0.694068Freq: D, dtype: float64data=pd.date_range('2010-01-01','2011-01-01',freq='M')print(data)DatetimeIndex(['2010-01-31', '2010-02-28', '2010-03-31', '2010-04-30','2010-05-31', '2010-06-30', '2010-07-31', '2010-08-31','2010-09-30', '2010-10-31', '2010-11-30', '2010-12-31'],dtype='datetime64[ns]', freq='M')# 指定索引rng = pd.date_range('2016 Jul 1', periods = 10, freq = 'D')rngpd.Series(range(len(rng)), index = rng)2016-07-01    02016-07-02    12016-07-03    22016-07-04    32016-07-05    42016-07-06    52016-07-07    62016-07-08    72016-07-09    82016-07-10    9Freq: D, dtype: int32
    复制代码
  • 指定索引

      periods = [pd.Period('2016-01'), pd.Period('2016-02'), pd.Period('2016-03')]ts = pd.Series(np.random.randn(len(periods)), index = periods)ts2016-07-01    02016-07-02    12016-07-03    22016-07-04    32016-07-05    42016-07-06    52016-07-07    62016-07-08    72016-07-09    82016-07-10    9Freq: D, dtype: int32
    复制代码
  • 时间戳和时间周期可以转换

      ts = pd.Series(range(10), pd.date_range('07-10-16 8:00', periods = 10, freq = 'H'))ts2016-07-10 08:00:00    02016-07-10 09:00:00    12016-07-10 10:00:00    22016-07-10 11:00:00    32016-07-10 12:00:00    42016-07-10 13:00:00    52016-07-10 14:00:00    62016-07-10 15:00:00    72016-07-10 16:00:00    82016-07-10 17:00:00    9Freq: H, dtype: int32ts_period = ts.to_period()ts_period2016-07-10 08:00    02016-07-10 09:00    12016-07-10 10:00    22016-07-10 11:00    32016-07-10 12:00    42016-07-10 13:00    52016-07-10 14:00    62016-07-10 15:00    72016-07-10 16:00    82016-07-10 17:00    9Freq: H, dtype: int32ts_period['2016-07-10 08:30':'2016-07-10 11:45']2016-07-10 08:00    02016-07-10 09:00    12016-07-10 10:00    22016-07-10 11:00    3Freq: H, dtype: int32ts['2016-07-10 08:30':'2016-07-10 11:45']2016-07-10 09:00:00    12016-07-10 10:00:00    22016-07-10 11:00:00    3Freq: H, dtype: int32
    复制代码

2 数据重采样

  • 时间数据由一个频率转换到另一个频率

  • 降采样

  • 升采样

      rng = pd.date_range('1/1/2011', periods=90, freq='D')ts = pd.Series(np.random.randn(len(rng)), index=rng)ts.head()2011-01-01   -1.0255622011-01-02    0.4108952011-01-03    0.6603112011-01-04    0.7102932011-01-05    0.444985Freq: D, dtype: float64ts.resample('M').sum()2011-01-31    2.5101022011-02-28    0.5832092011-03-31    2.749411Freq: M, dtype: float64ts.resample('3D').sum()2011-01-01    0.0456432011-01-04   -2.2552062011-01-07    0.5711422011-01-10    0.8350322011-01-13   -0.3967662011-01-16   -1.1562532011-01-19   -1.2868842011-01-22    2.8839522011-01-25    1.5669082011-01-28    1.4355632011-01-31    0.3115652011-02-03   -2.5412352011-02-06    0.3170752011-02-09    1.5988772011-02-12   -1.9505092011-02-15    2.9283122011-02-18   -0.7337152011-02-21    1.6748172011-02-24   -2.0788722011-02-27    2.1723202011-03-02   -2.0221042011-03-05   -0.0703562011-03-08    1.2766712011-03-11   -2.8351322011-03-14   -1.3841132011-03-17    1.5175652011-03-20   -0.5504062011-03-23    0.7734302011-03-26    2.2443192011-03-29    2.951082Freq: 3D, dtype: float64day3Ts = ts.resample('3D').mean()day3Ts2011-01-01    0.0152142011-01-04   -0.7517352011-01-07    0.1903812011-01-10    0.2783442011-01-13   -0.1322552011-01-16   -0.3854182011-01-19   -0.4289612011-01-22    0.9613172011-01-25    0.5223032011-01-28    0.4785212011-01-31    0.1038552011-02-03   -0.8470782011-02-06    0.1056922011-02-09    0.5329592011-02-12   -0.6501702011-02-15    0.9761042011-02-18   -0.2445722011-02-21    0.5582722011-02-24   -0.6929572011-02-27    0.7241072011-03-02   -0.6740352011-03-05   -0.0234522011-03-08    0.4255572011-03-11   -0.9450442011-03-14   -0.4613712011-03-17    0.5058552011-03-20   -0.1834692011-03-23    0.2578102011-03-26    0.7481062011-03-29    0.983694Freq: 3D, dtype: float64## 下采样print(day3Ts.resample('D').asfreq())2011-01-01    0.0152142011-01-02         NaN2011-01-03         NaN2011-01-04   -0.7517352011-01-05         NaN2011-01-06         NaN2011-01-07    0.1903812011-01-08         NaN2011-01-09         NaN2011-01-10    0.2783442011-01-11         NaN2011-01-12         NaN2011-01-13   -0.1322552011-01-14         NaN2011-01-15         NaN2011-01-16   -0.3854182011-01-17         NaN2011-01-18         NaN2011-01-19   -0.4289612011-01-20         NaN2011-01-21         NaN2011-01-22    0.961317Freq: D, Length: 88, dtype: float64
    复制代码
  • ffill 空值取前面的值

  • bfill 空值取后面的值

  • interpolate 线性取值

     day3Ts.resample('D').ffill(1)2011-01-01    0.0152142011-01-02    0.0152142011-01-03         NaN2011-01-04   -0.7517352011-01-05   -0.7517352011-01-06         NaN2011-01-07    0.1903812011-01-08    0.1903812011-01-09         NaN2011-01-10    0.2783442011-01-11    0.278344day3Ts.resample('D').bfill(1)2011-01-01    0.0152142011-01-02         NaN2011-01-03   -0.7517352011-01-04   -0.7517352011-01-05         NaN2011-01-06    0.1903812011-01-07    0.1903812011-01-08         NaN2011-01-09    0.2783442011-01-10    0.2783442011-01-11         NaN2011-01-12   -0.1322552011-01-13   -0.132255day3Ts.resample('D').interpolate('linear')2011-01-01    0.0152142011-01-02   -0.2404352011-01-03   -0.4960852011-01-04   -0.7517352011-01-05   -0.4376972011-01-06   -0.1236582011-01-07    0.1903812011-01-08    0.2197022011-01-09    0.2490232011-01-10    0.2783442011-01-11    0.1414782011-01-12    0.0046112011-01-13   -0.1322552011-01-14   -0.2166432011-01-15   -0.301030
    复制代码

3 滑动窗

  • 滑动窗计算

      %matplotlib inline import matplotlib.pylabimport numpy as npimport pandas as pddf = pd.Series(np.random.randn(600), index = pd.date_range('7/1/2016', freq = 'D', periods = 600))df.head()2016-07-01   -0.1921402016-07-02    0.3579532016-07-03   -0.2018472016-07-04   -0.3722302016-07-05    1.414753Freq: D, dtype: float64r = df.rolling(window = 10)#r.max, r.median, r.std, r.skew, r.sum, r.varprint(r.mean())016-07-01         NaN2016-07-02         NaN2016-07-03         NaN2016-07-04         NaN2016-07-05         NaN2016-07-06         NaN2016-07-07         NaN2016-07-08         NaN2016-07-09         NaN2016-07-10    0.3001332016-07-11    0.2847802016-07-12    0.2528312016-07-13    0.2206992016-07-14    0.1671372016-07-15    0.0185932016-07-16   -0.0614142016-07-17   -0.1345932016-07-18   -0.1533332016-07-19   -0.2189282016-07-20   -0.1694262016-07-21   -0.2197472016-07-22   -0.1812662016-07-23   -0.1736742016-07-24   -0.1306292016-07-25   -0.1667302016-07-26   -0.2330442016-07-27   -0.2566422016-07-28   -0.2807382016-07-29   -0.2898932016-07-30   -0.379625...   2018-01-22   -0.2114672018-01-23    0.0349962018-01-24   -0.1059102018-01-25   -0.1457742018-01-26   -0.0893202018-01-27   -0.1643702018-01-28   -0.1108922018-01-29   -0.2057862018-01-30   -0.1011622018-01-31   -0.0347602018-02-01    0.2293332018-02-02    0.0437412018-02-03    0.0528372018-02-04    0.0577462018-02-05   -0.0714012018-02-06   -0.0111532018-02-07   -0.0457372018-02-08   -0.0219832018-02-09   -0.1967152018-02-10   -0.0637212018-02-11   -0.2894522018-02-12   -0.0509462018-02-13   -0.0470142018-02-14    0.0487542018-02-15    0.1439492018-02-16    0.4248232018-02-17    0.3618782018-02-18    0.3632352018-02-19    0.5174362018-02-20    0.368020Freq: D, Length: 600, dtype: float64
    复制代码
  • 可视化

      import matplotlib.pyplot as plt%matplotlib inlineplt.figure(figsize=(15, 5))df.plot(style='r--')df.rolling(window=10).mean().plot(style='b')
    复制代码

4 ARIMA预测

  • 数据的预处理

      import pandas_datareaderimport datetimeimport matplotlib.pylab as pltimport seaborn as snsfrom matplotlib.pylab import stylefrom statsmodels.tsa.arima_model import ARIMAfrom statsmodels.graphics.tsaplots import plot_acf, plot_pacfstyle.use('ggplot')    plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False  stockFile = 'data/T10yr.csv'stock = pd.read_csv(stockFile, index_col=0, parse_dates=[0])stock.head(10)
    复制代码

    stock_week = stock['Close'].resample('W-MON').mean()stock_train = stock_week['2000':'2015'] stock_train.plot(figsize=(12,8))plt.legend(bbox_to_anchor=(1.25, 0.5))plt.title("Stock Close")sns.despine()
复制代码

    stock_diff = stock_train.diff()stock_diff = stock_diff.dropna()plt.figure()plt.plot(stock_diff)plt.title('一阶差分')plt.show()
复制代码

acf = plot_acf(stock_diff, lags=20)
plt.title("ACF")
acf.show()
复制代码

    pacf = plot_pacf(stock_diff, lags=20)plt.title("PACF")pacf.show()
复制代码

    model = ARIMA(stock_train, order=(1, 1, 1),freq='W-MON')result = model.fit()#print(result.summary())pred = result.predict('20140609', '20160701',dynamic=True, typ='levels')print (pred)2014-06-09    2.4635592014-06-16    2.4555392014-06-23    2.4495692014-06-30    2.4441832014-07-07    2.4389622014-07-14    2.4337882014-07-21    2.4286272014-07-28    2.4234702014-08-04    2.4183152014-08-11    2.4131592014-08-18    2.4080042014-08-25    2.4028492014-09-01    2.3976932014-09-08    2.3925382014-09-15    2.387383plt.figure(figsize=(6, 6))plt.xticks(rotation=45)plt.plot(pred)plt.plot(stock_train)
复制代码

5 总结

方便复习,整成笔记,内容粗略,勿怪

版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

相关文章

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部