时间序列数据的预处理及基于ARIMA模型进行趋势预测-大数据ML样本集案例实战...
版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。
1 数据的预处理
-
时间序列数据生成
import pandas as pdimport numpy as npdate_range:可以指定开始时间与周期H:小时D:天M:月# TIMES #2016 Jul 1 7/1/2016 1/7/2016 2016-07-01 2016/07/01rng = pd.date_range('2016-07-01', periods = 10, freq = '3D')rngDatetimeIndex(['2016-07-01', '2016-07-04', '2016-07-07', '2016-07-10','2016-07-13', '2016-07-16', '2016-07-19', '2016-07-22','2016-07-25', '2016-07-28'],dtype='datetime64[ns]', freq='3D')time=pd.Series(np.random.randn(20),index=pd.date_range(dt.datetime(2016,1,1),periods=20))print(time)2016-01-01 -0.1293792016-01-02 0.1644802016-01-03 -0.6391172016-01-04 -0.4272242016-01-05 2.0551332016-01-06 1.1160752016-01-07 0.3574262016-01-08 0.2742492016-01-09 0.8344052016-01-10 -0.0054442016-01-11 -0.1344092016-01-12 0.2493182016-01-13 -0.2978422016-01-14 -0.1285142016-01-15 0.0636902016-01-16 -2.2460312016-01-17 0.3595522016-01-18 0.3830302016-01-19 0.4027172016-01-20 -0.694068Freq: D, dtype: float64 复制代码 -
truncate过滤
time.truncate(before='2016-1-10')2016-01-10 -0.0054442016-01-11 -0.1344092016-01-12 0.2493182016-01-13 -0.2978422016-01-14 -0.1285142016-01-15 0.0636902016-01-16 -2.2460312016-01-17 0.3595522016-01-18 0.3830302016-01-19 0.4027172016-01-20 -0.694068Freq: D, dtype: float64time.truncate(after='2016-1-10')2016-01-01 -0.1293792016-01-02 0.1644802016-01-03 -0.6391172016-01-04 -0.4272242016-01-05 2.0551332016-01-06 1.1160752016-01-07 0.3574262016-01-08 0.2742492016-01-09 0.8344052016-01-10 -0.005444Freq: D, dtype: float64print(time['2016-01-15':'2016-01-20'])2016-01-15 0.0636902016-01-16 -2.2460312016-01-17 0.3595522016-01-18 0.3830302016-01-19 0.4027172016-01-20 -0.694068Freq: D, dtype: float64data=pd.date_range('2010-01-01','2011-01-01',freq='M')print(data)DatetimeIndex(['2010-01-31', '2010-02-28', '2010-03-31', '2010-04-30','2010-05-31', '2010-06-30', '2010-07-31', '2010-08-31','2010-09-30', '2010-10-31', '2010-11-30', '2010-12-31'],dtype='datetime64[ns]', freq='M')# 指定索引rng = pd.date_range('2016 Jul 1', periods = 10, freq = 'D')rngpd.Series(range(len(rng)), index = rng)2016-07-01 02016-07-02 12016-07-03 22016-07-04 32016-07-05 42016-07-06 52016-07-07 62016-07-08 72016-07-09 82016-07-10 9Freq: D, dtype: int32 复制代码 -
指定索引
periods = [pd.Period('2016-01'), pd.Period('2016-02'), pd.Period('2016-03')]ts = pd.Series(np.random.randn(len(periods)), index = periods)ts2016-07-01 02016-07-02 12016-07-03 22016-07-04 32016-07-05 42016-07-06 52016-07-07 62016-07-08 72016-07-09 82016-07-10 9Freq: D, dtype: int32 复制代码 -
时间戳和时间周期可以转换
ts = pd.Series(range(10), pd.date_range('07-10-16 8:00', periods = 10, freq = 'H'))ts2016-07-10 08:00:00 02016-07-10 09:00:00 12016-07-10 10:00:00 22016-07-10 11:00:00 32016-07-10 12:00:00 42016-07-10 13:00:00 52016-07-10 14:00:00 62016-07-10 15:00:00 72016-07-10 16:00:00 82016-07-10 17:00:00 9Freq: H, dtype: int32ts_period = ts.to_period()ts_period2016-07-10 08:00 02016-07-10 09:00 12016-07-10 10:00 22016-07-10 11:00 32016-07-10 12:00 42016-07-10 13:00 52016-07-10 14:00 62016-07-10 15:00 72016-07-10 16:00 82016-07-10 17:00 9Freq: H, dtype: int32ts_period['2016-07-10 08:30':'2016-07-10 11:45']2016-07-10 08:00 02016-07-10 09:00 12016-07-10 10:00 22016-07-10 11:00 3Freq: H, dtype: int32ts['2016-07-10 08:30':'2016-07-10 11:45']2016-07-10 09:00:00 12016-07-10 10:00:00 22016-07-10 11:00:00 3Freq: H, dtype: int32 复制代码
2 数据重采样
-
时间数据由一个频率转换到另一个频率
-
降采样
-
升采样
rng = pd.date_range('1/1/2011', periods=90, freq='D')ts = pd.Series(np.random.randn(len(rng)), index=rng)ts.head()2011-01-01 -1.0255622011-01-02 0.4108952011-01-03 0.6603112011-01-04 0.7102932011-01-05 0.444985Freq: D, dtype: float64ts.resample('M').sum()2011-01-31 2.5101022011-02-28 0.5832092011-03-31 2.749411Freq: M, dtype: float64ts.resample('3D').sum()2011-01-01 0.0456432011-01-04 -2.2552062011-01-07 0.5711422011-01-10 0.8350322011-01-13 -0.3967662011-01-16 -1.1562532011-01-19 -1.2868842011-01-22 2.8839522011-01-25 1.5669082011-01-28 1.4355632011-01-31 0.3115652011-02-03 -2.5412352011-02-06 0.3170752011-02-09 1.5988772011-02-12 -1.9505092011-02-15 2.9283122011-02-18 -0.7337152011-02-21 1.6748172011-02-24 -2.0788722011-02-27 2.1723202011-03-02 -2.0221042011-03-05 -0.0703562011-03-08 1.2766712011-03-11 -2.8351322011-03-14 -1.3841132011-03-17 1.5175652011-03-20 -0.5504062011-03-23 0.7734302011-03-26 2.2443192011-03-29 2.951082Freq: 3D, dtype: float64day3Ts = ts.resample('3D').mean()day3Ts2011-01-01 0.0152142011-01-04 -0.7517352011-01-07 0.1903812011-01-10 0.2783442011-01-13 -0.1322552011-01-16 -0.3854182011-01-19 -0.4289612011-01-22 0.9613172011-01-25 0.5223032011-01-28 0.4785212011-01-31 0.1038552011-02-03 -0.8470782011-02-06 0.1056922011-02-09 0.5329592011-02-12 -0.6501702011-02-15 0.9761042011-02-18 -0.2445722011-02-21 0.5582722011-02-24 -0.6929572011-02-27 0.7241072011-03-02 -0.6740352011-03-05 -0.0234522011-03-08 0.4255572011-03-11 -0.9450442011-03-14 -0.4613712011-03-17 0.5058552011-03-20 -0.1834692011-03-23 0.2578102011-03-26 0.7481062011-03-29 0.983694Freq: 3D, dtype: float64## 下采样print(day3Ts.resample('D').asfreq())2011-01-01 0.0152142011-01-02 NaN2011-01-03 NaN2011-01-04 -0.7517352011-01-05 NaN2011-01-06 NaN2011-01-07 0.1903812011-01-08 NaN2011-01-09 NaN2011-01-10 0.2783442011-01-11 NaN2011-01-12 NaN2011-01-13 -0.1322552011-01-14 NaN2011-01-15 NaN2011-01-16 -0.3854182011-01-17 NaN2011-01-18 NaN2011-01-19 -0.4289612011-01-20 NaN2011-01-21 NaN2011-01-22 0.961317Freq: D, Length: 88, dtype: float64 复制代码 -
ffill 空值取前面的值
-
bfill 空值取后面的值
-
interpolate 线性取值
day3Ts.resample('D').ffill(1)2011-01-01 0.0152142011-01-02 0.0152142011-01-03 NaN2011-01-04 -0.7517352011-01-05 -0.7517352011-01-06 NaN2011-01-07 0.1903812011-01-08 0.1903812011-01-09 NaN2011-01-10 0.2783442011-01-11 0.278344day3Ts.resample('D').bfill(1)2011-01-01 0.0152142011-01-02 NaN2011-01-03 -0.7517352011-01-04 -0.7517352011-01-05 NaN2011-01-06 0.1903812011-01-07 0.1903812011-01-08 NaN2011-01-09 0.2783442011-01-10 0.2783442011-01-11 NaN2011-01-12 -0.1322552011-01-13 -0.132255day3Ts.resample('D').interpolate('linear')2011-01-01 0.0152142011-01-02 -0.2404352011-01-03 -0.4960852011-01-04 -0.7517352011-01-05 -0.4376972011-01-06 -0.1236582011-01-07 0.1903812011-01-08 0.2197022011-01-09 0.2490232011-01-10 0.2783442011-01-11 0.1414782011-01-12 0.0046112011-01-13 -0.1322552011-01-14 -0.2166432011-01-15 -0.301030 复制代码
3 滑动窗
-
滑动窗计算
%matplotlib inline import matplotlib.pylabimport numpy as npimport pandas as pddf = pd.Series(np.random.randn(600), index = pd.date_range('7/1/2016', freq = 'D', periods = 600))df.head()2016-07-01 -0.1921402016-07-02 0.3579532016-07-03 -0.2018472016-07-04 -0.3722302016-07-05 1.414753Freq: D, dtype: float64r = df.rolling(window = 10)#r.max, r.median, r.std, r.skew, r.sum, r.varprint(r.mean())016-07-01 NaN2016-07-02 NaN2016-07-03 NaN2016-07-04 NaN2016-07-05 NaN2016-07-06 NaN2016-07-07 NaN2016-07-08 NaN2016-07-09 NaN2016-07-10 0.3001332016-07-11 0.2847802016-07-12 0.2528312016-07-13 0.2206992016-07-14 0.1671372016-07-15 0.0185932016-07-16 -0.0614142016-07-17 -0.1345932016-07-18 -0.1533332016-07-19 -0.2189282016-07-20 -0.1694262016-07-21 -0.2197472016-07-22 -0.1812662016-07-23 -0.1736742016-07-24 -0.1306292016-07-25 -0.1667302016-07-26 -0.2330442016-07-27 -0.2566422016-07-28 -0.2807382016-07-29 -0.2898932016-07-30 -0.379625... 2018-01-22 -0.2114672018-01-23 0.0349962018-01-24 -0.1059102018-01-25 -0.1457742018-01-26 -0.0893202018-01-27 -0.1643702018-01-28 -0.1108922018-01-29 -0.2057862018-01-30 -0.1011622018-01-31 -0.0347602018-02-01 0.2293332018-02-02 0.0437412018-02-03 0.0528372018-02-04 0.0577462018-02-05 -0.0714012018-02-06 -0.0111532018-02-07 -0.0457372018-02-08 -0.0219832018-02-09 -0.1967152018-02-10 -0.0637212018-02-11 -0.2894522018-02-12 -0.0509462018-02-13 -0.0470142018-02-14 0.0487542018-02-15 0.1439492018-02-16 0.4248232018-02-17 0.3618782018-02-18 0.3632352018-02-19 0.5174362018-02-20 0.368020Freq: D, Length: 600, dtype: float64 复制代码 -
可视化
import matplotlib.pyplot as plt%matplotlib inlineplt.figure(figsize=(15, 5))df.plot(style='r--')df.rolling(window=10).mean().plot(style='b') 复制代码
4 ARIMA预测
-
数据的预处理
import pandas_datareaderimport datetimeimport matplotlib.pylab as pltimport seaborn as snsfrom matplotlib.pylab import stylefrom statsmodels.tsa.arima_model import ARIMAfrom statsmodels.graphics.tsaplots import plot_acf, plot_pacfstyle.use('ggplot') plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False stockFile = 'data/T10yr.csv'stock = pd.read_csv(stockFile, index_col=0, parse_dates=[0])stock.head(10) 复制代码
stock_week = stock['Close'].resample('W-MON').mean()stock_train = stock_week['2000':'2015'] stock_train.plot(figsize=(12,8))plt.legend(bbox_to_anchor=(1.25, 0.5))plt.title("Stock Close")sns.despine()
复制代码 stock_diff = stock_train.diff()stock_diff = stock_diff.dropna()plt.figure()plt.plot(stock_diff)plt.title('一阶差分')plt.show()
复制代码 acf = plot_acf(stock_diff, lags=20)
plt.title("ACF")
acf.show()
复制代码 pacf = plot_pacf(stock_diff, lags=20)plt.title("PACF")pacf.show()
复制代码 model = ARIMA(stock_train, order=(1, 1, 1),freq='W-MON')result = model.fit()#print(result.summary())pred = result.predict('20140609', '20160701',dynamic=True, typ='levels')print (pred)2014-06-09 2.4635592014-06-16 2.4555392014-06-23 2.4495692014-06-30 2.4441832014-07-07 2.4389622014-07-14 2.4337882014-07-21 2.4286272014-07-28 2.4234702014-08-04 2.4183152014-08-11 2.4131592014-08-18 2.4080042014-08-25 2.4028492014-09-01 2.3976932014-09-08 2.3925382014-09-15 2.387383plt.figure(figsize=(6, 6))plt.xticks(rotation=45)plt.plot(pred)plt.plot(stock_train)
复制代码 5 总结
方便复习,整成笔记,内容粗略,勿怪
版权声明:本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。QQ邮箱地址:1120746959@qq.com,如有任何学术交流,可随时联系。
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
