Kaggle 풀어보기 - Bike Sharing Demand

Bike Sharing Demand 문제

코드는 유투브 강좌 와 공개된 public kerne l을 많이 참고하였다.

목차는

1. 우선 대략적인 문제 개요를 훑고

2. 보편적이고 간단한 방법을 이용하여 우선 예측/제출을 해본 후,

3. 분석과정 및 알고리즘에 대해 다뤄보고자 한다.

요리에 비유하자면, 때로는 재료나 조리법에 대한 이해부터 시작해서 도구를 사용하는 법을 배우고 음식을

만드는 것보다, 남이 만든 음식을 먹어보고, 플레이팅도 먼저 해본 쥐에 재료를 생각해보고 음식을 만드는게

더 재미있고 맛있을 때가 있기에, 우선 제출 -> 분석 의 순서로 공부해보고자 한다.

1. 문제 분석하기

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

요약하면, 과거 사용 패턴과 날씨 데이터를 이용해서 워싱턴 D.C의 자전거 수요를 예측하라는 문제이다.

2. 데이터 분석하기

csv 파일을 모두 다운로드하고, Data field를 보자

datetime - hourly date + timestamp
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated

count - number of total rentals

대충 시간, 계절, 휴일/업무일, 날씨, 온도, 습도, 풍속, 회원유무 등의 데이터들이 있고, 맨 아래 count는 몇 번 빌렸는지를 알려준다.

다운받은 train.csv 파일을 엑셀에서 열어서 피벗을 돌려 데이터들의 상관관계를 살펴본다.

대충 생각하기로 날씨가 좋을 때 자전거를 많이 빌리지 않을까.

계절	대여수	비율
봄	312,498	14.98%
여름	588,282	28.21%
가을	640,662	30.72%
겨울	544,034	26.09%

날씨	대여수	비율
좋음	1,476,063	70.78%
구름	507,160	24.32%
가벼운 눈,비	102,089	4.90%
폭설/폭우	164	0.01%

온도에도 영향을 받을 것 같다.

대략적인 연관관계야 나왔다지만,

솔직히 어떻게 연관관계를 대입해서 날짜에 따른 자전거 대여 수요를 예측할 지 모르겠다.

정확도에 상관없이 일단 제출이라도 해보자.

2. 일단 제출해보기

csv 파일을 모두 다운로드하고, Data Kernels 탭에서 New Kernel -> Notebook 을 누른다.

아래와 같은 화면이 나오는데, 코드를 입력하면 된다.

분석이나 이해 전에, 아래 코드로 제출을 위한 파일을 만들어보았다.

코드 설명 및 어떻게 제출하는지는 다음 포스팅에서 계속하도록 한다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65

import pandas as pd
import numpy as np
 
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
 
%matplotlib inline
 
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv("../input/train.csv", parse_dates=["datetime"])
test = pd.read_csv("../input/test.csv", parse_dates=["datetime"])
 
train["year"] = train["datetime"].dt.year
train["month"] = train["datetime"].dt.month
train["day"] = train["datetime"].dt.day
train["hour"] = train["datetime"].dt.hour
train["minute"] = train["datetime"].dt.minute
train["second"] = train["datetime"].dt.second
train["dayofweek"] = train["datetime"].dt.dayofweek
 
test["year"] = test["datetime"].dt.year
test["month"] = test["datetime"].dt.month
test["day"] = test["datetime"].dt.day
test["hour"] = test["datetime"].dt.hour
test["minute"] = test["datetime"].dt.minute
test["second"] = test["datetime"].dt.second
test["dayofweek"] = test["datetime"].dt.dayofweek
 
categorical_feature_names = ["season","holiday","workingday","weather",
                             "dayofweek","month","year","hour"]
 
for var in categorical_feature_names:
    train[var] = train[var].astype("category")
    test[var] = test[var].astype("category")
feature_names = ["season", "weather", "temp", "atemp", "humidity", "windspeed",
                 "year", "hour", "dayofweek", "holiday", "workingday"]
 
 
X_train = train[feature_names]
 
X_test = test[feature_names]
 
label_name = "count"
 
y_train = train[label_name]
 
from sklearn.ensemble import RandomForestRegressor
 
max_depth_list = []
 
model = RandomForestRegressor(n_estimators=100,
                              n_jobs=-1,
                              random_state=0)
 
model.fit(X_train, y_train)
 
predictions = model.predict(X_test)
 
submission = pd.read_csv("../input/sampleSubmission.csv")
 
submission["count"] = predictions
submission.to_csv("submission.csv", index=False)
 
Colored by Color Scripter

cs

저작자표시 비영리 변경금지 (새창열림)

잡지식을 쌓아보자

Kaggle 풀어보기 - Bike Sharing Demand - 1

티스토리툴바

Kaggle 풀어보기 - Bike Sharing Demand - 1

관련글

티스토리툴바