当前位置: 首页 > news >正文

常州网站建设 最易/百度推广开户代理

常州网站建设 最易,百度推广开户代理,17173论坛,网站开发如何报价单数据清洗 (data cleaning) 是机器学习和深度学习进入算法步前的一项重要任务,我平时比较习惯使用的 7 个步骤,总结如下:Step1 : read csvStep2 : preview dataStep3: check null value for every columnStep4: complete null valueStep5: fea…

数据清洗 (data cleaning) 是机器学习和深度学习进入算法步前的一项重要任务,我平时比较习惯使用的 7 个步骤,总结如下:

  • Step1 : read csv

  • Step2 : preview data

  • Step3: check null value for every column

  • Step4: complete null value

  • Step5: feature engineering

    • Step 5.1: delete some features

    • Step 5.2: create new feature

  • Step6: encode for categories columns

    • Step 6.1: Sklearn LabelEncode

    • Step 6.2: Pandas get_dummies

  • Step 7: check for data cleaning

今天使用泰坦尼克数据集,完整介绍以上 7 步的具体操作过程。

1 读入数据

这不废话吗,第一步就是读入数据。

data_raw = pd.read_csv('../input/titanicdataset-traincsv/train.csv')
data_raw

结果:

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ
891 rows × 12 columns

2 数据预览

data_raw.info()
data_raw.describe(include='all')

结果:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
count891.000000891.000000891.000000891891714.000000891.000000891.000000891891.000000204889
uniqueNaNNaNNaN8912NaNNaNNaN681NaN1473
topNaNNaNNaNHakkarainen, Mr. Pekka PietarimaleNaNNaNNaN1601NaNG6S
freqNaNNaNNaN1577NaNNaNNaN7NaN4644
mean446.0000000.3838382.308642NaNNaN29.6991180.5230080.381594NaN32.204208NaNNaN
std257.3538420.4865920.836071NaNNaN14.5264971.1027430.806057NaN49.693429NaNNaN
min1.0000000.0000001.000000NaNNaN0.4200000.0000000.000000NaN0.000000NaNNaN
25%223.5000000.0000002.000000NaNNaN20.1250000.0000000.000000NaN7.910400NaNNaN
50%446.0000000.0000003.000000NaNNaN28.0000000.0000000.000000NaN14.454200NaNNaN
75%668.5000001.0000003.000000NaNNaN38.0000001.0000000.000000NaN31.000000NaNNaN
max891.0000001.0000003.000000NaNNaN80.0000008.0000006.000000NaN512.329200NaNN

3 检查null值

data1 = data_raw.copy(deep=True)data1.isnull().sum()

结果:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Age 列 177 个空值,Cabin 687 个空值,一共才 891 行,估计没啥价值了!Embarked 2 个。

4 补全空值

data1['Age'].fillna(data1['Age'].median(), inplace = True)
data1['Embarked'].fillna(data1['Embarked'].mode()[0], inplace = True)data1.isnull().sum()

补全操作check:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

5 特征工程

5.1 干掉 3 列:

drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)

5.2 增加 3 列

增加一列 FamilySize

data1['FamilySize'] = data1 ['SibSp'] + data1['Parch'] + 1
data1

打印结果:


SurvivedPclassNameSexAgeSibSpParchFareEmbarkedFamilySize
003Braund, Mr. Owen Harrismale22.0107.2500S2
111Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01071.2833C2
213Heikkinen, Miss. Lainafemale26.0007.9250S1
311Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01053.1000S2
403Allen, Mr. William Henrymale35.0008.0500S1
.................................
88602Montvila, Rev. Juozasmale27.00013.0000S1
88711Graham, Miss. Margaret Edithfemale19.00030.0000S1
88803Johnston, Miss. Catherine Helen "Carrie"female28.01223.4500S4
88911Behr, Mr. Karl Howellmale26.00030.0000C1
89003Dooley, Mr. Patrickmale32.0007.7500Q1
891 rows × 10 columns

再创建一列:

data1['IsAlone'] = np.where(data1['FamilySize'] > 1,0,1)

再创建一列:

data1['Title'] = data1['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
data1

结果:

SurvivedPclassNameSexAgeSibSpParchFareEmbarkedFamilySizeIsAloneTitle
003Braund, Mr. Owen Harrismale22.0107.2500S20Mr
111Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01071.2833C20Mrs
213Heikkinen, Miss. Lainafemale26.0007.9250S11Miss
311Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01053.1000S20Mrs
403Allen, Mr. William Henrymale35.0008.0500S11Mr
.......................................
88602Montvila, Rev. Juozasmale27.00013.0000S11Rev
88711Graham, Miss. Margaret Edithfemale19.00030.0000S11Miss
88803Johnston, Miss. Catherine Helen "Carrie"female28.01223.4500S40Miss
88911Behr, Mr. Karl Howellmale26.00030.0000C11Mr
89003Dooley, Mr. Patrickmale32.0007.7500Q11Mr
891 rows × 12 columns

5.3 分箱走起

data1['FareCut'] = pd.qcut(data1['Fare'], 4)
data1['AgeCut'] = pd.cut(data1['Age'].astype(int), 6)
data1

结果:

SurvivedPclassNameSexAgeSibSpParchFareEmbarkedFamilySizeIsAloneTitleFareCutAgeCut
003Braund, Mr. Owen Harrismale22.0107.2500S20Mr(-0.001, 7.91](13.333, 26.667]
111Cumings, Mrs. John Bradley (Florence Briggs Th...female38.01071.2833C20Mrs(31.0, 512.329](26.667, 40.0]
213Heikkinen, Miss. Lainafemale26.0007.9250S11Miss(7.91, 14.454](13.333, 26.667]
311Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01053.1000S20Mrs(31.0, 512.329](26.667, 40.0]
403Allen, Mr. William Henrymale35.0008.0500S11Mr(7.91, 14.454](26.667, 40.0]
.............................................
88602Montvila, Rev. Juozasmale27.00013.0000S11Rev(7.91, 14.454](26.667, 40.0]
88711Graham, Miss. Margaret Edithfemale19.00030.0000S11Miss(14.454, 31.0](13.333, 26.667]
88803Johnston, Miss. Catherine Helen "Carrie"female28.01223.4500S40Miss(14.454, 31.0](26.667, 40.0]
88911Behr, Mr. Karl Howellmale26.00030.0000C11Mr(14.454, 31.0](13.333, 26.667]
89003Dooley, Mr. Patrickmale32.0007.7500Q11Mr(-0.001, 7.91](26.667, 40.0]
891 rows × 14 columns

6 编码

6.1 LabelEncoder 方法

使用 Sklearn 的 LabelEncoder

from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
data1['Sex_Code'] = label.fit_transform(data1['Sex'])
data1['Embarked_Code'] = label.fit_transform(data1['Embarked'])
data1['Title_Code'] = label.fit_transform(data1['Title'])
data1['AgeBin_Code'] = label.fit_transform(data1['AgeCut'])
data1['FareBin_Code'] = label.fit_transform(data1['FareCut'])
data1

结果 data1 选取某些列,算法模型终于能认出它们了,多不容易!

6.2 get_dummies 方法

get_dummies 将长 DataFrame 变为宽 DataFrame:

pd.get_dummies(data1['Sex'])

结果:

femalemale
001
110
210
310
401
.........
88601
88710
88810
88901
89001
891 rows × 2 columns

而 LabelEncoder 编码后,仅仅是把 Female 编码为 0, male 编码为 1.

label.fit_transform(data1['Sex'])
0      1
1      0
2      0
3      0
4      1..
886    1
887    0
888    0
889    1
890    1
Name: Sex_Code, Length: 891, dtype: int64

7 再 check

# Step 7: data cleaning check
data1[data1_x_alg].info()
print('-'*50)
data1_dummy.info()

结果:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Sex_Code         891 non-null int64
Pclass           891 non-null int64
Embarked_Code    891 non-null int64
Title_Code       891 non-null int64
SibSp            891 non-null int64
Parch            891 non-null int64
Age              891 non-null float64
Fare             891 non-null float64
dtypes: float64(2), int64(6)
memory usage: 55.8 KB
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 29 columns):
Pclass                891 non-null int64
SibSp                 891 non-null int64
Parch                 891 non-null int64
Age                   891 non-null float64
Fare                  891 non-null float64
FamilySize            891 non-null int64
IsAlone               891 non-null int64
Sex_female            891 non-null uint8
Sex_male              891 non-null uint8
Embarked_C            891 non-null uint8
Embarked_Q            891 non-null uint8
Embarked_S            891 non-null uint8
Title_Capt            891 non-null uint8
Title_Col             891 non-null uint8
Title_Don             891 non-null uint8
Title_Dr              891 non-null uint8
Title_Jonkheer        891 non-null uint8
Title_Lady            891 non-null uint8
Title_Major           891 non-null uint8
Title_Master          891 non-null uint8
Title_Miss            891 non-null uint8
Title_Mlle            891 non-null uint8
Title_Mme             891 non-null uint8
Title_Mr              891 non-null uint8
Title_Mrs             891 non-null uint8
Title_Ms              891 non-null uint8
Title_Rev             891 non-null uint8
Title_Sir             891 non-null uint8
Title_the Countess    891 non-null uint8
dtypes: float64(2), int64(5), uint8(22)
memory usage: 68.0 KB
往期文章Pandas时间序列数据操作
Matplotlib中的plt和ax都是啥?70G上市公司定期报告数据集
5个小问题带你理解列表推导式
文本数据清洗之正则表达式
Python网络爬虫与文本数据分析
综述:文本分析在市场营销研究中的应用
如何批量下载上海证券交易所上市公司年报
Numpy和Pandas性能改善的方法和技巧
漂亮~pandas可以无缝衔接Bokeh
YelpDaset: 酒店管理类数据集10+G

先有收获,再点在看!

http://www.lbrq.cn/news/943597.html

相关文章:

  • 淘宝客服推销做网站的技巧/深圳知名网络优化公司
  • vs做网站头部的代码/网店推广营销方案
  • 武汉市做网站/怎么免费建公司网站
  • 云服务器怎么样做网站/百度搜索风云榜人物
  • 企业网站开发费用会计分录/百度明星人气榜排名
  • 抖音seo公司帝搜平台/专业网站优化排名
  • 安阳网站建设哪家专业/百度一下就知道了官网楯
  • 做网站建设销售途径/网络培训心得体会
  • 新素材网站/关键词规划师工具
  • 公安科技信息化建设 素材 网站/免费做网站推广的软件
  • 英文外贸网站/天天seo站长工具
  • c#网站开发案例大全/思亿欧seo靠谱吗
  • 口碑好的武汉网站建设/公司网站建设需要注意什么
  • WordPress去掉你的位置/百度爱采购优化排名软件
  • 制作手机app需要学什么/seo网站页面优化包含
  • 建筑企业网站模板免费下载/今日的新闻头条10条
  • 上海做网站品牌公司有哪些/推广联盟
  • 求可以做问卷测试的网站/seo网站推广培训
  • 重庆最近新闻大事件/手机游戏性能优化软件
  • 网站被恶意解析/seo关键词优化软件app
  • 大理市政府建设办网站/seo关键词的优化技巧
  • 郴州网站建设公司/网址提交入口
  • 设计装修网站大全/一个完整的营销策划案范文
  • 零基础jsp网站开发/搜索引擎优化核心
  • 专做脚本的网站/seo赚钱方式
  • 网站蓝色/百度爱采购官网首页
  • js与asp.net做的网站/南昌seo报价
  • 吉安做网站的公司/创建网站的软件
  • 网站建设精英/培训学校加盟费用
  • 网站如何在百度上做推广/网站设计公司
  • Rust赋能土木工程数字化
  • useCallback/useMemo
  • GIS地理信息系统建设:高精度3D建模
  • 图解网络-小林coding笔记(持续更新)
  • Java学习|黑马笔记|Day23】网络编程、反射、动态代理
  • C++11 -- emplace、包装器