数据缺失、混乱、重复怎么办?最全数据清洗指南让你所向披靡(6)
2023-05-21 来源:飞速影视
for col in numeric_cols: missing = df[col].isnull() num_missing = np.sum(missing) if num_missing > 0: # only do the imputation for the columns that have missing values. print("imputing missing values for: {}".format(col)) df["{}_ismissing".format(col)] = missing med = df[col].median() df[col] = df[col].fillna(med)
很幸运,本文使用的数据集中的分类特征没有缺失值。不然,我们也可以对所有分类特征一次性应用众数填充策略。# impute the missing values and create the missing value indicator variables for each non-numeric column.df_non_numeric = df.select_dtypes(exclude=[np.number])non_numeric_cols = df_non_numeric.columns.values
for col in non_numeric_cols: missing = df[col].isnull() num_missing = np.sum(missing) if num_missing > 0: # only do the imputation for the columns that have missing values. print("imputing missing values for: {}".format(col)) df["{}_ismissing".format(col)] = missing top = df[col].describe()["top"] # impute with the most frequent value. df[col] = df[col].fillna(top)
本站仅为学习交流之用,所有视频和图片均来自互联网收集而来,版权归原创者所有,本网站只提供web页面服务,并不提供资源存储,也不参与录制、上传
若本站收录的节目无意侵犯了贵司版权,请发邮件(我们会在3个工作日内删除侵权内容,谢谢。)
www.fs94.org-飞速影视 粤ICP备74369512号