pandas常用预处理方法

news/2024/7/7 9:55:21
  1. 求均值,表格中含有空值:

    #The result of this is that mean_age would be nan. This is because any calculations we do with a null value also result in a null value
    mean_age = sum(titanic_survival["Age"]) / len(titanic_survival["Age"])
    print (mean_age)
    

    运行结果:
    在这里插入图片描述

  2. 正确的均值

    age = titanic_survival["Age"]
    # print(age.loc[0:10])
    age_is_null = pd.isnull(age)
    #we have to filter out the missing values before we calculate the mean.
    good_ages = titanic_survival["Age"][age_is_null == False]
    #print good_ages
    correct_mean_age = sum(good_ages) / len(good_ages)
    print (correct_mean_age)
    

    运行结果:
    在这里插入图片描述

  3. mean()

    # missing data is so common that many pandas methods automatically filter for it
    correct_mean_age = titanic_survival["Age"].mean()
    print (correct_mean_age)
    

    运行结果:
    在这里插入图片描述

  4. 计算不同类别的均值

    #mean fare for each class
    passenger_classes = [1, 2, 3]
    fares_by_class = {}
    for this_class in passenger_classes:
        pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
        pclass_fares = pclass_rows["Fare"]
        fare_for_class = pclass_fares.mean()
        fares_by_class[this_class] = fare_for_class
    print fares_by_class
    

    运行结果:
    在这里插入图片描述

  5. 数据透视表 获救的比例

    #index tells the method which column to group by
    #values is the column that we want to apply the calculation to
    #aggfunc specifies the calculation we want to perform
    passenger_survival = titanic_survival.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
    print (passenger_survival)
    

    运行结果:
    在这里插入图片描述

  6. 平均年龄

    passenger_age = titanic_survival.pivot_table(index="Pclass", values="Age")
    print(passenger_age)
    

    运行结果:
    在这里插入图片描述

  7. 一个量和两个量之间的关系

    port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
    print(port_stats)
    

    运行结果:
    在这里插入图片描述

  8. dropna

    #specifying axis=1 or axis='columns' will drop any columns that have null values
    drop_na_columns = titanic_survival.dropna(axis=1)
    new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age", "Sex"])
    #print new_titanic_survival
    
  9. loc函数

    row_index_83_age = titanic_survival.loc[83,"Age"]
    row_index_766_pclass = titanic_survival.loc[766,"Pclass"]
    print (row_index_83_age)
    print (row_index_766_pclass) 
    

    运行结果:
    在这里插入图片描述


http://www.niftyadmin.cn/n/4714812.html

相关文章

VS 2010之多显示器支持 / Multi-Monitor Support (VS 2010 and .NET 4 Series)

【原文地址】Multi-Monitor Support (VS 2010 and .NET 4 Series) 【原文发表日期】 Monday, August 31, 2009 10:37 PM 这是我针对即将发布的VS 2010 和 .NET 4所撰写的 贴子系列的第四篇。 今天的贴子讨论其中一个IDE改进,我知道很多人都在迫切期望VS 2010的--…

pandas自定义函数

sort_values和reset_index new_titanic_survival titanic_survival.sort_values("Age",ascendingFalse) print (new_titanic_survival[0:10]) titanic_reindexed new_titanic_survival.reset_index(dropTrue) print(titanic_reindexed.iloc[0:10])运行结果&#xf…

Series结构

读取csv文件: import pandas as pd fandango pd.read_csv(fandango_score_comparison.csv) series_film fandango[FILM] print(series_film[0:5]) series_rt fandango[RottenTomatoes] print (series_rt[0:5])运行结果: 制作Series # Import the Se…

折线图的绘制

to_datetime import pandas as pd unrate pd.read_csv(unrate.csv) unrate[DATE] pd.to_datetime(unrate[DATE]) print(unrate.head(12))运行结果: 绘图 from pandas.plotting import register_matplotlib_converters #%matplotlib inline #Using the different…

技术人员不应该固步自封

能力的提高不是通过量,而是通过质来提高的。 经常听到人们说,这点东西犯不到花这么大力气。 如果是学术问题,我觉得OK,确实是这样,因为有思路就行了。 但是技术问题则不同,光有想法是不够的。工程上是要…

子图的操作

读数据绘图: import pandas as pd from pandas.plotting import register_matplotlib_convertersunrate pd.read_csv(unrate.csv) unrate[DATE] pd.to_datetime(unrate[DATE]) first_twelve unrate[0:12] plt.plot(first_twelve[DATE], first_twelve[VALUE]) plt…

字符串相似度算法 / The Arithmetic of String Similarity Degree

dongle2001的《字符串相似度算法介绍(整理)》中提到,算法分为三类: 1、编辑距离(Levenshtein Distance) 编辑距离就是用来计算从原串(s)转换到目标串(t)所需要的最少的插入,删除和替换 的数目…

条形图与散点图

取出一行数据 import pandas as pd reviews pd.read_csv(fandango_scores.csv) cols [FILM, RT_user_norm, Metacritic_user_nom, IMDB_norm, Fandango_Ratingvalue, Fandango_Stars] norm_reviews reviews[cols] print(norm_reviews[:1])运行结果: 显示柱形图…