Release time: Nov. 21, 2024, 6:41 a.m.
In [1]:
from env_helper import info; info()
Page update time: 2020-07-05 19:42:15
Operating system/OS: Linux-4.19.0-9-amd64-x86_64-with-debian-10.4 ;Python: 3.7.3
Statistical methods help to understand and analyze the behavior of data. Now we will learn some statistical functions that can be applied to Pandas objects.
The series, DatFrames, and Panel all have the pct_change() function. This function compares each element with its previous element and calculates the percentage change.
In [2]:
import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,5,4])
print (s.pct_change())
df = pd.DataFrame(np.random.randn(5, 2))
print (df.pct_change())
0 NaN
1 1.000000
2 0.500000
3 0.333333
4 0.250000
5 -0.200000
dtype: float64
0 1
0 NaN NaN
1 1.715647 -0.782998
2 -1.036604 -3.946877
3 5.822339 -0.319222
4 -1.642254 0.651260
By default, pct_change() operates on columns; If you want to apply it to rows, you can use the axis=1 parameter.
Covariance applies to series data. The Series object has a method called cov to calculate the covariance between sequence objects. NA will be automatically excluded.
In [3]:
import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print (s1.cov(s2))
0.11455098112976062
When applied to a DataFrame, the covariance method calculates the covariance (cov) values between all columns.
In [4]:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print (frame['a'].cov(frame['b']))
print (frame.cov())
0.08799454145708029
a b c d e
a 0.505805 0.087995 -0.395936 -0.002622 0.385068
b 0.087995 0.674307 -0.096464 -0.410428 -0.151635
c -0.395936 -0.096464 0.989045 0.030377 -0.437175
d -0.002622 -0.410428 0.030377 1.201569 -0.439404
e 0.385068 -0.151635 -0.437175 -0.439404 1.258176
Note - Observe the COV result value between columns a and b in the first statement, which is the same as the value returned by the COV on the DataFrame.
Correlation shows a linear relationship between any two numerical values (series). There are multiple methods to calculate the correlation between Pearson (default), Spearman, and Kendall.
In [5]:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print (frame['a'].corr(frame['b']))
print (frame.corr())
-0.0678453446172776
a b c d e
a 1.000000 -0.067845 -0.326843 0.000984 -0.227762
b -0.067845 1.000000 0.439217 -0.219488 0.078177
c -0.326843 0.439217 1.000000 0.138069 -0.297560
d 0.000984 -0.219488 0.138069 1.000000 0.002307
e -0.227762 0.078177 -0.297560 0.002307 1.000000
If there are any non numeric columns in the DataFrame, they will be automatically excluded.
The data ranking generates a ranking for each element in the element array. In the case of relationships, allocate an average level.
In [6]:
import pandas as pd
import numpy as np
s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))
s['d'] = s['b'] # so there's a tie
print (s.rank())
a 5.0
b 3.5
c 2.0
d 3.5
e 1.0
dtype: float64
Rank can optionally use an ascending parameter that defaults to true; When there is an error, the data is sorted in reverse, meaning larger values are assigned smaller rankings.
Rank supports different tie breaking methods, specified by method parameters -