how to understand axis = 0 or 1 in pandas (Python)?

2018-07-03 09:05:12

From the documentation, "the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1)" And the code is

df1 = pd.DataFrame({"x":[1, 2, 3, 4, 5], 
                    "y":[3, 4, 5, 6, 7]}, 
                   index=['a', 'b', 'c', 'd', 'e'])


df2 = pd.DataFrame({"y":[1, 3, 5, 7, 9], 
                    "z":[9, 8, 7, 6, 5]}, 
                   index=['b', 'c', 'd', 'e', 'f'])
pd.concat([df1, df2], join='inner') # by default axis=0

since axis=0( which I interpret as column) I think concat only considers columns that are found in both dataframes. But the acutal output considers rows that are found in both dataframes.(the only common row element 'y') So how should I understand axis=0,1 correctly?

Data:

In [55]: df1
Out[55]:
   x  y
a  1  3
b  2  4
c  3  5
d  4  6
e  5  7

In [56]: df2
Out[56]:
   y  z
b  1  9
c  3  8
d  5  7
e  7  6
f  9  5

Concatenated horizontally (axis=1), using index elements found in both DFs (aligned by indexes for joining):

In [57]: pd.concat([df1, df2], join='inner', axis=1)
Out[57]:
   x  y  y  z
b  2  4  1  9
c  3  5  3  8
d  4  6  5  7
e  5  7  7  6

Concatenated vertically (DEFAULT: axis=0), using columns found in both DFs:

In [58]: pd.concat([df1, df2], join='inner')
Out[58]:
   y
a  3
b  4
c  5
d  6
e  7
b  1
c  3
d  5
e  7
f  9

If you don't use the inner join method - you will have it this way:

In [62]: pd.concat([df1, df2])
Out[62]:
     x  y    z
a  1.0  3  NaN
b  2.0  4  NaN
c  3.0  5  NaN
d  4.0  6  NaN
e  5.0  7  NaN
b  NaN  1  9.0
c  NaN  3  8.0
d  NaN  5  7.0
e  NaN  7  6.0
f  NaN  9  5.0

In [63]: pd.concat([df1, df2], axis=1)
Out[63]:
     x    y    y    z
a  1.0  3.0  NaN  NaN
b  2.0  4.0  1.0  9.0
c  3.0  5.0  3.0  8.0
d  4.0  6.0  5.0  7.0
e  5.0  7.0  7.0  6.0
f  NaN  NaN  9.0  5.0

Interpret axis=0 to apply the algorithm down each column, or to the row labels (the index).. A more detailed schema here.

If you apply that general interpretation to your case, the algorithm here is concat . Thus for axis=0, it means:

for each column, take all the rows down (across all the dataframes for concat ) , and do contact them when they are in common (because you selected join=inner ).

So the meaning would be to take all columns x and concat them down the rows which would stack each chunk of rows one after another. However, here x is not present everywhere, so it is not kept for the final result. The same applies for z . For y the result is kept as y is in all dataframes. This is the result you have.

First, OP misunderstood the rows and columns in his/her dataframe.

But the acutal output considers rows that are found in both dataframes.(the only common row element 'y')

OP thought the label y is for row. However, y is a column name.

df1 = pd.DataFrame(
         {"x":[1, 2, 3, 4, 5],  # <-- looks like row x but actually col x
          "y":[3, 4, 5, 6, 7]}, # <-- looks like row y but actually col y
          index=['a', 'b', 'c', 'd', 'e'])
print(df1)

            col   x    y
 index or row
          a       1     3   |   a
          b       2     4   v   x
          c       3     5   r   i
          d       4     6   o   s
          e       5     7   w   0

               -> column
                 a x i s 1

It is very easy to be misled since in the dictionary, it looks like y and x are two rows.

If you generate df1 from a list of list, it should be more intuitive:

df1 = pd.DataFrame([[1,3], 
                    [2,4],
                    [3,5],
                    [4,6],
                    [5,7]],
                    index=['a', 'b', 'c', 'd', 'e'], columns=["x", "y"])

So back to the problem, concat is a shorthand for concatenate (means to link together in a series or chain on this way [source]) Performing concat along axis 0 means to linking two objects along axis 0.

   1
   1   <-- series 1
   1
^  ^  ^
|  |  |               1
c  a  a               1
o  l  x               1
n  o  i   gives you   2
c  n  s               2
a  g  0               2
t  |  |
|  V  V
v 
   2
   2   <--- series 2
   2

So... think you have the feeling now. What about sum function in pandas? What does sum(axis=0) means?

Suppose data looks like

   1 2
   1 2
   1 2

Maybe...summing along axis 0, you may guess. Yes!!

^  ^  ^
|  |  |               
s  a  a               
u  l  x                
m  o  i   gives you two values 3 6 !
|  n  s               
v  g  0               
   |  |
   V  V

What about dropna ? Suppose you have data

   1  2  NaN
  NaN 3   5
   2  4   6

and you only want to keep

2
3
4

On the documentation, it says Return object with labels on given axis omitted where alternately any or all of the data are missing

Should you put dropna(axis=0) or dropna(axis=1) ? Think about it and try it out with

df = pd.DataFrame([[1, 2, np.nan],
                   [np.nan, 3, 5],
                   [2, 4, 6]])

# df.dropna(axis=0) or df.dropna(axis=1) ?

Hint: think about the word along.

链接地址: http://www.djcxy.com/p/93004.html

上一篇: 前向声明与#import的子类化

下一篇: 如何理解pandas（Python）中的axis = 0或1？