top | item 22144123

(no title)

qwhelan | 6 years ago

>Want to join two dataframes together like you'd join two database tables? df.join(other=df2, on='some_column') does the wrong thing, silently, what you really wanted was df.merge(right=df2, on='some_column')

Simply a matter of default type of join - join defaults to left while merge defaults to inner. They use the exact same internal join logic.

>What if they're optional? pd.DataFrame({'foo': [1,2,3,None]}) will silently change your integers to floating point values.

This was a long standing issue but is no longer true.

>Want to check if a dataframe is empty? Unlike lists or dicts, trying to turn a dataframe into a truth value will throw ValueError.

Those are 1D types where that's simple to reason about. It's not as straightforward in higher dimensions (what's the truth value of a (0, N) array?), which is why .empty exists

discuss

jfim|6 years ago

> Simply a matter of default type of join - join defaults to left while merge defaults to inner.

No, join does an index merge. For example, if you try to join with string keys, it'll throw an error (because strings and numeric indexes aren't compatible).

  left = pd.DataFrame({"abcd": ["a", "b", "c", "d"], "something": [1,2,3,4]})
  right = pd.DataFrame({"abcd": ["d", "c", "a", "b"], "something_else": [4,3,1,2]})
  left.join(other=right, on="abcd")
  
  ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

If you try to join with numeric keys:

  left = pd.DataFrame({"abcd": ["a", "b", "c", "d"], "something": [10,20,30,40]})
  right = pd.DataFrame({"abcd": ["d", "c", "a", "b"], "something": [40,30,10,20]})
  
  left.join(other=right, on="something", rsuffix="_r")
  
    abcd  something abcd_r  something_r
  0    a         10    NaN          NaN
  1    b         20    NaN          NaN
  2    c         30    NaN          NaN
  3    d         40    NaN          NaN

Or even worse if your numeric values are within the range for indexes, which kind of looks right if you're not paying attention:

  left = pd.DataFrame({"abcd": ["a", "b", "c", "d"], "something": [1,2,3,4]})
  right = pd.DataFrame({"abcd": ["d", "c", "a", "b"], "something": [4,3,1,2]})
  left.join(other=right, on="something", rsuffix="_r")
  
    abcd  something abcd_r  something_r
  0    a          1      c          3.0
  1    b          2      a          1.0
  2    c          3      b          2.0
  3    d          4    NaN          NaN

Whereas merge does what one would expect:

  left.merge(right=right, on="something", suffixes=['', '_r'])
  
    abcd  something abcd_r
  0    a         10      a
  1    b         20      b
  2    c         30      c
  3    d         40      d

>> What if they're optional? pd.DataFrame({'foo': [1,2,3,None]}) will silently change your integers to floating point values.

> This was a long standing issue but is no longer true.

Occurs in pandas 0.25.1 (and the release notes for 0.25.2 and 0.25.3 don't mention such a change), so that would likely be still the case in the latest stable release.

  pd.DataFrame({"foo": [1,2,3,4,None,9223372036854775807]})
  
              foo
  0  1.000000e+00
  1  2.000000e+00
  2  3.000000e+00
  3  4.000000e+00
  4           NaN
  5  9.223372e+18

It's also a lossy conversion if the integer values are large enough:

  df = pd.DataFrame({"foo": [1,2,3,4,None,9223372036854775807,9223372036854775806]})
  
              foo
  0  1.000000e+00
  1  2.000000e+00
  2  3.000000e+00
  3  4.000000e+00
  4           NaN
  5  9.223372e+18
  6  9.223372e+18
  
  df["foo"].unique()
  
  array([1.00000000e+00, 2.00000000e+00, 3.00000000e+00, 4.00000000e+00, nan, 9.22337204e+18])

>> Want to check if a dataframe is empty? Unlike lists or dicts, trying to turn a dataframe into a truth value will throw ValueError.

> Those are 1D types where that's simple to reason about. It's not as straightforward in higher dimensions (what's the truth value of a (0, N) array?), which is why .empty exists

It's not very pythonic, though. A definition of "all dimensions greater than 0" would've been much less surprising.

qwhelan|6 years ago

> Occurs in pandas 0.25.1 (and the release notes for 0.25.2 and 0.25.3 don't mention such a change), so that would likely be still the case in the latest stable release.

It was released in 0.24.0: https://pandas.pydata.org/pandas-docs/stable/user_guide/inte...

For example:

    pd.DataFrame({"foo": [1,2,3,4,None]}, dtype=pd.Int64Dtype())

        foo
    0     1
    1     2
    2     3
    3     4
    4  <NA>

    pd.DataFrame({"foo": [1,2,3,4,None,9223372036854775807,9223372036854775806]}, dtype=pd.Int64Dtype())

                       foo
    0                    1
    1                    2
    2                    3
    3                    4
    4                 <NA>
    5  9223372036854775807
    6  9223372036854775806