top | item 30087252

(no title)

ryanmonroe | 4 years ago

To reference variables in the outer scope, you would do

    mutate(df, b = .env$a + 1)
And if you have a string (contained in a_var) which identifies a variable you can do

    mutate(df, b = .data[[a_var]] + 1)
You could argue these feel clumsy, but I wouldn’t say it’s “hard” to do either of these things with dplyr.

discuss

order

krumbie|4 years ago

I don't think it's just about whether it's hard to do, your syntax example looks short enough and one can memorize these two patterns relatively quickly.

However, both patterns are another special case how identifiers are resolved in the expression. Aren't `.env` and `.data` both valid variable and column names? So what happens if I have a column named `.data`?

Another example, which is the reason why we chose the `:column` style to refer to columns in `DataFramesMeta.jl` and `DataFrameMacros.jl`:

What happens if you have the expression `mutate(df, b = log(a))`. Both `log` and `a` are symbols, but `log` is not treated as a column. Maybe that's because it's used in a function-like fashion? Maybe because R looks at the value of `log` and `a` in their scope and sees that `log` is a function an `a` isn't?

In Julia DataFrames, it's totally valid to have a column that stores different functions. With the dplyr like syntax rules it would not be possible to express a function call with a function stored in a column, if the pattern really is that function syntax means a symbol is not looked up in the dataframe anymore.

In Julia DataFrameMacros.jl for example, if you had a column named `:func` you could do `@transform(df, :b = :func(:a))` and it would be clear that `:func` resolves to a column.

This particular example might seem like a niche problem, but it's just one of these tradeoffs that you have to make when overloading syntax with a different meaning. I personally like it if there's a small rule set which is then consistently applied. I'd argue that's not always the case with dplyr.

ryanmonroe|4 years ago

I hadn't thought of that tradeoff. After testing just now, if you have a column named `.data` or `.env` those constructs work as if there was no such column, and actually in that case `mutate(df, b = .data + 1)` is an error.

Personally I'll happily take not being able to use those as column names if it means I can avoid always typing : before every in-data variable, but your comment gave me a better understanding of why it would be bad for some other person or scenario, perhaps where short term ease-of-use is lower on the list of priorities.

For your second example, it doesn't come up in R because a data frame column cannot be a function. Columns must be vectors (including lists) and you could have a vector where one or all elements are functions, but the column itself cannot not be a function (functions are not vectors), so there's no ambiguity there. To call a function stored in your data frame you'd have to access an element of the column, and any access method, e.g. `[[` or `$` would make the resulting set of characters invalid as the name of an object (without backticks, which would then disambiguate the intent)

    df <- tibble(x = list(function(x) x + 1))
    df %>% 
      mutate(y = x[[1]](3))
Separate from dplyr, in R when you use `(` to call a function it searches only for functions by that name.

    log <- 3
    log(1)
    # 0

    frog <- 3
    frog(3)
    # Error in frog(3) : could not find function "frog"
    
    log <- function(x) x^2
    log(1)
    # 1

pdeffebach|4 years ago

It would be interesting to profile the 2nd version though. Assuming the non-standard evaluation has performance benefits (which they do in DataFramesMeta.jl), are you eliminating those benefits when you use

    .data[[a_var]]

?