data.table join和j-expression意生手为

发布时间：2021-01-17 05:00:58 所属栏目：编程来源：网络整理

导读：在R 2.15.0和data.table 1.8.9中： d = data.table(a = 1:5,value = 2:6,key = "a")d[J(3),value]# a value# 3 4d[J(3)][,value]# 4 我但愿两者都发生沟通的输出(第二个),我信托它们应该. 为了破除这不是J语法题目,同样的祈望合用于以下(与上面沟通)表达式：

在R 2.15.0和data.table 1.8.9中：

d = data.table(a = 1:5,value = 2:6,key = "a")

d[J(3),value]
#   a value
#   3     4

d[J(3)][,value]
#   4

我但愿两者都发生沟通的输出(第二个),我信托它们应该.

为了破除这不是J语法题目,同样的祈望合用于以下(与上面沟通)表达式：

t = data.table(a = 3,key = "a")
d[t,value]
d[t][,value]

我但愿以上两个都返回完全沟通的输出.

那么让我从头表明一下这个题目 – 为什么(data.table计划得云云),要害列在d [t,value]中自动打印出来？

更新(按照下面的谜底和评述)：感谢@Arun等人,我领略计划 – 为什么此刻.上面打印密钥的缘故起因是由于每次通过X [Y]语法举办data.table归并时都存在潜匿状态,而by是按键.它以这种方法计划的缘故起因好像如下 – 由于必需在归并时执行by操纵,人们可以操作它而不是通过归并的键来执行另一个操纵.

此刻说,我信托这是一个语法计划缺陷.我读取data.table语法d [i,j,by = b]的方法是

take d,apply the i operation (be that subsetting or merging or whatnot),and then do the j expression “by” b

逐个冲破这个阅读,并先容一小我私人必需详细思索的案例(我归并i,仅仅是归并的要害等).我信托这应该是data.table的事变 – 在一个特定的归并环境下,当by便是密钥时,使得data.table更快的值得称道的全力应该以另一种方法完成(譬喻通过搜查假如by表达式现实上是归并的键,则在内部.

办理要领

编辑号码无穷：常见题目1.12正好答复你的题目:(也有效/相干是 FAQ 1.13,不粘贴在这里).

1.12 What is the difference between X[Y] and merge(X,Y)?
X[Y] is a join,looking up X’s rows using Y (or Y’s key if it has one) as an index. Y[X] is a join,looking up Y’s rows using X (or X’s key if it has one) as an index. merge(X,Y)1 does both ways at the same time. The number of rows of X[Y] and Y[X] usually dier; whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same. BUT that misses the main point. Most tasks require something to be done on the data after a join or merge. Why merge all the columns of data,only to use a small subset of them afterwards?
You may suggest merge(X[,ColsNeeded1],Y[,ColsNeeded2]),but that takes copies of the subsets of data,and it requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for you. When you write X[Y,sum(foo*bar)],data.table
automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses,and Y columns enjoy standard R recycling rules within the context of each group. Let’s say foo is in X,and bar is in Y (along with 20 other columns in Y). Isn’t X[Y,sum(foo*bar)] quicker to program and quicker to run than a merge followed by a subset?

没有答复OP的题目的老谜底(来自OP的评述),保存在这里,由于我信托它确实云云).

当你在data.table中给出像d [,4]或d [,value]这样的j的值时,j被计较为表达式.从data.table FAQ 1.1会见DT [,5](第一个常见题目解答)：

Because,by default,unlike a data.frame,the 2nd argument is an expression which is evaluated within the scope of DT. 5 evaluates to 5.

因此,起主要相识的是,在您的环境下：

d[,value] # produces a "vector"
# [1] 2 3 4 5 6

当i的查询是根基索引时,这没有什么差异：

d[3,value] # produces a vector of length 1
# [1] 4

可是,当我自己就是data.table时,这是差异的.来自data.table简介(第6页)：

d[J(3)] # is equivalent to d[data.table(a = 3)]

在这里,您正在执行插手.假如您只是执行d [J(3)],那么您将得到与该毗连相对应的全部列.假如你这样做,

d[J(3),value] # which is equivalent to d[J(3),list(value)]

既然你嗣魅这个谜底没有答复你的题目,我会指出你的“改写”题目的谜底在那边,我信托：—>然后你只获得谁人列,可是因为你正在执行毗连,因此也会输出键列(由于它是基于键列的两个表之间的毗连).

编辑：在你的第二次编辑之后,假如你的题目是为什么呢？那么我不甘心(可能说是蒙昧)答复,Matthew Dowle计划的是区分data.table基于毗连的子集和基于索引的子集操纵.

您的第二种语法相等于：

d[J(3)][,value] # is equivalent to:

dd <- d[J(3)]
dd[,value]

再次,在dd [,value]中,j被计较为表达式,因此获得一个向量.

答复第3个修悔改的题目：第3次,这是由于它是基于键列的两个data.tables之间的JOIN.假如我插手两个data.tables,我等候一个data.table

从data.table简介,再次：

Passing a data.table into a data.table subset is analogous to A[B] syntax in base R where A is a matrix and B is a 2-column matrix. In fact,the A[B] syntax in base R inspired the data.table package.

（编辑：湖南网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

MySQL源码安置	运用图形界面从SQL导入
MySQL5.7+keepalived+	windows系统下jsp+mys