1. Introduction
Keyword Search is very remarkable because users just need to use only one set of keywords to get useful information from the web (such as Google). It shows that the result of W-KwS is the ranked set which considering the importance of each tuple contains keywords in this set. Follow the result, there have some interesting discoveries. (1)Each result from a query is accompanied by a snippet[1,17,18], which is a brief summary, which sometimes may be included in the complete result. (2) The corresponding web pages with the keyword(s) (e.g. their personal web pages) can potentially provide meaningful and ample information about the designated subject.
The use of keyword search paradigms in relational databases is due to the favorable outcome of the W-KWS paradigm [2,3,4,20]. The relevant ranking paradigm takes into account the importance of which weights the flow through relationships. A great tool, Pagerank[4] can rank the global importance of web pages, which proves Google’s success. Keyword search in the database has its own unique characteristics, making the Pagerank model invalid. That is to say, each database infers that the semantics of different attributes are different and characteristic. In the database data graph, different attributes are represented by different relationships and attributes’ values, which is different from the Web where all edges are hyperlinked. ObjectRank[3] has some appropriate extensions and modifications to PageRank. For instance, in a bibliographic database (e.g. DBLP), under normal circumstances, an author with many citations is more important than another author with fewer citations.
However, ObjectRank ignores tuples’ attribute worth, which can affect the global importance. For example, the value of a customer with a high total purchase price should be higher than others same as these number of orders but lower overall purchase price. Respond to this limitation, Given this limitation, ValueRank is proposed in this paper, which can also consider values. Similarly, for ObjectRank, you can use patterns and specify the way to flow permissions across database graph nodes, which consider tuple values.
Because the methods of PageRank, ObjectRank [7] and other techniques can only be used bibliographic database., it is a challenging problem for sorting database tuples and estimating tuples' global importance scores (represented as Im(ti)). Therefore, ValueRank is introduced in this paper that also takes account of tuples value and so that it can be used to any class of database. In [6], ValueRank has been introduced with only trading databases (such as Northwind database) and has no evaluation results. In this paper, a new definition of ValueRank is defined and evaluation results verify this ValueRank produces more excellently or effectively ranking results than ObjectRank on general databases, such as DBLP databases.
From our previous work, the new keyword search paradigm proposed by [5,19] brings a concept of OSs, where all tuples from dataset about particular subject. More precisely, as we described in abstract, an OS is a multilevel-tree data structure, whose root is a tuple including keywords (e.g. Author tuple “Peter Chen”, denoted tDS) and the descendant nodes[5] are its connecting (i.e. Neighboring) itmes (containing other additional semantic meaning such as his papers, year of publication, etc.). But we find that some OSs’ size may be very large, which is not only unfriendly to users because they want to glance at the moment and find out which "Faloutsos" they are really want to browse, but also the production cost is also high. Evidently, the effective and efficient size-l generation of OSs is necessary[6]. The exact concept of object summary is described in the following section.
We highlight our contribution of this paper as follows: (1) the introduction of a new ranking method, which extends our previous work ValueRank[6] algorithm to a widely usage not only in commercial or trading domain, but also in all numerical or normalizable datasets. (2) based on the concept of Object Summary, we also propose a novel greedy algorithm (namely k-LASP) for the size-l generation of OSs.
The following is the rest of the structure of this paper. Section 2 presents the background and related work of this paper. Section 3 introduces ValueRank. Section 4 provides a greedy algorithm k-LASP. Whereas Section 5 provides our evaluation results. Finally, Section 6 conclusions and future work are discussed in it.
2. Related work and research background
2.1 Object Summary
In the research filed of the new keyword search, a keyword query is a set of keywords[5,19]. In other words, the result of the query is a set of OSs. It should combine graphs and SQL to construct OSs. The fundamental principle is based on the fact that the relations, which includes information about DSs and the relations linked around RDSs contain additional information about the particular DS. For each RDS, a Data Subject Schema Graph (GDS) is generated automatically, this is a directed labeled tree that finds a subset of the database schema with RDS which is the root. (Fig. 1 illustrates the schemata of DBLP and Fig. 2 illustrates respective GDSs from DBLP databases). The GDS is a “Treealization” of the schema, examples of such replications are the relationships Paper (Cited by), Paper (Cites) and Co-Author on Author GDS (see GDSs in Fig. 2). In GDS, affinity measures of relations (denoted Af(Ri)) are investigated, quantified and annotated, aiming to create a good OS, it’s difficult to select the relations from GDS which have the highest Affinity with the RDS that need to be traversed[12]. The Affinity of a relation Ri to RDS can be calculated with the following formula:
\(Affiniy\left(R_{i}\right)=\sum_{j} m_{j} \cdot w_{j} \cdot Affinity\left(R_{p o w n d}\right)\) (1)
where j denotes the ranges of metrics (m1, m2, ..., mn), with weights (w1, w2, ..., wn) respectively, Affintiy(RParent) (≤1) is the Affinity of the Ri’s parent to RDS. Affinity metrics between Ri to RDS include (1) their distance and (2) their connection properties on database schema and data- graph(see [5] for details). Provided an Affinity threshold θ, we can get a subset of GDS denoted as GDS(θ). Finally, we can generate the object summaries by traversing the graph GDS(θ). More precisely, it can user a BFS search for the corresponding GDS(θ) , its initial root is the tDS tuple of the OS tree [5].
Fig. 1. The DBLP Database Schema
Fig. 2. The DBLP Author(Annotated with Affinity)
In order to weaken the contribution of each tuple’s global importance, the score of local importance for each tuple ti in an object summary (namely Im(OS, ti )) can be generated from the formula:
\(\operatorname{Im}\left(O S, t_{i}\right)=\operatorname{Im}\left(t_{i}\right) \cdot A f f n i t y\left(t_{i}\right)\) (2)
Im(ti) denoted as the score of global importance of ti in the database. The global importance was calculated by ValueRank which is an importance ranking system (see section 3). Other tuples’ importance ranking systems can be investigated such as [7,9,10] etc. Note that IR-style techniques [13,14,15,16] are completely inappropriate for ranking OS tuples, because they miss important tuples without the keyword(s). We mentioned that an object summary usually contains the given keywords only once (i.e. tDS), therefore IR-style techniques can’t rank the remaining tuples of the OS effectively. So, it can use Formula 1 to calculate the Affinity (alternatively, Affinity(Ri)s can be manually set by domain experts) and then use Formula 2 to calculate the local importance. For example, consider tuple ti is the paper named “Efficient and Effective Querying by Image Content” with Im(ti)=21.74 and Af(ti)=Affinity(RPaper)=0.92 (see Affinity scores annotated on Author GDS of Fig. 2) then Im(OS, ti) = 21.74 × 0.92=20.
Distinguishing tuples with different Affinity relation score is considered crucial. For example, comparing Paper tuple “Efficient...” with the global importance score 21.74 and Year tuple “1988” with the global importance score 21.64 (i.e. almost equal scores), their local importance becomes 20 (calculate by 21.74 × 0.92) and 18 (calculate by 21.64 × 0.83) respectively. It is also recalled that due to the threshold θ, while tuples with less affinity relation scores may not be selected into an object summary.
2.2 size-l Object Summary
According to [11], a size-l object summary is a set of l nodes, i.e. given an complete object summary and an integer l, a candidate size-l object summary is any subset of the object summary consisting of l nodes (tuples are connected, while rooted at the tuple that including the given keywords). The result of size-l object summary meets the following two criteria. (1) All l tuples are connected with the tDS of the multilevel-tree and (2) the importance scores Im(OS, size-l) is maximum, namely max( Im(OS, ti)). The first criterion is to ensure that it can include self-descriptive semantics of keywords in the size-l object summary. Authos of [11] argued that an appropriately size-l object summary should be an independent, meaningful introduction to the most important node of a particular data subject, and it is easy for users to understand it without any redundant information. Thus, connecting nodes with nDS constraint guarantees that the size-l remains independent. For instance, consider the path RAuthor→RPaper→R(Co-)Author (in DBLP database), even if a paper’s local importance is not as high as the co-author, then it cannot only choose the co-author and exclude the paper. It is rational to exclude the semantic association by excluding the paper tuple between the authors which are the co-authors of this paper in this case.
Also, note that because of criterion (1)The l tuples with the top importance scores will not be included in a size-l object summary. E.g., we consider the path RAuthor→RPaper →RYear→RConference with corresponding tuples with scores 0.9, 0.2, 0.7, 0.6, then, the Conference tuple, although it has bigger importance score than that of Paper, may be excluded from the size-l object summary whilst Paper may remain. Also, the Im(OS, size-l) does not represent the maximum importance of l tuples but the maximum summation of the l connected to tDS tuple.
2.3 Rest of the Related Work
Recently, documentation summarizing techniques have aroused extensive research interest [1,18]. Web fragment is an example of a document summary, a search result used by web based keyword search for quick preview. They can be static (for example, consisting of the first few words of a document or descriptive metadata) or query biased (for example, consisting of sentences containing multiple keywords) [18]. Applying these technologies directly to the database, especially the OS, are still ineffective (e.g. relational associations and semantics of displayed tuples will be ignored). For example, papers authored by Chen (although the keyword “Chen” is not be included) importance is similar to their authors and citations, and this is ignored by the document summarization. On the other hand, general idea is the entity summarization in the semantic knowledge graph, it is similar to ours.More accurate concepts are given in [21], If a semantic knowledge graph and an entity represented by a node q graph, then the summary of q is a subset of the size l graph, where nodes surround the node q.[21].
RELIN [22] is another related research, which uses random walks on a graph to describe entity’s features.Different from such document summarization studies or existing works, our proposed Object Summary generation approach is for each standalone data subject, we use tuples to further explaining and supporting tuples that including the querying keywords, in order to distinguish each other from the results, while its relational tuple ranking methodology is an authority-transfer based approach considering their corresponding ‘values’ in relational datasets, specially for keyword search in relational databases.
A similar approach that using OSs to search semantics in web was proposed in [23] namely information unit.That is, the result of web keyword search is a document consisting of a group of linked web pages containing all the keywords, rather than a physical document. The Sphere Search proposed in [24] is a keyword search for heterogeneous data in semi-structured, none-structured, and structured data. These works are searching for associations of nodes that contain the keywords to adopt and provide the semantics of relational keyword search. Moreover, ranking algorithms, keyword search, and value based analysis etc techniques have been widely studied in cloud computing [25], fog computing[26], dig data etc. approaches.
3. ValueRank
PageRank-style (such as ObjectRank[8]) are considered the most effective approaches for databases with relationship edges associating with authority flow semantics. But for trading databases, like Northwind, PageRank-style gives more references to important nodes (i.e. with high score), but ignores tuples’ values.For examples, there are two customers that are namely C1 (has 100 orders) and C2 (has 5 orders), but if C2’s orders have high total order price, C2 may be more important than C1. As a result, it can observe that in such a database, it has to rank OSs according to some of its tuples’ values. This paper proposes and investigates a more versatile solution, namely ValueRank that can be applicable to any databases.
The nature of ValueRank is based on the concept of ObjectRank[8], when calculating the authority transfer rate, basic set ect, also take values into account, where the Basic Set and Authority Transfer Rate consider not only the number of tuples’ link or linked but also values. The Basic Set is the set of tuples, where the values of these nodes are deemed to have a significant impact on the authority of other nodes. E.g. All tuples in Paper and Year have influence on other tuples in DBLP database. Furthermore, the Authority-based Rate from Paper to Year and so on can be taken as a functional relationship of these values (normalised). For example, consider Paper P1 (published in 2016) with one reference which is published in 1996 and P2 (published in 2017) with one reference which is published in 2016. The Authority Transfer Rate between Papers and Years can be a function of these values, therefore according to this function, it can be calculated that P2 would obtain higher ValueRank.
More exactly, the dataset is modeled as a data-graph whilst the Schema Graph describes its schema structure. The corresponding Authority-based Transfer Schema is be created from G(VG, EG), it affects the authority-based flow through the edges of the graph (e.g. Fig. 3). Further, for each edge eG = (vi→vj) of the EG, two Authority Transfer Edge can be created, that are D (the Data Graph) and GA (defined from Authority-based Transfer Schema Graph), DA(VD, EDA) is the corresponding Authority-based Transfer Data Graph could be derived as: for each edge of the ED, the DA has two edges, i.e. in edge and outgoing edge, respectively ef = (vi→vj), eb = (vj→vi) which are represented by the Authority-based Transfer Rates ɑ(ef) and ɑ(eb) correspondingly, where \(a\left(e^{f}\right)=\mathbf{a}\left(e_{G}^{f}\right) / O u t \operatorname{Deg}\left(u, e_{G}^{f}\right) \text { if } \operatorname{OutDeg}\left(u, e_{G}^{f}\right)>0\left(\text {OutDeg}\left(u, e_{G}^{f}\right)\right.\) is the total number of outgoing edges from u of type \(\left.e_{G}^{f}\right) or a\left(e^{f}\right)=0 otherwise \left(a\left(e^{b}\right)\right.\) is defined accordingly).
Fig. 3. The GAs for the DBLP database
Instead of using the whole VD, it can use any subset S of nodes as the Base Set, which can increase the authority associated with them.S is a subset of the tuple containing the keywords.
A node vi ’s value si describe the relative score of a node, and si can be calculated by a function with the normalized attributes values of vi. The si of a node vi in S can be defined with the equation:
\(\mathbf{s}_{i}=\boldsymbol{a} \cdot f\left(v_{i}\right)\) (3)
where ɑ is a tuning constant and ɑ ≤ 1. function(vi) is a normalizing function of the value of vi and 0 ≤ f(vi) ≤ 1. Si is in the range [0, 1] rather than just 0 or 1 as in ObjectRank. For example, for a tuple vi in ROrderDetails, si = function(OrderDetails.Price * OrderDetails.Quantity). si may be a function of the attributes of neighbouring nodes. For instance, for a tuple of Orders, si = function( OrderDetails.Price * OrderDetails.Quantity). It has more dynamic transfer rates if vi’s values combine with Authority-based Transfer Edges. The intuition is that a tuple ’s different restriction values may an impact on its different edges. The Authority Transfer Edges can be denoted as ɑ(e)’ whether forward or backward , ɑ(e)’ can be calculated by the following formula:
\(a(e)^{\prime}=\beta+\gamma \times f\left(v_{i} \rightarrow v_{j}\right)\) (4)
where β and γ are tuning constants, so β + γ ≤ 1, f(vi → vj) is a normalizing function of vi and vj and its values is in the range [0, 1] . Fig. 4 illustrates the graph for the Microsoft Northwind database. Similarly to ObjectRank calculations, the Authority-based Transfer Rates, Basic Set S and tuning constants are experimented as variables.
Fig. 4. The graph for the Microsoft Northwind dataset
Basic Set S including nodes, while whose Jaccard coefficient (Jc) are considering to have significant affect on the their connecting tuples authorities. E.g., in the DBLP dataset, the corresponding Authority Transfer Schema Gragh GA(i.e. Fig. 3) is created. For the Paper → Paper, a paper that is cited by important paper and their Jaccard coefficient (Jc) is high, then it will be clearly important. For an author, the papers of his main areas should obtain higher ValueRank. If A1 has three papers named P1, P2, P3 respectively, P1 cites P2 and P3, then get Jaccard coefficients of P1 with P2 and P1 with P3, P1 with P2, has higher Jaccard coefficient than P1 with P3(namely s1 and s2), so it can obtain the Jaccard value of A1 → P1 is s1. The Jaccard value J(vi → vj) of the Authority Transfer Edge can be calculated:
\(J\left(v_{i} \rightarrow v_{j}\right)=\max [(A-e)(j,:)]\) (5)
in which, A is an n × n matrix with Aij=Jc(ni, nj) (Jc(ni, nj) is the Jaccard coefficient of ni with nj), ni, nj RPaper(nAuthor) (i.e. all papers of an author). e is an n × n unit matrix and max[(A)(i, :)] is the maximum score in line i of matrix A. Ji produces values in the range [0,1), Ji fails to reach 1 because there are no two papers that are exactly alike. ɑ(e)’ is calculated by Formulas 4 where f(vi → vj) is a normalization function of J(vi → vj).
Now, this paper also proposes the Time Decrement (namely TD) for the Paper → Paper. More precisely, the rate with TD(vi → vj) of a paper vi flows to its a cited paper vj can be calculated by
\(T D\left(v_{i} \rightarrow v_{j}\right)=\frac{\frac{1}{A_{v_{j}}+b}}{\sum_{v_{j} \in p_{v_{i}}} \frac{1}{A_{v_{j}}+b}}\) (6)
where pv is a set of the paper that cited by paper vi, is the “age” that vj cited by vi (calculated by , \(A_{v_{j}}=y_{v_{i}}-y_{v_{j}}+1, y_{v_{i}}\) is the year of the publication of paper vi) and b is a tuning constant, it can adjust the transfer flow rate with different cited papers of different ages. It will not obtain a large weight of the cited paper with younger age, i.e. b will obtain the smaller value for the paper aging fast whereas the larger value for the paper aging slow. Regarding to tuning constant b, when b = 5, for instance, a paper was published in 1989 named “A Knowledge Level Analysis of Belief Revision” (denote as PA) in Computer Science research field, it cites 2 papers: one of them was published in 1988 named “Investigations into a Theory of Knowledge Base Revision” (PB), and another was published in 1986 year named “Learning at the Knowledge Level” (PC). So that PC’s age is 4 and PB’s age is 2, then it can use the Formula 6 to calculate their corresponding TD(PA → PB) and TD(PA → Pc) are 0.562 and 0.438 respectively. is extension of can be calculate by
\(a(e)=\beta+\gamma\left(\sum_{s=1}^{n} a_{s} f_{s}\left(v_{i} \rightarrow v_{j}\right)\right)\) (7)
where as is the score that required considering factor in transfer rate like J and TD, fs(vi → vj) is its corresponding normalization function and \(\sum_{s=1}^{n} a_{s}=1\). Fig. 3(b) illustrates the GA for the DBLP database.
Let r denote the vector with ValueRank ri of a node vi, then r can be calculated:
\(r=d A r+(1-d) \frac{s}{|S|}\) (8)
where Aij = a(e) if there is an edge e = (vi → vj) in EDA or 0 otherwise, d control the Base Set importance and s = [s1, ... , sn]T is the Base Set vector for S, si and are calculated by Formulas 3 and 7 respectively.
Table 1 gives ValueRank scores that were produced by the graph for the Microsoft trading Northwind database and d = 0.85 (i.e. the default setting: d = 0.85 as described in Section 5). Whereas ObjectRanks were generated by the corresponding “ObjectRank version” of this GA (i.e. denoted as GA3), namely basic sets were not used and it has = β for all edges (see Section 5 for details). The results in [6], it interestingly shows that ValueRank provides a better comparison scores than that from ObjectRank, and we also get the following observations: In the Northwind database, ObjectRank is highly correlated with the total number of Order_Details, Orders, and so on. But ValueRank is highly correlated with the summing value of Freight, Orders and so on.
For example, Cus_SA has 31 total number of Orders, thus whose ObjectRank score (0.70) is higher than that of Cus_QU (0.62), on the other hand, Cus_QU has higher values (considering Orders), thus in the view of ValueRank, Cus_QU (0.69) is greater than Cus_SA (0.65). Moreover, Prod_59 has a greater total number of Orders (i.e. 54), while it has higher score in ObjectRank than that of Prod_38, however, by considering their corresponding values, the results of ValueRank scores are more likely balance the conditions of number and values.
Table 1. Selected examples in Microsoft trading dataset (ObjectRank against ValueRank scores)
4. Greedy Algorithm: k-LASP
Note that the cost of dynamic programming algorithm[11] will be high when the required l is huge, so this paper proposes the following one greedy algorithm exploit accordingly interesting properties of OS for more efficiency. Although it provides approximate results in section 5.
Meanwhile, we define a greedy algorithm named k-LASP (k-Largest Averaged Score Path) in this paper, it is the extension of LASP [12] that uses a Priority Queue (PQ) to build the size-l OS by expanding on the current tuple with the largest averaged score path. But we have to update all remaining nodes when it selects a path (or a node) to the size-l object summaries on the size-l generations. Note that the cost of LASP algorithm will be high if the scale of |OS| is very large. This paper presents k-LASP, i.e. the largest averaged score path of k nodes. It has to calculate w(ti) of each node and its corresponding average w(ti) score with its n-1 (n = max(k, length-1)) grandparent nodes (donated as APk(ti)) of the path from the ti to the root. The corresponding APk(ti) of each node ti on the size-l OS generation can be calculated by:
\(A P_{k}\left(t_{i}\right)=\frac{v_{i}+\sum_{j=1}^{n-1} w\left(R_{j}\right)}{n}\) (9)
where n = max(k, physical length), Ris are(is) ti’s grandparent nodes(or node) that have been not selected to size-l OS, w(Ri) is its corresponding score. More precisely (see Algorithm 1), the input of the algorithm are l (the size of tuples returned, i.e. the size of output),tDS (It can be regarded as the keyword tuple of search) and GDS includes information about DSs and the relations linked tDS , the tDS contains the additional particular DS’s information. Firstly, the initial OS (i.e. complete OS) with the APk(ti) calculated by Equation 9(line 1) is generated, The original value of each tuple in complete OS is calculated based on ValueRank (Equation 8). It use the PQ to select the largest AP score node and add its corresponding path pi to size-l object summaries (lines 3 and 4). Then remove the nodes of pi from OS and PQ, the OS tree become a forest, the parents of all roots of the forest are the nodes of pi, the affected nodes vi of this forest need to update its corresponding AP(vi) (lines 6-8). Finally, as long as the selected nodes are smaller than the required l, the process will be repeated. Fig. 5 illustrates this algorithm using the example of 3-LASP, the tDS is node t1 in Fig. 5,Fig. 5(a) shows the complete OS generated by using t1 as input, t6 is the largest value in deQueue(PQ), so the path p1 is t1→t6 , we add first two nodes of p1 to size-10 OS(line 3-4), now the number of | size-10 OS | is 2, so we remove t1 and t6 from the OS and PQ, for each descendant node ti (number n, n = max(k-1, physical length)) of nodes in p1, upadate descendant node ti’s value APk(ti) on the OS tree and PQ(line 6-8). Fig. 5(b) illustrates that t9 is the largest value in deQueue(PQ), so p2 is t3→t9, | size-10 OS | is 4, do line 6-8 again, so continue, Fig. 5(d) shows the result of this example of size-10 OS.
Algorithm 1: k-LASP Algorithm |
k-LASP (l, tDS, GDS) Input: l, tDS Output: size-l OS 1.generate the initial OS and initial PQ //the initial OS is the complete OS. PQ is the priority queue with OS’s leaf nodes. calculate AP(ti). 2.while (|size-l Object Summary| <l) do 3. pi = path from delete(PriorityQueue) //the largest value from PQ 4. add 1st (l-|size-l Object Summary|) nodes of pi to size-l Object Summary 5. if (|size-l Object Summary| < l) then 6. Delete selected path pi from the Object Summary tree and Priority Queue 7. For each descendant node ti (number n, n = max(k-1, physical length)) of nodes in pi do 8. update APk(ti) on the OS tree and PQ 9.return size-l Object Summary //i.e. the output size-l OS |
In the worst case, k-LASP costs O(l(bk+nlog2n)) to get size-l object summaries, where the number of l is the size of object summaries, variable b is the total number of tuples in the complete OS tree, n is nodes of the complete OS, every k-LASP chooses nodes to add to size-l OS, it costs O(bk), i.e. the value of AP of descendant nodes in b paths need to be updated, and up to k nodes’ AP are updated in each path on OS tree, sorting algorithm costs O(nlog2n), So in the worst case, it needs to update l times, So the time complexity of k-LASP is O(l(bk+nlog2n)).
Fig. 5. The 2-LASP Algorithm: Under construction size-5 OSs and their corresponding PQs. (Shaded nodes are the selected ones).
5. Evaluations
We conduct our evaluation on two aspects, i.e. effectiveness and efficiency. We compare the results generated from our proposed k-LASP algorithm with different variables. It evaluates the scores (i.e.global importance) with both ValueRank and ObjectRank. Regarding to the simulation setup, initially, we investigate and study the effectiveness of ValueRank through our selected evaluators. Then, we comparatively investigate the performance results from both of the ObjectRank and ValueRank. Finally, we analyzed the quality of the object summaries emerged from the greedy heuristics algorithm k-LASP.
It used two databases in this paper: bibliography and trading, there are 2,959,511 and 3,209 tuples in the DBLP, MS Northwind databases. They take about 500MB and 1MB of disk space. With ObjectRank scores i.e. global importance[8] and ValueRank, it generating the global importance for each tuples of the and Northwind trading databases separately. Cold cache and a PC with an i5-4590 3.30 GHz (Intel-Core) processor and 8GB of memory were used in experiments.
5.1 Effectiveness
The effectiveness of ValueRank is thoroughly investigated comparatively with ObjectRank against evaluators. As the Northwind trading database and DBLP database have schema (comprising of many relationships, restrictions, and attributes), this paper uses them for evaluation. They have more understandable instances to present and evaluate the techniques easily. It measures the affect of d and transfer rate in different graphs. It imitates and extends the setting parameters used to evaluate ObjectRanks[3]. To be more exact, in [3], the affect of variable d is investigated (where d = 0.85, 0.99, 0.10, 0.85 is set as default).
Meanwhile, 3 groups of different graphs for the Northwind database and four different graphs for the DBLP database. Namely, for the Northwind database, the default graph1 of Fig. 4, For the DBLP database, this paper proposes two factors (i.e. Jc and TD), the Equation 7 becomes \(\alpha(e)=0.7+0.3\left(a_{J} f_{s}(J c)+a_{2} f_{s}(T D)\right)\), so the corresponding GAⅠ is the GA of Fig. 3(b) with a1 = 0.5 and a2 = 0.5, GAⅡ is the GA of Fig. 3(b) with a1 = 0.1 and a2 = 0.9, and GAⅢ with a1 = 0.9 and a2 = 0.1. However, GAIV had all α (consequently si = 0) and set to 0, hence producing ObjectRank values. Table 2 illustrates the variables of graphs and default settings, Table 3 illustrates the evaluation of ValueRank’s effectiveness.
Table 2. Experimental variable and default settings
Table 3. Evaluation of ValueRank effectiveness
For the objective of the evaluation of ValueRank’s quality, ObjectRank[3] was used to conduct a similar evaluation survey. Namely, in our University, five professors and researchers are participated in this survey. Select lists of 10 tuples randomly, they are compared and ranked by every participant, afterwards give a score of 1 to 10. For each tuple, it also provides a set of descriptive details and statistical data. Generally, evaluators gives better scores on Graph1 and Graph2, with different settings of variable d, on the other hand, as we see from the results, the last group of settings, i.e. Graph2-d1, did not satisfy evaluators, comparing with the rest groups, which is due to without considering values.
ValueRank also gives better comparative ranking than ObjectRank in the DBLP database, for instance, RG1, RG2, RG3and RG4 are the corresponding ObjectRank (GAIV), VauleRank (GAI), VauleRank(GAII) and VauleRank (GAIII)’s rank in all Author tuples (341,623) respectively. Author A1’s RG1 is 4, but RG2 is 2. The cause of the rank going up is that Author A1’s papers mainly concentrated in the direction of the Database, but some of his papers are not or little relationships to the direction of the Database, the number of these papers is nur and the number of his all papers is named nsum, then we can get a ratio ri calculated by nur/nsum , the bigger of ri, his the value of rank will drop more. On the contrary, Author A2’s RG2 is higher than RG1, because the field of his papers is relatively concentrated, he majors in computer science and technology. Authors’ RG3 and RG4 are changed by corresponding required which compared with RG2. RG3 emphasis TD more and RG4 emphasis Jc more. The changes of the tuples of Papers’ rankings are alike to Authors’. Paper A1’s RG2 is higher than RG3, because Jc is higher than others, i.e. this paper has strong relevance with cited papers and TD is also higher, i.e. this paper is younger, it makes better qualified for users. Similarly, the TD should be paid more attention to, then the Paper B1 and B2 have corresponding changes. And, we pay more attention to the Jc, then the Paper C1 and C2 have corresponding changes. The result (Table 4) illustrates the impact of GA on tuples ranking on the DBLP database.
Table 4. Samples of ObjectRank and ValueRank scores in DBLP database
5.2 Efficiency
In this subsection, we mainly focusing on comparing the overall importance of the size-l OSs generated by the greedy method (i.e. our proposed k-LASP algorothm). For details, the results of Fig. 6.(a) show the approximate quality under the default settings, namely holistic importance of the achieved object summary importance (i.e. Im(size-l)). Meanwhile, the average results for 10 random object summaries are shown. The result shows that the scores of 4-LASP and 6-LASP are always higher than the 2-LASP. This is exactly what we expect, the node with lower score may be considered because its ancestor nodes have higher scores. In other words, the node with higher score may not be considered because its ancestor nodes have lower scores. For instance, Im(OS, P1) = 0.4 (P1 is a paper tuple ‘On Total Functions...’) and one of its children is a Year tuple Y1 with Im(OS, Y1) = 1.2, Im(OS, P2) = 0.9 (P2 is a paper tuple ‘A deterministic...’) and one of its children is a Year tuple Y2 with Im(OS, Y2) = 1.0, it will choose the tuple Y1 and P1 by traditional method, but actually more Paper tuples may want to be known, so Y2’s score (= (1.0+0.9)/2 = 0.95) is more than other tuples’ scores (like Y1’s = (1.2+0.4)/2 = 0.8) by 2-LASP., the data subject graph is as same as Fig. 2 with the setting θ=0.7, so the results of 4-LASP and 6-LASP have the same importance. Since the running cost of the blind search algorithm is very high, we have not given any optimization results.
On the other hand, we also considering the total run-time performance of our proppsed algorithm with different coefficient k in Fig. 6.(b). Again the same object summaries are used as in Fig. 6.(a) (i.e. the same 10 object summaries) and generate the global importance of the tuple with the default settings.
Fig. 6.(b) shows the costs of our algorithm using different k values to calculate size-l OSs from OSs with different l values, excluding the time required to generate the OS for the algorithm. We can see that with the increases of k, the cost is increases, so the cost of 2-LASP is the lowest.
Fig. 6. Approximation Quality and Efficiency on DBLP(Aver(|OS| = 1116)
6. Conclusion and Future Works
In this paper, based on our previous work, we initially extended the ValueRank approach, which also taking the values from none business dataset, e.g. DBLP database into account, to further providing precise authority transfer flow when calculating their neighboring relations. Meanwhile, we also provided a novel faster object summary generation algorithm, i.e. k-LASP algorithm, which not only the single average score per path or pre pare, but also considering the k-LASP from the root. The evaluation show that our proposed methods have significant results in relational keyword searches.
As a further work, we will extend our proposed techniques to more complicated relational dataset, e.g. XML, OWL, and also further investigate the spatio-temporal dataset while taking location and time as values combining keyword search.
References
- Turpin, A., Tsegay, Y., Hawking, D., & Williams, H. E, "Fast generation of result snippets in web search," in Proc. of International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp.127-134, 2007.
- Aditya, B., Bhalotia, G., Chakrabarti, S., Nakhe, C., Nakhe, C., & Parag, P., et al., "BANKS: browsing and keyword searching in relational databases," in Proc. of International Conference on Very Large Data Bases, VLDB, pp.1083-1086, 2002.
- Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., & Sudarshan, S., "Keyword searching and browsing in databases using BANKS," in Proc. of International Conference on Data Engineering, 2002. Proceedings IEEE, pp.431-440, 2002.
- Hristidis, V., & Papakonstantinou, Y., "Discover: keyword search in relational databases," Vldb, Vol. 26 No. 2, pp. 670-681, 2002.
- Fakas, G. J., "Automated generation of object summaries from relational databases: A novel keyword searching paradigm," in Proc. of IEEE, International Conference on Data Engineering, DBRank Workshop, IEEE Computer Society, pp.564-567, 2008.
- Fakas, G. J., Cai, Z., "Ranking of object summaries," IEEE International Conference on Data Engineering, DBRank Workshop, pp.1580-1583, 2009.
- Sehgal, U., Kaur, K., & Kumar, P., "Notice of Violation of IEEE Publication Principles The Anatomy of a Large-Scale Hyper Textual Web Search Engine," in Proc. of International Conference on Computer & Electrical Engineering, IEEE Computer Society, Vol.2, pp.491-495, 2009.
- Hwang, H., Hristidis, V., & Papakonstantinou, Y., "Objectrank: authority-based keyword search in databases," ACM Transactions on Database Systems (TODS), vol. 33, no. 1, 2008.
- Varadarajan, R., Hristidis, V., & Raschid, L., "Explaining and Reformulating Authority Flow Queries," in Proc. of IEEE, International Conference on Data Engineering, IEEE, pp.883-892, 2007.
- Huang, X. F., "Tuplerank and implicit relationship discovery in relational databases," Lecture Notes in Computer Science, 2762, 445-457, 2003.
- Fakas, G. J., Cai, Z., & Mamoulis, N., "Versatile size-$l$ object summaries for relational keyword search," in Proc. of IEEE Transactions on Knowledge & Data Engineering, Vol.26 No.4, 1026-1038, 2014.
- Fakas, G., Cai, Z., & Mamoulis, N., "Diverse and Proportional Size-l Object Summaries for Keyword Search," in Proc. of ACM SIGMOD International Conference, pp.363-375, 2015.
- Hristidis, V., Papakonstantinou, Y., & Gravano, L, "Efficient IR-Style Keyword Search over Relational Databases," in Proc. of 2003 VLDB Conference, pp.850-861, 2003.
- Liu, F., Yu, C., Chowdhury, A., & Chowdhury, A., "Effective keyword search in relational databases," in Proc. of ACM SIGMOD International Conference on Management of Data, ACM, pp.563-574, 2006.
- Yi Luo, Wei Wang, Xuemin Lin, & Xiaofang Zhou., "Spark2: top-k keyword query in relational databases," in Proc. of Knowledge and Data Engineering, IEEE Transactions on, Vol.23 No.12, pp.763-1780, 2011.
- Yu, B., Li, G., Sollins, K., & Tung, A. K. H., "Effective keyword-based selection of relational databases," in Proc. of ACM SIGMOD International Conference on Management of Data, ACM, pp.139-150, 2007.
- Huang, Y., Liu, Z., & Chen, Y., "Query biased snippet generation in XML search," in Proc. of ACM SIGMOD International Conference on Management of Data, ACM, pp.315-326, 2008.
- Tombros, Anastasios, & Sanderson, Mark., "Advantages of query biased summaries in information retrieval," in Proc. of SIGIR '98: Proceedings of the, International ACM SIGIR Conference on Research and Development in Information Retrieval, August 24-28 1998, Melbourne, Australia, pp.2-10, 1998.
- Fakas, G. J., "A novel keyword search paradigm in relational databases: object summaries," Data & Knowledge Engineering, Vol.70 No.2, pp.208-229, 2011. https://doi.org/10.1016/j.datak.2010.11.003
- Markowetz, A., Yang, Y., & Papadias, D., "Keyword search over relational tables and streams," Acm Transactions on Database Systems, Vol.34 No.3, pp.1-51, 2009. https://doi.org/10.1145/1567274.1567279
- Sydow, M., Pikula, M., & Schenkel, R., "The notion of diversity in graphical entity summarisation on semantic knowledge graphs," Journal of Intelligent Information Systems, Vol.41 No.2, pp.109-149, 2013. https://doi.org/10.1007/s10844-013-0239-6
- Cheng, G., Tran, T., & Qu, Y., "RELIN: Relatedness and Informativeness-Based Centrality for Entity Summarization," The Semantic Web - ISWC 2011 -, International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, DBLP, Vol.7031, pp.114-129, 2011.
- Li, W. S. , K. Selcuk Candan, Vu, Q. , & Agrawal, D., "Retrieving and organizing web pages by "Information Unit," in Proc. of International World Wide Web Conference, pp. 230-244, 2001.
- Graupmann, J., Schenkel, R., Weikum, G., Böhm, K., Jensen, C. S., & Haas, L. M., et al., "The spheresearch engine for unified ranked retrieval of heterogeneous xml and web documents," 2005.
- Vinueza Naranjo, Paola & Shojafar, Mohammad & Vaca-Cardenas, Leticia & Canali, Claudia & Lancellotti, Riccardo & Baccarelli, Enzo, "Data Over SmartGrid -A Fog Computing Perspective," 2016.
- Baccarelli, Enzo & Scarpiniti, Michele & Vinueza Naranjo, Paola & Vaca-Cardenas, Leticia, "Fog of Social IoT: When the Fog Becomes Social," IEEE Network, 32(4), 68-80, 2018. https://doi.org/10.1109/MNET.2018.1700031