1. Introduction
Nowadays, with the rapid development of network technology, online social networks have been widely applied in our lives. For example, we are able to do many things on online social networks, such as instant messaging, shopping, mobile payment, live streaming, hotel booking and so on, which make our daily lives more and more convenient [1]. More importantly, since a large number of user behaviors are recorded on online social networks, online social networks store a large amount of data, which can be analyzed and mined by many companies to provide users with better services. However, these data contain a great deal of sensitive information about personal social relations, salary, financial transaction behavior, disease, time and space activities, religious beliefs, political opinions, etc., which can lead to privacy leakage in the case of illegal use[2]. For instance, as the biggest social network platform, FaceBook has fallen into many scandals in recent years, in which a lot of user data is illegally used and a larger number of individual privacy information is breached[3]. Therefore, it is necessary to provide sufficient privacy preserving for online social networks to preserve individual privacy.
In particular, large amounts of data in social networks are often represented as graphs, which can be utilized in many graph analysis tasks, such as information propagation, link prediction, community detection, etc[4]. At the same time, malicious attackers can also use graph analysis methods to mine personal privacy information in these graph data. To preserve these graph data, many modification methods have been designed, which can be grouped into three categories: (1) Edge and Vertex modification methods that modify(add, delete or switch) edges/nodes in a graph. (2) Generalization methods that group vertices and edges into supervertices and super-edges. (3) Uncertain graph methods that inject uncertainty into the edge of the graph. In the first method, it is easy to preserve the original graph by randomly adding or deleting a node or edge, but this random modification method has insufficient data utility. In addition, to improve data utility, K-anonymity methods have been widely applied to preserve the sensitive nodes and edges. In K-degree-anonymity method, each node connects k nodes with the same degree, which results in the probability of identifying each node being less than 1/k[5]. Furthermore, the l-diversity method[6], t-close method[7] and k-anonymity with edge selection method[8] have been presented to resist all kinds of attacks. In the second method, for improving privacy preserving and anti- attack capability, the generalization methods cluster similar nodes together to generate super-nodes, which preserve the link relationships between nodes in this super-node, and super-edges which combine the edges between these super-nodes[9]. In the third method, by injecting uncertainty semantics into the graph, the uncertain graph method can preserve the sensitive relationship between nodes in the graph while keeping the similar structure of the original graph to achieve notably better data utility.
As a special graph modification method, the uncertain graph method is composed of two steps to generate an uncertain graph. In particularity, the first step modifies the original graph by adding/deleting some edges while maintaining its structure as much as possible, then the second step injects uncertainty into the modified graph to get an uncertain graph which preserves the privacy of the original graph. However, there are some disadvantages to uncertain methods. For example, although the (k,ε1)-obfuscation method has better data utility[10], it is not able to resist the rounding attack. In the random walk method[11], the structure of the original graph is modified by deleting many edges, which leads to insufficient data utility. For the UDGP method[12], the differential privacy can improve the privacy preserving of edges, but it is vulnerable to attacks based on eliminating edge probability.
Because differential privacy can provide a rigorous privacy guarantee against attacks based on background knowledge, many differential privacy based methods have been proposed to preserve the graph structure data since differential privacy was developed by Cynthia Dwork[13]. Especially, the random response with differential privacy has better privacy preserving than the centralized differential privacy. To address these disadvantages in the uncertain methods and accomplish the preservation of link privacy, a novel uncertain graph method based on node random response is devised to generate an uncertain graph. In this method, to improve privacy preserving, the random response is adopted to modify edges on nodes and the node differential privacy injects the noise on edges to generate an uncertain graph. In addition, to minimize the perturbation to the original graph, the original graph is decomposed to get many sub-graphs and some sub-graphs with a larger number of edges selected by the exponent mechanism are modified. Therefore, the proposed uncertain graph method can preserve the link privacy of social networks while maintaining data utility.
The major contributions in this paper are shown as follows:
(1) A general framework to generate an uncertain graph is proposed, which can achieve the trade-off between privacy preserving and data utility. In this framework, after the original graph is decomposed into many sub-graphs, some sub-graphs with a large number of edges are obtained through the exponent mechanism. After that, all obtained sub-graphs are modified. In the end, an uncertain graph is generated by combining all the sub-graphs.
(2) An uncertain graph method based on node random response is presented to provide sufficient privacy preserving for the link privacy of social networks. In this method, the random response mechanism is adopted to modify the edges and the uncertainty is injected on edges through node differential privacy.
(3) The experiments are performed on synthetic and real data sets to demonstrate the effectiveness of our method regarding privacy preserving and data utility. Compared with other methods, the result demonstrates that our method can preserve the link privacy of the original graph with a high level of data utility.
The organization of this paper is described as follows: Section 2 concentrates on the graph modification methods and differential privacy based methods. Some basic knowledge and definitions are introduced in Section 3. Section 4 demonstrates the model of the devised uncertain method, describes the algorithms in detail and explicitly analyzes the privacy guarantees of this method. The experiments are shown in Section 5 to evaluate the proposed method. Finally, the conclusion and future work are described in Section 6.
2. Related Work
For preserving the sensitive information in the social network, many graph modification methods had been widely adapted for privacy preserving before the social network was released. In general, these methods include edge and node modification methods, generalization methods and uncertain graph methods.
Among the existing edge and node modification methods, due to insufficient data utility caused by the random perturbation, X.Ying[14] developed two algorithms that could preserve the original graph while maintaining its spectral properties as much as possible. In [15], only the most important edges were protected to achieve a better trade-off between privacy preserving and data utility. As a useful privacy preserving method, k-degree anonymity had been usually employed to preserve social networks through anonymity graphs. J.Casas in [16] developed a k-degree anonymity method that anonymized the degree sequence of a graph by using the univariate micro-aggregation to achieve the desired data utility. On the basis of this method, the edge relevance was considered to design a k-degree anonymity method that minimized edge perturbation to enhance data utility. To resist structural attacks, the kisomorphism method was introduced in [17] to preserve social networks, which achieved strong anonymity while maintaining the data utility. In addition, by using graph similarity detection to get subgraphs, [18] proposed a subgraph K+-isomorphism method that satisfied k-isomorphism while reducing information loss.
Different from edge and node modification methods, generalization methods focused on how to generate super-nodes and super-edges, which could hide the details of individuals. In [19], firstly, the mutual information of each node was calculated according to the physical data theory, then the nodes with high mutual information were selected as key nodes. In the end, the key nodes were used as core nodes to cluster similar nodes to generate a clustered graph. To simultaneously protect the characteristics of nodes and communities, F.Yu[20] proposed a clustering algorithm that adopted some perturbation strategies to reduce privacy leakages while maintaining data utility.
In order to preserve social networks with better data utility, Boldi in [10] presented a (k,ε1)-obfuscation method which generated an uncertain graph by injecting uncertainty to the edges of social networks. Due to the insufficiency of obfuscation, the uncertain graph generated by this method was easy to be re-identified through the rounding attack. Compared with the (k,ε1)-obfuscation method, the Rand-Walk method [11] was able to provide strong privacy preserving with insufficient data utility. In [21], Nguyen proposed a Maximum Variance method for a better trade-off between privacy and utility, which utilized the quadratic programming method to assign the probability value of edges. Based on the above work, [22] devised a generalized obfuscation model that could preserve the degree of nodes unchanged and get an uncertain graph by using uncertain adjacency matrices. To provide strong privacy preserving for link privacy of social networks, J.Hu in [12] utilized edge-differential privacy to design an uncertain graph method that also met the requirements of data utility. [23] introduced a method based on the triadic closure to generate an uncertain graph that was suitable for small social networks.
In comparison with the graph modification methods, it is noted that differential privacy had some advantages that could stop attacks based on background knowledge and provide rigorous mathematical proof [24]. Owing to these advantages, many differential privacy methods had been developed to preserve social networks since C. Dwork created differential privacy. In general, these methods usually adopt differential privacy to preserve specific sensitive statistics of graphs and publish synthetic graphs. When publishing the degree distribution of a graph, Day in [25] proposed two node differential privacy methods which used aggregation and cumulative histogram respectively to reduce the error caused by the noise. To release the node strength histogram with fewer errors, [26] designed an edge differential privacy based method that aggregated the original histogram of the graph by using the sequence-aware and local density based clustering approaches. In [27], because of the sub-graph based attacks, Nguyen introduced a method that perturbed all k-vertices that linked some sub-graphs by adding noise to some edges. In addition, other statistics in social networks including triangle counts, node centrality and shortest path had been preserved by many differential privacy based methods when they were published [28,29]. Except for the statistical data, the differential privacy had also been employed to obtain a synthetic graph that could preserve the original graph. In [30], V.Karwa used a graphical degree partition of the original graph and perturbed it by differential privacy to get a synthetic graph. In addition, [31] developed an LDPGen method which clustered structurally-similar users together through multiple iterations to construct a synthetic graph in a distributed environment.
Particularly, it is well known that randomized response is an input perturbation algorithm that perturbs the input value by a probability mechanism. For preserving the answer to a sensitive question elicited in the surveys, such a design based on randomized response has been widely used and studied [32]. For example,[33] controlled the statistical disclosure by using the randomized response when publishing data in the form of contingency tables. In addition, it has also been used to release network data preserved by differential privacy in [34], which offered rigorous privacy guarantees for the original network. In this paper, a randomized response mechanism under differential privacy was designed to modify the edges of a graph and construct a perturbed graph to preserve the original graph.
3. Preliminaries
In this section, some definitions used throughout the paper are introduced. In particularity, a social network is abstracted as a simple undirected graph G=(V, E), where V denotes nodes and E represents edges.
Definition (Uncertain graph[10]).
Let a graph G=(V, E), a function P: EP →[0, 1], which assigns probabilities to edges in E’, we can get an uncertain graph G’ =(V, E’, EP), where E’ is attained by modifying the E, and EP represent the probabilities of edges. Compared with a graph G, the uncertain graph G’ has the same nodes as G and has different edges from G. In a deterministic graph, the probabilities of all edges are 1.
Definition 2 (Neighboring graph[12]).
For two graphs Ga=(Va, Ea) and Gb=(Vb, Eb), compared with Gb, if Ga has one different node, |Va|= |Vb|+1, Eb in Ea, Ga and Gb are neighboring graphs.
As illustrated in Fig. 1, Fig. 1(a) has a different node and three different edges compared with Fig. 1(b), so Fig. 1(a) and Fig. 1(b) are neighboring graphs.
Fig. 1. An example of neighboring graphs
In addition, if there is one different edge between Ga and Gb, |Ea| = |Eb| + 1, Ga and Gb are also neighboring graphs.
Definition 3 (Sensitivity[12]).
Let F be a sequence of queries: G → E, the sensitivity of F is: (1)
\(\begin{align}\Delta f=\max _{G_{a}, G_{b}}\left\|F\left(G_{a}\right)-F\left(G_{b}\right)\right\|_{1}\end{align}\) (1)
The Hamming distance is used to calculate thesensitivity of F. If Ga is different from Gb by one node, the sensitivity of F is dmax, where the dmax is the maximum degree of nodes in the graph.
Definition 4 (Differential Privacy[12]).
Let ε ≥ 0, a randomized algorithm M satisfies ε-differential privacy if for any two neighboring graphs Ga and Gb and all S ⊆ Range(M), the following holds:
Pr[M(Ga) ∈ S] = eε x Pr[M(Gb) ∈ S] (2)
where Ga and Gb are neighbors, ε denotes a privacy preserving level. To achieve ε-differential privacy for graphs, two ways including the Laplace mechanism and the Exponential mechanism have been adopted to perturb the outputs of M.
Definition 5 (Laplace Mechanism[12]).
Let F be a sequence of queries: G → E, and M is a randomized algorithm applied on G, there is the following :
M (G) = F(G) + lap(∆f/ε) (3)
where lap(∆f/ε) denotes the Laplace noise with µ =0, b = ∆f/ε.
The Laplace mechanism is the way that adds Laplace noise on F(G) to ensure the algorithm Z can satisfy ε-differential privacy.
In addition, Eq. (4) describes the Laplace noise distribution.
L(x) =1/2b ∗ exp(−|x − µ|/b) (4)
where µ represents a position parameter, b is a scale parameter and x denotes a random variable.
Definition 6 (Exponential Mechanism[13]).
For a data set D, let r be the output of function F, where r ∈ R, the function U: (D, t) →R is the scoring function of r on D, and the global sensitivity of the scoring function is Δ U. If the random algorithm A (D, u, R) is proportional to exp(\(\begin{align}\frac{\varepsilon U(D, r)}{2 \Delta U}\end{align}\)) to select and output r ∈ R, the random algorithm A is said to satisfy ε-differential privacy. The implementation process of this algorithm is called the exponential mechanism
Definition 7 (Randomized Response[32]).
The randomized response mechanism is defined as follows:
P(yi = k|xi = j) = Pij (5)
where xi is an input which equals j, the probability to output that yi equals k is Pij. When the value ranges of j and k belong to {0,1}, i ⊂ [1, N], N is the number of the inputs.
The design matrix Pm of the 2-dimensional randomized response is defined as follows:
\(\begin{align}P_{m}=\left(\begin{array}{ll}p_{00} & p_{01} \\ p_{10} & p_{11}\end{array}\right)\end{align}\)
where P00 indicates the probability that a random output equals 0 when a real input is 0, P01 represents the probability that a random output is 1 when the real input equals 0, where P00 and P01 in [0,1]. At the same time, P10 represents the probability that a random output equals 0 when the real input is 1, P11 is the probability that a random output is 1 when a real input equals 1, where P10 and P11 in [0,1]. Particularly, as the sum of probabilities of each row is 1, the design matrix can be simplified to
\(\begin{align}P_{m}=\left(\begin{array}{cc}p_{00} & 1-p_{00} \\ 1-p_{11} & p_{11}\end{array}\right)\end{align}\)
Definition 8 (Randomized Response satisfying ε-Differential Privacy).
Given a parameter ε, if max {P00/P10, P00/P01, P01/P11, P10/P11}< eε, the randomized response scheme based on design matrix Pm in definition 7 will achieve ε-differential privacy. Definition 9 (Post-Processing[12]).
Assuming a randomized algorithm M that satisfies ε-differential privacy, when a graph G is input to M, the output of the algorithm M is G’, which can preserve the graph G. Let N be an arbitrary randomized mapping, when N is applied on G’ to get G”, the algorithm M∘N :G → G" satisfies ε-differential privacy.
Definition 10 (Parallel composition properties[13]).
Let a sequence of algorithms be {A1, A2, ..., An}, and assuming that each algorithm Ai is εi-differential privacy. When these algorithms are utilized respectively to preserve n disjoint subsets of the database D, then the combination processing of all algorithms satisfies maxεi differential privacy, and it is called the parallel composition properties of differential privacy.
4. Framework and Method
4.1 A general framework
A general framework proposed consists of five steps, three of which are the main steps of the framework. These three steps are decomposing and selecting, edge modification and merging sub-graphs. The detail of this framework is described in the following.
As shown in Fig. 2, step 1 inputs an original graph that denotes a social network. In step 2, the original graph is divided into many sub-graphs by using the Louvain algorithm, then the exponent mechanism selects some sub-graphs with a large number of edges, so the minimal edge modification can be realized on the original graph under a given privacy budget. Then, step 3 utilizes the randomized response and the differential privacy to modify each selected sub-graph through edge modification. In this step, the randomized response is applied on nodes to modify each sub-graph. In particular, this process only adds or deletes edges between one node and its neighbor nodes and second-order adjacent nodes. After that, the node differential privacy adds the Laplace noise on edges and the post-process mechanism injects the uncertainty into these edges through. Finally, a set of uncertain sub-graphs is generated. In step 4, all uncertain sub-graphs and the unselected sub-graphs are merged to generate an uncertain graph. In the end, step 5 outputs an uncertain graph that achieves differential privacy preserving for link privacy.
Fig. 2. A general framework to generate an uncertain graph
In summary, a general framework based on node random response is devised, which can preserve the link privacy of the social network while obtaining sound data utility.
4.2 Methods and Algorithms
4.2.1 UGNRR (Uncertain graph based on node random response) method
As shown in the general framework, the key work of this framework is to generate an uncertain graph that can provide strong privacy preserving for the link privacy of the original graph. Therefore, the UGNRR method based on this framework is proposed, which includes three algorithms, SGEM(Selecting sub-graph based on exponent mechanism) algorithm, EMNR(Edge modification based on node random response mechanism)algorithm and UNDP (Uncertain graph based on node differential privacy) algorithm. In this method, the SGEM algorithm utilizes the exponent mechanism to get a set of sub-graphs with a large number of edges. For each sub-graph in this set, the EMRN algorithm modifies the edges of each sub-graph according to random response, then the UDPM algorithm transforms each modified sub-graph into an uncertain sub-graph through node differential privacy. At last, this method generates an uncertain graph that achieves privacy preserving for link privacy of the original graph.
Algorithm 1. The UGNRR algorithm
To achieve the proposed method, the UGNRR algorithm is presented above. In line 1, an inputted undirected graph is decomposed into a set of sub-graphs Ss. Line 2 selects a set of sub-graphs Ssub from Ss through the SGEM algorithm. For each sub-graph SGi, it is dealt with by two algorithms from line 4 to line 7. Line 5 adds and deletes edges by the EMNR algorithm, then the UNDP algorithm generates an uncertain sub-graph in line 6. Lastly, the two sub-graph sets SGu and Sr are merged to generate an uncertain graph in line 9.
4.2.2 SGEM(Selecting sub-graph based on exponent mechanism) algorithm
After an original graph is decomposed into many subgraphs, the edge modification is used to modify these subgraphs. In UGNRR method, the number of edges in each sub-graph is used to denote the size of this sub-graph. Due to the different sizes of these subgraphs, the edge modification will perturb them differently. When the same perturbation of the edge modification is added to each sub-graph, the smaller the size of the sub graph, the larger the perturbation is.To reduce the perturbation, some small-size sub-graphs are deleted and the perturbation is added on the remained sub-graphs. In this way, there are two kinds of perturbation. One is caused by the edge modification, the other is brought by the deleted sub-graphs. In order to gain the minimum perturbation, the exponent mechanism is utilized to select some larger size sub-graphs. In this exponent mechanism, the Laplace noise denotes the perturbation of edge modification.
Given an undirected graph G, the GN algorithm divides it into n sub-graphs. For each sub-graph Si, the Swi, which is the sum of the edges in Si, denotes the size of the sub-graph Si. Then, there is a sequence WG, described as[Sw1, Sw2, ..., Swn].
Algorithm 2. The SGEM algorithm
After sorting the WG from larger to small, we select the first m units from it and add noise to them. Then, we will get the Error(WG), which is illustrated as follows.
Error(WG) = DE(WG) + LE(WG)
where DE(WG) represents the error caused by the deleted units, LE(WG) is the Laplace noise added.
\(\begin{align}\begin{array}{l}D E\left(W_{G}\right)=E\left(\sqrt{\sum_{i=m+1}^{n}\left|S w_{i}\right|^{2}}\right) \\ L E\left(W_{G}\right)=E\left(\sqrt{\sum_{i=1}^{m} \operatorname{lap}(\Delta f / \varepsilon)^{2}}\right) \\ D E\left(W_{G}\right)+L E\left(W_{G}\right)=\sqrt{\sum_{i=m+1}^{n}\left|S w_{i}\right|^{2}}+\sqrt{2 * m} * \frac{\Delta f}{\varepsilon}\end{array}\end{align}\)
Here, a query function is f : f(G) → WG
∆f = |f(G) - f(G')| = |WG - WG'| = dmax
where ∆f is the sensitivity of a query function f, dmax is the maximum degree of nodes in G, and G and G’ are neighboring graphs with one different node .
In order to gain a minimum value of Error(WG), the exponent mechanism selects a best threshold m which can be used to select some sub-graphs. Thus, a scoring function U is set up:
\(\begin{align}U(G, m)=\sqrt{\sum_{i=m+1}^{n}\left|S w_{i}\right|^{2}}+\sqrt{2 * m} * \frac{\Delta f}{\varepsilon}\end{align}\)
In this algorithm, the node differential privacy is applied to realize strong privacy preserving. Therefore, the ∆U is:
∆U = U(G, m) - U(G', m) = ∆RE + ∆LE
the ∆U is:
\(\begin{align}\begin{array}{l}\Delta R E \leq \max \left|\sqrt{\sum_{i=m+1}^{n}\left|S w_{i}\right|^{2}}-\sqrt{\sum_{i=m+1}^{n}\left|S w_{i}^{\prime}\right|^{2}}\right| \\ \quad \leq \max \left|\sum_{i=m+1}^{n}\right| S w_{i}\left|-\sum_{i=m+1}^{n}\right| S w_{i}^{\prime}|| \leq d_{\max } \\ \Delta L E=\Delta f \\ \Delta U=\Delta R E+\Delta L E \leq 2 d_{\max }\end{array}\\\end{align}\)
The probability to select the threshold m is
\(\begin{align}p_{r}(m)=\frac{\exp \left(-\frac{\varepsilon * U(G, m)}{2 \Delta U}\right)}{\sum_{i=1}^{n} \exp \left(-\frac{\varepsilon * U(G, i)}{2 \Delta U}\right)}\end{align}\)
Then the best threshold m is used to truncate the WG. Finally, a set of sub-graphs Sm is obtained, which can be utilized to realize the minimal noise perturbation in the original graph.
Line 1 is the number of sub-graphs and line 2 is a sequence WG that records the size of each sub-graph. From line 3 to line 5, the exponent mechanism gains a threshold m. According to m, line 6 truncates the sequence WG and line 7 selects a set of sub-graphs Sm from the set of the sub-graphs Ss. In the end, line 8 gets a set of sub-graphs Ssub.
4.2.3 EMNR (Edge modification based on node random response mechanism) algorithm
In the set of sub-graphs Ssub, the random response applied on nodes is used to add and delete edges in each sub-graph. To maintain data utility, this algorithm only adds and deletes between this node and its adjacent nodes and the second-order adjacent nodes. In order to add some edges in SGi, an edge sequence is created firstly, in which each edge links node i and one of its second-order adjacent nodes, and each edge in it is assigned a value 0. Then, we input the value of each edge into the random response mechanism. If the value of one edge becomes 1, this edge will be added to this sub-graph. In addition, when deleting edges from the graph, another edge sequence is generated, in which each edge links node i and one of its adjacent nodes, and each edge in it is assigned a value 1. After entering the value of one edge into the random response mechanism, this edge will be deleted from the sub-graph SGi if its value becomes 0. Since deleting edges will destroy the structure of the graph, only one-ith of selected edges will be deleted, where i usually take 3. Finally, the edge modification modifies each sub-raph SGi by the node random response mechanism. The detail of the EMNR algorithm is described in Algorithm 3.
Algorithm 3. The EMNR algorithm
4.2.4 UNDP (Uncertain graph based on node differential privacy) algorithm
To generate an uncertain graph, the UNDP algorithm is shown in Fig. 3. As shown in Fig. 3, to generate an uncertain sub-graph, the UNDP algorithm contains three steps. First of all, according to the Laplace mechanism, the Laplace noise is added on each edge of a graph SnGi. In this process, the node differential privacy is applied, which provides better privacy preserving than edge differential privacy. After that, each edge of a graph SnGi becomes a noised edge with a noise value. Therefore, a graph SnGi is transformed into a noised subgraph. Finally, according to the principle of post-processing, the noise value of each edge is calculated based on the modulo operation. In this algorithm, the modulo operation is to modulo 1 then taking the remainder, so the result of this operation is in [0,1]. Thus, this result is regarded as a probability value and it is assigned on each edge of SnGi. Therefore, an uncertain graph SmGi is generated. The detail of UNDP algorithm is described in Algorithm 4.
Fig. 3. The UNDP algorithm
Algorithm 4. The UNDP algorithm
4.3 The analysis of method
Theorem : The UGNRR method satisfies ε-differential privacy.
Proof: In this method, the exponent mechanism that satisfies differential privacy is adopted in the SGEM algorithm, the EMNR algorithm uses a randomized response with differential privacy and the UNDP algorithm utilizes the node differential privacy. Therefore, these three algorithms all satisfy differential privacy. To achieve minimal edge modification to the original graph under a given privacy budget, the SGEM algorithm selects some sub-graphs from the original graph, which satisfies differential privacy. Then, according to the parallel composition properties principle of differential privacy, the process that the EMNR algorithm and the UNDP algorithm are applied to generate uncertain sub-graphs also satisfies differential privacy. In the end, the process of merging all uncertain sub-graphs satisfies the post-processing . In summary, the UGNRR algorithm satisfies ε-differential privacy.
5. Experimental Analysis
The developed method is evaluated in this section. First, some experiment data sets are introduced. Then, the developed method is analyzed from different aspects. Finally, the proposed method is also compared with other uncertain graph approaches.
5.1 Data set
In our experiments, two kinds of experiment data are utilized, which include synthetic data sets and real data sets. The synthetic data sets are obtained from ER graphs, which contain 500 and 1000 nodes. The real data sets contain Face-book data with 4039 nodes and 63731 nodes, and Enron email network with 36692 nodes, which is from [35].
To evaluate the proposed method, (k,ε1)-obfuscation method, Rand-Walk method and UGDP method are adopted for comparison. All simulation experiments run on an HP computer, which has an Intel Core i5-8500 with 3.00GHz and 12GB memory. For programming, Python is used on the Microsoft Windows 7 operating system.
5.2 Privacy evaluation
5.2.1. Privacy measurement
When a graph is converted into an uncertain graph, there is a certain gap between them which can be measured by the editing distance. Because the edge in uncertain graphs is uncertain, the expectation of editing distance is introduced to measure the gap between an original graph and an uncertain graph, which also can be used to evaluate preserving privacy.The larger the EED, the better privacy preserving.
It is well-known that the definition of edit distance between two deterministic graphs G1, G2 is: D(G1, G2) = |E1\E2| + |E2\E1|
According to the formula above, the expected edit distance between the uncertain graph G’’ and the deterministic graph G is:
\(\begin{align}E E D\left[D\left(G, G^{\prime \prime}\right)\right]=\sum_{G_{1}^{\prime}} P_{r}\left(G_{1}^{\prime}\right) D\left(G, G_{1}^{\prime}\right)=\sum_{e_{i} \in G}\left(1-p_{i}\right)+\sum_{e_{i} \notin G} p_{i}\end{align}\)
where G’1 is sampled from G", Pr(G’1) indicates the probability of obtaining G’1 from the uncertain graph G".
In UGNRR algorithm, when we get an uncertain graph Gu, the expected edit distance between Gu and the graph G is:
EED[D(G, Gu)] = EED[D(G, G')] + EED[D(G', Gu)]
where G’ is obtained by the MERN algorithm, and Gu is generated by the UNDP algorithm.
EED[D(G, G')] = ek
where ek equals the edit distance between two deterministic graphs G and G’, which is calculated by the following formula: ek = |Eα| + |Ed|
where |Eα| denotes the number of edges which are added in G, where |Ed| is the number of edges which are deleted from G.
Then there are no edges added and removed in the UNDP algorithm,, thus, the expected edit distance between Gu and G’ is
\(\begin{align}E E D\left[D\left(G^{\prime}, G u\right)\right]=\sum_{e_{i} \in G^{\prime}}\left(1-p_{i}\right)\end{align}\)
where ei belongs to the edges set of Gu, pi is the probability of the edge ei.
The expectation of editing distance(EED) between Gu and G is shown as follows:
\(\begin{align}E E D\left[D\left(G, G u\right)\right]=e_{k}+\sum_{e_{i} \in G^{\prime}}\left(1-p_{i}\right)\end{align}\)
5.2.2. Privacy analysis
To evaluate the different uncertain graph algorithms in privacy preserving, the EED is used. The greater the EED, the better privacy preserving this uncertain graph method achieves. all data sets are executed 10 times by the proposed method and other methods to gain the average results.
In the comparative experiments, the parameter of three methods is shown in Table 1. In (k,ε1)-obfuscation method, the obfuscation level k belongs to 10, 20, the tolerance parameter ε l equals 0.1, the multiplier factor c is 1 and the white noise q is equal to 0.01. In Rand-Walk method, the parameter t denotes the size of the noise. In addition, the privacy budget ε is in [0.2, 0.5, 1, 1.5, 2] in the UGNRR method.
Table 1. The EED values of four methods in differential data sets
The result of EED values can be seen in Table 1. In Table 1, the EED values in the UGNRR method are shown from the first to the five row, where the EED increases as the value of ε decreases, which means that the privacy preserving of the UGNRR method becomes stronger. For example, in the FaceBook data set with 4039 nodes, when ε is 2, the value of EED is 75354. As ε ascends to 0.5, the value of EED rises to 76243, which means that the privacy preserving of UGNRR method is improved. Additionally, as the number of nodes in the original graph increases, the EED of UGNRR method rises simultaneously, which indicates that the UGNRR method is able to provide privacy preserving for the different social networks. For instance, in table 1, when ε is 1, it is clear that the EED of UGNRR method increases from 18687 to 675156 as the number of nodes changes from 500 to 63731, which illustrates this method can be applied in different social networks.
As shown in Table 1, the EED of (k,ε1)-obfuscation method is shown from the sixth row to the seventh row while the rows from the eighth to the ninth indicate the EED values of Rand-walk method. In addition, the detail of the UGDP method is described in the rest rows. Compared with the other three methods in the same data set, the value of EED obtained by using UGNRR method is larger than that through (k,ε1)-obfuscation method and UGDP method, meanwhile it is smaller than that through Rand-Walk method. For example, in the FaceBook data set with 4039 nodes, when ε is 0.5, the value of EED in the UGNRR method is 76243, which is larger than that in the (k,ε1)-obfuscation method with k=20 and that in the UGDP method while being less than that in the Rand-Walk method with t=10. In particular, compared with the UGDP method, the results show that the edge modification and node differential privacy applied in the UGNRR method take effect on the value of EED. Therefore, according to the definition of EED, it is clear that the UGNRR method can provide stronger privacy preserving than (k,ε1)-obfuscation method and the UGDP method, but it is weaker than Rand-Walk method.
5.3. Utility evaluation
5.3.1. Utility metrics
In order to evaluate the data utility, the NE, AD and DV are used in our experiments. Due to the uncertainty of edges in an uncertain graph, the degree of a node in an uncertain graph is the expected degree which is equal to the sum of probabilities of its adjacent edges. Therefore, the definitions the NE, AD and DV are shown as follows:
\(\begin{align}\begin{array}{rl}d_{v}=\sum p(i, j) & N E=1 / 2 \sum_{v \in V} d_{v} \\ A D=1 / n \sum_{v \in V} d_{v} & D V=1 / n \sum_{v \in V}\left(d_{v}-A D\right)^{2}\end{array}\end{align}\)
Then, several structural measures are adopted. The first one is the diameter (SDiam) which denotes the maximum distance among all path-connected pairs of nodes. The second measure is the average distance(SAPD) which is the average shortest distance among all path-connected pairs of nodes.
Furthermore, the Utility(function) defined as follows is utilized to measure the data utility of each method. The greater the Utility, the better the data utility of this method.
\(\begin{align}Utility=\left(1-\frac{|U V-R V|}{R V}\right) \times 100 \%\end{align}\)
where UV is the graph metrics in uncertain graphs achieved by different methods, RV is the real metrics in the original graphs.
Finally, to compare the UGNRR method with other three methods in the data utility, the error on one graph metric is used, which is described as follows:
∆q(G, Gu) = |q(G) − q(Gu)|
where q represents one graph metric.
5.3.2. Utility analysis
To evaluate the data utility of the uncertain graph method, the experimental results are obtained by averaging the results 10 times and taking the final value. Table 2 illustrates the graph metrics in the original graph and the UGNRR method, Table 3 shows the graph metrics in the original graph, (k,ε1)-obfuscation method and Rand-walk method while Table 4 demonstrates the graph utility metrics in the UGDP method.
Table 2. The metrics in the UGNRR method
Table 3. The metrics in (k,ε l )-obfuscation method and Rand-walk method
Table 4. The metrics in the UGDP method
As shown in Table 2, the value of NE in five data sets decreases as the ε rises, so does the value of AD. For instance, in the Facebook data set with 4039 nodes, the value of NE descends from 72846 to 70895 with the ε changing from 0.2 to 2, while the value of AD decreases from 36.01 to 35.10. In addition, the value of DV descends from 4789.76 to 4084.23, while the SAPD rises to 2.17. In the UGNRR method, the smaller ε, the more edges are modified in the original graph, so the greater the value of NE and AD. On the contrary, the larger ε, the fewer edges are modified, thus the value of DV becomes smaller and the SAPD gets close to that of original graph. Therefore, the UGNRR method can provide sufficient data utility regardless of the privacy budget ε.
Then the Utility is used to evaluate the data utility of UGNRR method. As shown in Fig. 4 (a), the maximum Utility of NE is 85%. In Fig. 4 (b), the highest Utility of AD can reach 87%, the lowest is 75%, so the average Utility of AD is about 81%. According to the results in Table 2, in the Facebook data set with 4039 nodes, the highest Utility of SDiam is about 74%, while that of SAPD is 72%. Especially, the highest Utility of SAPD can reach 84% in the Facebook data set with 63731 nodes. Therefore, the data utility of UGNRR method is feasible.
Fig. 4. The Utility of NE and AD in UGNRR method
Furthermore, ∆q is utilized to compare UGNRR method with (k,ε1)-obfuscation method, Rand-walk method and the UGDP method in data utility. In the Facebook data set with 63731 nodes, the NE of the original graph is 817090, the NE obtained by the UGNRR method is 697090(ε=0.2) while the NE of the other three methods is 816286(k=10), 425702(t=5), 612833 (ε=0.2) respectively. Thus, the value of the ∆q of NE obtained by the UGNRR method is larger than that in (k,ε1)-obfuscation method, but it is less than that in the UGDP method and the Rand-walk method. The result indicates that the UGNRR method is not better than the (k,ε1)-obfuscation method in the data utility, but it is better than the UGDP method and the Rand-walk method. Additionally, the details of the ∆q of other graph metrics are described in Fig. 5, where Fig. 5 (a) shows the ∆q about AD in different methods, while Fig. 5 (b) demonstrates the ∆q about SAPD. According to the results, (k,ε1)-obfuscation method has better data utility than the UGNRR method, while the UGNRR method is better than Rand-walk method. In addition, the UGNRR method is better than the UGDP method in some graph metrics, such as AD.
Fig. 5. The comparison of different methods
5.4. Computational complexity evaluation
Given a social network G=(V, E ), where the number of nodes V is n and the number of edges E is m. In UGNRR method, the Louvain algorithm is adopted to decompose G into k sub-graphs Gs(Vs, Es), where |Vs|=ns and |Es|=ms. As UGNRR method consists of three main steps, the computational complexity O(x) of UGNRR method is the total execution time of these three steps. For step 1, G is decomposed into k sub-graphs through the Louvain algorithm, then SGEM algorithm is used to select kt sub-graphs, the computational complexity is O(nlogn)+O(k). For step 2, EMRN algorithm and UNDP algorithm are utilized to gain kt uncertain sub-graphs, so the computational complexity is kt*(O(ns*d2max)+O(me)), where dmax is maximum degrees of nodes and me is the number of edges of each modified sub-graph. For step 3, as all sub-graphs are merged into an uncertain graph, the operation is within constant time and the computational complexity is O(1). Therefore, it is clear that the total computational complexity is O(nlogn)+O(k)+kt*(O(ns*d2max)+O(me))+O(1). Especially, without considering the low order of magnitude, UGNRR method finally has the computational complexity O(nlogn)+kt*(O(ns*d2max)+O(me)).
In other methods, UGDP method has the computational complexity O(m), while the computational complexity O(x) of (k,ε1) method and Rand-walk method is O((1+|c|)*m) and O(n) respectively. Compared with UGDP method, after using O(nlogn) to decompose the original graph, UGNRR method spends kt*O(ns*d2max)+O(me) to get an uncertain graph, which is smaller than O(m). In addition, the computational complexity of UGNRR method to get an uncertain graph is also smaller than (k,ε1) method and larger than Rand-walk method. Therefore, as the computational complexity O(nlogn) is feasible to decompose the original graph, O(nlogn)+kt*(O(ns*d2max)+O(me)) of UGNRR method is viable to generate an uncertain graph.
In summary, the results of experiments show that the UGNRR method can not only provide sufficient privacy preserving, but also maintain data utility when it is applied in practice.
6. Conclusion
In the conclusion, you can reiterate the main points of the paper, but do not duplicate the abstract as a conclusion. With the increasing attention to individual privacy, many graph modification methods and differential privacy methods have been widely adopted to preserve the graph structure data in social networks, which contain personal sensitive link privacy. As a special useful graph modification method, the uncertain graph methods provide effective privacy preserving while maintaining data utility. To improve the privacy preserving of uncertain methods, an uncertain graph method based on node random response is developed, which can provide stronger privacy preserving than other uncertain graph methods. In particular, the random response is utilized to modify the original graph and the node differential privacy is applied to inject uncertainty on edges. In addition, to maintain data utility, after the original graph is decomposed into many sub-graphs, some sub-graphs with a larger number of edges are selected through the exponent mechanism and are modified by the edge modification. In particular, the edge modification only adds and deletes edges between the node and its neighbor nodes and second-order adjacent nodes in each sub-graph. According to the properties of differential privacy, the proposed uncertain graph method satisfies differential privacy. Meanwhile, the experiment results indicate that the proposed method owns better privacy preserving while attaining sound data utility. Therefore, the developed uncertain graph method can be widely applied to preserve the link privacy of social networks.
In the future, although the presented method achieves a better balance between privacy preserving and data utility, whether it can be applied to complex networks, such as dynamic graphs and directed graphs, is our next work.
References
- S.R.Sahoo, B.B.Gupta, "Multiple features based approach for automatic fake news detection on social networks using deep learning," Applied Soft Computing, vol.100, Mar.2021.
- A.K.Jain, S.R.Sahoo, J.Kaubiyal, "Online social networks security and privacy: comprehensive review and analysis," Complex & Intelligent Systems, vol.7, no.5, pp.2157-2177, Jun. 2021. https://doi.org/10.1007/s40747-021-00409-7
- J. Isaak, M. J. Hanna, "User Data Privacy: Facebook, Cambridge Analytica, and Privacy Protection," Computer, vol.51, no.8, pp.56-59, Aug.2018. https://doi.org/10.1109/MC.2018.3191268
- K.Swati. [Online]. Available: https://thehackernews.com/2018/03/facebook-cambridge-analytica.html.
- L.Sweeney, "k-Anonymity: a model for protecting privacy," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol.10, no.5, pp.557-570, Oct.2002. https://doi.org/10.1142/S0218488502001648
- B. Zhou, J. Pei, "The k-anonymity and l-diversity approaches for privacy preservation in social networks against neighbourhood attacks," Knowledge and Information Systems, vol.28, no.1, pp.47-77, Jul. 2011. https://doi.org/10.1007/s10115-010-0311-2
- S.Chester, M.B.Kapron, G.Srivastava, "Complexity of social network anonymization," Social Network Analysis and Mining, vol.3, pp.151-166, Jun.2013. https://doi.org/10.1007/s13278-012-0059-7
- J.Casas-Roma, J.Herrera-Joancomarti, V.Torra. "k-Degree anonymity and edge selection: improving data utility in large networks," Knowledge and Information Systems, vol.50, no.2, pp.447-474, Feb.2017. https://doi.org/10.1007/s10115-016-0947-7
- K.R.Langari, S.Sardar, A.A.S.Mousavi, "Combined fuzzy clustering and firefly algorithm for privacy preserving in social networks," Expert Systems with Applications, vol.141, pp.1-12, Mar.2020. https://doi.org/10.1016/j.eswa.2019.112968
- P. Boldi, F. Bonchi, A. Gionis, "Injecting uncertainty in graphs for identity obfuscation," Proceedings of the VLDB Endowment, vol.5, No.11, pp.1376-1387, Aug.2012. https://doi.org/10.14778/2350229.2350254
- P.Mittal, C.Papamanthou, D. Song, "Preserving Link Privacy in Social Network Based Systems," in Proc. of NDSSS, San Diego, USA, pp.1-16, Feb. 2013.
- J.Hu, J.Yan, Z Wu, "A Privacy-Preserving Approach in Friendly-Correlations of Graph Based on Edge-Differential Privacy," Journal of Information Science and Engineering, vol.35, no.4, pp.821-837, Jul.2019.
- C. Dwork, "Differential Privacy," in Proc. of ICALP, Venice, Italy, pp.1-12, Jul. 2006.
- X.Ying, X.Wu, "Randomizing social networks: a spectrum preserving approach," in Proc. of SDM, Atlanta, GA, USA, pp.739-750, Apr.2008.
- R.Casas, "Privacy-Preserving on Graphs Using Randomization and Edge-Relevance," in Proc. of MDAI, Tokyo, Japan, pp.204-216, Oct.2014.
- J.Casas, J.Herrera, V.Torra, "An Algorithm For k-Degree Anonymity On Large Networks," in Proc. of ASONAM, Niagara, Ontario, Canada, pp.671-675, Aug.2013.
- J.Cheng, A.W.Fu, J.Liu, "K-isomorphism: privacy preserving network publication against structural attacks," in Proc. of ICMD, Indianapolis, Indiana, USA, pp.459-470, Jun. 2010.
- H.Rong, T.Ma, M.Tang, "A novel subgraph K+-isomorphism method in social network based on graph similarity detection," Soft Computing, vol.22, no.8, pp.2583-2601, Apr. 2018. https://doi.org/10.1007/s00500-017-2513-y
- Y.Liu, J.Jin, Y.Zhang, "A new clustering algorithm based on data field in complex networks," Journal of Supercomputing, vol.67, no.3, pp.723-737, Mar. 2014. https://doi.org/10.1007/s11227-013-0984-x
- F.Yu F, M.Chen, B.Yu, "Privacy preservation based on clustering perturbation algorithm for social network," Multimedia Tools and Applications ,vol.77, no.9, pp. 11241-11258, 2018. https://doi.org/10.1007/s11042-017-5502-3
- H. H. Nguyen, A.Imine, M. Rusinowitch, "A Maximum Variance Approach for Graph Anonymization," in Proc. of FPS, Montreal, Canada, pp.49-64, Nov.2014.
- H.H.Nguyen, A.Imine, M.Rusinowitch, "Anonymizing Social Graphs via Uncertainty Semantics," in Proc. of ICCS, Singapore, pp.495-506, Apr. 2015.
- J.Yan, L.Zhang L, C.W.Shi. "Uncertain Graph Method Based on Triadic Closure Improving Privacy Preserving in Social Network," in Proc. of NaNA, Kathmandu, Nepal, pp.190-195, Oct.2017.
- C. Dwork, "Differential privacy: a survey of results," in Proc. of TAMODELSC, Xi'an, China, pp.1-19, Apr. 2008.
- R.K.Macwan, J.S.Patel, "Node differential privacy in social graph degree publishing," Procedia computer science, vol.143, pp.786-793, 2018. https://doi.org/10.1016/j.procs.2018.10.388
- Q.Qian, Z.Li, P.Zhao, "Publishing Graph Node Strength Histogram with Edge Differential Privacy," in Proc. of DSAA, Gold Coast, QLD, Australia, pp.75-91, May, 2018.
- B.P.Nguyen, H.Ngo, J.Kim, "Publishing Graph Data with Subgraph Differential Privacy," in Proc. of DSAA, Hanoi, Vietnam, pp.134-145, Apr. 2015.
- T.Dong, Y.Zeng, H.Z.Liu, "A Differential Privacy Topology Scheme for Average Path Length Query," Journal of Information Science & Engineering, vol.37, no.4, pp.134-145, Jul. 2021.
- H.Jiang, J.Pei, D,Yu, "Applications of Differential Privacy in Social Network Analysis: A Survey," IEEE Transactions on Knowledge and Data Engineering, vol.35, no.1, pp. 108-127, 2023.
- V.Karwa, A.B.Slavkovi'c, "Differentially private graphical degree sequences and synthetic graphs," in Proc. of ICPSD, Palermo, Italy, pp.273-285, Sep. 2012.
- Z.Qin, T.Yu, Y.Yang, "Generating Synthetic Decentralized Social Graphs with Local Differential Privacy," in Proc. of CCS, Dallas, Texas, USA, pp.425-438, Oct. 2017.
- C.Liu, S.Chen, S.Zhou, "A general framework for privacy-preserving of data publication based on randomized response techniques," Information Systems, vol.96, pp.1-12, Feb. 2021. https://doi.org/10.1016/j.is.2020.101648
- A.van den Hout A, P. G. M.van der Heijden, "Randomized response,statistical disclosure control and misclassificatio: a review," International Statistical Review,vol.70, no.2, pp.269-288, 2002. https://doi.org/10.1111/j.1751-5823.2002.tb00363.x
- V.Karwa V, B.A.Slavkovic, P.Krivitsky, "Differentially private exponential random graphs," in Proc. of ICPSD, Ibiza, Spain, pp.143-155, September, 2014.
- Stanford Large Network Dataset Collection. [Online]. Available: http://snap.stanford.edu/data/.