Lee and Tsai presented a data hiding scheme for HTML files by using different space codes to represent a space in an HTML file [11]. The main idea of Lee and Tsai’s method is to collect all the codes that represent a white space in the Microsoft Internet Explorer browser and then use different codes to conceal secret data. Table 1 summarizes the white space codes. The key steps of Lee and Tsai’s method are as follows: First, collect all space character sequences and transform them into 8-bit ASCII codes. Then encode the space character by considering its corresponding secret bits. Lee and Tsai’s method has good performance in terms of embedding capacity. However, the embedding capacity of data hiding in a webpage can be further improved.
Table 1.Special space code representations in HTML [11]
In this paper a novel data hiding scheme using HTML files is presented. The proposed method embeds secret data in an HTML file by using special space codes for between-word location segments. The main components include special space codes, a character coding table, and embedding rules. In order to improve the embedding capacity of webpage data hiding, we observed the source code in HTML files and found that codes “ ” and “ ” can also be used for concealing secret data in HTML files. Fig. 2 shows an example by testing different codes represented in Microsoft Internet Explorer. We added the special space codes “ ” and “ ” to generate a new index table, shown in Table 2.
Fig. 2.Example of the special space codes in an HTML file
Table 2Index coding table
The proposed scheme inspired the concept of the special space codes [11] and word-segment to conceal secret data into an HTML file’s content sentences. Here, the concept of word-segment is to divide a sentence into several segments. Each segment contains two between-word locations. For example, “The unanimous Declaration of the thirteen United States of America” is a sentence from the United States Declaration of Independence. The sentence can be divided into five segments as “The unanimous ”, “Declaration of ”, “the thirteen ”, “United States ”, and “of America”.
Before embedding, Character codes are summarized in Table 3, which are selected from ASCII codes from 32 to 126. A new index number is assigned to every character. For instance, “30” is indexed to character ‘A’. Note that, the character ‘se code’ is used to represent the start or the end of a secret message we embed. Because of the ‘se code’ notification, it is no need to remember the total length of a secret message embedded in an HTML file.
Table 3.Character codes table
The key steps of data embedding are as follows. First, secret data is encrypted by any encryption algorithm (e.g., DES, AES, RSA, etc.). The cipher text is then converted according to the symbols in Table 3. Next, space in the segment is encoded by using the special space code according to the index of the cipher message’s character index. For example, the index of the cipher message character ‘k’ is 69 (referring to Table 3), which is embed into first segment “The unanimous ”, then the encoded HTML code will be “The unanimous ” (referring to Table 2).
For ease of explanation, let the cipher message be represented as M = {mk | k = 1,2,…, Nm}, where Nm is the number of character in M. Further, S = {si | i = 1,2,…, NS} represents a set of segments in a cover HTML file H. Here NS is the total number of segments in H. Every si contains two between-word locations and is denoted as sij where j ∈ {0, 1}. Table 4 contains a summary of the simplified descriptions of the notations used in the proposed method .
Table 4.The definition of notations
In the proposed method, every segment can be used to conceal one character mk. The following is the embedding procedure:
Data Embedding Procedure:
Input: A cover HTML file H and cipher secret messages M.
Output: A Stego-HTML file H’.
Step 1: Let i = 1, k = 1, where i = 1,2, …, Ns and k = 1,2, …, Nm.
Step 2: Segment cover text into non-overlapping segments such that every segment contains two between-word locations.
Step 3: Encode the “se code” into first segment by replacing s10 as “ ” and s11 as “ ”, respectively.
Step 4: Encode si0 and si1 byusing the codes in the codes in the special coding table(i.e., referring to Table 2) corresponding to mk’s tens digit and units digit, respectively.
Step 5: If M has been embedded in H, then embed the “se code” into a segment to mark the end of M and go to Step 6,
Step 6: Generate the remaining part of HTML code with no change. Output the stego HTML file H’.
A receiver can extract the secret data from H’ by using the extracting procedure. The extracted data is a cipher message, thus, the receiver needs to decrypt the message with a pre-associated key to obtain the original message. From this point of view, it will be very challenging for an unexpected user to determine the hidden message without a decryption key, even if the message in H’ can be extracted. The following is the extracting procedure:
The Extracting Procedure:
Input: A cover HTML file H’.
Output: Cipher message M.
Step 1: Let i = 1, k = 1, where i = 1,2, …, Ns and k = 1,2, …, Nm.
Step 2: Parse the H’ to find the first two codes corresponding to “ ” and “ ” to mark the beginning of the secret message. Set i = i + 1.
Step 3: Parse the remaining code to find the two near codes contained in Table 2.
Step 4: Extract a secret character mk by looking up Table 3.
Step 5: If mk “se code”, then all of secret characters have been extracted and go to Step 6, else set i = i + 1 and k = k + 1 and go to Step 3.
Step 6: Concatenate the extracted secret characters and output message.
In order to evaluate the performance of the proposed method in terms of embedding capacity, Sui and Luo’s method [18], Yang and Yang’s method [22], Lee and Tsai’s method [11] and the proposed method were implemented using Octave software. The embedding capacity is found by counting the total number of secret characters that were embedded in a stego HTML file. We use eleven USA President inaugural addresses (listed in Table 5) to play the cover html and the “US Declaration of Independence” (i.e., total 63,456 characters) as the secret message.
The proposed method embeds eight bits for every segment. That means, each between-word location conceals four secret bits. Obviously, the proposed method conceals one more secret bit than Lee and Tsai’s method for every between-word location. The embedding capacity of the proposed method and Lee and Tsai’s method depend on the number of between-word locations. Fig. 3 illustrates the performance comparison. The experimental results show that the proposed method can conceal more secret data than Lee and Tsai’s method.
Table 5.Eleven cover html
Fig. 3.The capacity comparison of the proposed method and Lee and Tsai’s method
Fig. 4 demonstrates the stego HTML generated by the proposed method. The experimental results show that it would be very challenging for a user to distinguish the difference between the original HTML and the stego HTML by using only the human eye.
Fig. 4.The visual quality of cover HTML and the source code
Fig. 5 shows the browsing results by using other popular browsers, Firefox and Google Chrome. From Fig. 5, it is hard to distinguish the difference by using the human eye.
Fig. 5.The stego HTML browsed by different bowser (2013 USA President inaugural address)
Sui and Luo’s method uses the case of tag letters to conceal secret data. Thus, the embedding capacity of Sui and Luo’s method is limited by the number of tags. To carry more secret data, using more redundant tags is a solution. For example, if the secret data contains 300 characters (i.e., 1 character = 8 bits) and normally a tag contains 4 letters, then it requires 300 * 8 / 4 = 600 tags to conceal the secret data.
On the other hand, Yang and Yang’s method uses different quotation types to conceal secret data. Thus, the embedding capacity of Yang and Yang’s method may also be limited by the number of attribute settings. To carry more secret data, using more redundant attributes is a solution. For example, if the secret data contains 300 characters (i.e., 1 character = 8 bits) and normally an attribute setting can conceal one secret bit, then it requires 300 * 8 = 2400 attribute settings to conceal the secret data.
The perfect situation for stego html is when all the blank characters never change. However, that is impossible. So, our goal is to try to increase the secret data embedded and increase the reserved “space” characters. The proposed method conceals secret data by using the index coding table with the character coding table, thus the index coding codes utilization will be highly related to secret content. Fig. 6 demonstrates the histogram of index coding codes adopted in concealed secret data. As we see, the “type space” is higher than others.
Fig. 6.The count of special codes embedded in test cover htmls using the proposed method
Fig. 7 shows the histogram of special codes embedded in stego html by using Lee and Tsai’s method. From Fig. 7, it can be seen that the frequency of the “space” character adopted to conceal secret data is similar to others.
Fig. 7.The count of special codes embedded into test cover htmls using the Lee and Tsai’s method
Fig. 8 demonstrates the frequency comparison of the “space” character adoption between the proposed method and Lee and Tsai’s method. From this point of view the proposed method not only successfully increases the performance in terms of embedding capacity but also increases the reserved “space” characters. Also, the proposed character coding table can be permuted and pre-shared to both the sender and receiver. In the other words, the proposed method is more secure than Lee and Tsai’s method.
Fig. 9 shows the result of run time comparison, the proposed method takes more time to conceal secret data than Lee and Tsai’s method. The reason for this is that the proposed method is required to scan the character coding table to determine the corresponding tens digits and units digits. However, due to the computational power of web servers, the run time of the proposed method is still acceptable.
Fig. 8.The comparison of the count of “Type space” between the proposed method and Lee and Tsai’s method
Fig. 9.The run time comparison between the proposed method and Lee and Tsai’s method
Employing a web page as a cover medium is a good method of secret message delivery, because the web page is a very popular way of sharing knowledge and advertising a company’s information. According to the property of encoding English sentences in a web page, the type space can be represented by several different special codes. Lee and Tsai’s method uses eight different special codes to represent the “type space” and conceal the secret message. The proposed method described here improves the embedding capacity of web page data hiding. The proposed method utilizes eleven special space codes and sentence segmentation to increase the embedding capacity. In the proposed method, every between-word location can conceal one more secret bit than in Lee and Tsai’s method.