Data Hiding for HTML Files Using Character Coding Table and Index Coding Table

Chou, Yung-Chen;Hsu, Ping-Kun;Lin, Iuon-Chang;

doi:10.3837/tiis.2013.11.021

Lee and Tsai presented a data hiding scheme for HTML files by using different space codes to represent a space in an HTML file [11]. The main idea of Lee and Tsai’s method is to collect all the codes that represent a white space in the Microsoft Internet Explorer browser and then use different codes to conceal secret data. Table 1 summarizes the white space codes. The key steps of Lee and Tsai’s method are as follows: First, collect all space character sequences and transform them into 8-bit ASCII codes. Then encode the space character by considering its corresponding secret bits. Lee and Tsai’s method has good performance in terms of embedding capacity. However, the embedding capacity of data hiding in a webpage can be further improved.

3. The Proposed Method

3.1 The Embedding Phase

In this paper a novel data hiding scheme using HTML files is presented. The proposed method embeds secret data in an HTML file by using special space codes for between-word location segments. The main components include special space codes, a character coding table, and embedding rules. In order to improve the embedding capacity of webpage data hiding, we observed the source code in HTML files and found that codes “ ” and “ ” can also be used for concealing secret data in HTML files. Fig. 2 shows an example by testing different codes represented in Microsoft Internet Explorer. We added the special space codes “ ” and “ ” to generate a new index table, shown in Table 2.

The proposed scheme inspired the concept of the special space codes [11] and word-segment to conceal secret data into an HTML file’s content sentences. Here, the concept of word-segment is to divide a sentence into several segments. Each segment contains two between-word locations. For example, “The unanimous Declaration of the thirteen United States of America” is a sentence from the United States Declaration of Independence. The sentence can be divided into five segments as “The unanimous ”, “Declaration of ”, “the thirteen ”, “United States ”, and “of America”.

Before embedding, Character codes are summarized in Table 3, which are selected from ASCII codes from 32 to 126. A new index number is assigned to every character. For instance, “30” is indexed to character ‘A’. Note that, the character ‘se code’ is used to represent the start or the end of a secret message we embed. Because of the ‘se code’ notification, it is no need to remember the total length of a secret message embedded in an HTML file.

The key steps of data embedding are as follows. First, secret data is encrypted by any encryption algorithm (e.g., DES, AES, RSA, etc.). The cipher text is then converted according to the symbols in Table 3. Next, space in the segment is encoded by using the special space code according to the index of the cipher message’s character index. For example, the index of the cipher message character ‘k’ is 69 (referring to Table 3), which is embed into first segment “The unanimous ”, then the encoded HTML code will be “The unanimous ” (referring to Table 2).

For ease of explanation, let the cipher message be represented as M = {mk | k = 1,2,…, Nm}, where Nm is the number of character in M. Further, S = {si | i = 1,2,…, NS} represents a set of segments in a cover HTML file H. Here NS is the total number of segments in H. Every si contains two between-word locations and is denoted as sij where j ∈ {0, 1}. Table 4 contains a summary of the simplified descriptions of the notations used in the proposed method .

In the proposed method, every segment can be used to conceal one character mk. The following is the embedding procedure:

Step 2: Segment cover text into non-overlapping segments such that every segment contains two between-word locations.

Step 3: Encode the “se code” into first segment by replacing s10 as “ ” and s11 as “ ”, respectively.

Step 4: Encode si0 and si1 byusing the codes in the codes in the special coding table(i.e., referring to Table 2) corresponding to mk’s tens digit and units digit, respectively.

Step 5: If M has been embedded in H, then embed the “se code” into a segment to mark the end of M and go to Step 6,

Step 6: Generate the remaining part of HTML code with no change. Output the stego HTML file H’.

3.2 The Extracting Phase

A receiver can extract the secret data from H’ by using the extracting procedure. The extracted data is a cipher message, thus, the receiver needs to decrypt the message with a pre-associated key to obtain the original message. From this point of view, it will be very challenging for an unexpected user to determine the hidden message without a decryption key, even if the message in H’ can be extracted. The following is the extracting procedure:

Step 2: Parse the H’ to find the first two codes corresponding to “ ” and “ ” to mark the beginning of the secret message. Set i = i + 1.

Step 3: Parse the remaining code to find the two near codes contained in Table 2.

Step 5: If mk “se code”, then all of secret characters have been extracted and go to Step 6, else set i = i + 1 and k = k + 1 and go to Step 3.

4. Experiment Result and Analysis

4.1 Experimental Results

In order to evaluate the performance of the proposed method in terms of embedding capacity, Sui and Luo’s method [18], Yang and Yang’s method [22], Lee and Tsai’s method [11] and the proposed method were implemented using Octave software. The embedding capacity is found by counting the total number of secret characters that were embedded in a stego HTML file. We use eleven USA President inaugural addresses (listed in Table 5) to play the cover html and the “US Declaration of Independence” (i.e., total 63,456 characters) as the secret message.

The proposed method embeds eight bits for every segment. That means, each between-word location conceals four secret bits. Obviously, the proposed method conceals one more secret bit than Lee and Tsai’s method for every between-word location. The embedding capacity of the proposed method and Lee and Tsai’s method depend on the number of between-word locations. Fig. 3 illustrates the performance comparison. The experimental results show that the proposed method can conceal more secret data than Lee and Tsai’s method.

Fig. 3.The capacity comparison of the proposed method and Lee and Tsai’s method

Fig. 4 demonstrates the stego HTML generated by the proposed method. The experimental results show that it would be very challenging for a user to distinguish the difference between the original HTML and the stego HTML by using only the human eye.

Fig. 5 shows the browsing results by using other popular browsers, Firefox and Google Chrome. From Fig. 5, it is hard to distinguish the difference by using the human eye.

Fig. 5.The stego HTML browsed by different bowser (2013 USA President inaugural address)

Sui and Luo’s method uses the case of tag letters to conceal secret data. Thus, the embedding capacity of Sui and Luo’s method is limited by the number of tags. To carry more secret data, using more redundant tags is a solution. For example, if the secret data contains 300 characters (i.e., 1 character = 8 bits) and normally a tag contains 4 letters, then it requires 300 * 8 / 4 = 600 tags to conceal the secret data.

On the other hand, Yang and Yang’s method uses different quotation types to conceal secret data. Thus, the embedding capacity of Yang and Yang’s method may also be limited by the number of attribute settings. To carry more secret data, using more redundant attributes is a solution. For example, if the secret data contains 300 characters (i.e., 1 character = 8 bits) and normally an attribute setting can conceal one secret bit, then it requires 300 * 8 = 2400 attribute settings to conceal the secret data.

4.2 Experimental Analysis

The perfect situation for stego html is when all the blank characters never change. However, that is impossible. So, our goal is to try to increase the secret data embedded and increase the reserved “space” characters. The proposed method conceals secret data by using the index coding table with the character coding table, thus the index coding codes utilization will be highly related to secret content. Fig. 6 demonstrates the histogram of index coding codes adopted in concealed secret data. As we see, the “type space” is higher than others.

Fig. 6.The count of special codes embedded in test cover htmls using the proposed method

Fig. 7 shows the histogram of special codes embedded in stego html by using Lee and Tsai’s method. From Fig. 7, it can be seen that the frequency of the “space” character adopted to conceal secret data is similar to others.

Fig. 7.The count of special codes embedded into test cover htmls using the Lee and Tsai’s method

Fig. 8 demonstrates the frequency comparison of the “space” character adoption between the proposed method and Lee and Tsai’s method. From this point of view the proposed method not only successfully increases the performance in terms of embedding capacity but also increases the reserved “space” characters. Also, the proposed character coding table can be permuted and pre-shared to both the sender and receiver. In the other words, the proposed method is more secure than Lee and Tsai’s method.

Fig. 9 shows the result of run time comparison, the proposed method takes more time to conceal secret data than Lee and Tsai’s method. The reason for this is that the proposed method is required to scan the character coding table to determine the corresponding tens digits and units digits. However, due to the computational power of web servers, the run time of the proposed method is still acceptable.

Fig. 8.The comparison of the count of “Type space” between the proposed method and Lee and Tsai’s method

Fig. 9.The run time comparison between the proposed method and Lee and Tsai’s method

5. Conclusion

Employing a web page as a cover medium is a good method of secret message delivery, because the web page is a very popular way of sharing knowledge and advertising a company’s information. According to the property of encoding English sentences in a web page, the type space can be represented by several different special codes. Lee and Tsai’s method uses eight different special codes to represent the “type space” and conceal the secret message. The proposed method described here improves the embedding capacity of web page data hiding. The proposed method utilizes eleven special space codes and sentence segmentation to increase the embedding capacity. In the proposed method, every between-word location can conceal one more secret bit than in Lee and Tsai’s method.

References

M. Barni and F. Bartolini, "Data Hiding for Fighting Piracy," IEEE Signal Processing Magazine, vol. 21, no. 2, 2004, pp. 28-39.
C.C. Chang, C.C. Wu, and I.C. Lin, "A Data Hiding Method for Text Documents Using Multiple-Base Encoding," High Performance Networking, Computing, Communication Systems, and Mathematical Foundations, (Yanwen Wu, Qi Luo Eds.), Springer-Verlag Berlin Heidelberg, Sanya, Hainan Island, China , vol. 66, 2010, pp. 101-109. https://doi.org/10.1007/978-3-642-11618-6_15
C. Chen, S.Z. Wang, and X.P. Zhang, "Information Hiding in Text Using Typesetting Tools with Stego-Encoding," in Proc. of the First International Conference on Innovative Computing, Information and Control, Beijing, China, vol. 1, Aug. 2006, pp. 459-462.
I.J. Cox, J. Kilian, F. Thomson Leighton, and T. Shamoon, "Secure Spread Spectrum Watermarking for Multimedia," IEEE Transactions on Image Processing, vol. 6, no. 12, 1997, pp. 1673-1687. https://doi.org/10.1109/83.650120
S. Dey, H. Al-Qaheri, and S. Sanyal, "Embedding Secret Data in HTML Web Page," Image Processing & Communications Challenges, (Ryszard S. Choraoe, Antoni Zabludowski Eds.), Academy Publishing House EXIT Warsaw, 2009, pp. 474-481.
M. Grosvald and C. Orhan Orgun, "Free from the Cover Text: A Human-generated Natural Language Approach to Text-based Steganography," Journal of Information Hiding and Multimedia Signal Processing, vol. 2, no. 2, 2011, pp. 133-141.
H.J. Huang, X.M. Sun, Z.S. Li, and G., Sun, "Detection of Hidden Information in Webpage," in Proc. of the Fourth International Conference on Fuzzy Systems and Knowledge Discovery, Haikou, China, vol. 4, Aug. 2007, pp. 317-321.
H.J. Huang, S.H. Zhong, and X.M. Sun, "An Algorithm of Webpage Information Hiding Based on Attributes Permutation," in Proc. of the Fourth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Harbin, China, Aug. 2008, pp. 257-260.
Y.W. Kim, K.A. Moon, and I.S. Oh, "A Text Watermarking Algorithm Based on Word Classification and Inter-word Space Statistics," in Proc. of the Seventh International Conference on Document Analysis and Recognition, Edinburgh, Scotland, Aug. 2003, pp. 775-779.
I.S. Lee and W.H. Tsai, "Data Hiding in Emails and Applications Using Unused ASCII Control Codes," Journal of Information Technology and Applications, vol. 3, no. 1, 2008, pp. 13-24.
I.S. Lee and W.H. Tsai, "Secret Communication through Web Pages Using Special Space Codes in HTML Files," International Journal of Applied Science and Engineering, vol. 6, no. 2, 2008, pp. 141-149.
I.S. Lee and W.H. Tsai, "A New Approach to Covert Communication via PDF Files," Signal Processing, vol. 90, no. 2, 2010, pp. 557-565. https://doi.org/10.1016/j.sigpro.2009.07.022
I.S. Lee and W.H. Tsai, "Security Protection of Software Programs by Information Sharing and Authentication Techniques Using Invisible ASCII Control Codes," International Journal of Network Security, vol. 10, no. 1, 2010, pp. 1-10.
B. Li, J. He, J. Huang, and Y.Q. Shi, "A Survey on Image Steganography and Steganalysis," Journal of Information Hiding and Multimedia Signal Processing, vol. 2, no. 2, 2011, pp. 142-172.
I.C. Lin and P.K. Hsu, "A Data Hiding Scheme on Word Documents Using Multiple-base Notation System," in Proc. of the Sixth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Darmstadt, Germany, Oct. 2010, pp. 31-33.
T.Y. Liu and W.H. Tsai, "A New Steganographic Method for Data Hiding in Microsoft Word Documents by a Change Tracking Technique," IEEE Transactions on Information Forensics and Security, vol. 2, no. 1, 2007, pp. 24-30. https://doi.org/10.1109/TIFS.2006.890310
M.A. Qadir and I. Ahmad, "Digital Text Watermarking: Secure Content Delivery and Data Hiding in Digital Documents," IEEE Aerospace and Electronic Systems Magazine, vol. 21, no. 11, 2006, pp. 18-21.
X.G. Sui and H. Luo, "A New Steganography Method Based on Hypertext," in Proc. of Asia-Pacific Radio Science Conference, Qingdao, China, Aug. 2004, pp. 181-184.
X.M Sun, G. Lou, and H.J. Huang, "Component-based Digital Watermarking of Chinese Texts," in Proc. of the Third International Conference on Information Security, Shanghai, China, Nov. 2004,vol. 85, pp. 76-81.
Z.H. Wang, C.C. Chang, C.C Lin, and M.C Li, "A Reversible Information Hiding Scheme Using Left-Right and Up-Down Chinese Character Representation," Systems and Software, vol. 82, no. 8, 2009, pp. 1362-1369. https://doi.org/10.1016/j.jss.2009.04.045
S. Weng, Y. Zhao, J.S. Pan, and R. Ni, "Reversible Watermarking based on Invariability and Adjustment on Pixel Pairs," IEEE Signal Processing Letters, vol. 15, 2008, pp. 721-724. https://doi.org/10.1109/LSP.2008.2001984
Y.J. Yang and Y.M. Yang, "An Efficient Webpage Information Hiding Method Based on Tag Attributes," in Proc. of the Seventh International Conference on Fuzzy Systems and Knowledge Discovery, Yantai, China, Aug. 2010, pp. 1181-1184.
X.P. Zhang and S.Z. Wang, "Steganography Using Multiple-Base Notational System and Human Vision Sensitivity," IEEE Signal Processing Letters, vol. 12, no. 1, 2005, pp. 67-70. https://doi.org/10.1109/LSP.2004.838214
Y. Zhao, R. Ni, and Z. Zhu, "RST Transforms Resistant Image Watermarking based on Centroid and Sector-shaped Partition," Science in China: Series F Information Science, vol. 55, no. 3, 2012.
S.P. Zhong, X.Q. Cheng, and T.R. Chen, "Data Hiding in a Kind of PDF Texts for Secret Communication," International Journal of Network Security, vol. 4, no. 1, 2007, pp. 17-26.