![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HomeScreenshotsDownloadOrderSupportQuick Start GuidePublicationsAbout UsPartners | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Watermarking Schemes EvaluationFabien A. P. Petitcolas, Microsoft Research Digital watermarking has been presented as a solution to copy protection of multimedia objects and dozens of schemes and algorithms have been proposed. Two main problems seriously darken the future of this technology though. Firstly, the large number of attacks and weaknesses which appear as fast as new algorithms are proposed, emphasizes the limits of this technology and in particu-lar the fact that it may not match users expectations. Secondly, the requirements, tools and methodologies to assess the current technologies are almost non-existent. The lack of benchmarking of current algorithms is bla-tant. This confuses rights holders as well as software and hardware manufacturers and prevents them from using the solution appropriate to their needs. Indeed basing long-lived protection schemes on badly tested watermarking technology does not make sense. In this paper we will discuss how one could solve the second problem by having a public benchmarking ser-vice. We will examine the challenges behind such a service. I. INTRODUCTIONDigital watermarking remains a largely `untested field and only very few large industrial consortiums have published requirements against which watermarking algorithms should be tested [1,2]. For instance the International Federation for the Phonographic Industry led one of the first large scale comparative testing of watermarking algorithm for audio. In general, a number of broad claims have been made about the ‘robustness’ of various digital watermarking or fingerprinting methods but very few researchers or companies have published extensive tests on their systems. The growing number of attacks against watermarking systems (e.g., [3, 4, 5]) has shown that far more research is required to improve the quality of existing watermarking methods so that, for instance, the coming JPEG 2000 (and new multimedia standards) can be more widely used within electronic commerce applications. We already pointed out in [6] that most papers have used their own limited series of tests, their own pictures and their own methodology and that consequently comparison was impossible without re-implementing the method and trying to test them separately. But then, the implementation might be very different and probably weaker than the one of the original authors. This led to suggest that methodologies for evaluating existing watermarking algorithms were urgently required and we proposed a simple benchmark for still image marking algorithms. With a common benchmark authors and watermarking software providers would just need to provide a more or less detailed table of results, which would give a good and reliable summary of the performances of the proposed scheme. So end users can check whether their basic require-ments are satisfied, researchers can compare different algorithms and see how a method can be improved or whether a newly added feature actu-ally improves the reliability of the whole method and the industry can properly evaluate risks associated to the use of a particular solution by knowing which level of reliability can be achieved by each contender. Watermarking system designers can also use such evaluation to identify possible weak points during the early development phase of the system. Evaluation per see is not a new problem and significant work has been done to evaluate, for instance, image compression algorithms or security of information systems [7] and we believe that some of it may be re-used for watermarking. In section II will explain what is the scope of the evaluation we envisage. Section III will review the type of watermarking schemes that an automated evaluation service could deal with. In section IV we will review what are the basic functionalities that need to be evaluated. Section V will examine how each functionality can be tested. Finally, section VI will argue the need for a third party evaluation service and briefly sketch its architecture.
II. SCOPE OF THE EVALUATIONWatermarking algorithms are often used in larger system designed to achieve certain goals (e.g., prevention of illegal copying). For instance Herrigel et al. [ ] presented a system for trading images; this system uses watermarking technologies but relies heavily on cryptographic protocols. Such systems may be flawed for other reasons than watermarking itself; for instance the protocol, which uses the watermark , may be wrong or the random number generator used by the watermark embedder may not be good. In this paper we are only concerned with the evaluation of watermarking (so the signal processing aspects) within the larger system not the effectiveness of the full system to achieve its goals. III. TARGET OF EVALUATIONThe first step in the evaluation process is to clearly identify the target of evaluation, that is the watermarking scheme (set of algorithms required for embedding and extraction) subject to evaluation and its purpose. The purpose of a scheme is defined by one or more objectives and an operational environment. For instance, we may wish to evaluate a watermarking scheme that allows automatic monitoring of audio tracks broadcast over radio. Typical objectives found across the watermarking and copy protection literature include:
IV. BASIC FUNCTIONALITIESThe objectives of the scheme and its operational environment dictate several immediate constraints (a set of minimal requirements) on the algorithm. In the case of automated radio monitoring, for instance, the watermark should clearly withstand distortions introduced by the radio channel. Similarly, in the case of MPEG video broadcast the watermark detector must be fast to allow real time detection and simple in terms of number gates required for hardware implementation. One or more of the following general functionalities can be used: A. Perceptibility One does not wish that the hidden mark deteriorates too much the perceived quality of the medium. B. Level of reliability There are two main aspects to reliability:
C. Capacity Knowing how much information can reliably be hidden in the signal is very important to users especially when the scheme gives them the ability to change this amount. Knowing the watermarking-access-unit (or granularity) is also very important; indeed spreading the mark over a full sound track prevents audio streaming, for instance. D. Speed As we mentioned earlier, some applications require real time embedding and/or detection. E. Statistical undetectability For some private watermarking systems, that is scheme requiring the original signal, one may wish to have a perfectly hidden watermark. In this case it should not be possible for an attacker to find any significant statistical differences between an unmarked signal and a marked signal. As a consequence an attacker could never know whether an attack succeeded or not; otherwise he could still try something similar to the ‘oracle’ attack [ ]. Note that this option is mandatory for steganographic systems. F. Asymmetry Private-key watermarking algorithms require the same secret key both for embedding and extraction. They may not be good enough if the secret key has to be embedded in every watermark detector (that may be found in any consumer electronic or multimedia player software), then malicious attackers may extract it and post it to the Internet allowing anyone to remove the mark. In these cases the party, which embeds a mark, may wish to allow another party to check its presence without revealing its embedding-key. This can be achieved using asymmetric techniques. Unfortunately, robust asymmetric systems are currently unknown and the current solution (which does not fully solve the problem) is to embed two marks: a private one and a public one. Other functionality classes may be defined but the one listed above seem to include most requirements used in the recent literature. The first three functionalities are strongly linked together and the choice of any two of them imposes the third one. In fact, when considering the three-parameter (perceptibility, capacity and reliability) watermarking model the most important parameter to keep is the imperceptibility. Then two approaches can be considered: emphasise capacity over robustness or favour robustness at the expense of low capacity. This clearly depends on the purpose of the marking scheme and this should be reflected in the way the system is evaluated.
V. EVALUATIONA full scheme is defined as a collection of functionality services to which a level of assurance is globally applied and for each of which a specific level of strength is selected. So a proper evaluation has to ensure that all the selected requirements are met to a certain level of assurance. The number of level of assurance cannot be justified precisely. On the one hand, it should be clear thought that a large number of them makes the evaluation very complicated and unusable for particular purposes. On the other hand too few levels prevent scheme providers from finding an evaluation close enough to their needs. Also we are limited by the accuracy of the methods available for rating. Information technology security evaluation has been using, for the reasons we just mentioned above but also for historical reasons, six or seven levels. This seems to be a reasonable number for robustness evaluation. For perceptibility we preferred to use fewer levels and hence follow more or less the market segmentation for electronic equipment. Moreover, given the roughness of existing quality metrics it is hard to see how one could reasonably increase the number of assurance levels. A. Perceptibility Perceptibility can be assessed to different level of assurance. The problem here is very similar to the evaluation of compression algorithms. The watermark could just be slightly perceptible but not annoying or not perceptible under domestic/consumer viewing/listening conditions. Another level is non-perceptibility in comparison with the original under studio conditions. Finally, the best assurance is obtained when the watermarked media are assessed by a panel of individual who are asked to look or listen carefully at the media under the above conditions. (Table 1) However, as it is stated, this cannot be automated and one may wish to use less stringent levels. In fact, various level of assurance can also be achieved by using various quality measures based on human perceptual models. Since there are various models and metrics available an average of them could be used. Current metrics do not really take into account geometric distortions which remain a challenging attack against many watermarking scheme. Table 1—Summary of the possible perceptibility assurance levels. These levels may seem vague but this is the best we can achieve as long as we do not have good and satisfactory quality metrics.
B. Reliability Although robustness and capacity are linked in the sense that scheme with high capacity are usually easy to defeat, we believe that it is enough to evaluate them separately. Watermarking schemes are defined for a particular application and each application only requires a certain fixed payload so we are only concerned by the robustness of the scheme for this given payload. 1) Robustness The robustness can be assessed by measuring the detection probability of the mark and the bit error rate for a set of criteria that are relevant for the application which is considered. For level zero no special robustness features have been added to the scheme apart the one needed to fulfil the basic constrains imposed by the purpose and operational environment of the scheme. So if we go back to the radio-monitoring example the minimal robustness feature should make sure that the mark survives the distortions of the radio link in normal conditions. Moderate robustness is achieved when more expensive tools are required as well as some basic knowledge on watermarking. So if we keep the previous example, the end user would need tools such as Adobe Photoshop and apply more processing to the image to disable the mark. Moderately high: tools are available but special skills and knowledge are required and attacks may be unsuccessful. Several attempts and operations may be required and the attacker must have to work on the attack. High robustness: all known attacks have been unsuccessful. Some research by a team of specialists is necessary. The cost of the attack may be much higher what it is worth and the success of it is uncertain. Provable robustness: it should be computationally (or even more stringent: theoretically) infeasible for a wilful opponent to disable the mark. This is similar to what we have for cryptography where some algorithms are based on some difficult mathematical problem. The first levels of robustness can be assessed automatically by applying a simple benchmark algorithm similar to [6]: Table 2—Evaluation profile sample.
This procedure must be repeated several times since the hidden information is random and a test may be successful by chance. Levels of robustness differ by the number and strength of attacks applied and the number of media they are measured on. The set of test and media will also depend on the purpose of the watermarking scheme and are defined in evaluation profiles. An evaluation profile sample is given in Table 2. For instance, schemes used in medical systems need only to be tested on medical images while watermarking algorithms for owner identification have to be tested on a large panel of images. The first levels of robustness can be defined using a finite and precise set of robustness criteria (e.g., S.D.M.I., IFPI or E.B.U. requirements) and one just need to check them. 2) False positives False positives are difficult to measure and current solutions use a model to estimate their rate. This has two major problems: first ‘real world’ watermarking schemes are difficult to model accurately; secondly modelling the scheme requires access to details of the algorithm. Despite the fact that not publishing algorithms breaches Kerckhoffs’ principles [ ], details of algorithm are still considered as trade secrets and getting access to them is not always possible. So one (naïve) way to estimate the false alarm rate is to count the number of false alarm using large sample of data. This may turn out to be another very difficult problem, as some applications require 1 error in 108 or even 1012. C. Capacity In most applications the capacity will be a fixed constraint of the system so robustness test will be done with a random payload of given size. While developing a watermarking scheme however, knowing the trade-off between the basic requirements is very useful to know and graph with two varying requirements, the others being fixed, are a simple way to achieve this. In the basic three-parameter watermarking model for instance one can study the relation between robustness and strength of the attack when the quality of the watermarked medium is fixed, between the strength of the attack and the and the visual quality or between the robustness and the visual quality [6]. The first one is probably the most important graph. For a given attack, and a given visual quality, it shows the bit error rate as a function of the strength of the attack. The second one shows the maximum attack that the watermarking algorithm can tolerate. This is useful from a user point of view: the performance is fixed (we want only 5% of the bits to be corrupted so we can use error correction codes to recover all the information we wanted to hide) and so it helps to define what kind of attacks the scheme will survive if the user accepts such or such quality degradation. D. Speed Speed is very dependent on the type of implementation: software or hardware. In the automated evaluation service we propose in the next section, we are not concerned with hardware implementations. For these, the complexity is an important criteria and some application impose a limitation on the maximum number of gates that can be used, the amount of required memory, etc. [15]. For a software implementation is also depends very much on the hardware used to run it but comparing performance result obtain on the same platform (usually the typical platform of end users) provide a reliable measure. E. Statistical undetectability All methods of steganography and watermarking substitute part of the cover signal, which has some particular statistical properties, with another signal with different statistical properties; in fact embedding processes usually do not pay attention to the difference in statistical properties between the original cover-signal and the stego-signal. This leads to possible detection attacks [16]. As for false positives evaluating such functionality is not trivial but fortunately very few watermarking schemes require it so we will not consider it in the next section.
VI. METHODOLOGY – NEED FOR THIRD PARTYTo gain trust in the reliability of a watermarking scheme, its qualities must be rated. This can be done by:
Only the third option provides an objective solution to the problem but the general acceptance of the evaluation methodology implies that the evaluation itself is as transparent as possible. This was the aim of StirMark and this remains the aim of the project to build a next generation of StirMark Benchmark. This is why the source code and methodology must be public so one can reproduce the results easily. A question one may ask is: does the watermarking system manufacturer need to submit any program at all or can everything be done remotely using some interactive proof? Indeed, watermarking system developers are not always willing to give out software or code for evaluation, or company policy for intellectual property prevents them from doing this quickly. Unfortunately there is no protocol by which an outsider can evaluate such systems using a modified version of the above robustness testing procedure. One could imagine that the verifier sends an image I to be watermarked to the prover. After receiving the marked images Ĩ, the verifier would apply a transformation f to the image and send either J := f(I) or J := f(Ĩ) to the prover, who would just say ‘I can detect the mark’ or ‘I cannot detect the mark’. The verifier would always accept a ‘no’ answer but a ‘yes’ answer only with a certain probability. After several iterations of the protocol the verifier would be convinced. Unfortunately in this case, most f are invertible or almost invertible – even if f is a random geometric distortion, such as the one implemented into StirMark, it can be inverted using the original image. So the prover can always approximate f 1 by comparing J to I and Ĩ and try do detect the mark in f 1(J) and so, always cheat. The conclusion of this is that the verifier must have at least a copy of the detection or extraction software. So we propose, as a first step towards a widely accepted way to evaluate watermarking schemes, to implement an automated benchmark server for digital watermarking schemes. The idea is to allow users to send a binary library of their scheme to the server which in turns runs a series of tests on this library and keeps the results in a database accessible to the scheme owner and/or to all ‘watermarkers’. One may consider this service as the next generation of the StirMark benchmark: fully automated evaluation with real time access to data. In order to be widely accepted this service must have a simple interface with existing watermarking libraries; in the implementation we propose we have exported only three functions (scheme information, embedding and detection). The service must also, as we described earlier, take into account the application of the watermarking scheme by proposing different evaluation profiles (tests and set of media samples) and strengths; this will be achieved by the use of different evaluation profiles configuration files. The service must be easy to use:
At last all evaluation procedures, profiles and code must be publicly available. Although our current implementation only supports image-watermarking schemes, the general architecture we have chosen will allow us to support other media in the near future.
VII. CONCLUSIONS AND FUTURE WORKIn this paper we have used a duality approach to the watermarking evaluation problem by splitting the evaluation criteria into two (independent) groups: functionality and assurance. The first group represents a set of requirements that can be verified using agreed series of tests the second is a set of level to which each functionality is evaluated. These level go from zero or low to very high. We are investigating how evalution profiles can be defined for different applications and how importance sampling techniques could be used to evaluate the false alarm rate in an automated way. Hopefully this new generation of watermarking testing tool (in the continuation of the StirMark benchmark [17]) will be very useful to the watermarking community! VIII. REFERENCES
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||