Lincoln08 tec blog

Applications and the Rise of the Robots

Written By: admin - Oct• 29•11

 

 Today  CAPTCHAs  are  vastly  applied  in  Internet  environments  to  prevent resource abuse by bots, although marginally there exist some other applications, which will be described later in Section 2.4.2. Usually, Web sites are designed and intended for human use. According to Basso and Sicco [14], Web robots, or bots for short, can be defined as computer programs that run automated tasks over the Internet without the need of human interaction. Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher rate than would be possible for a human alone. From the point of view of the Web server, it is impossible to tell whether a Web request originated from a human user or from a bot: HTTP (Hypertext Transfer Protocol) requests look exactly the same. However, a bot can repeatedly perform Web-related activities, which have been thought and created as prerogatives of human beings, much more rapidly than a human user. As will be seen later in Section 6, these differences in behavior can become alternative ways to distinguish between human and machine users over Internet.

Provided that it is impossible to distinguish human users from machine users based solely on the HTTP protocol, CAPTCHAs provide a security barrier by posing a puzzle that human users can pass but machine users cannot. To be able to go ahead, first the CAPTCHA must be solved. It works as the gatekeeper to the Web resource coveted by the attacker.

In an effort to defeat all attempts to stop the proliferation of bots, automated tools are evolving toward the development of more complex and sophisticated programs, which posses an always increasing intelligence and can reproduce human actions with a high degree of fidelity.

The actions of bots can be driven by legitimate purposes or can rely on malicious plans. Therefore, robots can accomplish two opposite goals [14] l Help human beings in carrying out repetitive and time-consuming operations. l Undertake  hostile  or  illegal  activities,  becoming a  serious  threat  to  Web application security

Legitimate  Purposes of Robots 

Currently, there are several situations in which using automated tools is manda- tory, due to large amount of data to process. Some examples are:

 l Web spidering or crawling [53]. A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or  in  an orderly fashion. Many sites, in particular, search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking  links or validating  HTML  code.  Legitimate  Web  bots  identify  themselves  by  the User-agent field of an HTTP request when they make a request to a Web server; for instance, Yahoo!’s Web crawler Slurp is identified with the  following string:  Mozilla/5.0 (compatible; Yahoo! Slurp; hxx://help. yahoo.com/help/us/ysearch/slurp). Legitimate Web spiders usually respect the resources of Web servers according to the robots exclusion protocol, also known as the robots.txt protocol [93], that is, a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers.

l Web site mirroring [63]. For instance, the Internet Archive is a nonprofit that was founded to build an Internet library. Its purposes include offering perma- nent access for researchers, historians, scholars, people with disabilities, and the general public to historical collections that exist in digital format. The Internet Archive includes texts, audio, moving images, and software as well as archived Web pages in their collections, and it features a crawler named Heritrix and identified with the user agent field archive.org_bot.

l Vulnerability assessment [49]. This is the process of performing a  security review of a Web application by searching for design flaws, vulnerabilities, and inherent weaknesses. It can be automated by using a  software that retrieves Web site pages and builds specific requests to find unvalidated inputs, improper error handling, cross site scripting, etc. An example of an automated Web site vulnerability assessment tool is White-Hat Sentinel [136].

l Chat and instant messaging system management. For instance, an Internet relay chat (IRC) bot is a set of scripts or an independent program that connects to IRC as a client, to perform automated functions that include preventing malicious

users from taking over the channel, logging what happens in an IRC channel, giving out information on demand, creating statistics, hosting trivia games, etc.

 

Unfortunately, not always previous uses are legitimate. For instance, Web craw- lers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (most often for spamming). These bots identify them- selves as legitimate users or as search engine bots to disguise their ultimate goal. Another example is malicious IRC bots [19], designed for the purpose of infecting other users with viruses, sending spam, or controlling botnets for spamming and Denial of Service attacks.

There are also some activities in the fringes of legality. For instance, gaming bots

can be used for fair purposes (e.g., as competitors or collaborators in a game, featuring week AI functions), or for unfair purposes (like those used as a help to the user for collecting resources, increasing player’s avatar experience, etc.). As another example, automated trading systems help stock brokers to, for example, make sells or purchases under certain conditions. However, they can also be used for artificially manipulating stock prices.

Implementation and Deployment

Written By: admin - Oct• 29•11

 

When Web designers want to install a CAPTCHA to protect a Web form, they have three main possibilities:

 Programming their own CAPTCHA. This is the most costly approach, as the designer has to program all the software (probably reusing existing graphic and voice libraries), ensuring that  the challenge fits all previous features,  and intensively testing it.

l Installing an available CAPTCHA software, or activating one available at the CMS he or she is using. Many commercial and open source CMSs currently include either their own CAPTCHAs or allow to insert them as plugins. For instance, the open source Java-based OpenCMS [2] includes CAPTCHAs as a part of the system; the open source systems Drupal (PHP-based) [39] and the Plone (Python-based) [90] admit CAPTCHAs as plugins; the proprietary IBM WebSphere server that powers Lotus Web Content Management solution [62] allows also CAPTCHA plugins, etc.  Plugins and software packages include Securimage [89], WebSpamProtect [134], or WP Captcha-Free [138].

l Subscribing to an external CAPTCHA service and integrating it in the CMS/

Web server [116]. The generation and validation of the CATPCHA is done in a separate server (the service provider’s one), and it is needed to include small pieces of code into the Web site to be protected. Example services include reCAPTCHA [47], captchas.net [20], or the WebSpam-Protect service cited above.

 Implementing a CAPTCHA from scratch is not a trivial task. First of all, it is needed  to  implement  all  the  steps  of  the  Generalized  Classical  CAPTCHA algorithm [76]:

 1. Computer generates a test instance.

2. Test is shown to the human/bot.

3. Human/bot attempts to solve the test.

4. Human/bot reports supposed solution to the computer.

5. Computer evaluates the submitted solution.

6. Computer reports the result of evaluation to the human/bot and allows  or blocks access to a resource based on the result.

 For the generation of the test instance, all previous desirable properties must be considered. Libraries and Application Programming Interfaces that help a program- mer to write the test generation are very scarce. Moreover, while most of the programmers focus on the test generation, the rest of steps allow the programmed CAPTCHA to fail easily, making it vulnerable to the side-channel attack. If the user fails a CAPTCHA, the processing must be performed carefully to keep all the data he or she has already entered in the Web form, as it is very annoying for them to type these data again and again.

The design and implementation of CAPTCHAs by nonexperts is typically weak,because they are not aware of the current methods of CAPTCHA solving, and of the typical flaws present in CAPTCHA design. Instead of inventing a new CAPTCHA, it is a better security decision to use one whose robustness has been already tested. Installing existing CAPTCHA software comes with the same considerations as installing any packaged security control, like a firewall or antivirus. To be effective, the software makers have to update the software frequently to patch vulnerabilities in previous versions and to combat cracks.

The integration of an existing and tested CAPTCHA service in a Web server can be a relatively simple task. For instance, reCAPTCHA offers an e-mail address protection service named Mailhide, intended to hide an e-mail address in a Web page until the user demonstrated he or she is a human being. The integration of the Mailhide service implies including few lines of simple HTML/JavaScript code in the Web page, as shown in the Fig. 4. The code shown includes a hyper link (< a>) for a part of the e-mail address to hide (.. .) which leads to the execution of a script when the action is clicking on it (onClick). The script opens a window (window.open ()) with a number of parameters (the address at which the CAPTCHA is generated, the parameters of CAPTCHA view, the size and outlook of the window, etc.), in which the CAPTCHA is shown, and after solved, the complete e-mail address is displayed

General Description of CAPTCHAs

Written By: admin - Oct• 29•11

 

The general goal of a CAPTCHA is to defend an application against the automated actions of undesirable and malicious bot programs on the Internet. To achieve this goal, CAPTCHAs are programs designed to generate and grade tests that most humans can pass, but current computer programs cannot. A CAPTCHA works as a simple two-round authentication protocol as follows [141]: S(ervice) ! C(lient): a CAPTCHA  challenge C ! S: response

 

 

Such challenges are based on hard, open AI problems [129]. It is important to remark that in this context “hard” is defined in terms of the consensus of a commu- nity: an AI problem is said to be hard if the people working on it agree that it is hard [129]. This notion is very similar to that of computational complexity theory, in which a problem is regarded as inherently difficult if solving the problem requires a large amount of resources, whatever the algorithm used for solving it. The security of most public key cryptosystems is based on assumptions agreed upon by the community, such as the existence of one-way functions. For instance, RSA’s security relies on the assumption that 1024-bit integers are impossible to factor with today’s available computing resources and number theory advances, although it has not yet been proven that any function exists for which no such reverse algorithms exist.

 

For all practical purposes, it is assumed that the adversary creating the malicious bot cannot solve the underlying AI problem proposed by the CAPTCHA with higher accuracy than what is currently known to the AI community. Given that at present there is no way to prove that a program cannot pass a test which a human can pass, all CAPTCHAs can do is to present evidence that with the current state of the art in AI it is hard to write a program that can pass the test, just as it happens with many cryptographic primitives.

 

Such hard AI problems are hypothesized to include [100]:Computer vision and subproblems such as object ecognition. Actually, these problems are used in many CAPTCHAs, as will be covered in Sections 3.1 and 3.2. Natural language understanding and subproblems such as text mining, machine translation, and word sense disambiguation. This type of problem sed in many CAPTCHAs, as will be explained in Section 3.4.

 

A CAPTCHA is not just an isolated hard AI problem to be solved. In real world

 

scenarios, CAPTCHAs are to be deployed in Internet environments subject to heavy load by the use of thousands or even millions of users, and most likely under attack by hackers when the protected resource is appealing enough. As a consequence, choosing an interesting AI problem is not enough to make for a good CAPTCHA. A usable CAPTCHA for security purposes should satisfy most, if not all, of the following properties:

 

1.Efficient: It should be taken quickly and easily by human users. While it would be ideal to have 100% recognition accuracy for human users, it would be safe to assume that human users would tolerate some lack of CAPTCHA ease. For instance, a human user might not be overly concerned with having to retry a CAPTCHA once in 10 attempts. This fact can be exploited while designing

 

CAPTCHAs: some human recognition performance can be deliberately sacrificed if it degrades machine performance by a considerable amount.However, it must have very low false negatives—false negative takes place when a legitimate human user is flagged as a bot. It should reject very few human users.2 Usable: It should accept all human users with high reliability, thout any discrimination based on mental or physical disabilities, race, language, genre, nor age. This property might require the combined use of different AI problems. 3.Economical: It is also desirable that its generation consumes a small amount of network and computational resources in the client and that the area it takes to present it to the user is small to allow for hand-held devices.4.Secure: Virtually no machine should be able to solve it. The only way to pass

the CAPTCHA should be to solve the underlying AI problem. A secure implementation should avoid side-channel attacks, described later in Section 5.1.4. It should have very few false positives—a false positive takes place when a bot is flagged as a legitimate human user. It should reject virtually all machine users.: It should be difficult to write a computer program that can solve the AI problem posed by a CAPTCHA even if its algorithm and its data are known: the only hidden information is a small amount of randomness used to generate the tests. As opposed to the “security through obscurity” principle, in cryptography, this transparency is often referred to as Kerckhoffs’ principle: a cryptosystem should be secure even if everything about the system, except the key, is public knowledge [81]. Another argument for disclosure is that, like cryptographic systems, CAPTCHAs benefit from peer review as well, which is usually successful at identifying weaknesses. This public scrutiny also allows researchers to compete with each other in an attempt to find CAPTCHAs with increasing levels of security, thus making the field advance.Robust

: The test should resist automatic attack for many years even as technol- advances. Depending on the actual implementation, changing a CAPTCHA might be a costly and time-consuming process, leaving the application unpro- tected in the meantime. Such upgrades should happen rarely, if at all. Automated : It must be generated and evaluated automatically. Obviously, a CAPTCHA that requires human supervision to evaluate answers would be impractical for large-scale deployment. A computer program should be able both to generate the challenges and mark the answers as qualifying or not. 8.Random: The random piece of information used to create the CAPTCHA

should be generated in a truly random, unpredictable way. Otherwise, an adversary might attack the sequence generator instead of solving the AI puzzle.9.Large space: It should be immune to brute force guessing attacks. This means that the CAPTCHA solution must occupy a space large enough so that both simple dictionary attacks and exhaustive search attacks become impractical.Good CAPTCHAs rely on a completely random system of generation based on creating random word images or choosing files from a database consisting of many names, images, and other files. This database should be large enough to deter efforts to mine it all. The database used to create the CAPTCHA should not contain the solutions, because hackers could break into the database and obtain the solutions to the tests.

10.Human cost: A CAPTCHA is successful if the cost of answering challenges with a machine is higher than the cost of soliciting humans to perform the same task. During their history, a number of CAPTCHAs have been proposed that do not fit some of these requirements, and depending of their popularity, they have been broken (automatically solved) quite often. Moreover, the security requirement, that is, the premise in CAPTCHA design that being able to build an automated system that solves a CAPTCHA should involve giving a fundamental step on solving a relevant AI problem, has been violated in practice, with no relevant contribution to AI [77]. This is due to the fact that, for the most part, the challenges in question are largely artificial, having little basis in the real world of an AI problem.

captchas part 2

Written By: admin - Oct• 29•11

 

1. It is easy to generate many instances of the problem, together with their unambiguous solution, without requiring human intervention at all. 2. Humans can solve a given instance effortlessly with very few errors. Providing the answer should also be easy. 3. The best known programs for solving such problems fail on a nonnegligible fraction of the problems, even if the method of generating instances is known. The number of instances in a challenge will depend on this fraction. 4. An instance specification is succinct both in the amount of communication needed to describe it and in the area it takes to present it to the user.Naor suggested a number of areas from vision and natural language processing as possible candidates for such problems: gender recognition, facial expression under- standing, finding body parts, deciding nudity, naive drawing understanding, handwrit- ing understanding, speech recognition, filling in words, and disambiguation. As will be seen later in this chapter, many of his suggestions have been applied throughout years to develop automated Turing Tests. Luis von Ahn et al. would later formalize and substantiate Naor’s conceptual model in what would be known as CAPTCHA [129].

 

Major differences between the Turing test and Naor’s proposal include:

In a Naor test, the judge is also a computer, and the goal is not to verify that the other part in the communication is a computer as proficient as a person, but to confirm that he or she is actually a person. That is the reason why these tests are often called reverse Turing tests. In the Turing test, the communication is conversational. However, Naor’s proposal involves a variety of sensory inputs. In the Turing test, the conversation lasts until the player is able to take a (possibly wrong) decision. However, in these reverse Turing tests, the player has only a chance (posing a challenge to the user and screening their answer) to take the decision. The challenge may be very difficult to verify that only a person can solve it, but not as difficult as an average human user is unable to solve it. The first known application of reverse Turing tests (named CAPTCHAs from now on) was developed by a technical team at the search engine AltaVista. In 1997, AltaVista sought ways to block or discourage the automatic submission of URLs to their search engine. This free “add-URL” service was important to AltaVista since it broadened its search coverage. Yet some users were abusing the service by rather than asking the prover to solve the problem once, he or she can be asked to solve the problem twice. If the prover gets good at solving the problem twice, she can be asked to solve the problem three times, etc. If for example the best computer program has a success 0.1 against a given problem, when asked to pass it twice in a series, the best computer program’s success probability will be reduced to 0.01, etc.

to pass it twice in a series, the best computer program’s success probability will be reduced to 0.01, etc.

automating the submission of large numbers of URLs, in an effort to skew AltaVista’s importance ranking algorithms. Andrei Broder, Chief Scientist of AltaVista, and his colleagues developed a filter. Their method is to generate an image of printed text randomly so that machine vision (optical character recognition, OCR) systems cannot read it but humans still can. In January 2002, Broder stated that the system had been in use for “over a year” and had reduced the number of “spam add-URL” by “over 95%.” A U.S. patent was issued in April 2001.

In September 2000, Udi Manber of Yahoo! described this “chat room problem” to researchers at Carnegie Mellon University (CMU): “bots” were joining online chat rooms and irritating the people there by pointing them to advertising sites. How could all bots be refused entry to chat rooms? CMU’s Prof. Manual Blum, Luis A. von Ahn, and John Langford developed a (hard) GIMPY CAPTCHA, which picked English words at random and rendered them as images of printed text under a wide variety of shape deformations and image occlusions, the word images often overlapping. The user was asked to transcribe some number of the words correctly. A simplified version of GIMPY (EZ-GIMPY), using only one word-image at a time, was installed by Yahoo!, and was used in their chat rooms to restrict access to only human users. A sample of several GIMPY CAPTCHA images are shown in the Fig. 1.

Other researchers were stimulated by this particular problem, and the Principal Scientist of Palo Alto Research Center, Henry Baird, leaded the development of PessimalPrint, a CAPTCHA that uses a model of document image degradations that approximates ten aspects of the physics of machine-printing and imaging of text. This model included spatial sampling rate and error, affine spatial deformations, jitter, speckle, blurring, thresholding, and symbol size. Their paper [32] was the first refereed technical publication on CAPTCHAs.

These works and many others discussed in this chapter have been widely used by the World Wide Web-related industry, from search engines like Yahoo! or Google, to Web e-mail providers (Gmail, Hotmail, etc.), and ultimately, nearly all major Social networking sites like Facebook or MySpace, to prevent automated subscription,posting, and in general, automated misuse of their services to human users. CAPTCHAs have evolved to become a commodity, as they are integrated as plugins in Content Management Systems (CMSs), for example, Drupal, or even as Web Services. A major instance of a reverse Turing test provided as a Web Service is reCAPTCHA [47] (shown in the Fig. 2), started by the previously mentioned researcher Louis von Ahn. This service can be integrated in any Web page to typically protect a Web form, and it has become very popular, serving more than 30 million instances of the challenge every day. Now this service is provided by Google.

Moreover, the popularity of CAPTCHAs is so big that companies have emerged to take advantage of it, by using them as a platform for serving ads to the consumers [117], as shown in Fig. 3. A very interesting feature of CAPTCHA ads is that it is the very user who writes the brand message into the text field provided, so they effectively read and even learn the message.

On the opposite side, the popularization of CAPTCHA technologies has pushed a

number of (quite often successful) attempts to circumvent these kinds of challenges. Attacks include automated bots able to understand the CAPTCHA itself, attacks to the programming structure of the form to avoid solving the CAPTCHA, and crowd sourcing.

 

AI and Captchas

Written By: admin - Oct• 29•11

 

During more than 60 years, the purpose of Artificial Intelligence (AI) has been to design  and build a machine able to think as a human person. The term AI was coined by John Carthy in the summer of 1956, during a historic meeting held at the Dartmouth College by him and other leaders of this field over decades. While first works in AI were surprising, and some of these and other researchers believed this problem would be solved in 20 years, it was not. Evaluating how far we are today from meeting this ambitious goal is out of the scope of this chapter, but no one can deny that this field has provided hundreds of ideas that have been at the heart of Computer Science, and often driven it. For instance, Semantic Networks emerged as a mechanism to formalize the semantics of Natural Languages to enable Machine Translation and to, ultimately, make computers understand and express through human languages. While machines still do not have this ability, Semantic Networks are a representation language that has for instance evolved to languages used to describe software models, like class diagrams in Object Oriented Programming. Semantic Networks are also the under- lying mechanism to the Semantic Web and several other technologies. Computer Science is plagued with subproducts of AI, and Information Technologies (IT) Security is not an exception—modern spam filters include Machine Learning and Natural Language Processing techniques; Machine Learning is also used in antivirus research, intrusion detection, etc.; and there are many other examples.

In 1950, Alan Turing, one of the fathers of AI, faced the question: “Can machines think?” He conceived a test to evaluate it, the Turing Test. Theoretically, using this test it is possible to decide if a machine has reached the human levels of communi- cation and reasoning. Current computers are still not able to pass that test, but as with Semantic Networks, other applications of this test have emerged.

Perhaps the most prominent application of the Turing Test to the field of Infor- mation Security is the concept of CAPTCHA. This word, which stands for Completely Automated Public Turing test to tell Computers and Humans Apart, makes reference to an automated test aimed at confirming that the user of an application is a human being instead of a robot, that is, a program which would probably misuse a service or resource. The concept, often named also reverse Turing test or Human Interactive Proof, has emerged in the context of Web security. This can be explained by the huge and ever-increasing popularity of Web-based services that range from low level ones like online data and information storage, to more complex and extremely popular ones like Web-based e-mail, word processors, image processors, etc., and ultimately, Social Networks as the ultimate communica- tion platform for hundreds of millions of users. However, the concept can be applied to any other kind of IT service or tool.

1.1 The Turing Test and the Origin of CAPTCHAs The Turing Test was conceived by Alan Turing and presented in a rather philo- sophical paper [124] discussing the properties of a machine able to think like a human being. The test is based in the “imitation game,” described as follows: given a man and a woman, an interrogator (the player or judge) has to guess who is who by

addressing questions to them. The goal of one of them is make the interrogator fail,

while the other has to help the player to make a correct guess. The experiment or game is done using typewritten text, so the voices may not help the player. Questions made by the interrogator may include, for example, “Will you please tell me the length of your hair?” but the answers may be absolutely correct, partly false, etc. To answer the question “Can machines think?” Turing proposed to substitute the man or the woman by a computer. If after the game, the player makes a wrong guess, then we can deduce that the machine has reached the human levels of performance at communication and intelligence.

This test has been considered as a real “think proof” for many years, but not exempt of criticism as human intelligence is still hard to define (e.g., Refs. [99,101]). Moreover, no machine has yet passed the test, although a public contest, The Loebner Prize in AI [75], is yearly carried out since 1991, and computers and programs have greatly evolved since Turing posed his question. However, a number of systems have been inspired by this test, Eliza [135] being the most popular one. Weizenbaum’s Eliza is a relatively simple pattern-matching program able to simu- late the conversation of a therapist or psychoanalyst. For instance, given a sentence by a person: “I believe I do not love my mother,” the program might answer: “Please tell me more about your mother,” based on an rule that fires when the person writes the word “mother.” Conversational programs have evolved to commercial products that are used for, for example, Customer Relationship Management [128].

The first mention of ideas related to Automated Turing Tests seems to appear in an unpublished manuscript by Moni Naor [85], who in 1996 proposed a theoretical framework that would serve as the first approach in testing humanity by automated means. In Naor’s humanity test, the human interrogator from the original Turing Test was substituted by a computer program. The original goal of his proposal was to present a scheme that would discourage computer software robots from misusing services originally intended to be used by humans only, much in the same sense of stopping an automation based attack though human identification. Basically, he proposed an adaptation of the way identification is handled in cryptographic settings to deal with this situation. In cryptography, when one party

A wants to prove its identity to another party B, the process is a proof that A can effectively compute a (keyed) function that a different user (not having the key) cannot compute. The identification process consists of a challenge selected by B and the response com- puted by A

. What would replace the keyed cryptographic function in the proposed setting would be a task where humans excel, but machines have a hard-time competing with the performance of a 3 year old. By successfully performing such a task, the user proves that he or she is human.

In Naor’s work, the collection of problems should possess the following  properties: