Captcha recognition: The war

Started by farter, October 20, 2013, 10:54:19 AM

Previous topic - Next topic


i know no forum with many programmers so i just post here..  
move me to block with hole plz if it's not proper to post this here.

wtf is CAPTHCA

so i'm the one to MAKE A MESS WOOOOO but not to design recognition.

there are many human captcha typers networks, and i'm not to talk about them.

i wrote one with relatively precised tuning here: (sorry cssdeck iono css) (still updating).
[!--ImageUrlBegin--][!--ImageUrlEBegin--][img width=\\\"400\\\" class=\\\"attach\\\" src=\\\"\\\" border=\\\'0\\\' alt=\\\"IPB Image\\\" /][!--ImageUrlEnd--][/a][!--ImageUrlEEnd--]

the main idea is to cut the "unimportant part" for bold letters. and thick strokes on the background with border-only letters mixed.

i searched the internet and got [a href=\\\"]
, who claims:
[!--quoteo--][div class=\\\'quotetop\\\']QUOTE[/div][div class=\\\'quotemain\\\'][!--quotec--]This is the place where OCRs are born.[/quote]
machine recognition yoooo

and there are reports showing that it recognizes re-captcha pictures very accurately and efficiently.

so i requested some free trial.
here are some results:

so... any idea on either side?
are most of my captchas solvable for you?
any enhancement for distortion or recognition?

Edit: something updated


Cool idea. Tbh, I couldn't figure out some of the letters myself, though.


De-captcha services are written to attack existing captchas. If you introduce a new style, of course none of the existing OCR stuff is likely to succeed. I believe de-captcher is backed by humans, as well, so perhaps your 1 in 8 statistic is 1 in 8 humans

The real test is to have someone try and break it (algorithmically, with OCR or whatever). Personally I find these difficult to comprehend a significant portion of the time.

By the way, simple domain-specific questions make pretty decent spam blockers. For example, on the TG forums I use a bunch of Tetris-related questions and have had no spam registrations since, while with ReCaptcha I was getting quite a few.


Quote from: myndzi
De-captcha services are written to attack existing captchas. If you introduce a new style, of course none of the existing OCR stuff is likely to succeed. I believe de-captcher is backed by humans, as well, so perhaps your 1 in 8 statistic is 1 in 8 humans

The real test is to have someone try and break it (algorithmically, with OCR or whatever). Personally I find these difficult to comprehend a significant portion of the time.

By the way, simple domain-specific questions make pretty decent spam blockers. For example, on the TG forums I use a bunch of Tetris-related questions and have had no spam registrations since, while with ReCaptcha I was getting quite a few.

but see the FBWC - FBVG,
[!--ImageUrlBegin--][a href=\\\"\\\" target=\\\"_new\\\"][!--ImageUrlEBegin--]\\\


It's a pretty cool way but yea i agree with myndzi any captcha created usually gets broken once it get's adopted by enough services that warrant breaking by someone.  

Also notice that the captcha's from reCapcha has an audio version that is vulnerable (Actually I use this more and more now because it's easier to figure out than the image sometimes.  The reason for the audio component is that U.S. law requires websites to be accessible by people who are blind.

I'm not an expert in captcha breaking but it looks like finding the bold G wouldn't be so difficult.  Since part of the difficulty of your captcha lies in cropping the top and bottom, but that also restricts you to having to standardized the size of the letters.  The third letter in your example of FB?G is probably the most difficult but tbh i think most real users will also not be able to identify it.



Quote from: vipjun
The third letter in your example of FB?G is probably the most difficult but tbh i think most real users will also not be able to identify it.

i updated the generator a bit, same url on cssdeck, now the image is zoomed twice as big. and a bit more distortion is added on the foreground.
does it seem easier now?

btw the three in the image attached by you are all resized version on harddrop page... click to enlarge, and full-sized version (the captcha is 75*14) is far easier because it's already very small and can't be zoomed out again (first i also considered confusing AI by sub-pixel grey-scaled lines...)

also seems another strong AI recognizer was just released with demo video:
[!--quoteo--][div class=\\\'quotetop\\\']QUOTE[/div][div class=\\\'quotemain\\\'][!--quotec--]HERE COMES A NEW CHALLENGER![/quote]


Quote from: farter
i updated the generator a bit, same url on cssdeck, now the image is zoomed twice as big. and a bit more distortion is added on the foreground.
does it seem easier now?

btw the three in the image attached by you are all resized version on harddrop page... click to enlarge, and full-sized version (the captcha is 75*14) is far easier because it's already very small and can't be zoomed out again (first i also considered confusing AI by sub-pixel grey-scaled lines...)
also seems another strong AI recognizer was just released with demo video:

The new version of the captcha I was about 75-80% correct and then after looking at about 10 of them, maybe 90-95%.  Difficult ones were a sequence of letters that are all outlines with white backgrounds, and an E/F or B/P as the last letter(not a big deal for AI, IMO any letter that can achieve 60%> confidence interval is pretty good.

the vicarious link looks like some kind of unsupervised learning, how well do you think your captcha will fare against that?


Is there such a thing as chinese captcha ? could you show some images of that if there are any.


Quote from: vipjun
Is there such a thing as chinese captcha ? could you show some images of that if there are any.
sure there are. baidu forum(tieba, 贴吧) uses chinese captcha.

it had been in the simplest form, asking user to type them (only used in a few subforum, most are still using latin letters at that time), looking like this.

because it's a bit slow for users to type, now it turns to clicking, given a picture containing 4 characters and 9 single pieces including the right answer and other ones which look similar, and let user choose. like this.
latin letter captcha has been wholly replaced by that now.

Quote from: vipjun
The new version of the captcha I was about 75-80% correct and then after looking at about 10 of them, maybe 90-95%.  Difficult ones were a sequence of letters that are all outlines with white backgrounds, and an E/F or B/P as the last letter(not a big deal for AI, IMO any letter that can achieve 60%> confidence interval is pretty good.

the vicarious link looks like some kind of unsupervised learning, how well do you think your captcha will fare against that?

i want to make it cropped just fine that will always be difficult but mostly recognizable if paying a bit more attention on the detail. so the size of the picture is just accurately tuned that if it's an E, the bottommost horizontal stroke must be visible, same for P-B-R. also B-E N-M on the right side.

why i think 95%+ is no problem orz.......... the only really bad pair i admit appearing often is C-O.... and some people say L-V M-N sometimes gets unrecognizable..

for the strong AI....
solid ones might be easier, so the thick white line segment may make it harder, i guess for 30-60% accuracy per single letter.
and cropped outline letters should be fairly unbeatable, 20%-ish per letter imo...