2018年4月

初次尝试,准确度80%,看有没有可能进行训练。
脚本

#!/usr/bin/python
from PIL import Image
import pytesseract
text=pytesseract.image_to_string(Image.open('/usr/tmp/dd.jpg'),lang='chi_sim')
print text

图片
dd.jpg

识别结果

风急天高猿啸衷′ … 冒沙^鸟飞巩
无边落木萧萧下′ 不尽长江滚滚来。
万 悲秋常作害′ 百仨多病独登台】
艰难苦恨寰霜鬓′ 渣倒新停浊酒杯】

部署记录

pip install --upgrade pip
#pip install PIL
#python 2.7如下安装
pip install pillow
pip install pytesseract

yum install epel-release
yum install tesseract
#运行脚本
pytesseract.pytesseract.TesseractError: (1, u'Tesseract Open Source OCR Engine v3.04.00 with Leptonica Error opening data file /usr/share/tesseract/tessdata/chi_sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language \'chi_sim\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

#去 https://github.com/tesseract-ocr/tessdata 下载 对应的语言包
Wget https://raw.githubusercontent.com/tesseract-ocr/tessdata/master/chi_sim.traineddata
mv chi_sim.traineddata /usr/share/tesseract/tessdata/