테서렉트 사용 방법

Google Tesseract 사용 방법 정리

설치 방법¶

Windows: Tesseract at UB Mannheim에서 설치파일 다운로드 가능

Linux: 아래 명령어들로 다운로드 및 설치 가능

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev

apt install -y tesseract-ocr

사용 방법¶

주요 언어별 tesseract 사용 보조 패키지

Python: pytesseract

CLI에서 Tesseract 설치 위치에 접근하여 tesseract.exe 파일을 실행하면 아래와 같이 기초 사용법을 확인할 수 있음

tesseract.exe

Usage:
  tesseract.exe --help | --help-extra | --version
  tesseract.exe --list-langs
  tesseract.exe imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.

전체 명령어 확인 방법

tesseract.exe --help-extra

Usage:
tesseract.exe --help | --help-extra | --help-psm | --help-oem | --version
tesseract.exe --list-langs [--tessdata-dir PATH]
tesseract.exe --print-fonts-table [options...] [configfile...]
tesseract.exe --print-parameters [options...] [configfile...]
tesseract.exe imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]

OCR options:
--tessdata-dir PATH   Specify the location of tessdata path.
--user-words PATH     Specify the location of user words file.
--user-patterns PATH  Specify the location of user patterns file.
--dpi VALUE           Specify DPI for input image.
--loglevel LEVEL      Specify logging level. LEVEL can be
                        ALL, TRACE, DEBUG, INFO, WARN, ERROR, FATAL or OFF.
-l LANG[+LANG]        Specify language(s) used for OCR.
-c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
--psm NUM             Specify page segmentation mode.
--oem NUM             Specify OCR Engine mode.
NOTE: These options must occur before any configfile.

Page segmentation modes:
0    Orientation and script detection (OSD) only.
1    Automatic page segmentation with OSD.
2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
3    Fully automatic page segmentation, but no OSD. (Default)
4    Assume a single column of text of variable sizes.
5    Assume a single uniform block of vertically aligned text.
6    Assume a single uniform block of text.
7    Treat the image as a single text line.
8    Treat the image as a single word.
9    Treat the image as a single word in a circle.
10    Treat the image as a single character.
11    Sparse text. Find as much text as possible in no particular order.
12    Sparse text with OSD.
13    Raw line. Treat the image as a single text line,
    bypassing hacks that are Tesseract-specific.

OCR Engine modes:
0    Legacy engine only.
1    Neural nets LSTM engine only.
2    Legacy + LSTM engines.
3    Default, based on what is available.

Single options:
-h, --help            Show minimal help message.
--help-extra          Show extra help for advanced users.
--help-psm            Show page segmentation modes.
--help-oem            Show OCR Engine modes.
-v, --version         Show version information.
--list-langs          List available languages for tesseract engine.
--print-fonts-table   Print tesseract fonts table.
--print-parameters    Print tesseract parameters.

실행 옵션¶

PSM

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

OEM

OCR Engine modes:
  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.

인식 언어 관련¶

현재 환경에 설치 된 인식 언어 확인 명령어

tesseract.exe --list-langs

List of available languages in "C:\programming\Tesseract-OCR/tessdata/" (3):
eng
kor
osd

Tesseract 인식 대상 언어 추가 방법
1. tesseract-ocr 저장소에서 원하는 언어의 *.traineddata 파일 다운로드
2. 설치 경로의 tessdata 폴더에 추가