diff options
Diffstat (limited to 'tesseract/README.md')
-rw-r--r-- | tesseract/README.md | 136 |
1 files changed, 136 insertions, 0 deletions
diff --git a/tesseract/README.md b/tesseract/README.md new file mode 100644 index 00000000..30fa8ea0 --- /dev/null +++ b/tesseract/README.md @@ -0,0 +1,136 @@ +# Tesseract OCR + +[![Build Status](https://travis-ci.org/tesseract-ocr/tesseract.svg?branch=master)](https://travis-ci.org/tesseract-ocr/tesseract) +[![Build status](https://ci.appveyor.com/api/projects/status/miah0ikfsf0j3819/branch/master?svg=true)](https://ci.appveyor.com/project/zdenop/tesseract/) +![Build status](https://github.com/tesseract-ocr/tesseract/workflows/sw/badge.svg)<br> +[![Coverity Scan Build Status](https://scan.coverity.com/projects/tesseract-ocr/badge.svg)](https://scan.coverity.com/projects/tesseract-ocr) +[![Code Quality: Cpp](https://img.shields.io/lgtm/grade/cpp/g/tesseract-ocr/tesseract.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/tesseract-ocr/tesseract/context:cpp) +[![Total Alerts](https://img.shields.io/lgtm/alerts/g/tesseract-ocr/tesseract.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/tesseract-ocr/tesseract/alerts) +[![OSS-Fuzz](https://img.shields.io/badge/oss--fuzz-fuzzing-brightgreen)](https://bugs.chromium.org/p/oss-fuzz/issues/list?sort=-opened&can=2&q=proj:tesseract-ocr) +<br/> +[![GitHub license](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](https://raw.githubusercontent.com/tesseract-ocr/tesseract/master/LICENSE) +[![Downloads](https://img.shields.io/badge/download-all%20releases-brightgreen.svg)](https://github.com/tesseract-ocr/tesseract/releases/) + +## About + +This package contains an **OCR engine** - `libtesseract` and a **command line program** - `tesseract`. +Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused +on line recognition, but also still supports the legacy Tesseract OCR engine of +Tesseract 3 which works by recognizing character patterns. Compatibility with +Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). +It also needs [traineddata](https://tesseract-ocr.github.io/tessdoc/Data-Files.html) files which support the legacy engine, for example +those from the tessdata repository. + +The lead developer is Ray Smith. The maintainer is Zdenko Podobny. +For a list of contributors see [AUTHORS](https://github.com/tesseract-ocr/tesseract/blob/master/AUTHORS) +and GitHub's log of [contributors](https://github.com/tesseract-ocr/tesseract/graphs/contributors). + +Tesseract has **unicode (UTF-8) support**, and can **recognize more than 100 languages** "out of the box". + +Tesseract supports **various output formats**: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output. + +You should note that in many cases, in order to get better OCR results, +you'll need to **[improve the quality](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html) of the image** you are giving Tesseract. + +This project **does not include a GUI application**. +If you need one, please see the [3rdParty](https://tesseract-ocr.github.io/tessdoc/User-Projects-%E2%80%93-3rdParty.html) documentation. + +Tesseract **can be trained to recognize other languages**. +See [Tesseract Training](https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html) for more information. + +## Brief history + +Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and +at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some +more changes made in 1996 to port to Windows, and some C++izing in 1998. +In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google. + +The latest (LSTM based) stable version is **[4.1.1](https://github.com/tesseract-ocr/tesseract/releases/tag/4.1.1)**, released on December 26, 2019. +Latest source code is available from [master branch on GitHub](https://github.com/tesseract-ocr/tesseract/tree/master). +Open issues can be found in [issue tracker](https://github.com/tesseract-ocr/tesseract/issues), +and [planning documentation](https://tesseract-ocr.github.io/tessdoc/Planning.html). + +The latest 3.0x version is **[3.05.02](https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.02)**, released on June 19, 2018. Latest source code for 3.05 is available from [3.05 branch on GitHub](https://github.com/tesseract-ocr/tesseract/tree/3.05). +There is no development for this version, but it can be used for special cases (e.g. see [Regression of features from 3.0x](https://tesseract-ocr.github.io/tessdoc/Planning.html#regression-of-features-from-30x)). + +See **[Release Notes](https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html)** +and **[Change Log](https://github.com/tesseract-ocr/tesseract/blob/master/ChangeLog)** for more details of the releases. + +## Installing Tesseract + +You can either [Install Tesseract via pre-built binary package](https://tesseract-ocr.github.io/tessdoc/Home.html) +or [build it from source](https://tesseract-ocr.github.io/tessdoc/Compiling.html). + +Supported Compilers are: + +* GCC 4.8 and above +* Clang 3.4 and above +* MSVC 2015, 2017, 2019 + +Other compilers might work, but are not officially supported. + +## Running Tesseract + +Basic **[command line usage](https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html)**: + + tesseract imagename outputbase [-l lang] [--oem ocrenginemode] [--psm pagesegmode] [configfiles...] + +For more information about the various command line options use `tesseract --help` or `man tesseract`. + +Examples can be found in the [documentation](https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html#simplest-invocation-to-ocr-an-image). + +## For developers + +Developers can use `libtesseract` [C](https://github.com/tesseract-ocr/tesseract/blob/master/include/tesseract/capi.h) or +[C++](https://github.com/tesseract-ocr/tesseract/blob/master/include/tesseract/baseapi.h) API to build their own application. +If you need bindings to `libtesseract` for other programming languages, please see the +[wrapper](https://tesseract-ocr.github.io/tessdoc/AddOns.html#tesseract-wrappers) section in the AddOns documentation. + +Documentation of Tesseract generated from source code by doxygen can be found on [tesseract-ocr.github.io](https://tesseract-ocr.github.io/). + +## Support + +Before you submit an issue, please review **[the guidelines for this repository](https://github.com/tesseract-ocr/tesseract/blob/master/CONTRIBUTING.md)**. + +For support, first read the [documentation](https://tesseract-ocr.github.io/tessdoc/), +particularly the [FAQ](https://tesseract-ocr.github.io/tessdoc/FAQ.html) to see if your problem is addressed there. +If not, search the [Tesseract user forum](https://groups.google.com/d/forum/tesseract-ocr), the [Tesseract developer forum](https://groups.google.com/d/forum/tesseract-dev) and [past issues](https://github.com/tesseract-ocr/tesseract/issues), and if you still can't find what you need, ask for support in the mailing-lists. + +Mailing-lists: +* [tesseract-ocr](https://groups.google.com/d/forum/tesseract-ocr) - For tesseract users. +* [tesseract-dev](https://groups.google.com/d/forum/tesseract-dev) - For tesseract developers. + +Please report an issue only for a **bug**, not for asking questions. + +## License + + The code in this repository is licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + +**NOTE**: This software depends on other packages that may be licensed under different open source licenses. + +Tesseract uses [Leptonica library](http://leptonica.com/) which essentially +uses a [BSD 2-clause license](http://leptonica.com/about-the-license.html). + +## Dependencies + +Tesseract uses [Leptonica library](https://github.com/DanBloomberg/leptonica) +for opening input images (e.g. not documents like pdf). +It is suggested to use leptonica with built-in support for [zlib](https://zlib.net), +[png](https://sourceforge.net/projects/libpng) and +[tiff](http://www.simplesystems.org/libtiff) (for multipage tiff). + +## Latest Version of README + +For the latest online version of the README.md see: + +https://github.com/tesseract-ocr/tesseract/blob/master/README.md |