20120254181 | TEXT, CHARACTER ENCODING AND LANGUAGE RECOGNITION - A method is disclosed, for recognizing whether some electronic data is the digital representation of a piece of text and, if so, in which character encoding it has been encoded. A fingerprint is constructed from the data, wherein the fingerprint comprises, for each of a plurality of predetermined character encoding schemes, at least one confidence value, representing a confidence that the data was encoded using said character encoding scheme. The fingerprint also comprises a frequency value for each of a subset of byte values, each frequency value representing the frequency of occurrence of a respective byte value in the data. A statistical classification of the data is then performed based on the fingerprint. | 10-04-2012 |