unicode - preg_match and UTF-8 in PHP

ID : 131377

viewed : 5

Tags : phpunicodeutf-8pcrephp

Top 5 Answer for unicode - preg_match and UTF-8 in PHP

vote vote

95

Although the u modifier makes both the pattern and subject be interpreted as UTF-8, the captured offsets are still counted in bytes.

You can use mb_strlen to get the length in UTF-8 characters rather than bytes:

$str = "\xC2\xA1Hola!"; preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE); echo mb_strlen(substr($str, 0, $a_matches[0][1])); 
vote vote

84

Try adding this (*UTF8) before the regex:

preg_match('(*UTF8)/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE); 

Magic, thanks to a comment in https://www.php.net/manual/function.preg-match.php#95828

vote vote

78

Looks like this is a "feature", see http://bugs.php.net/bug.php?id=37391

'u' switch only makes sense for pcre, PHP itself is unaware of it.

From PHP's point of view, strings are byte sequences and returning byte offset seems logical (i don't say "correct").

vote vote

61

Excuse me for necroposting, but may be somebody will find it useful: code below can work both as replacement for preg_match and preg_match_all functions and returns correct matches with correct offset for UTF8-encoded strings.

     mb_internal_encoding('UTF-8');       /**      * Returns array of matches in same format as preg_match or preg_match_all      * @param bool   $matchAll If true, execute preg_match_all, otherwise preg_match      * @param string $pattern  The pattern to search for, as a string.      * @param string $subject  The input string.      * @param int    $offset   The place from which to start the search (in bytes).      * @return array      */     function pregMatchCapture($matchAll, $pattern, $subject, $offset = 0)     {         $matchInfo = array();         $method    = 'preg_match';         $flag      = PREG_OFFSET_CAPTURE;         if ($matchAll) {             $method .= '_all';         }         $n = $method($pattern, $subject, $matchInfo, $flag, $offset);         $result = array();         if ($n !== 0 && !empty($matchInfo)) {             if (!$matchAll) {                 $matchInfo = array($matchInfo);             }             foreach ($matchInfo as $matches) {                 $positions = array();                 foreach ($matches as $match) {                     $matchedText   = $match[0];                     $matchedLength = $match[1];                     $positions[]   = array(                         $matchedText,                         mb_strlen(mb_strcut($subject, 0, $matchedLength))                     );                 }                 $result[] = $positions;             }             if (!$matchAll) {                 $result = $result[0];             }         }         return $result;     }      $s1 = 'Попробуем русскую строку для теста';     $s2 = 'Try english string for test';      var_dump(pregMatchCapture(true, '/обу/', $s1));     var_dump(pregMatchCapture(false, '/обу/', $s1));      var_dump(pregMatchCapture(true, '/lish/', $s2));     var_dump(pregMatchCapture(false, '/lish/', $s2)); 

Output of my example:

    array(1) {       [0]=>       array(1) {         [0]=>         array(2) {           [0]=>           string(6) "обу"           [1]=>           int(4)         }       }     }     array(1) {       [0]=>       array(2) {         [0]=>         string(6) "обу"         [1]=>         int(4)       }     }     array(1) {       [0]=>       array(1) {         [0]=>         array(2) {           [0]=>           string(4) "lish"           [1]=>           int(7)         }       }     }     array(1) {       [0]=>       array(2) {         [0]=>         string(4) "lish"         [1]=>         int(7)       }     } 
vote vote

50

You can calculate the real UTF-8 offset by cutting the string to the offset returned by the preg_mach with the byte-counting substr and then measuring this prefix with the correct-counting mb_strlen.

$utf8Offset = mb_strlen(substr($text, 0, $offsetFromPregMatch), 'UTF-8'); 

Top 3 video Explaining unicode - preg_match and UTF-8 in PHP

Related QUESTION?