PHP:使用 levenshtein 距离来匹配单词

pengyingh 阅读:26 2024-09-07 23:24:14 评论:0

我一直在阅读和测试 php levenshtein 中的一些示例。 比较 $input 和 $words 输出 比较

$input = 'hw r u my dear angel'; 
 
    // array of words to check against 
    $words  = array('apple','pineapple','banana','orange','how are you', 
                    'radish','carrot','pea','bean','potato','hw are you'); 

输出

Input word: hw r u my dear angel 
Did you mean: hw are you? 

比较,删除数组中的hw are you

$input = 'hw r u my dear angel'; 
 
    // array of words to check against 
    $words  = array('apple','pineapple','banana','orange','how are you', 
                    'radish','carrot','pea','bean','potato'); 

在第二次删除数组输出中的 hw are you

Input word: hw r u my dear angel 
Did you mean: orange?  

similar_text() 中的位置

 echo '<br/>how are you:'.similar_text($input,'how are you'); 
    echo '<br/>orange:'.similar_text($input,'orange'); 
    echo '<br/>hw are you:'.similar_text($input,'hw are you'); 
 
how are you:6 
orange:5 
hw are you:6 

在第二次比较时,为什么它输出 orangehow are you 也有 6 个类似的文本,如 hw are you?有什么方法可以改进或更好的方法吗?我也将所有可能的输入保存在数据库中。我应该查询它并存储在 array 中,然后使用 foreach 来获取 levenshtein distance?但如果有数百万,那会很慢。

代码

  <?php 
    // input misspelled word 
    $input = 'hw r u my dear angel'; 
 
    // array of words to check against 
    $words  = array('apple','pineapple','banana','orange','how are you', 
                    'radish','carrot','pea','bean','potato','hw are you'); 
 
 
    // no shortest distance found, yet 
    $shortest = -1; 
 
    $closest = closest($input,$words,$shortest); 
 
 
    echo "Input word: $input<br/>"; 
    if ($shortest == 0) { 
        echo "Exact match found: $closest\n"; 
    } else { 
        echo "Did you mean: $closest?\n"; 
    } 
    echo '<br/><br/>'; 
 
    $shortest = -1; 
    $words  = array('apple','pineapple','banana','orange','how are you', 
                    'radish','carrot','pea','bean','potato'); 
    $closest = closest($input,$words,$shortest); 
    echo "Input word: $input<br/>"; 
    if ($shortest == 0) { 
        echo "Exact match found: $closest\n"; 
    } else { 
        echo "Did you mean: $closest?\n"; 
    } 
 
    echo '<br/><br/>'; 
    echo 'Similar text'; 
    echo '<br/>how are you:'.similar_text($input,'how are you'); 
    echo '<br/>orange:'.similar_text($input,'orange'); 
    echo '<br/>hw are you:'.similar_text($input,'hw are you'); 
 
 
 
    function closest($input,$words,&$shortest){ 
        // loop through words to find the closest 
    foreach ($words as $word) { 
 
        // calculate the distance between the input word, 
        // and the current word 
        $lev = levenshtein($input, $word); 
 
        // check for an exact match 
        if ($lev == 0) { 
 
            // closest word is this one (exact match) 
            $closest = $word; 
            $shortest = 0; 
 
            // break out of the loop; we've found an exact match 
            break; 
        } 
 
        // if this distance is less than the next found shortest 
        // distance, OR if a next shortest word has not yet been found 
        if ($lev <= $shortest || $shortest < 0) { 
            // set the closest match, and shortest distance 
            $closest  = $word; 
            $shortest = $lev; 
        } 
 
 
    } 
    return $closest; 
    } 
    ?> 

请您参考如下方法:

首先,similar_text() 输出什么并不重要,因为它使用另一种算法来计算字符串之间的相似度。

让我们试着理解为什么 levenstein() 认为 hw r u my Dear ange 更接近 orange 而不是 '你好吗。维基百科有一个 good definition莱文斯坦距离是多少。

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.

现在让我们计算一下我们需要进行多少次编辑才能将 hw r u my Dear angel 变为 orange

  1. hw r u my Dear angel → hw r u my Dear angel(删除最后一个字符)
  2. hw r u my Dear ange → hw r u my dearange(删除最后一个空格)
  3. hw r u my dearange → arange(删除前 12 个字符)
  4. arange → 橙色(用 o 代替 a)

所以总共需要 1 + 1 + 12 + 1 = 15 次编辑才能将 hw r u my Dear angel 变为 orange。 p>

这是你亲爱的天使你好吗的转变。

  1. hw r u my Dear angel → how r u my Dear 天使(插入 o 字符)
  2. 亲爱的天使你好吗→亲爱的天使(删除7个字符)
  3. how Dear angel → how ar angel(删除2个字符)
  4. how ar angel → how are angel(插入 e 字符)
  5. how are angel → how are ang(删除最后 2 个字符)
  6. how are ang → how are you(替换最后 3 个字符)

1 + 7 + 2 + 1 + 5 = 16 次编辑。因此,正如您所看到的莱文斯坦距离 orange 更接近 hw r u my Dear angel ;-)


标签:PHP
声明

1.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源;2.本站的原创文章,请转载时务必注明文章作者和来源,不尊重原创的行为我们将追究责任;3.作者投稿可能会经我们编辑修改或补充。

关注我们

一个IT知识分享的公众号