本文参考博客:https://blog.csdn.net/henshuia/article/details/111498753?utm_medium=distribute.pc_relevant.none-task-blog-2defaultbaidujs_title~default-0.pc_relevant_default&spm=1001.2101.3001.4242.1&utm_relevant_index=3
再次感谢作者分享!
实现敏感词过滤,一般都是遍历敏感词库然后一个一个的用敏感词进行匹配,但是这样效率不高并且如果敏感词中间掺杂特殊符号或者空格就匹配不到了,其次高级点使用正则匹配,但是正则匹配效率低,所有就使用DFA算法解决。
DFA简介
在实现文字过滤的算法中,DFA是唯一比较好的实现算法。DFA即Deterministic Finite Automaton,也就是确定有穷自动机,它是是通过event和当前的state得到下一个state,即event+state=nextstate。下图展示了其状态的转换
如果我们敏感词是:黄色、黄赌毒、黄色丝袜、丝袜,则数据结构如下所示
每次过滤文章的时候先从整片文章从第一个字开始往后过滤,如果当前这个字存在根节点下则用这个字后面字个这个节点下一个节点匹配,如果存在并且end=1则是敏感词。
首先创建对敏感词处理,按格式写入
private SoftReference<Map<String, Map>> sensitiveWordReference=null;private void setWordMap(List<String> result){Map<String, Map> sensitiveWordMap=new HashMap<>(result.size());Iterator<String> iterator = result.iterator();String key = null;Map nowMap = null;while(iterator.hasNext()){nowMap=sensitiveWordMap;key = iterator.next();for(int i=0;i< key.length();i++){char keyChar = key.charAt(i);if(nowMap.containsKey(keyChar)){nowMap=(Map)nowMap.get(keyChar);}else{Map<String,String>newWorMap = new HashMap();newWorMap.put("end", "0");nowMap.put(keyChar, newWorMap);nowMap = newWorMap;}if(i==(key.length()-1)){nowMap.put("end","1");}}}log.info("敏感词组合{}", JSONObject.toJSONString(sensitiveWordMap));sensitiveWordReference=new SoftReference<Map<String, Map>>(sensitiveWordMap);}
敏感词是:毒品,黑色,黄色丝袜,色情,丝袜,黄色,黄赌毒
敏感词组合 {"黑": {"色": {"end": "1"},"end": "0"},"毒": {"品": {"end": "1"},"end": "0"},"色": {"情": {"end": "1"},"end": "0"},"黄": {"色": {"end": "1","丝": {"end": "0","袜": {"end": "1"}}},"end": "0","赌": {"毒": {"end": "1"},"end": "0"}},"丝": {"end": "0","袜": {"end": "1"}}
}
检查并返回匹配到的第一个敏感词
private String checkWords(String sensitiveWords){if(StringUtils.isEmpty(sensitiveWords)){return StringUtils.EMPTY;}Map<String, Map> sensitiveWordMap=sensitiveWordReference.get();Map nowMap = null;StringBuffer sb=new StringBuffer();for(int i =0; i < sensitiveWords.length(); i++){if(!Character.toString(sensitiveWords.charAt(i)).matches("[\\u4E00-\\u9FA5]+")) {continue;}if(!sensitiveWordMap.containsKey(sensitiveWords.charAt(i))){continue;}sb=new StringBuffer().append(sensitiveWords.charAt(i));nowMap=sensitiveWordMap.get(sensitiveWords.charAt(i));int sensitiveWordLength=i;while (sensitiveWordLength<MAX_MAP_LENGTH){sensitiveWordLength++;if(!Character.toString(sensitiveWords.charAt(sensitiveWordLength)).matches("[\\u4E00-\\u9FA5]+")) {sb.append(sensitiveWords.charAt(sensitiveWordLength));continue;}nowMap=(Map)nowMap.get(sensitiveWords.charAt(sensitiveWordLength));if(CollectionUtils.isEmpty(nowMap)){break;}sb.append(sensitiveWords.charAt(sensitiveWordLength));if(nowMap.get("end").equals("1")){return sb.toString();}}}return sb.toString();}
本文增加了对特殊字符的处理,匹配敏感词的时候会忽略特殊字符只匹配汉字。
执行输入检查的文章:发 丝 0 袜送的发达色法士大夫黄 的色是打发士大 夫是
返回结果敏感词是:丝 0 袜