Featured image of post IK分词器实现热更新

IK分词器实现热更新

定制IK分词器实现词库热更新

定制IK分词器实现热更新

最近在写一个项目的时候有用到es做搜索引擎,选择了用ik分词器来分词。拿来直接用的时候,很多现在新产生的一些新词都只能一个一个字的拆分,效果不是很理想。只能手动添加扩展词库来增强分词效果。下面是加了一个扩展词库以后,手动添加了几个热词的分词效果。ac62a088922b952d80698bf33784c856

但是还是存在一些停用词的问题,很多无用的词语被分出来了。现在经常都会有很多的新词产生,如果一有新词产生就去手动添加扩展词库还是很麻烦的。所以就想到了修改ik分词器的词库加载逻辑。对ik分词器进行定制化,加入mysql驱动,通过查询数据库实现热跟新。后期只需要在数据库进行添加新词,很快就能直接加载到分词器里面,同时也更容易通过编码实现词库的更新。 本次项目使用的是elasticsearch 7.17.7版本。ik分词器也要对应的版本。把对应版本的ik分词器的源码下载解压出来,用idea打开,加载maven项目。

添加maven坐标导入mysql驱动

1
2
3
4
5
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.29</version>
        </dependency>

这里看项目的结构:

image-20240312173917550

和字典相关的代码都在org.wltea.analyzer.dic包下面。 可以看到Monitor类是用来监控远程词库是否变更的一个类。 image-20240312174218946

我们可以模仿这个写一个用于监控数据库词库是否变更的监视器。

大致逻辑就是从数据库读取到词库数据,为了避免读取到大量数据,增加系统负担,检索之后要保存检索时间,下一次检索就只检索修改时间在次之后的数据。由于是根据时间来检索的数据,可以给时间加上非唯一索引,加快检索速率。

监视器代码如下:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
public class JdbcMinitor implements Runnable {
    static {
        try {
            Class.forName("com.mysql.cj.jdbc.Driver");
        } catch (Exception e) {
            e.getStackTrace();
        }
    }

    /**
     * jdbc配置
     */
    private JdbcConfig jdbcConfig;
    /**
     * 主词汇上次更新时间
     */
    private Timestamp mainLastModitime = Timestamp.valueOf("2022-01-01 00:00:00");
    /**
     * 停用词上次更新时间
     */
    private Timestamp stopLastModitime = Timestamp.valueOf("2022-01-01 00:00:00");

    private static final Logger logger = ESPluginLoggerFactory.getLogger(JdbcMinitor.class.getName());

    public JdbcMinitor(JdbcConfig jdbcConfig) {
        this.jdbcConfig = jdbcConfig;
    }

    @Override
    public void run() {
        SpecialPermission.check();
        AccessController.doPrivileged((PrivilegedAction<Void>) () -> {
            this.runUnprivileged();
            return null;
        });
    }

    /**
     * 加载词汇和停用词
     */
    public void runUnprivileged() {
        //Dictionary.getSingleton().reLoadMainDict();
        loadWords();
    }

    private void loadWords() {
        List<String> mainWords = new ArrayList<>();
        List<String> delMainWords = new ArrayList<>();
        List<String> stopWords = new ArrayList<>();
        List<String> delStopWords = new ArrayList<>();
        setAllWordList(mainWords, delMainWords, stopWords, delStopWords);
        mainWords.forEach(w -> Dictionary.getSingleton().fillSegmentMain(w));
        delMainWords.forEach(w -> Dictionary.getSingleton().disableSegmentMain(w));
        stopWords.forEach(w -> Dictionary.getSingleton().fillSegmentStop(w));
        delStopWords.forEach(w -> Dictionary.getSingleton().disableSegmentStop(w));
        logger.info("ik dic refresh from db. mainLastModitime: {} stopLastModitime: {}", mainLastModitime, stopLastModitime);
    }

    /**
     * 获取主词汇和停用词
     * @param mainWords
     * @param delMainWords
     * @param stopWords
     * @param delStopWords
     */
    private void setAllWordList(List<String> mainWords, List<String> delMainWords, List<String> stopWords, List<String> delStopWords) {
        Connection connection = null;
        try {
            connection = DriverManager.getConnection(jdbcConfig.getUrl(), jdbcConfig.getUsername(), jdbcConfig.getPassword());
            setWordList(connection, jdbcConfig.getMainWordSql(), mainLastModitime, mainWords, delMainWords);
            setWordList(connection, jdbcConfig.getStopWordSql(), stopLastModitime, stopWords, delStopWords);
        } catch (SQLException throwables) {
            logger.error("jdbc load words failed: mainLastModitime-{} stopLostMOditime-{}", mainLastModitime, stopLastModitime);
            logger.error(throwables.getStackTrace());
        } finally {
            if (connection != null) {
                try {
                    connection.close();
                } catch (SQLException throwables) {
                    logger.error("failed to close connection");
                    logger.error(throwables.getMessage());
                }
            }
        }
    }
    /**
     * 连接数据库获取词汇
     * @param connection
     * @param sql
     * @param lastModitime
     * @param words
     * @param delWords
     */
    private void setWordList(Connection connection, String sql, Timestamp lastModitime, List<String> words, List<String> delWords) {
        PreparedStatement prepareStatement = null;
        ResultSet result = null;

        try {
            prepareStatement = connection.prepareStatement(sql);
            prepareStatement.setTimestamp(1, lastModitime);
            result = prepareStatement.executeQuery();

            while (result.next()) {
                String word = result.getString("word");
                Timestamp moditime = result.getTimestamp("moditime");
                String ifdel = result.getString("ifdel");

                if ("1".equals(ifdel)) {
                    delWords.add(word);
                } else {
                    words.add(word);
                }

                // 取最大的时间
                if (moditime.after(lastModitime)) {
                    lastModitime.setTime(moditime.getTime());
                }
            }
        } catch (SQLException throwables) {
            logger.error("jdbc load words failed: {}", lastModitime);
            logger.error(throwables.getMessage());
        } finally {
            if (result != null) {
                try {
                    result.close();
                } catch (SQLException throwables) {
                    logger.error("failed to close prepareStatement");
                    logger.error(throwables.getMessage());
                }
            }

            if (prepareStatement != null) {
                try {
                    prepareStatement.close();
                } catch (SQLException throwables) {
                    logger.error("failed to close prepareStatement");
                    logger.error(throwables.getMessage());
                }
            }
        }
    }
}

停用词库表

1
2
3
4
5
6
7
8
CREATE TABLE `es_dic_stop`  (
  `id` int NOT NULL AUTO_INCREMENT,
  `word` varchar(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL COMMENT '停用词',
  `moditime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `ifdel` char(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`) USING BTREE,
  INDEX `time`(`moditime`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 2434 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci COMMENT = '停用词' ROW_FORMAT = Dynamic;

分词词库表

1
2
3
4
5
6
7
8
CREATE TABLE `es_dic_main`  (
  `id` int NOT NULL AUTO_INCREMENT,
  `word` varchar(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL COMMENT '词汇',
  `moditime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `ifdel` char(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`) USING BTREE,
  INDEX `time`(`moditime`) USING BTREE COMMENT 'time'
) ENGINE = InnoDB AUTO_INCREMENT = 5593806 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci COMMENT = '主词汇' ROW_FORMAT = Dynamic;

两个词库大同小异,ifdel是标记屏蔽这个词

添加需要使用的配置类放org.wltea.analyzer.cfg包下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
public class JdbcConfig {
    private String url;

    private String username;

    private String password;

    private String mainWordSql;

    private String stopWordSql;

    private Integer interval;

    public JdbcConfig(String url, String username, String password, String mainWordSql, String stopWordSql, Integer interval) {
        this.url = url;
        this.username = username;
        this.password = password;
        this.mainWordSql = mainWordSql;
        this.stopWordSql = stopWordSql;
        this.interval = interval;
    }
}

上面代码还需要补充各个属性对应的 getter setter 方法

对应的配置文件如下:jdbc.properties

1
2
3
4
5
6
7
jdbc.url=jdbc:mysql://127.0.0.1:3306/ik?useUnicode=true&characterEncoding=utf8&autoReconnect=true&useSSL=false&serverTimezone=Asia/Shanghai
jdbc.username=root
jdbc.password=1111111
main.word.sql=SELECT * FROM es_dic_main WHERE moditime >= ?
stop.word.sql=SELECT * FROM es_dic_stop WHERE moditime >= ?
# ????(?)
interval=10

上面数据库连接相关替换成真实连接信息即可。 配置文件放最外层的config目录下

现在需要 字典管理器 构造处修改代码加载jdbc配置文件org.wltea.analyzer.dic.Dictionary 构造方法最后一行添加读取配置文件的逻辑

image-20240312183037908

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
	private void setJdbcConfig() {
		Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_JDBC_CONFIG);
		Properties properties = null;
		try {
			properties = new Properties();
			properties.load(new FileInputStream(file.toFile()));
		} catch (Exception e) {
			logger.error("load jdbc.properties failed");
			logger.error(e.getMessage());
		}
		jdbcConfig = new JdbcConfig(
				properties.getProperty("jdbc.url"),
				properties.getProperty("jdbc.username"),
				properties.getProperty("jdbc.password"),
				properties.getProperty("main.word.sql"),
				properties.getProperty("stop.word.sql"),
				Integer.valueOf(properties.getProperty("interval"))
		);
	}

然后修改权限,resources/plugin-security.policy文件

1
2
3
4
5
grant {
  // needed because of the hot reload functionality
  permission java.net.SocketPermission "*", "connect,resolve";
  permission java.lang.RuntimePermission "setContextClassLoader";
};

之后就可以打包插件了。放进 Elasticsearch 插件目录下加载。