SGK

处理10e

Posted by 3thernet on July 6, 2024

仅供学习

大小:34,302,066,688 B

格式:name,cardNo

常用命令(windows):

1
2
3
get-content 10e -totalcount 10
get-content 10e -tail 10
(get-content 10e -totalcount 10|Out-String) -replace '\r','aaa' -replace '\n', 'bbb'

思路:按照province_code建立分区表,然后建立关于姓名和生日的索引。分区不仅可以加快根据省份同名查询的速度,还能够筛选掉部分格式不正确的数据

建表:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
CREATE TABLE `china_mainland` (
  `cardNo` char(18) COLLATE utf8_unicode_ci NOT NULL,
  `name` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
  `province_code` char(2) COLLATE utf8_unicode_ci GENERATED ALWAYS AS (substr(`cardNo`,1,2)) STORED NOT NULL,
  `address_code` char(6) COLLATE utf8_unicode_ci GENERATED ALWAYS AS (substr(`cardNo`,1,6)) VIRTUAL,
  `birth_date` date GENERATED ALWAYS AS (str_to_date(substr(`cardNo`,7,8),'%Y%m%d')) STORED,
  `gender` enum('M','F') COLLATE utf8_unicode_ci GENERATED ALWAYS AS (if(((substr(`cardNo`,17,1) % 2) = 1),'M','F')) STORED,
  `address` text COLLATE utf8_unicode_ci,
  `mobile` varchar(11) COLLATE utf8_unicode_ci DEFAULT '',
  UNIQUE KEY `unique_info` (`cardNo`,`province_code`,`name`,`mobile`,`address`(100)),
  KEY `idx_name` (`name`),
  KEY `idx_mobile` (`mobile`),
  KEY `idx_address` (`address`(100))
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
PARTITION BY LIST  COLUMNS(province_code)
(PARTITION p11 VALUES IN ('11') ENGINE = MyISAM,
 PARTITION p12 VALUES IN ('12') ENGINE = MyISAM,
 PARTITION p13 VALUES IN ('13') ENGINE = MyISAM,
 PARTITION p14 VALUES IN ('14') ENGINE = MyISAM,
 PARTITION p15 VALUES IN ('15') ENGINE = MyISAM,
 PARTITION p16 VALUES IN ('16') ENGINE = MyISAM,
 PARTITION p17 VALUES IN ('17') ENGINE = MyISAM,
 PARTITION p18 VALUES IN ('18') ENGINE = MyISAM,
 PARTITION p19 VALUES IN ('19') ENGINE = MyISAM,
 PARTITION p20 VALUES IN ('20') ENGINE = MyISAM,
 PARTITION p21 VALUES IN ('21') ENGINE = MyISAM,
 PARTITION p22 VALUES IN ('22') ENGINE = MyISAM,
 PARTITION p23 VALUES IN ('23') ENGINE = MyISAM,
 PARTITION p24 VALUES IN ('24') ENGINE = MyISAM,
 PARTITION p25 VALUES IN ('25') ENGINE = MyISAM,
 PARTITION p26 VALUES IN ('26') ENGINE = MyISAM,
 PARTITION p27 VALUES IN ('27') ENGINE = MyISAM,
 PARTITION p28 VALUES IN ('28') ENGINE = MyISAM,
 PARTITION p29 VALUES IN ('29') ENGINE = MyISAM,
 PARTITION p30 VALUES IN ('30') ENGINE = MyISAM,
 PARTITION p31 VALUES IN ('31') ENGINE = MyISAM,
 PARTITION p32 VALUES IN ('32') ENGINE = MyISAM,
 PARTITION p33 VALUES IN ('33') ENGINE = MyISAM,
 PARTITION p34 VALUES IN ('34') ENGINE = MyISAM,
 PARTITION p35 VALUES IN ('35') ENGINE = MyISAM,
 PARTITION p36 VALUES IN ('36') ENGINE = MyISAM,
 PARTITION p37 VALUES IN ('37') ENGINE = MyISAM,
 PARTITION p38 VALUES IN ('38') ENGINE = MyISAM,
 PARTITION p39 VALUES IN ('39') ENGINE = MyISAM,
 PARTITION p40 VALUES IN ('40') ENGINE = MyISAM,
 PARTITION p41 VALUES IN ('41') ENGINE = MyISAM,
 PARTITION p42 VALUES IN ('42') ENGINE = MyISAM,
 PARTITION p43 VALUES IN ('43') ENGINE = MyISAM,
 PARTITION p44 VALUES IN ('44') ENGINE = MyISAM,
 PARTITION p45 VALUES IN ('45') ENGINE = MyISAM,
 PARTITION p46 VALUES IN ('46') ENGINE = MyISAM,
 PARTITION p47 VALUES IN ('47') ENGINE = MyISAM,
 PARTITION p48 VALUES IN ('48') ENGINE = MyISAM,
 PARTITION p49 VALUES IN ('49') ENGINE = MyISAM,
 PARTITION p50 VALUES IN ('50') ENGINE = MyISAM,
 PARTITION p51 VALUES IN ('51') ENGINE = MyISAM,
 PARTITION p52 VALUES IN ('52') ENGINE = MyISAM,
 PARTITION p53 VALUES IN ('53') ENGINE = MyISAM,
 PARTITION p54 VALUES IN ('54') ENGINE = MyISAM,
 PARTITION p55 VALUES IN ('55') ENGINE = MyISAM,
 PARTITION p56 VALUES IN ('56') ENGINE = MyISAM,
 PARTITION p57 VALUES IN ('57') ENGINE = MyISAM,
 PARTITION p58 VALUES IN ('58') ENGINE = MyISAM,
 PARTITION p59 VALUES IN ('59') ENGINE = MyISAM,
 PARTITION p60 VALUES IN ('60') ENGINE = MyISAM,
 PARTITION p61 VALUES IN ('61') ENGINE = MyISAM,
 PARTITION p62 VALUES IN ('62') ENGINE = MyISAM,
 PARTITION p63 VALUES IN ('63') ENGINE = MyISAM,
 PARTITION p64 VALUES IN ('64') ENGINE = MyISAM,
 PARTITION p65 VALUES IN ('65') ENGINE = MyISAM,
 PARTITION p71 VALUES IN ('71') ENGINE = MyISAM,
 PARTITION p81 VALUES IN ('81') ENGINE = MyISAM,
 PARTITION p82 VALUES IN ('82') ENGINE = MyISAM);

原始文件有一些不规范数据,比如加密过的姓名、身份证,还有一部分的数据格式不是(name, cardNo),而是(name, mobile, address),因此需要对原始数据进行清洗

顺便说一下为什么要拆分,我在导入超过2gb的文件时会无法导入(无报错但是0 records),搜索了一下有人说修改max_allowed_packet,但这个其实是单次传输文件的大小,实际上应该修改exec_mem_limit导入内存限制(默认2gb,低版本疑似无法修改)。然而大文件的导入速度实在太慢,如果中间出错很可能功亏一篑,因此最好先拆分成小文件

参考:MYSQL-LOAD - Apache Doris

在linux上可以使用split命令进行拆分,windows下老老实实用python吧:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import os

def validate_format(line):
    parts = line.strip().split(',')
    # 期望的格式是(name, identity_number)
    if len(parts) == 2 and len(parts[1]) == 18:
        return True
    return False

def split_large_csv(input_file, output_dir, max_size_gb):
    # 计算最大文件大小(以字节为单位)
    max_size_bytes = max_size_gb * 1024 * 1024 * 1024

    # 初始化当前文件大小和块计数器
    current_size = 0
    chunk_count = 1

    # 打开文件保存格式不正确的行
    error_file = open(os.path.join(output_dir, 'error_lines'), 'w', encoding='utf-8')

    # 逐行读取输入文件
    with open(input_file, 'r', encoding='utf-8') as infile:
        outfile = None

        for line in infile:
            if not validate_format(line):
                # 将不正确的行写入错误文件
                error_file.write(line)
                continue  # 跳过后续处理,直接处理下一行

            line_size = len(line.encode('utf-8'))

            if outfile is None or current_size + line_size > max_size_bytes:
                if outfile:
                    outfile.close()
                outfile = open(os.path.join(output_dir, f'chunk_{chunk_count}.csv'), 'w', encoding='utf-8')
                chunk_count += 1
                current_size = 0

            outfile.write(line)
            current_size += line_size

        if outfile:
            outfile.close()

# 调用函数
split_large_csv('10e', 'renkou', max_size_gb=1)

由于myisam是表级锁,无法多线程导入,因此直接用批处理脚本导入即可:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@echo off
setlocal

set USER=your_username
set PASSWORD=your_password
set DATABASE=your_database
set TABLE=your_table
set HOST=localhost
set DATA_DIR=./renkou/
set MYSQL_PATH=mysql

for %%f in (%DATA_DIR%\*.csv) do (
    echo Loading %%f into MySQL
    "%MYSQL_PATH%" -h %HOST% -u %USER% -p%PASSWORD% %DATABASE% -e ^
    "LOAD DATA LOCAL INFILE '%%f' INTO TABLE %TABLE% FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n' (name, cardNo);"
)

echo Done.
endlocal

然后导入完发现只有1亿多条有效数据,再三验证发现这个所谓10e是重复了9遍的1e

查看各分区表中数据:

1
2
3
SELECT partition_name, table_rows
FROM information_schema.partitions
WHERE table_name='your_table' and table_schema='your_database';