仅供学习
大小:34,302,066,688 B
格式:name,cardNo
常用命令(windows):
1
2
3
get-content 10e -totalcount 10
get-content 10e -tail 10
(get-content 10e -totalcount 10|Out-String) -replace '\r','aaa' -replace '\n', 'bbb'
思路:按照province_code
建立分区表,然后建立关于姓名和生日的索引。分区不仅可以加快根据省份同名查询的速度,还能够筛选掉部分格式不正确的数据
建表:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
CREATE TABLE `china_mainland` (
`cardNo` char(18) COLLATE utf8_unicode_ci NOT NULL,
`name` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`province_code` char(2) COLLATE utf8_unicode_ci GENERATED ALWAYS AS (substr(`cardNo`,1,2)) STORED NOT NULL,
`address_code` char(6) COLLATE utf8_unicode_ci GENERATED ALWAYS AS (substr(`cardNo`,1,6)) VIRTUAL,
`birth_date` date GENERATED ALWAYS AS (str_to_date(substr(`cardNo`,7,8),'%Y%m%d')) STORED,
`gender` enum('M','F') COLLATE utf8_unicode_ci GENERATED ALWAYS AS (if(((substr(`cardNo`,17,1) % 2) = 1),'M','F')) STORED,
`address` text COLLATE utf8_unicode_ci,
`mobile` varchar(11) COLLATE utf8_unicode_ci DEFAULT '',
UNIQUE KEY `unique_info` (`cardNo`,`province_code`,`name`,`mobile`,`address`(100)),
KEY `idx_name` (`name`),
KEY `idx_mobile` (`mobile`),
KEY `idx_address` (`address`(100))
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
PARTITION BY LIST COLUMNS(province_code)
(PARTITION p11 VALUES IN ('11') ENGINE = MyISAM,
PARTITION p12 VALUES IN ('12') ENGINE = MyISAM,
PARTITION p13 VALUES IN ('13') ENGINE = MyISAM,
PARTITION p14 VALUES IN ('14') ENGINE = MyISAM,
PARTITION p15 VALUES IN ('15') ENGINE = MyISAM,
PARTITION p16 VALUES IN ('16') ENGINE = MyISAM,
PARTITION p17 VALUES IN ('17') ENGINE = MyISAM,
PARTITION p18 VALUES IN ('18') ENGINE = MyISAM,
PARTITION p19 VALUES IN ('19') ENGINE = MyISAM,
PARTITION p20 VALUES IN ('20') ENGINE = MyISAM,
PARTITION p21 VALUES IN ('21') ENGINE = MyISAM,
PARTITION p22 VALUES IN ('22') ENGINE = MyISAM,
PARTITION p23 VALUES IN ('23') ENGINE = MyISAM,
PARTITION p24 VALUES IN ('24') ENGINE = MyISAM,
PARTITION p25 VALUES IN ('25') ENGINE = MyISAM,
PARTITION p26 VALUES IN ('26') ENGINE = MyISAM,
PARTITION p27 VALUES IN ('27') ENGINE = MyISAM,
PARTITION p28 VALUES IN ('28') ENGINE = MyISAM,
PARTITION p29 VALUES IN ('29') ENGINE = MyISAM,
PARTITION p30 VALUES IN ('30') ENGINE = MyISAM,
PARTITION p31 VALUES IN ('31') ENGINE = MyISAM,
PARTITION p32 VALUES IN ('32') ENGINE = MyISAM,
PARTITION p33 VALUES IN ('33') ENGINE = MyISAM,
PARTITION p34 VALUES IN ('34') ENGINE = MyISAM,
PARTITION p35 VALUES IN ('35') ENGINE = MyISAM,
PARTITION p36 VALUES IN ('36') ENGINE = MyISAM,
PARTITION p37 VALUES IN ('37') ENGINE = MyISAM,
PARTITION p38 VALUES IN ('38') ENGINE = MyISAM,
PARTITION p39 VALUES IN ('39') ENGINE = MyISAM,
PARTITION p40 VALUES IN ('40') ENGINE = MyISAM,
PARTITION p41 VALUES IN ('41') ENGINE = MyISAM,
PARTITION p42 VALUES IN ('42') ENGINE = MyISAM,
PARTITION p43 VALUES IN ('43') ENGINE = MyISAM,
PARTITION p44 VALUES IN ('44') ENGINE = MyISAM,
PARTITION p45 VALUES IN ('45') ENGINE = MyISAM,
PARTITION p46 VALUES IN ('46') ENGINE = MyISAM,
PARTITION p47 VALUES IN ('47') ENGINE = MyISAM,
PARTITION p48 VALUES IN ('48') ENGINE = MyISAM,
PARTITION p49 VALUES IN ('49') ENGINE = MyISAM,
PARTITION p50 VALUES IN ('50') ENGINE = MyISAM,
PARTITION p51 VALUES IN ('51') ENGINE = MyISAM,
PARTITION p52 VALUES IN ('52') ENGINE = MyISAM,
PARTITION p53 VALUES IN ('53') ENGINE = MyISAM,
PARTITION p54 VALUES IN ('54') ENGINE = MyISAM,
PARTITION p55 VALUES IN ('55') ENGINE = MyISAM,
PARTITION p56 VALUES IN ('56') ENGINE = MyISAM,
PARTITION p57 VALUES IN ('57') ENGINE = MyISAM,
PARTITION p58 VALUES IN ('58') ENGINE = MyISAM,
PARTITION p59 VALUES IN ('59') ENGINE = MyISAM,
PARTITION p60 VALUES IN ('60') ENGINE = MyISAM,
PARTITION p61 VALUES IN ('61') ENGINE = MyISAM,
PARTITION p62 VALUES IN ('62') ENGINE = MyISAM,
PARTITION p63 VALUES IN ('63') ENGINE = MyISAM,
PARTITION p64 VALUES IN ('64') ENGINE = MyISAM,
PARTITION p65 VALUES IN ('65') ENGINE = MyISAM,
PARTITION p71 VALUES IN ('71') ENGINE = MyISAM,
PARTITION p81 VALUES IN ('81') ENGINE = MyISAM,
PARTITION p82 VALUES IN ('82') ENGINE = MyISAM);
原始文件有一些不规范数据,比如加密过的姓名、身份证,还有一部分的数据格式不是(name, cardNo),而是(name, mobile, address),因此需要对原始数据进行清洗
顺便说一下为什么要拆分,我在导入超过2gb的文件时会无法导入(无报错但是0 records),搜索了一下有人说修改max_allowed_packet
,但这个其实是单次传输文件的大小,实际上应该修改exec_mem_limit
导入内存限制(默认2gb,低版本疑似无法修改)。然而大文件的导入速度实在太慢,如果中间出错很可能功亏一篑,因此最好先拆分成小文件
在linux上可以使用split
命令进行拆分,windows下老老实实用python吧:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import os
def validate_format(line):
parts = line.strip().split(',')
# 期望的格式是(name, identity_number)
if len(parts) == 2 and len(parts[1]) == 18:
return True
return False
def split_large_csv(input_file, output_dir, max_size_gb):
# 计算最大文件大小(以字节为单位)
max_size_bytes = max_size_gb * 1024 * 1024 * 1024
# 初始化当前文件大小和块计数器
current_size = 0
chunk_count = 1
# 打开文件保存格式不正确的行
error_file = open(os.path.join(output_dir, 'error_lines'), 'w', encoding='utf-8')
# 逐行读取输入文件
with open(input_file, 'r', encoding='utf-8') as infile:
outfile = None
for line in infile:
if not validate_format(line):
# 将不正确的行写入错误文件
error_file.write(line)
continue # 跳过后续处理,直接处理下一行
line_size = len(line.encode('utf-8'))
if outfile is None or current_size + line_size > max_size_bytes:
if outfile:
outfile.close()
outfile = open(os.path.join(output_dir, f'chunk_{chunk_count}.csv'), 'w', encoding='utf-8')
chunk_count += 1
current_size = 0
outfile.write(line)
current_size += line_size
if outfile:
outfile.close()
# 调用函数
split_large_csv('10e', 'renkou', max_size_gb=1)
由于myisam是表级锁,无法多线程导入,因此直接用批处理脚本导入即可:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@echo off
setlocal
set USER=your_username
set PASSWORD=your_password
set DATABASE=your_database
set TABLE=your_table
set HOST=localhost
set DATA_DIR=./renkou/
set MYSQL_PATH=mysql
for %%f in (%DATA_DIR%\*.csv) do (
echo Loading %%f into MySQL
"%MYSQL_PATH%" -h %HOST% -u %USER% -p%PASSWORD% %DATABASE% -e ^
"LOAD DATA LOCAL INFILE '%%f' INTO TABLE %TABLE% FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n' (name, cardNo);"
)
echo Done.
endlocal
然后导入完发现只有1亿多条有效数据,再三验证发现这个所谓10e是重复了9遍的1e
查看各分区表中数据:
1
2
3
SELECT partition_name, table_rows
FROM information_schema.partitions
WHERE table_name='your_table' and table_schema='your_database';