下载全国中小企业股份转让所有披露信息PDF文件的方法

Tags: /计算机文档/Linux & Unix/ Date Created:

如何下载全国中小企业股份转让系统所有披露信息的PDF文件？

http://www.neeq.com.cn/disclosure/examine.html

先打开网页，输入你要下载数据的日期范围，并点击查询，得到结果的总页数，然后在命令提示符下运行以下指令：

for /l %i in (0,1,579) do curl -d "disclosureType=9&page=%i&companyCd=%E5%85%AC%E5%8F%B8%E5%90%8D%E7%A7%B0%2F%E6%8B%BC%E9%9F%B3%2F%E4%BB%A3%E7%A0%81&keyword=%E5%85%B3%E9%94%AE%E5%AD%97&startTime=2016-01-01&endTime=2016-05-13" "http://www.neeq.com.cn/disclosureInfoController/infoResult.do?callback=jQuery183009154627335424248_1463152016217 " -o "%i.json"

记得把(0,1,579)中的579，替换为你看到的总页数！并且把startTime和endTime后的日期替换为你需要下载文件的对应日期范围！

命令运行后，可以在目录下生成对应页数个json文件。

启动bash（linux bash），创建函数：

function parse_json()
{
    echo $1 | \
    sed -e 's/[{}]/''/g' | \
    sed -e 's/", "/'\",\"'/g' | \
    sed -e 's/" ,"/'\",\"'/g' | \
    sed -e 's/" , "/'\",\"'/g' | \
    sed -e 's/","/'\"---SEPERATOR---\"'/g' | \
    awk -F=':' -v RS='---SEPERATOR---' "\$1~/\"$2\"/ {print}" | \
    sed -e "s/\"$2\"://" | \
    tr -d "\n\t" | \
    sed -e 's/\\"/"/g' | \
    sed -e 's/\\\\/\\/g' | \
    sed -e 's/^[ \t]*//g' | \
    sed -e 's/^"//'  -e 's/"$//'
}

然后在bash命令行中，运行以下指令：

for f in *.json; do parse_json `cat $f` destFilePath >>result.txt; done

等待命令完成，将会在目录下生成result.txt文件，里面包含了所有PDF文件的路径。

该文件需要处理一下，得到最终的文件：

即打开result.txt，查找所有的 "" 符号，替换为回车换行（可以用Notepad++软件）处理一下，处理完成后，每一行是一个文件，最后查找

/disclosure，替换为http://www.neeq.com.cn/disclosure，得到最终的文件列表，类似下面：

http://www.neeq.com.cn/disclosure/2016/2016-05-13/1463125510_729375.pdf
http://www.neeq.com.cn/disclosure/2016/2016-05-13/1463125646_149808.pdf
http://www.neeq.com.cn/disclosure/2016/2016-05-13/1463126203_041625.pdf
http://www.neeq.com.cn/disclosure/2016/2016-05-13/1463124670_146310.pdf
http://www.neeq.com.cn/disclosure/2016/2016-05-13/1463124671_227049.pdf
http://www.neeq.com.cn/disclosure/2016/2016-05-13/1463124672_461095.pdf
.....

用Excel打开这个文件，用删除重复数据功能，去掉重复的内容并保存为download.txt。

最后，运行以下指令开始下载所有文件：

wget -i download.txt

下面是一个脚本，可以自动做好上面的事情，需要bash支持：

#/bin/bash!

mkdir aabbcc
cd aabbcc

# 请替换600为你在网页上看到的页数！
# 请替换startTime和endTime为你需要的日期！
for i  in {0..600}; do
  curl -d "disclosureType=9&page=$i&companyCd=%E5%85%AC%E5%8F%B8%E5%90%8D%E7%A7%B0%2F%E6%8B%BC%E9%9F%B3%2F%E4%BB%A3%E7%A0%81&keyword=%E5%85%B3%E9%94%AE%E5%AD%97&startTime=2016-01-01&endTime=2016-05-13" "http://www.neeq.com.cn/disclosureInfoController/infoResult.do?callback=jQuery183009154627335424248_1463152016217 " -o "$i.json"
done

function parse_json()
{
    echo $1 | \
    sed -e 's/[{}]/''/g' | \
    sed -e 's/", "/'\",\"'/g' | \
    sed -e 's/" ,"/'\",\"'/g' | \
    sed -e 's/" , "/'\",\"'/g' | \
    sed -e 's/","/'\"---SEPERATOR---\"'/g' | \
    awk -F=':' -v RS='---SEPERATOR---' "\$1~/\"$2\"/ {print}" | \
    sed -e "s/\"$2\"://" | \
    tr -d "\n\t" | \
    sed -e 's/\\"/"/g' | \
    sed -e 's/\\\\/\\/g' | \
    sed -e 's/^[ \t]*//g' | \
    sed -e 's/^"//'  -e 's/"$//'
}

# 处理所有json，找出destFilePath，即文件路径
for f in *.json; do parse_json `cat $f` destFilePath >>result.txt; done
# 下面两行代码处理文件，规整化为一行一个文件
sed -i 's/""/\r\n/g' result.txt
sed -i 's/xls\//xls\r\n\//g' result.txt
sed -i 's/\/disclosure/http:\/\/www\.neeq\.com\.cn\/disclosure/g' result.txt
cat result.txt | sort | uniq > download.txt
wget -c -i download.txt

down.sh (1.5KB)