原创

如何搭建简易pyspider服务器及其爬虫实践

源码地址:https://github.com/binux/pyspider,然后通过git命令clone到本地

下面来介绍通过pyspider源码来安装和运行,安装依赖,无需安装pip
#python setup.py install
#./run.py

配置文件:/root/tools/pyspider/config.json

{
"taskdb": "mysql+taskdb://pyspider:wisetop@127.0.0.1:3306/taskdb",
"projectdb": "mysql+projectdb://pyspider:wisetop@127.0.0.1:3306/projectdb",
"resultdb": "mysql+resultdb://pyspider:wisetop@127.0.0.1:3306/resultdb",
"message_queue": "redis://127.0.0.1:6379/db",
"webui": {
"username": "admin",
"password": "pyspider",
"need-auth": true
}
}


CREATE USER 'pyspider'@'%' IDENTIFIED BY 'wisetop';
set password =password('wisetop');



其他相关:

查看端口占用情况:lsof -i:25555
查看mysql进程情况:ps -ef | grep mysql
启动pyspider: /usr/local/bin/pyspider  -c /root/tools/pyspider/config.json
开通防火墙设:
iptables -I INPUT 4 -p tcp -m state --state NEW -m tcp --dport 5000 -j ACCEPT
iptables-save

pyspider : /usr/local/lib/python2.7/dist-packages/pyspider/libs

其他文章:

http://blog.csdn.net/dabpop139/article/details/51167149
http://blog.csdn.net/sinat_33871437/article/details/50599735 【教你如何搭建简易pyspider服务器】
http://blog.csdn.net/dabpop139/article/details/51167149 【PySpider爬虫框架折腾体验】
http://www.tuicool.com/articles/6RJ3qqn 【漫谈Pyspider网络爬虫的实践】
http://blog.csdn.net/jxnu_xiaobing/article/details/44671653 【pyspider爬虫的一个应用】


远程连接mysql:

如果你的帐号不允许从远程登陆,只能在localhost。这个时候只要在localhost的那台电脑,登入mysql后,更改 "mysql" 数据库里的 "user" 表里的 "host" 项,从"localhost"改称"%"
改表法:
mysql>update user set host = '%' where user = 'root';
mysql>select host, user from user;
授权法:
例如,你想myuser使用mypassword从任何主机连接到mysql服务器的话。
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'KeYpZrZx' WITH GRANT OPTION;
FLUSH   PRIVILEGES;


问题总结:

问题一: ubuntu12.04下安装好了 pyspider,然后运行pyspider,出现下面这个错误
Python code
Traceback (most recent call last):
File "/usr/local/bin/pyspider", line 5, in
from pkg_resources import load_entry_point
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2707, in
working_set.require(__requires__)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 686, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 584, in resolve
raise DistributionNotFound(req)
pkg_resources.DistributionNotFound: click>=3.3
解答:网上找了解决方案说是 没有安装Click,于是我更新了Click
需要升级下setuptools,从经验看来,很多安装问题都和升级有关系
#pip install -U setuptools


问题二:
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
Cleaning up...
Command /usr/bin/python -c "import setuptools, tokenize;__file__='/tmp/pip_build_root/pycurl/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-AphElC-record/install-record.txt --single-version-externally-managed --compile failed with error code 1 in /tmp/pip_build_root/pycurl
Storing debug log for failure in /root/.pip/pip.log
解答:首先重新更新:sudo apt-get build-dep python-lxml ,sudo pip install lxml --upgrade,sudo apt-get install libpcap-dev libpq-dev

问题三:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/lib/python2.7/dist-packages/pyspider-0.3.10_dev-py2.7.egg/pyspider/scheduler/scheduler.py", line 679, in xmlrpc_run
from pyspider.libs.wsgi_xmlrpc import WSGIXMLRPCApplication
File "/usr/local/lib/python2.7/dist-packages/pyspider-0.3.10_dev-py2.7.egg/pyspider/libs/wsgi_xmlrpc.py", line 18, in
from six.moves.xmlrpc_server import SimpleXMLRPCDispatcher
ImportError: No module named xmlrpc_server
解答:升级下six 是版本过低的问题,命令为:pip install -U six

正文到此结束
本文目录